Does that mean the query is broken up into 2 processes that are
delegated to unique "chunks" and executed in parallel say like so:
1. find all the documents that have the word bat in them -> v1
2. find all the documents that have the word cat in them -> v2
if thats the case how do they reduce the complexity of:
find the intersection between v1 and v2 where the words (text
patterns) "Bat" and "Cat" are consecutive (and please don't think by
consecutive I mean "Bat" word word word "Cat" or any other form -
just plain old simple consecutive)
Best case join complexity is O(nm) and from there you still have to
make sure that Bat and Cat are consecutive, which is another O(n)
scan.
I'm well aware of the breaking down and distribution of the problem
over large clusters etc, its just I'm unsure of the details of the
final steps mainly the join and the final filtering process. also
another problem would be partial match queries...
Sherrie
Brian Wakem wrote:
> Sherrie Laraurens wrote:
>
> > Hi Brain,
> >
> > I'm not interested in the variations of logical predicates you
> > presented, also my question relates to how said queries be them sql
> > or something else are actually executed in a timely fashion.
> >
> > The phrase "I have written a boolean search engine myself" followed
> > by "It simply generates the SQL queries" basically means you didn't
> > write anything of importance just a DB front end and sql generator
> > for an actual db which is doing all the dirty work for you, I guess
> > anyone with a couple of hours on their hands and an SQL compatible db
> > can say that they too have written a search engine of sorts, but alas
> > this was not the answer(s) I was looking for.
>
>
> Then your question relates to scaling and is nothing to do with boolean
> searches at all.
>
> Google use 'chunck servers'. The data is split into say 10,000 parts, each
> part stored on 1 server. Each of those servers is then replicated by say
> 10 others. When you a query need answering it is sent to the relevant
> chunk servers which process in parallel and then feed the results back
> where they are combined.
>
> For a single query this is slightly less than 10,000 times faster than
> storing everything on a single machine with a large disk array.
>
> For multiple parellel queries it is even faster as the query can be sent to
> any of the 10 chunck servers that hold each chunk of data.
>
> That would be just 1 datacentre, google then copy the whole lot over dozens
> of datacentres (not a copy in reality as they all seem to hold different
> data).
>
> Note: The numbers I have quoted are guesses.
>
>
> --
> Brian Wakem
> Email: http://homepage.ntlworld.com/b.wakem/myemail.png