Go Back   Forum Care Forums > Development Reference Area > SEO & Traffic Generation

Reply
 
LinkBack Thread Tools Display Modes
Boolean query algorithms
Old
  (#1)
Sherrie Laraurens
Guest
 
Posts: n/a
Default Boolean query algorithms - 05-09-2007, 04:28 PM

Hi all,

I have a question relating to how search engines and (in fact
anything else that supports boolean queries) manage to do such
things so efficiently.

my question involves a query were you would like to retrieve the all
the documents in your corpus that have the words "Cat" and "Bat" in
them and that they not only contain both words but that they must be
consecutive for example "Bat Cat"

I can think of a very crude way of doing this which involves hashing
every word in a document into a hash table and storing an index for
said document , then in the query stage to hash both words (hash
join) get the intersection vector of the resulting vectors from the
hashing process, then one by one examine each document from the
intersection vector find the word "Bat" and see if the word "Cat" is
the next word if it is place said document into the final result
vector and once finished p*** back to user.

I believe this will work for AND and OR type queries but I can't
imagine systems like google or yahoo using such a CRUDE method nor
can i see them caching such results, just because the sheer amount
of combinations things they would have to be caching.

Does anyone here have any idea on how these things are done? any
keywords I could use to google etc..




Sherrie

   
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote

Re: Boolean query algorithms
Old
  (#2)
Brian Wakem
Guest
 
Posts: n/a
Default Re: Boolean query algorithms - 05-09-2007, 04:28 PM

Sherrie Laraurens wrote:

> Hi all,
>
> I have a question relating to how search engines and (in fact
> anything else that supports boolean queries) manage to do such
> things so efficiently.
>
> my question involves a query were you would like to retrieve the all
> the documents in your corpus that have the words "Cat" and "Bat" in
> them and that they not only contain both words but that they must be
> consecutive for example "Bat Cat"



In which case it isn't a boolean query at all, you are looking for "Bat
Cat", not Bat AND Cat.


> I can think of a very crude way of doing this which involves hashing
> every word in a document into a hash table and storing an index for
> said document , then in the query stage to hash both words (hash
> join) get the intersection vector of the resulting vectors from the
> hashing process, then one by one examine each document from the
> intersection vector find the word "Bat" and see if the word "Cat" is
> the next word if it is place said document into the final result
> vector and once finished p*** back to user.
>
> I believe this will work for AND and OR type queries but I can't
> imagine systems like google or yahoo using such a CRUDE method nor
> can i see them caching such results, just because the sheer amount
> of combinations things they would have to be caching.
>
> Does anyone here have any idea on how these things are done? any
> keywords I could use to google etc..



I have written a boolean search engine myself, but didn't do anything as
convoluted as you describe.

It simply generates the SQL queries complete with ANDs ORs and NOTs based on
the user input.

You have to account for brackets and quoted phrases too of course.

The biggest problem we have is explaining to our users that:

A OR B AND C

is the same as:

A OR (B AND C)

not:

(A OR B) AND C

as OR has lower precedence.

Unfortunately this is beyond the comprehension of your average computer
user.


--
Brian Wakem
Email: http://homepage.ntlworld.com/b.wakem/myemail.png
   
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Re: Boolean query algorithms
Old
  (#3)
Sherrie Laraurens
Guest
 
Posts: n/a
Default Re: Boolean query algorithms - 05-09-2007, 04:28 PM

Hi Brain,

I'm not interested in the variations of logical predicates you
presented, also my question relates to how said queries be them sql
or something else are actually executed in a timely fashion.

The phrase "I have written a boolean search engine myself" followed
by "It simply generates the SQL queries" basically means you didn't
write anything of importance just a DB front end and sql generator
for an actual db which is doing all the dirty work for you, I guess
anyone with a couple of hours on their hands and an SQL compatible db
can say that they too have written a search engine of sorts, but alas
this was not the answer(s) I was looking for.


Sherrie


Brian Wakem wrote:
> Sherrie Laraurens wrote:
>
> > Hi all,
> >
> > I have a question relating to how search engines and (in fact
> > anything else that supports boolean queries) manage to do such
> > things so efficiently.
> >
> > my question involves a query were you would like to retrieve the all
> > the documents in your corpus that have the words "Cat" and "Bat" in
> > them and that they not only contain both words but that they must be
> > consecutive for example "Bat Cat"

>
>
> In which case it isn't a boolean query at all, you are looking for "Bat
> Cat", not Bat AND Cat.
>
>
> > I can think of a very crude way of doing this which involves hashing
> > every word in a document into a hash table and storing an index for
> > said document , then in the query stage to hash both words (hash
> > join) get the intersection vector of the resulting vectors from the
> > hashing process, then one by one examine each document from the
> > intersection vector find the word "Bat" and see if the word "Cat" is
> > the next word if it is place said document into the final result
> > vector and once finished p*** back to user.
> >
> > I believe this will work for AND and OR type queries but I can't
> > imagine systems like google or yahoo using such a CRUDE method nor
> > can i see them caching such results, just because the sheer amount
> > of combinations things they would have to be caching.
> >
> > Does anyone here have any idea on how these things are done? any
> > keywords I could use to google etc..

>
>
> I have written a boolean search engine myself, but didn't do anything as
> convoluted as you describe.
>
> It simply generates the SQL queries complete with ANDs ORs and NOTs based on
> the user input.
>
> You have to account for brackets and quoted phrases too of course.
>
> The biggest problem we have is explaining to our users that:
>
> A OR B AND C
>
> is the same as:
>
> A OR (B AND C)
>
> not:
>
> (A OR B) AND C
>
> as OR has lower precedence.
>
> Unfortunately this is beyond the comprehension of your average computer
> user.
>
>
> --
> Brian Wakem
> Email: http://homepage.ntlworld.com/b.wakem/myemail.png


   
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Re: Boolean query algorithms
Old
  (#4)
Brian Wakem
Guest
 
Posts: n/a
Default Re: Boolean query algorithms - 05-09-2007, 04:28 PM

Sherrie Laraurens wrote:

> Hi Brain,
>
> I'm not interested in the variations of logical predicates you
> presented, also my question relates to how said queries be them sql
> or something else are actually executed in a timely fashion.
>
> The phrase "I have written a boolean search engine myself" followed
> by "It simply generates the SQL queries" basically means you didn't
> write anything of importance just a DB front end and sql generator
> for an actual db which is doing all the dirty work for you, I guess
> anyone with a couple of hours on their hands and an SQL compatible db
> can say that they too have written a search engine of sorts, but alas
> this was not the answer(s) I was looking for.



Then your question relates to scaling and is nothing to do with boolean
searches at all.

Google use 'chunck servers'. The data is split into say 10,000 parts, each
part stored on 1 server. Each of those servers is then replicated by say
10 others. When you a query need answering it is sent to the relevant
chunk servers which process in parallel and then feed the results back
where they are combined.

For a single query this is slightly less than 10,000 times faster than
storing everything on a single machine with a large disk array.

For multiple parellel queries it is even faster as the query can be sent to
any of the 10 chunck servers that hold each chunk of data.

That would be just 1 datacentre, google then copy the whole lot over dozens
of datacentres (not a copy in reality as they all seem to hold different
data).

Note: The numbers I have quoted are guesses.


--
Brian Wakem
Email: http://homepage.ntlworld.com/b.wakem/myemail.png
   
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Re: Boolean query algorithms
Old
  (#5)
Paul
Guest
 
Posts: n/a
Default Re: Boolean query algorithms - 05-09-2007, 04:28 PM

On 12 Jul 2006 02:38:20 -0700, "Sherrie Laraurens"
<EMAIL REMOVED> wrote:

>Hi Brain,


Some typo's can be quite amusing.
--

----== Posted via Newsfeeds.Com - Unlimited-Unrestricted-Secure Usenet News==----
http://www.newsfeeds.com The #1 Newsgroup Service in the World! 120,000+ Newsgroups
----= East and West-Coast Server Farms - Total Privacy via Encryption =----
   
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Re: Boolean query algorithms
Old
  (#6)
Sherrie Laraurens
Guest
 
Posts: n/a
Default Re: Boolean query algorithms - 05-09-2007, 04:28 PM

Does that mean the query is broken up into 2 processes that are
delegated to unique "chunks" and executed in parallel say like so:

1. find all the documents that have the word bat in them -> v1
2. find all the documents that have the word cat in them -> v2

if thats the case how do they reduce the complexity of:

find the intersection between v1 and v2 where the words (text
patterns) "Bat" and "Cat" are consecutive (and please don't think by
consecutive I mean "Bat" word word word "Cat" or any other form -
just plain old simple consecutive)

Best case join complexity is O(nm) and from there you still have to
make sure that Bat and Cat are consecutive, which is another O(n)
scan.

I'm well aware of the breaking down and distribution of the problem
over large clusters etc, its just I'm unsure of the details of the
final steps mainly the join and the final filtering process. also
another problem would be partial match queries...


Sherrie




Brian Wakem wrote:
> Sherrie Laraurens wrote:
>
> > Hi Brain,
> >
> > I'm not interested in the variations of logical predicates you
> > presented, also my question relates to how said queries be them sql
> > or something else are actually executed in a timely fashion.
> >
> > The phrase "I have written a boolean search engine myself" followed
> > by "It simply generates the SQL queries" basically means you didn't
> > write anything of importance just a DB front end and sql generator
> > for an actual db which is doing all the dirty work for you, I guess
> > anyone with a couple of hours on their hands and an SQL compatible db
> > can say that they too have written a search engine of sorts, but alas
> > this was not the answer(s) I was looking for.

>
>
> Then your question relates to scaling and is nothing to do with boolean
> searches at all.
>
> Google use 'chunck servers'. The data is split into say 10,000 parts, each
> part stored on 1 server. Each of those servers is then replicated by say
> 10 others. When you a query need answering it is sent to the relevant
> chunk servers which process in parallel and then feed the results back
> where they are combined.
>
> For a single query this is slightly less than 10,000 times faster than
> storing everything on a single machine with a large disk array.
>
> For multiple parellel queries it is even faster as the query can be sent to
> any of the 10 chunck servers that hold each chunk of data.
>
> That would be just 1 datacentre, google then copy the whole lot over dozens
> of datacentres (not a copy in reality as they all seem to hold different
> data).
>
> Note: The numbers I have quoted are guesses.
>
>
> --
> Brian Wakem
> Email: http://homepage.ntlworld.com/b.wakem/myemail.png


   
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Re: Boolean query algorithms
Old
  (#7)
Brian Wakem
Guest
 
Posts: n/a
Default Re: Boolean query algorithms - 05-09-2007, 04:28 PM

Sherrie Laraurens wrote:

> Does that mean the query is broken up into 2 processes that are
> delegated to unique "chunks" and executed in parallel say like so:
>
> 1. find all the documents that have the word bat in them -> v1
> 2. find all the documents that have the word cat in them -> v2




I very much doubt it.

As per my first answer, the query would be built with ANDs ORs NOTs before
being sent to the chunk servers (or something similar if they don't use
SQL).



> if thats the case how do they reduce the complexity of:
>
> find the intersection between v1 and v2 where the words (text
> patterns) "Bat" and "Cat" are consecutive (and please don't think by
> consecutive I mean "Bat" word word word "Cat" or any other form -
> just plain old simple consecutive)



This depends on what we are actually looking for. If you are looking for
"Bat Cat" then none of this applies, it is a single string.

If you search for Bat AND Cat then, each chunk server will check how far
apart they are on the page and use this as part of the relevance ordering
(along with pagerank, words in anchor text etc). But this certainly does
not guarantee consecutiveness.



--
Brian Wakem
Email: http://homepage.ntlworld.com/b.wakem/myemail.png
   
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Re: Boolean query algorithms
Old
  (#8)
Sherrie Laraurens
Guest
 
Posts: n/a
Default Re: Boolean query algorithms - 05-09-2007, 04:28 PM

I don't believe anyone would be stupid enough to index documents
based on composites such as bat and cat, there would litterally be a
combinatorial explosion of things that could be index,

"bat cat", "bat mat", "mat cat", "mat hat" "hat bat" etc... it will
never end, in other words a simple example of the number of indexes
for just the english language would (number of words in the
language)!

Brian I think-you for your input but I think we should let some other
people on this list have a go at explain or find an answer to my
question.


Sherrie


Brian Wakem wrote:
> Sherrie Laraurens wrote:
>
> > Does that mean the query is broken up into 2 processes that are
> > delegated to unique "chunks" and executed in parallel say like so:
> >
> > 1. find all the documents that have the word bat in them -> v1
> > 2. find all the documents that have the word cat in them -> v2

>
>
>
> I very much doubt it.
>
> As per my first answer, the query would be built with ANDs ORs NOTs before
> being sent to the chunk servers (or something similar if they don't use
> SQL).
>
>
>
> > if thats the case how do they reduce the complexity of:
> >
> > find the intersection between v1 and v2 where the words (text
> > patterns) "Bat" and "Cat" are consecutive (and please don't think by
> > consecutive I mean "Bat" word word word "Cat" or any other form -
> > just plain old simple consecutive)

>
>
> This depends on what we are actually looking for. If you are looking for
> "Bat Cat" then none of this applies, it is a single string.
>
> If you search for Bat AND Cat then, each chunk server will check how far
> apart they are on the page and use this as part of the relevance ordering
> (along with pagerank, words in anchor text etc). But this certainly does
> not guarantee consecutiveness.
>
>
>
> --
> Brian Wakem
> Email: http://homepage.ntlworld.com/b.wakem/myemail.png


   
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Re: Boolean query algorithms
Old
  (#9)
Big Bill
Guest
 
Posts: n/a
Default Re: Boolean query algorithms - 05-09-2007, 04:28 PM

On 12 Jul 2006 04:22:19 -0700, "Sherrie Laraurens"
<EMAIL REMOVED> wrote:

>I don't believe anyone would be stupid enough to index documents
>based on composites such as bat and cat, there would litterally be a
>combinatorial explosion of things that could be index,
>
>"bat cat", "bat mat", "mat cat", "mat hat" "hat bat" etc... it will
>never end, in other words a simple example of the number of indexes
>for just the english language would (number of words in the
>language)!
>
>Brian I think-you for your input but I think we should let some other
>people on this list have a go at explain or find an answer to my
>question.
>
>
>Sherrie


Paging Roy! Anyone seen Schestowitz?

BB
--

http://www.kruse.co.uk/seo-services.htm
http://www.crystal-liaison.com/baby-gund/index.html
http://www.here-be-posters.co.uk/jul...et-posters.htm
   
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On





Contact Us - Forum Care Forums - Archive - Top