Job queue and search engine

Wednesday

May282008

Wednesday, May 28, 2008 at 7:15PM

Hi,

I want to implement a search engine with lucene.
To be scalable, I would like to execute search jobs asynchronously (with a job queuing system).

But i don't know if it is a good design... Why ?

Search results can be large ! (eg: 100+ pages with 25 documents per page)
With asynchronous sytem, I need to store results for each search job.
I can set a short expiration time (~5 min) for each search result, but it's still large.

What do you think about it ?
Which design would you use for that ?

Thanks
Mat

mat |

3 Comments |

Permalink |

Print Article

Email Article

General Discussion,

job queue,

Reader Comments (3)

If you have 25 docs per page the results max is 25. Simply add paging (well not really paging just a next/prev link). Also shouldn't the results be simply ID's with a human readable very short or short summary and not complete documents?

December 31, 1999 |

Anonymous

Thanks for your suggest !
Yes, each document is a brief description and 1 ID, so a single document is very light.

With my initial solution : 1 search job == all results
The search job is executed 1 time, and the paging system show only parts of the results.
The problem is : 25 document * X pages * Y users may be huge.

With your solution : 1 search job == results for 1 page.
One search job is executed for each page view.
In term of storage, your solution is good and lightweight.
But the risk is : navigation between page may be slow because it generate more jobs.

In term of CPU, according to the lucene FAQ :
* How do I implement paging, i.e. showing result from 1-10, 11-20 etc?
-> Just re-execute the search and ignore the hits you don't want to show. As people usually look only at the first results this approach is usually fast enough.

Each page view will generate the same job to the lucene engine.
It may be slower for the user experience and more "cpu intensive" for the job search workers.

I will try an intermediate approach :

1 job search will generate results for 5 pages.
Each search result will be cached (memcache).
When user ask for page 4 it wil generate a new search job in background for page 5-10 (if not already in cache of course).

In term of cpu and memory usage, it will be lighter than the 2 previous solutions.
What do you think about it ? Do you see any other possible improvements ?
Do you think i'm paranoid in term of storage ? ;)

December 31, 1999 |

mat

SERPS have to be faster in order to release effective results otherwise it will lose its priority
-----
http://underwaterseaplants.awardspace.com">sea plants
http://underwaterseaplants.awardspace.com/seagrapes.htm">sea grapes...http://underwaterseaplants.awardspace.com/plantroots.htm">plant roots

December 31, 1999 |

farhaj

Post a New Comment

Enter your information below to add a new comment.

Author:

Author Email (optional):

Author URL (optional):

Post:

↓ | ↑

Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>