Friday
Aug102007
How do we make a large real-time search engine?

We're implementing a website which should be oriented to content and with massive access by public and we would need a search engine to index and execute queries on the indexes of contents (stored in a database, most likely MySQL InnoDB or Oracle).
The solution we found is to implement a separate service to make index constantly the contents of the database at regular intervals. Anyway, this is a complex and not optimal solution, since we would like it to index in real time and make it searchable.
Could you point me to some examples or articles I could review to design a
solution for such this context?
The solution we found is to implement a separate service to make index constantly the contents of the database at regular intervals. Anyway, this is a complex and not optimal solution, since we would like it to index in real time and make it searchable.
Could you point me to some examples or articles I could review to design a
solution for such this context?
Reader Comments (2)
It seems that way anyway. You work so hard on getting your site up and working and then there's this giant search problem to solve that seems as big as everything you've already done. Unfortunately, I don't think there's a way out of that pain for large dynamic sites. :-(
Keeping searching away from your main database is the way to go IMHO. You want your database doing the work only it can do, transactions. So loading it and blowing caches for searches might waste your precious database resources.
You could have the indexer work off read-only slaves. That would isolate the load, but it wouldn't necessarily be real-time.
Here's a good discussion of using http://lucene.apache.org">Lucene for real-time updates at http://www.gossamer-threads.com/lists/lucene/java-user/51517. Seems to be a lot of interesting issues (batching, garbage collection) around making Lucene update indexes quickly, but it seems possible.
I also wonder if the Google Custom Search engine at http://google.com/coop/cse/ might be an option? If you kept a parallel tree of documents Google searched then Google would probably search faster than your traditional options. There's also an API available when using the for pay versions. What I don't know is how fast they would respond to changes. They have this linked CSE product now and that might do the trick. It's worth a look anyway.
When you are referring real time search means as soon as user is creating record they are firing read request as well.
Not sure if it is possible but you could try this as a POC -
As soon as your application receive a request for creating new document, assign a unique GUID to this document, create 2 thread one is going to put it in storage and other thread is splitting this document into token and generating inverting index in application memory cache(In case if using sticky session it will help) and put it into distributed cache as well (will help in non-sticky session). Meanwhile in backend your indexing process will be triggering at some point in time.
If any new read request come you can again fire 2 threads to first look into your cache and to your primary index storage and then return merged result from your application.
Though it is going to add a cpu overhead because for 1 read/write we are going to use 2 threads. But might help to get immediate data need.