Sunday
Jun082008
Search fast in million rows

I have a table .This table has many columns but search performed based on 1 columns ,this table can have more than million rows.
The data in these columns is something like funny,new york,hollywood
User can search with parameters as funny hollywood .I need to take this 2 words and then search on column whether that column contain this words and how many times .It is not possible to index here .If the results return say 1200 results then without comparing each and every column i can't determine no of results.I need to compare for each and every column.This query is very frequent .How can i approach for this problem.What type of architecture,tools is helpful.
I just know that this can be accomplished with distributed system but how can i make this system. I also see in this website that LinkedIn uses Lucene for search .Is Lucene is helpful in my case.My table has also lots of insertion ,however updation in not very frequent.
The data in these columns is something like funny,new york,hollywood
User can search with parameters as funny hollywood .I need to take this 2 words and then search on column whether that column contain this words and how many times .It is not possible to index here .If the results return say 1200 results then without comparing each and every column i can't determine no of results.I need to compare for each and every column.This query is very frequent .How can i approach for this problem.What type of architecture,tools is helpful.
I just know that this can be accomplished with distributed system but how can i make this system. I also see in this website that LinkedIn uses Lucene for search .Is Lucene is helpful in my case.My table has also lots of insertion ,however updation in not very frequent.
Reader Comments (6)
I don't understand why you cannot use an index. To be blunt, without the use of indexes you cannot search through millions of rows quickly. How do you expect a database engine to do that unless everything is laid out in a structured logical order i.e. accessed using indexes?
If funny or hollywood is a category, then you would normally represent these as categoryid's from a related lookup table. Then you would search the database with the where predicate based on a selected categoryid. With a clustered index on this column this would normally be very fast. In Sql Server 2005, I have a huge indexed view with a similar amount of columns and perform such queries in under a second. Without the use of indexes on the source tables and the view however, the query time can go up to 10 seconds! You cannot query millions of rows without indexes and expect high performance, especially if you have anything other than narrow tables containing only ints.
Yes, this would be a good application of an inverted index like Lucene.
Thanks for your reply
In column ,data is like "funny ;hollywood ;fast ;youtube;".
For every column i need to compare each and every word contained in that column .If i use substring kind of function then in that case database don't use index.I want to store tags in column and use it for searching.
you can use "innodb" for mysql or oracle to billion rows but you need some information to use or download from web site innodb
Is there any other solution like Lucene or Hadoop.I heard Lucene is very good in searching but not sure about its scalability.Is we can use Hadoop for online search or it is only suitable for indexing our data offline.
Check out Sphinx search. I think if you can find a way of pushing your keyword values from that column into attribute fields you'll be able to get it to do what you want.
http://www.sphinxsearch.com/