High Scalability -

HighScalability Team |

2 Comments |

Permalink |

Monday

Aug022021

Evolution of search engines architecture - Algolia New Search Architecture Part 1

Monday, August 2, 2021 at 9:11AM

Search engines, and more generally, information retrieval systems, play a central role in almost all of today’s technical stacks. Information retrieval started in the beginning of computer science. Research accelerated in the early 90s with the introduction of the Text REtrieval Conference (TREC). After more than 30 years of evolution since TREC, search engines continue to grow and evolve, leading to new challenges.

In this article, we look at some key milestones in the evolution of search engine architecture. We also describe the challenges those architectures face today. As you’ll see, we grouped the engines into four architecture categories. This is a simplification, as there are in reality a lot of different engines with various mix of architectures. We did this to focus our attention on the most important characteristics of those architectures.

1. The Inverted Index — the early days of search

HighScalability Team |

7 Comments |

Permalink |

Thursday

Nov262009

Kngine Snippet Search New Indexing Technology

Thursday, November 26, 2009 at 3:39PM

While Kngine just announce some improvement and new features, I would like you take you in small trip in Snippet Search research project at Kngine.

Haytham ElFadeel |

3 Comments |

Permalink |

tagged

Indexing,

Kngine,

Searching in

Answer System,

research,

Friday

May012009

FastBit: An Efficient Compressed Bitmap Index Technology

Friday, May 1, 2009 at 1:42AM

Data mining and fast queries are always in that bin of hard to do things where doing something smarter can yield big results. Bloom Filters are one such do it smarter strategy, compressed bitmap indexes are another. In one application "FastBit outruns other search indexes by a factor of 10 to 100 and doesn’t require much more room than the original data size." The data size is an interesting metric. Our old standard b-trees can be two to four times larger than the original data. In a test searching an Enron email database FastBit outran MySQL by 10 to 1,000 times.

FastBit is a software tool for searching large read-only datasets. It organizes user data in a column-oriented structure which is efficient for on-line analytical processing (OLAP), and utilizes compressed bitmap indices to further speed up query processing. Analyses have proven the compressed bitmap index used in FastBit to be theoretically optimal for one-dimensional queries. Compared with other optimal indexing methods, bitmap indices are superior because they can be efficiently combined to answer multi-dimensional queries whereas other optimal methods can not.

It's not all just map-reduce and add more servers until your attic is full.

FastBit: Digging through databases faster. An excellent description of how FastBit works, especially compared to b-trees.

Todd Hoff |

3 Comments |

Permalink |

Paper,

Sunday

Jun082008

Search fast in million rows

Sunday, June 8, 2008 at 5:10PM

I have a table .This table has many columns but search performed based on 1 columns ,this table can have more than million rows. The data in these columns is something like funny,new york,hollywood User can search with parameters as funny hollywood .I need to take this 2 words and then search on column whether that column contain this words and how many times .It is not possible to index here .If the results return say 1200 results then without comparing each and every column i can't determine no of results.I need to compare for each and every column.This query is very frequent .How can i approach for this problem.What type of architecture,tools is helpful. I just know that this can be accomplished with distributed system but how can i make this system. I also see in this website that LinkedIn uses Lucene for search .Is Lucene is helpful in my case.My table has also lots of insertion ,however updation in not very frequent.

d17may |

6 Comments |

Permalink |

General Discussion,

search,

Job queue and search engine

Wednesday, May 28, 2008 at 7:15PM

Hi, I want to implement a search engine with lucene. To be scalable, I would like to execute search jobs asynchronously (with a job queuing system). But i don't know if it is a good design... Why ? Search results can be large ! (eg: 100+ pages with 25 documents per page) With asynchronous sytem, I need to store results for each search job. I can set a short expiration time (~5 min) for each search result, but it's still large. What do you think about it ? Which design would you use for that ? Thanks Mat

mat |

3 Comments |

Permalink |

Sunday

Feb242008

Yandex Architecture

Sunday, February 24, 2008 at 1:53PM

Update: Anatomy of a crash in a new part of Yandex written in Django. Writing to a magic session variable caused an unexpected write into an InnoDB database on every request. Writes took 6-7 seconds because of index rebuilding. Lots of useful details on the sizing of their system, what went wrong, and how they fixed it. Yandex is a Russian search engine with 3.5 billion pages in their search index. We only know a few fun facts about how they do things, nothing at a detailed architecture level. Hopefully we'll learn more later, but I thought it would still be interesting. From Allen Stern's interview with Yandex's CTO Ilya Segalovich, we learn:

3.5 billion pages in the search index.

Over several thousand servers.

35 million searches a day.

Several data centers around Russia.

Two-layer architecture.

The database is split in pieces and when a search is requested, it pulls the bits from the different database servers and brings it together for the user.

Languages used: c++, perl, some java.

FreeBSD is used as their server OS.

$72 million in revenue in 2006.

Todd Hoff |

13 Comments |

Permalink |