« Load Balancing of web server traffic | Main | Moving old to new. Do not be afraid of the re-write -- but take some help »
Thursday
Jan172008

Database People Hating on MapReduce

Update: Typical Programmer tackles the technical issues in Relational Database Experts Jump The MapReduce Shark. The culture clash is still what fascinates me.

David DeWitt writes in the Database Column that MapReduce is a major step backwards:
  • A giant step backward in the programming paradigm for large-scale data intensive applications
  • A sub-optimal implementation, in that it uses brute force instead of indexing
  • Not novel at all -- it represents a specific implementation of well known techniques developed nearly 25 years ago
  • Missing most of the features that are routinely included in current DBMS
  • Incompatible with all of the tools DBMS users have come to depend on

    Listening to databasers and map reducers talk is like eavesdropping on your average family holiday mashup. Every holiday people who have virtually nothing in common are thrown together because they incidentally share a little DNA or are married to the shared DNA. In desperation everyone gravitates to some shared enemy they can all confidently bash. But after that moment is relieved and awkward silence once again looms, nothing is left but more drinking and tackling sensitive topics you just know will end badly.

    Database folks love their schemas, relational purity and their swiss army knife indexes. You soon learn that really map reduce is just another form of an index and indexes really can scale to any heights with just a little tweaking. Map reducers lover their pure functional models, their self-healing clustery filled ecosystems, and the shear joy of the semi-organized chaos of letting a 10,000 CPUs simultaneously bloom.

    I for one stand firmly by the relish tray. Transactions have a place and so does structured data. That's why Google contributes heavily to MySQL. Yet, I too like my map reduce engine, distributed file system combo platter. With map reduce I can implement any complex behavior over any data set. With enough machines that work can be performed in a predictable amount of time. You aren't limit to set logic, SQL types, and tweaked indexes. That's pretty good stuff too.

    Much like a staunchly conservative nail crunching father and his too soft pansy liberal son, these two camps will never understand each other. Every sign of beauty in one person's eyes is just another confirmation to the other side of impending senility. Why even try? Just hug in a manly way and agree to meet again next year.
  • Reader Comments (15)

    Slight typo, first paragraph after the bullet list, you say "ease dropping" when I think you mean "eavesdropping". :)

    Cheers - http://www.callum-macdonald.com/" title="Callum" target="_blank">Callum.

    December 31, 1999 | Unregistered Commenterchmac

    PS> Todd, whatever you've been smoking over the last few weeks that has led to this new philosophical, art of war quoting author, I want me some! :)

    December 31, 1999 | Unregistered Commenterchmac

    > You soon learn that really map reduce is just another form of an index and indexes really can scale to any heights with just a little tweaking.

    How so? Map reduce iterates over every data element, akin to a table scan. There's nothing sorted or indexed about it at all.

    After reading the article, I'm convinced the author has no idea what MapReduce is used for. The main purpose of a database is to find particular pieces of information quickly among a large data set. MapReduce is used to run distributed calculations on every piece of information in a data set. Indexes are largely useless if you need to hit every data element (unless it's a covering index, but I digress).

    I also doubt that Google runs a MapReduce job when you perform a search query... In fact, I'm positive they have indexes that would make most of our heads spin ;)

    Two completely different problems. Or, put another way, when all you have is a hammer, everything looks like a nail.

    Sean

    December 31, 1999 | Unregistered CommenterSean

    > Slight typo

    Ack, thanks.

    > Todd, whatever you've been smoking

    Just sniffing blue sky and sipping organically distilled rain water. :-)

    > How so?

    This take stems from an actual overheard conversation between two entrenched advocates from both sides. Really fascinating. Databases with columnar and adaptive indexes were capable of great things and map reduce didn't need no stinkin' indexes to deliver. So I think each side "understands" each other, they just don't understand each other.

    December 31, 1999 | Unregistered CommenterTodd Hoff

    >I also doubt that Google runs a MapReduce job when you perform a search query

    Huh? How else do you think its done? I just did a search for "The quick brown fox jumps over the lazy dog". Do you really think that the above search query is hitting a database?

    The amount of data that is being searched is HUGE. The number of requests per second is HUGE. The response time is TINY.

    How do you index every single word on a page using a database using indexes? Is this scalable to the number of requests that google handles? What about response times it provides? (0.14 secs for me for the above search term)

    I'll go back over to the relish tray.. :)

    December 31, 1999 | Unregistered Commentervlod

    I don't think that MapReduce is invoked for the query. I don't work for Google, so obviously this is an educated guess.

    Reviewing what technology they have opened to the public, I would imagine that MapReduce is used to generate the index. Once the index is created, a query is really doing a query against the index and combining the results for relevance and presentation.

    December 31, 1999 | Unregistered CommenterNoah Campbell

    >>I also doubt that Google runs a MapReduce job when you perform a search query
    > Huh? How else do you think its done?

    They use a specialized, massively scalable, massively parallelized database called BigTable.

    > I just did a search for "The quick brown fox jumps over the lazy dog". Do you really think that the above search query is hitting a database?

    Yes.

    > How do you index every single word on a page using a database using indexes?

    By using a specialized, massively scalable, massively parallelized database.

    > Is this scalable to the number of requests that google handles?

    Apparently it is.

    > What about response times it provides?

    Achievable if a) the search is parallelized, and b) extensive memory caching (and other optimizations) are used. Think about it: (after stripping out irrelevant words) each word in your query gets sent to a different DB machine. Each machine looks up its word up in its index, and gets back a list of pages that contain that word. In most cases, the lookup can be fetched from a memory cache. When it can't, they've optimized it so that it only takes 1 disk seek, which is the next fastest thing. They then union the lists of pages together, and serve you up your results.

    Read the BigTable paper at http://labs.google.com/papers/bigtable.html if you want to learn more. (And be amazed!)

    December 31, 1999 | Unregistered CommenterDAR

    The previous commentor is correct. MapReduce is used to gather the table for indexing, to build the index. BigTable is used to store that data and is what is searched against.

    I haven't met any mapreduce people, but I do know DBAs and they really don't like their little kingdoms threatened in any way. I once made the (humorous) mistake of telling a DBA that a database is just a place to store data. You could see the sweat beading up, the veins in his head throbbing as he was trying to control his outrage.

    Heaven forbid if there is another solution available. Besides, no RDBMS has come close to being as massively scalable at processing like a MapReduce implementation. I love the "tools" argument. Do you really think some Google suit is wanting to run crystal reports against the kind of data that their index is built from? It isn't a business application after all.

    December 31, 1999 | Unregistered CommenterRobert

    Actually I am pretty sure that Google's index data is not stored in BigTable. BigTable is used for a variety of other tasks related to search (as well as other apps) such as storing your search history for personalization purposes. BigTable does plug very nicely into googles architecture for providing versioned lightweight database functionality. Also checkout Sawzall. Building an index to provide near instant lookup times is a very different problem. I can speak from experience using Hadoop and HBase for building search technology.

    MapReduce is used to process massive amounts of data. I have used it to process data in relational databases although I don't let the job talk to the database directly.

    Interesting discussion though.

    December 31, 1999 | Unregistered CommenterSteve Severance

    No inside knowledge here but my understanding was that BigTable was a storage mechanism and MapReduce was a distributed calculation infrastructure. One of the big issues in getting the M/R and RDBMS folks talking is that most of the M/R folks come from or work in the NLP field, where answers are inherently subjective and fuzzy matching algorithms are king. I find similar frustrations getting strongly typed and dynamically typed language people to see that each has it's sweet spot.

    I've found that "Managing Gigabytes" (Witten) and " Foundations of Statistical Natural Language Processing" (Manning/Shuetze) to be the best inoculation for RDBMS folks trying to think about NLP and text search problems.

    BTW - One criticism I would have on M/R is that it seems horribly inefficient in terms of computation. Their goal was probably development agility for parallel computation, so that's not a big ding. I worry about inter-node bandwidth, though. When working on a text search engine back in 93, I "cleaned up" some code in a way that pushed one data map from L1 cache to main memory and dropped indexing speed by 40%.

    Cheers,
    Clark

    December 31, 1999 | Unregistered CommenterClark Breyman

    I think this document is comparing things that are not comparable. They are talking about MapReduce as if it were a distributed database. But that's completely wrong. Hadoop is a distributed computed platform, not a distributed database prepared for OLAP. In some situations where scalability is important, Hadoop could be used instead of a database. But this cases are very specific ones.

    They said that distributed database were invented a long time ago. Maybe that is true, but it seems that they did not success. Otherwise, why there are not any distributed database that scales properly now?

    December 31, 1999 | Unregistered CommenterIván de Prado

    The reason for it is its incompatibility to fulfill the requirements where others are there from programmers but are not as likly as that but able to work out databases more easily and convieniently.
    -----
    http://underwaterseaplants.awardspace.com">Underwater sea plants
    http://underwaterseaplants.awardspace.com/seaweed.htm">Seaweed...http://underwaterseaplants.awardspace.com/seagrass.htm">Seagrass

    December 31, 1999 | Unregistered Commenterfarhaj

    The spider populates Big Table once a month.
    Map Reduce searches Big Table for your search.

    Map takes to params.
    In this case (allTheDocs, Y'our Search Words).

    Reduce builds and sorts the viewable results.

    December 31, 1999 | Unregistered CommenterWill S

    I have built numerous search engines from scratch, and while I don't have inside knowledge, I can tell you based on my experience that my guess is that MapReduce is used to compile a search index, and then they have a very specialized hybrid solution for storing and accessing the index. BigTable is probably just used for metadata, possibly in the retrieval of titles/abstracts, but probably just for the crawlers and other systems getting discrete pieces of data.

    It would be nice if Hadoop could evolve into a solution that can index data as well as churn through all of it. Right now the most popular use of Hadoop seems to be to process log files, something that doesn't need indexing. But I see a great need to use Hadoop in other fields, such as the financial world, where maybe we need to process all the data from a particular year, or all data from the US, etc, that could greatly benefit from an index, but also is a ton of data that requires massive parallelization. I am working on a proof-of-concept project that wants to take about 20 billion rows of data and produce analytical results in less than 10 seconds. Not sure Hadoop by itself is going to do the trick (unless we are talking about an ungodly number of nodes).

    I understand why DB people hate MapReduce. I hate it too, but I also hate DB's. I'm a search engine guy.

    December 31, 1999 | Unregistered Commenterdumbfounder

    @vlod,

    MapReduce is a batch system. Consider that MR jobs take more than a minute to start up and seconds to weeks to run beyond that, depending on amount of data crunched. Google returns results to you by the dozens in about 0.2 seconds. This immediately proves that a Google query cannot be a MapReduce job.

    Disclaimer: I used to work for Google in an unrelated capacity but I understood they used word occurrence indexing before joining the company.

    October 22, 2010 | Unregistered CommenterMonica Anderson

    PostPost a New Comment

    Enter your information below to add a new comment.
    Author Email (optional):
    Author URL (optional):
    Post:
     
    Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>