« Stuff The Internet Says On Scalability For July 29, 2011 | Main | Web 2.0 Killed the Middleware Star »
Wednesday
Jul272011

Making Hadoop 1000x Faster for Graph Problems

Dr. Daniel Abadi, author of the DBMS Musings blog and Cofounder of Hadapt, which offers a product improving Hadoop performance by 50x on relational data, is now taking his talents to graph data in Hadoop's tremendous inefficiency on graph data management (and how to avoid it), which shares the secrets of getting Hadoop to perform 1000x better on graph data.

TL;DR:

  • Analysing graph data is at the heart of important data mining problems.
  • Hadoop is the tool of choice for many of these problems.
  • Hadoop style MapReduce works best on KeyValue processing, not graph processing, and can be well over a factor of 1000 less efficient than it needs to be.
  • Hadoop inefficiency has consequences in real world. Inefficiencies on graph data problems like improving power utilization, minimizing carbon emissions, and improving product designs, leads to a lot value being left on the table in the form of negative environmental consequences, increased server costs, increased data center space, and increased energy costs.
  • 10x improvement by using a clustering algorithm to graph partition data across nodes in the Hadoop cluster. By default in Hadoop data is distributed randomly around a cluster, which means data that's close together in the graph can be very far apart on disk. This is very slow for common operations like sub-graph pattern matching, which prefers neighbors to be stored on the same machine.
  • 10x improvement by replicating data on the edges of partitions so that vertexes are stored on the same physical machine as their neighbors. By default Hadoop replicates data 3 times, treating all data equally is inefficient. 
  • 10x improvement by replacing the physical storage system with graph-optimized storage. HDFS, which is a distributed file system, and HBase, which is an unstructured data storage system, are not optimal data stores for graph data. 

Voila! That's a 10x * 10x * 10x = 1000x performance improvement on graph problems using techniques that make a lot of sense. What may be less obvious is the whole idea of keeping the Hadoop shell and making the component parts more efficient for graph problems. Hadoop stays Hadoop externally, but internally has graph super powers. These are strategies you can use.

What I found most intriguing is thinking about the larger consequences of Hadoop being inefficient. There's more in play than I had previously considered. From the most obvious angle, money, we are used to thinking this way about mass produced items. If a widget can be cost reduced by 10 cents and millions of them are made, we are talking real money. If Hadoop is going to be used for the majority of data mining problem, then making it more efficient adds up to real effects. Going to the next level, the more efficient Hadoop becomes, the quicker important problems facing the world will be solved. Interesting.

For more details please read the original article and the paper describing the work: Scalable SPARQL Querying of Large RDF Graphs.

Relate Articles

Reader Comments (4)

I'm not sure if Hadoop should implement all that. It's going to be great for graphs problem but going to be inefficient (slower) for everything else that is not graphs. The first 2 improvements would require extra works on Hadoop replication mechanism, which, in effect, will slow down the replication of data in this operation.
Now, that would be a nice configuration flag to turn on/off for graphs.
If I remember it right, there is a company doing that for Hadoop already (sorry, forgot the name). It's open-source also.

July 27, 2011 | Unregistered CommenterTim Pham

What would be a "graph-optimized storage" solution today - this is a little unclear.

July 27, 2011 | Unregistered CommenterGeorge

The next generation of hadoop (aka mapreduce 2.0, Apache JIRA MAPREDUCE-279) changes the landscape completely by separating generic and application specific resource management, i.e., mapreduce will be a userland library. Application specific solutions like this can implemented with no change to hadoop installation.

July 28, 2011 | Unregistered Commentervicaya

HDFS is optimized to transfer data in 64 MB chunks. The latency required for each I/O, which could be random, is enough to kill many graph traversal algorithms. You'd be much better off with a dedicated graph database management system. If you were looking at HDFS because it can handle large amounts of data you'd be better off with a scalable, distributed graph database product, such as InfiniteGraph.

July 29, 2011 | Unregistered CommenterLeon

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>