1 Billion Reasons Why Adobe Chose HBase

Tuesday

Mar162010

1 Billion Reasons Why Adobe Chose HBase

Tuesday, March 16, 2010 at 11:46AM

Cosmin Lehene wrote two excellent articles on Adobe's experiences with HBase: Why we’re using HBase: Part 1 and Why we’re using HBase: Part 2. Adobe needed a generic, real-time, structured data storage and processing system that could handle any data volume, with access times under 50ms, with no downtime and no data loss. The article goes into great detail about their experiences with HBase and their evaluation process, providing a "well reasoned impartial use case from a commercial user". It talks about failure handling, availability, write performance, read performance, random reads, sequential scans, and consistency.

One of the knocks against HBase has been it's complexity, as it has many parts that need installation and configuration. All is not lost according to the Adobe team:

HBase is more complex than other systems (you need Hadoop, Zookeeper, cluster machines have multiple roles). We believe that for HBase, this is not accidental complexity and that the argument that “HBase is not a good choice because it is complex” is irrelevant. The advantages far outweigh the problems. Relying on decoupled components plays nice with the Unix philosophy: do one thing and do it well. Distributed storage is delegated to HDFS, so is distributed processing, cluster state goes to Zookeeper. All these systems are developed and tested separately, and are good at what they do. More than that, this allows you to scale your cluster on separate vectors. This is not optimal, but it allows for incremental investment in either spindles, CPU or RAM. You don’t have to add them all at the same time.

Highly recommended, especially if you need some sort of balance to the recent gush of Cassandra articles.

HighScalability Team |

2 Comments |

Permalink |

Print Article

Email Article

Example,

nosql

Reader Comments (2)

I know Cosmin and have intimate knowledge of their project.

Their setup is marvelously scalable but there's a catch ... they didn't actually need it. Those 1 billion reasons were mostly imaginary.

They tried providing a common infrastructure for various internal projects (storing common stuff). But the projects are different enough that data sharing doesn't really happen, a project's space being practically an island which have been better off served by its own individual database.

There are other advantages to that, like having a reusable storage, but you need your own database anyway because in HBase only certain kinds of data ends-up being stored (happened for one of the clients Cosmin's talking about). And then you end-up with consistency/local performance issues (to workaround that a lot of metadata is replicated in the local database, and there are synchronization issues anyway, only without guarantees ... like in the case of a master-master that dies loudly in case of a conflict).

This is not to downplay their effort (they are really top-notch engineers) ... but my impression is that they had way too much time on their hands, and their particular project is just a solution looking for a problem.

Of course, when you have the resources of Adobe, this may actually be a good idea. But in a smaller company that has to deliver before its competition does this is yak shaving.

March 17, 2010 |

Amt

Our first project was 40 million records but the system never reached its capacity. As mentioned in the article:

"The system never reached its planned capacity.[…] In reality, all we had would have been easy to handle with a MySQL cluster and just a little operational overhead."

The scale of our latest solution definitely isn’t imaginary. In the second half of the article I talk about the present where there are currently 560 million fairly fat records (~4.5 TB) in one table and 2.4 billion in a different cluster.

Multiple clients are using the current system today and there are no local databases or syncing issues.

Talking about how the new system replaces the older system that you’re familiar with was beyond the scope of this article but that might be a good idea for a future post.

We’ll follow up with more posts on our work and I’d be happy to talk more about this with you offline at work if you’d like.

March 18, 2010 |

Cosmin Lehene

Post a New Comment

Enter your information below to add a new comment.

Author:

Author Email (optional):

Author URL (optional):

Post:

↓ | ↑

Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>