How Next Big Sound Tracks Over a Trillion Song Plays, Likes, and More Using a Version Control System for Hadoop Data
Tuesday, January 28, 2014 at 8:56AM
HighScalability Team in Example

This is a guest post by Eric Czech, Chief Architect at Next Big Sound, talks about some unique approaches taken to solving scalability challenges in music analytics.

Tracking online activity is hardly a new idea, but doing it for the entire music industry isn't easy. Half a billion music video streams, track downloads, and artist page likes occur each day and measuring all of this activity across platforms such as Spotify, iTunes, YouTube, Facebook, and more, poses some interesting scalability challenges. Next Big Sound collects this type of data from over a hundred sources, standardizes everything, and offers that information to record labels, band managers, and artists through a web-based analytics platform.

While many of our applications use open-source systems like Hadoop, HBase, Cassandra, Mongo, RabbitMQ, and MySQL, our usage is fairly standard, but there is one aspect of what we do that is pretty unique. We collect or receive information from 100+ sources and we struggled early on to find a way to deal with how data from those sources changed over time, and we ultimately decided that we needed a data storage solution that could represent those changes.  Basically, we needed to be able to "version" or "branch" the data from those sources in much the same way that we use revision control (via Git) to control the code that creates it.  We did this by adding a logical layer to a Cloudera distribution and after integrating that layer within Apache Pig, HBase, Hive, and HDFS, we now have a basic version control framework for large amounts of data in Hadoop clusters.

As a sort of "Moneyball for Music," Next Big Sound has grown from a single server LAMP site tracking plays on MySpace (it was cool when we started) for a handful of artists to building industry-wide popularity charts for Billboard and ingesting records of every song streamed on Spotify. The growth rate of data has been close to exponential and early adoption of distributed systems has been crucial in keeping up. With over 100 sources tracked coming from both public and proprietary providers, dealing with the heterogenous nature of music analytics has required some novel solutions that go beyond the features that come for free with modern distributed databases.

Next Big Sound has also transitioned between full cloud providers (Slicehost), hybrid providers (Rackspace), and colocation (Zcolo) all while running with a small engineering staff using nothing but open source systems. There was no shortage of lessons learned in this process and we hope others can take something away from our successes and failures.

Stats

Platform

 

Storage Architecture

Storing timeseries data is relatively simple with distributed systems like Cassandra and HBase, but managing how that data evolves over time is much less so. At Next Big Sound, aggregating data from 100+ sources involves a somewhat traditional Hadoop ETL pipeline where raw data is processed via MapReduce applications, Pig, or Hive and results are written to HBase for later retrieval via Finagle/Thrift services; but with a twist. All data stored within Hadoop/HBase is maintained by a special version control system that supports changes in ETL results over time, allowing for changes in the code that defines the processing pipeline to align with the data itself.

Our "versioning" approach for managing data within Hadoop is an extension to techniques like that used in the Linked.in data cycle, where results from Hadoop are recomputed, in full, on a recurring basis and atomically swapped out with old result sets in Voldemort in a revertible, versioned way. The difference with our system is that versioning doesn't just occur at the global level, it occurs on a configurable number of deeper levels. This means, for example, that if we're recording the number of retweets by country of an artist on Twitter and we find that our logic for geocoding tweet locations was wrong for a few days, we can simply create new versions of data for just those days rather than rebuilding the entire dataset. Different values will now be associated with each of these new versions and access to each version can be restricted to certain users; developers might see only the newest versions while normal users will see the old version until the new data is confirmed as accurate. "Branching" data like this has been critical for keeping up with changes in data sources and customer requests as well as supporting efficient, incremental data pipelines.

For some extra detail on this system, this diagram portrays some of the key differences described above.

For even more details, check out Paper: HBlocks: A Hadoop Subsystem for Iterative Data Engineering.

The Hadoop infrastructure aside, there are plenty of other challenges we face as well. Mapping the relationships of entities within the music industry across social networks and content distribution sites, building web applications for sorting/searching through millions of datasets, and managing the collection of information over millions of API calls and web crawls all require specialized solutions. We do all of this using only open source tools and a coarse overview of how these systems relate to one another is shown below.

 

Hosting

 

Lessons Learned

 

The Next Big Sound Blog also has some more interesting posts on data mining in the music industry and if you're really into that sort of thing, we're always hiring!

Article originally appeared on (http://highscalability.com/).
See website for complete article licensing information.