« OpenSpaces.org community site launched - framework for building scale-out applications | Main | A Note on How to Create Teasers When Posting »
Sunday
Jan132008

Google Reveals New MapReduce Stats

The Google Operating System blog has an interesting post on Google's scale based on an updated version of Google's paper about MapReduce.

The input data for some of the MapReduce jobs run in September 2007 was 403,152 TB (terabytes), the average number of machines allocated for a MapReduce job was 394, while the average completion time was 6 minutes and a half. The paper mentions that Google's indexing system processes more than 20 TB of raw data.



Niall Kennedy calculates that the average MapReduce job runs across a $1 million hardware infrastructure, assuming that Google still uses the same cluster configurations from 2004: two 2 GHz Intel Xeon processors with Hyper-Threading enabled, 4 GB of memory, two 160 GB IDE hard drives and a gigabit Ethernet link.

Greg Linden notices that Google's infrastructure is an important competitive advantage. "Anyone at Google can process terabytes of data. And they can get their results back in about 10 minutes, so they can iterate on it and try something else if they didn't get what they wanted the first time."

It is interesting to compare this to Amazon EC2:

  • $0.40 Large Instance price per hour x 400 instances x 10 minutes = $26.7

  • 1 TB data transfer in at $0.10 per GB = $100



For a hundred bucks you could also process a TB of data!

Reader Comments (1)

Amazon charge by the full hour:
Pricing is per instance-hour consumed for each instance type. Partial instance-hours consumed are billed as full hours.

So the calculation is:


  • $0.40 Large Instance price per hour x 400 instances = $160

  • 1 TB data transfer in at $0.10 per GB = $100

Plus, you need to have an efficient way of bringing 400 instances online simultaneously, and somewhere to store 1Tb of data that can serve it to EC2 in under 5 minutes. That in itself is a pretty significant problem.

So if you assume you upload the data to S3, you add the cost of storing the data, which would be $150 per month. Assuming you can upload the data, run the process and then delete the source data in 24 hours, the cost of storage would be reduced to around $5 - $10.

It's still an awful lot less than Google's $1m investment though.

Cheers - http://www.callum-macdonald.com/" title="Callum" target="_blank">Callum.

December 31, 1999 | Unregistered Commenterchmac

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>