Saturday
May312008
Biggest Under Reported Story: Google's BigTable Costs 10 Times Less than Amazon's SimpleDB

Why isn't Google's aggressive new database pricing strategy getting more pub? That's what Bill Katz, instigator of the GAE Meetup and prize winning science fiction author is wondering:
If money matters to you then the burn rate under GAE could be convincingly lower. Let's compare the numbers:
GAE pricing:
* $0.10 - $0.12 per CPU core-hour
* $0.15 - $0.18 per GB-month of storage
* $0.11 - $0.13 per GB outgoing bandwidth
* $0.09 - $0.11 per GB incoming bandwidth
SimpleDB Pricing:
* $0.14 per Amazon SimpleDB Machine Hour consumed
* Structured Data Storage - $1.50 per GB-month
* $0.100 per GB - all data transfer in
* $0.170 per GB - first 10 TB / month data transfer out (more on the site)
Clearly Google priced their services to be competitive with Amazon. We may see a response by Amazon in the near feature, but the database storage cost for GAE is dramatically cheaper at $0.15 - $0.18 per GB-month vs $1.50 per GB-month.
Interestingly, Google's price is the same as Amazon's S3 (file storage) pricing. Google seems to think of database storage as more like file storage. That makes a certain amount of sense because BigTable is a layer on the Google File System. File system pricing may be the more appropriate price reference point.
On SimpleDB a 1TB database costs $1,500/month and BigTable costs in the $180/month range. As you grow into ever larger data sets the difference becomes even more compelling.
If you are a startup your need for funding just dropped another notch. It's hard to self-finance many thousands of dollars a month, but hundreds of dollars is an easy nut to make.
Still, Amazon's advantage is they support application clusters that can access the data for free within AWS. GAE excels at providing a scalable two tier architecture for displaying web pages. Doing anything else with your data has to be done outside GAE, which kicks up your bandwidth costs considerably. How much obviously depends on your application. But if your web site is of the more vanilla variety the cost savings could be game changing.
It's surprising that the blogosphere hasn't picked up the biggest difference in pricing:
Google's datastore is less than a tenth of the price of Amazon's SimpleDB while offering a better API.
If money matters to you then the burn rate under GAE could be convincingly lower. Let's compare the numbers:
GAE pricing:
* $0.10 - $0.12 per CPU core-hour
* $0.15 - $0.18 per GB-month of storage
* $0.11 - $0.13 per GB outgoing bandwidth
* $0.09 - $0.11 per GB incoming bandwidth
SimpleDB Pricing:
* $0.14 per Amazon SimpleDB Machine Hour consumed
* Structured Data Storage - $1.50 per GB-month
* $0.100 per GB - all data transfer in
* $0.170 per GB - first 10 TB / month data transfer out (more on the site)
Clearly Google priced their services to be competitive with Amazon. We may see a response by Amazon in the near feature, but the database storage cost for GAE is dramatically cheaper at $0.15 - $0.18 per GB-month vs $1.50 per GB-month.
Interestingly, Google's price is the same as Amazon's S3 (file storage) pricing. Google seems to think of database storage as more like file storage. That makes a certain amount of sense because BigTable is a layer on the Google File System. File system pricing may be the more appropriate price reference point.
On SimpleDB a 1TB database costs $1,500/month and BigTable costs in the $180/month range. As you grow into ever larger data sets the difference becomes even more compelling.
If you are a startup your need for funding just dropped another notch. It's hard to self-finance many thousands of dollars a month, but hundreds of dollars is an easy nut to make.
Still, Amazon's advantage is they support application clusters that can access the data for free within AWS. GAE excels at providing a scalable two tier architecture for displaying web pages. Doing anything else with your data has to be done outside GAE, which kicks up your bandwidth costs considerably. How much obviously depends on your application. But if your web site is of the more vanilla variety the cost savings could be game changing.
Reader Comments (24)
It IS more like file storage, in that you aren't getting the expensive JOIN operations that relational databases do. BigTable could probably be better thought of as a file system with a limited locking capability and a unique take on indexing. All its operations are designed to be cheap, which I don't think is true of most databases.
Not to be picky, but I think there is a intangible down side to use Amazon's DB or Google's DB.
If you look at the trend of web 2.0, it is all about data. Who ever hordes the massive amount of data and can effectively parse it or analyze it over the cluster wins. For years, Google has been "forcibly" collecting web data using web robots(crawlers). What if the customers started storing all their valuable data on google's database? There isn't anything to prevent google to periodically scan your data, and make monetary decisions based on your data. Sure, they will say they didn't do it. But do you really trust your data store to a third party? Credit card numbers? Social security numbers? anything?
You are right that Google's app engine is great only when your scale demands it. But the other side of the coin is that when your application demands horizontal scaling like Google App Engine, you should be able to install Hadoop instead.
The trick is of course dependence, which means any application created on top of a Google or Amazon platform will instantly be locked onto the infrastucture, making it diffcult for future liquidation.
To add to the comment above, Google's infrastructure is great only when you reach a point when it MUST be distributed. Right now you can cheaply build a 2 Socket 8 core system with 64GB of RAM, and attach 2 Dell MD3000s and make it a 30 SAS Drive Database server running MySQL or PostgreSQL. Such a system with 30 36GB SAS drives in RAID10 can handle a database in the area of 100-300GB in size with decent speed.
If you need more than that, the value of the data itself is reason enough to not trust it to a third party.
Look into EnterpriseDB's GridSQL which is a shared nothing parallel SQL Server. It is free for noncommercial use, and it is pretty cheap for production license too.(less than 2K)
Just to correct myself, the datastore pricing is about a tenth of the cost of SimpleDB. The costs even up a little when bandwidth costs dominate, so it depends on usage scenario.
Also, I'd rather design a system upfront to scale than waiting until "you reach a point when it MUST be distributed," especially if upfront design isn't too cumbersome. I felt SimpleDB was too simple. AppEngine might hit my sweet spot because it removes a lot of infrastructure hassles and still provides enough DB support (references, query sorts, etc).
I did pick it up =).
See the "Google App Engine Has a Better Database" section:
http://adamfisk.wordpress.com/2008/05/28/where-google-app-engine-spanks-amazons-web-services-s3-ec2-simple-db-sqs/
-Adam
Sorry Adam, you sure did. BTW, several people have said the BigTable API is better. Is it primarily the sorting and GQL you find better or is there something else?
It would be great to see Amazon offer HBASE as a service with cheaper rates than SimpleDB (since it makes fewer guarantees). Yes, you can build a cluster from instanaces, but it would be better if it 'just worked' in the cloud. Actually, it would be great if why did this with PostgreSQL and memcached as well.
Spot on. The Google App Engine technology is going to shake up the entire web hosting marketplace. I got to try Google App Engine and I have to say I'm impressed and excited so far. I'm familiar with web programming concepts. But with no prior Python experience whatsoever, I was able to follow the documentation and get a basic app up and running.
I made my app in spare time without a huge coding team. So imagine what a huge coding team could accomplish? If you're interested, the app is SellStuff!, an online marketplace. The proof-of-concept is at http://sellstuff.appspot.com/ if you're interested (Google Maps integration coming soon).
First, GAE is not as reliable as AWS; second, when you move data inside Amazon's cloud, you don't pay anything. This means that in a real scenario, with Amazon Web Services (and SimpleDB) you could actually spend LESS, having a more reliable environment.
Also, you can have GAE on top of AWS, but not the other way around :-)
I'm almost sure that by december of this year, somewhere, somehow, we will have an open source implementation for google architecture, based maybe on mysql, or on somekind of new custom made "bigtable", maybe a better suppoted version of dev_appserver.py, ready for production, ready to install on any vanilla box. Maybe it's not going to be very scalable (as LAMP), but I doubt we will be tied to Google System forever.
Being a long term user of MySQL and a managar of very large datasets, i know how quick and easy MySQL is, but also, i know how awful and problematic its replication is. The clustering aspect is also no where near the level of ease it should be.
I hope these developments from Amazon/Google drives MySQL (SUN) to actually rethink this area of their engine, because compared to the likes of Berkley, it is lagging behind.
LOL, Coming to a newspaper near you, "GOOGLE TAKES OVER THE WORLD". LMAO
JJ
http://www.Ultimate-Anonymity.com
Both the sorting and GQL are significant. I don't necessarily think GQL is better, but it's more familiar, so you can get up and running quicker. The fact you just know it will scale is huge too, though. I'm not sure on the "eventual consistency" of Simple DB versus BigTable because I haven't played with BigTable enough. Simple DB can sometimes take awhile (as much as 10 seconds, although I haven't observed anything that high in practice) for things to replicate. I would guess BigTable is better, but I haven't confirmed that.
@Simone Brunozzi
> First, GAE is not as reliable as AWS;
??? Based on what? You're basically saying Google infrastructure is not as reliable as whatever you custom-build on Amazon. That's a ridiculous statement.
> second, when you move data inside Amazon's cloud, you don't pay anything.
You don't pay for internal *bandwidth* within Amazon. You still pay for whatever CPU and disk space you use.
> This means that in a real scenario, with Amazon Web Services (and SimpleDB) you could actually spend LESS, having a more reliable environment.
Totally wrong. That's the whole point -- the disk space is far more on Amazon. Look up the Simple DB numbers, or just actually read this post. The more reliable environment thing is simply not true. Not sure why you're saying these things.
yeah it may be cheaper but it comes at the cost of choice and flexibility.
Unless Google gives you more flexibility with their hosting services, this doesn't really cause me to bat an eye
@Brian
I totally agree, but it depends on what you're building. If you're just building a web site, App Engine will take way less time (as long as you can hammer out some Python). If you need a lot of custom, tricky stuff, you need to run at least some of your stuff on AWS or your own servers. I predict a hybrid approach will be very common. That's what we'll be doing with http://www.littleshoot.org/site.html">LittleShoot -- moving from all AWS to a hybrid because it's cheaper and easier to build something that will scale.
Three things to note.
Big table was designed to hold massive data and SimpleDB was designed not to hold massive data, in fact SimpleDB info pages specifically warn you against this. The pricing example of 1TB per month is something that is in no way recommended for SimpleDB. With AWS you use S3 to store large data.
If you are storing dynamic website data and your site has greater than 5 million page views per month, bandwidth costs are likely dwarf your storage costs by a factor greater than 10. Comparing hosting plans along a single dimension without the perspective of likely usage may not yield much insight.
GAE doesn't give you access to BigTable. You get access to a DataStore API that is backed by BigTable. BigTable has a lot of major features and options that are not exposed via the DataStore API. GAE and BigTable are not synonymous.
OK, seems to me that Amazon is store more portable because you can choose not to use S3 and SimpleDB directly. Why couldn't you run a Hadoop cluster on a bunch of EC2 instances? Then, if you ever wanted to change, you just move the cluster over to some other Linux cluster? Nothing too proprietary there?
http://codershangout.com
A place for coders to hangout!
@cbmeeks I totally agree. For starting up, it's certainly more work, but to me you've nailed the major alternative. In fact, that's my current plan if I ever have to move things from App Engine. The biggest downside, of course, is it takes more time. You also still have to set up some sort of load balancing with something like HA Proxy or Pound. Amazon has their Elastic IPs, otherwise you'd have to deal with that too.
I'd say HBase on top of Hadoop is more accurate, but we're talking about the same thing. Probably no big table, but close.
I thought about this some more, apparently, google and amazon is pretty good at masking the cost to the users.
Look at this cost structure closely.
* $0.10 - $0.12 per CPU core-hour
* $0.15 - $0.18 per GB-month of storage
* $0.11 - $0.13 per GB outgoing bandwidth
* $0.09 - $0.11 per GB incoming bandwidth
the bandwidth cost is 0.11 cents on average per GB. Since 1mbit of bandwidth is about 300GB/month, Google is charging you about 30 dollars/1mbit, which isn't bad when you are small. When your bandwidth needs starts to go to 100mbit above, you can negotiate the cost down to 10-15 dollars/1mbit. So for bandwidth cost, GAE favors small guys, which by the way, is against their scaling argument. The whole point of GAE is unlimited scaling, but Google makes 50% cut once your bandwidth goes to about 50-100mbit(where you can negotiate your own contracts to market rates)
The most compelling argument for the GAE is the storage. For a 1TB database, you only get charged a monthly fee of 180 dollars/month.
The killer of course is the 11 cents per CPU core-hour. First of all, without details being hammered out, you can only assume that Google's "core" is equvalent to Amazon's core which is a Xeon "1.7Ghz" equivalent. The thing about equivalence, is that when each generation of new cores (Nehalem for example) come out, the "vendor" aka google and amazon can fudge the virtualization factor a little bit. Say if a new 8 core Nehalem comes out, they might give you 1.2Ghz slice and mark it as a core since a 1.2Ghz Nehalem would be equivalent IPC wise to an older generation Xeon at 1.7Ghz. If you look at the CPU market today, an 8 core 2.5Ghz Xeon system can be cheaply had for about 2000 dollars. 11 cents per CPU core(1.7Ghz) means Google is charging you about $1-$1.20 per hour for a 8 core system colocated in their data center. Assume 720 hours in a month, you are paying close to 800 dollars a month to purchase the CPU power from Google/Amazon. So in 3-4 months, Google can recoup their initial investment for the box itself. That is simply too good for them.
This is actually the "business model" for google/amazon. They make the money from CPU hours. CPU is bound by Moore's law, so the cost will decrease over the years at the rate of inverse moores law. So CPU cost to the vendors will be halved every 18 months. Assume Google/AMZN will pass some of the cost savings to the customer, they will not cut their price in half every 18 months for sure.
The best use for the GAE and Amazon SimpleDB is actually historical archive database where you don't query the data very often, but need the storage cheap. Given the recent Sarbox regulations, historical database warehousing will be a whole lot more expensive. Imaging storing 20PB of historical data on GAE? It's pretty cheap. But then again, that much data on a third party vendor is kinda scary. After Google is after all kinds of data anyways if you know what I am taking about. So that reduces the utility of the GAE only to non-proprietary historical archive databases.
I think it's amusing to see folks here say things like "You can just buy a 8 core system and put MySQL on it". There are no doubt issues with GAE and AWS inherit to their services (such as batch processing on the backend in GAE -- ergo the suggested hybrid solution is sound). But, to suggest hosting you own server has it's costs. Power, bandwidth and management are the three biggies that come to mind. Utilizing GAE or AWS initially is by far the better way to go when you're starting out (Not that GAE and AWS are apples to apples comparison). In the event you need/want more pricing leverage...that's when you host yourself.
-Dan
For those suggesting Hadoop for horizontal scalability: Hadoop is not a real-time processing engine. It's a mapreduce implementiation designed for parallel, data-intensive offline processing. It will not help you scale for webservice. It will help you scale data processing performed on behalf of your web service.
This clears one thing if not the other that Amazon is definitely beating google in some of their marketing strategies. Its better google starts thinking more on this and do less "google dances" parties.
-----
http://underwaterseaplants.awardspace.com">sea plants
http://underwaterseaplants.awardspace.com/seagrapes.htm">Sea grapes...http://underwaterseaplants.awardspace.com/plantroots.htm">Plant roots
Thanks for the nice info.Doing anything else with your data has to be done outside GAE, which kicks up your bandwidth costs considerably. How much obviously depends on your application.http://www.condomman.com/">Condoms