Friday
Jun242011
Stuff The Internet Says On Scalability For June 24, 2011

Submitted for your scaling pleasure:
- Achievements:
- Watson uses 10,000's of watts, the computer between the ears uses 20. With only 200 million pages and 2TB of data, Watson is BigInsights, not BigData.
- That Google is pretty big: 1 billion unique monthly visitors
- tweetimages: We peaked at 22m avatars yesterday. Bandwidth peaked at 9GB of @twitter avatars in a single hour.
- Foursquare Surpasses 10 Million Users
- Reddit Hits 1.2B Monthly Pageviews, More Than Doubles Its Engineering Staff
- Twitter: 185 million tweets are posted daily; 1.6 billion search queries daily; indexing latency is less than 10 seconds.
- Quotable quotes:
- skr: OH: "people wait their whole lives for a situation where they can use bloom filters"
- joeweinman: @Werner at #structureconf : as of Nov 10, 2010, all Amazon.com traffic was served from AWS. <-- The child surpasses the parent.
- bbatsov: A compiled language does not scalability make -- Yoda
- swredman: If i read the marketing buzzwords 'scalability' or leverage your data' one more time, gonna lose my sh*t.
- ipeirotis: Most tasks are too arbitrary to even decompose in atomic steps, while handling quality, cost, scalability and interactions.
- ArmonDadgar: Some people. You had to use NoSQL for 2 TPS and a DB that can fit on my iPhone? This is some serious #BigData.
- aphethean: For a guy who started Dev life programming applications on hierarchical databases #nosql looks remarkably familiar
- "Software is an entropic system whose arrow of time flows in the direction of failure, aided and abetted by human bullsh*t"
- bstg: Wonder if the best way to improve insights from #bigdata isn't better analytics, but a fundamental change in the way we capture it? #in
- Apple has made their WWDC 2011 Videos available. Apple is normally as closed as a counter-insurgency cell, but their WWDC videos are always top notch.
- James Hamilton hits on another big change in the database landscape, moving away from crazy enterprise pricing schemes to more sustainable and rational models.
- Great discussion on Reddit of Eben Moglen's fascinating The alternate net we need, and how we can build it ourselves. Our net has been turned against us. How do we get it back? Without anonymity the human race will not be human anymore. We need smart routers that work for us.
- The NoSQL Fad says Alex Popescu won't be countered by a relational database with relaxed semantics, as that's just recreating NoSQL in the first place.
- Spark, in-memory cluster computing that aims to make data analytics fast — both fast to run and fast to write.
- In The State of Management Scalability at Stack Exchange Kyle Brandt talks about scaling their ability to manageer their Linux and Windows environment through automation. The idea is if you have to do more than once on multiple servers then automate it. The cool part is they have a chart of what part of their current process doesn't meet this goal and they have a plan of how to get there.
- PortLand: Scaling Data Center Networks to 100,000 Ports and Beyond, great discussion of how we need to be able scale the network layer as easily as we currently scale the CPU and storage layer. We are being held back by IO.
- Disruptor - a Concurrent Programming Framework, is a general-purpose mechanism that solves a complex problem in concurrent programming in a way that maximizes performance.
- Content Delivery Summit Videos Now Available For Viewing. Secrets behind the real heart of the web, CDNs.
- Velocity 2011 Speaker Slides & Video are available.
- Tired of the NoSQL love fest? Here you go: Scaling with MongoDB (or how Urban Airship abandoned it for PostgreSQL). Found MongoDB was fast until data and indexes no longer fit in memory and that Auto-sharding and Replica Sets too scary to trust. Decided to move their data to a manually partitioned PostgreSQL. To learn how Foursquare uses Mongo take a look at Practical Data Storage: MongoDB at foursquare.
- Nice summary (in Japanese) of How do I improve the scalability of database? Hard to get over the auto translation capability in Chrome. Ain't the web grand?
- When Watson needs to be fed it dines at chez Hadoop. A Hadoop backend is used to crunch through the documents to prep for the interactive Jeopardy matches. There is no other system flexible enough to allow for the flexible knowledge extraction that we need.
- More videos for you. NDC 2011 Video Torrent, a torrent of all the NDC 2011 videos (Norwegian Developers Conference) is now available. If that's not enough here are videos from Jfokus 2011. Emil Eifrem talks NoSQL and there are talks on GWT, Scala, Java EE 6, and TDD.
- Performance is a Feature says Jeff Atwood. To be fast: Follow Yahoo's Guidelines, Optimize for Anonymous Users, Make Performance a Point of Public Pride.
- Intel takes wraps off 50-core supercomputing coprocessor plans reports Jon Stokes. It's the age-old general-purpose (slower, easier to use) vs. specialized (faster, harder to use) tradeoff, and Intel is betting that since Tesla has so far been the only real option there are plenty of potential users out there who are in the market for something less specialized.
- Lift - a web framework built on Scala to create concise, secure, scalable, highly interactive web applications that provide a full set of layered abstractions on top of HTTP and HTML.
- Greg Weber talks High Performance Ruby Part 3: non-blocking IO and web application scalability. What I am hoping the Ruby community can achieve is the same ease of programming of Rails, but with easier deployment and much better scalability. But lets step back and look at the situation we are in.
- Windows Azure Storage Abstractions and their Scalability Targets. A single queue is targeted to be able to process up to 500 messages per second. The target throughput of a single blob is up to 60 MBytes/sec. The throughput target for a single partition is up to 500 entities per second.
- Jeremiah Peschka with a good overview of Resolving Conflicts in the Database. Some options: Manual intervention, Logging conflicts, Master write server, Last write wins, Write partitioning.
- A unspoken law of the Internet is that all of Google's infrastructure must be recreated outside Google in open source form. GoldenOrb is doing their part by creating an open source version of Pregel, used for massive-scale graph processing. If you are unsure what Pregel is or how to use it, Michael Nielsen has a very good article on Pregel that's worth a look.
- Riak Pipe details shared. Pipe allows you to specify work in the form of a chain of function pairs. It will be used to supercharge Riak's MapReduce feature.
- Learn more about Parallel Alogorithms from Guy Blelloch's 15-499: Parallel Algorithms course.
- A diagram of Amazon's Multi-AZ setup.
Reader Comments (2)
Amazon is lying about AWS, and has been from the beginning. When they first announced AWS they said that you could "rent the same infrastructure that Amazon.com uses!" That was a lie, unless by "same infrastructure" you mean host in the same data centers as a small fraction of Amazon.com, and use the same hardware configuration. None of the custom services such as S3, were derived from work for Amazon.com and Amazon.com at that time didn't use any of them.
If Werner Vogals is now saying that %100 of Amazon.com is served by AWS, that is either a complete lie, or he means that they are provisioning generic servers (EC2 VPSs for Amazon.com).
The retail operation uses software that is so crufty and so... massive... that there hasn't been enough time to re-write it to *really* be based on AWS's custom services.
Amazon is a dishonest company with great customer service. People like them because they treat their customers well. But their engineering is crappy, they do not treat their employees well, and they are not a "high tech" company by any means... they are a sales organization. Their primary product is bullshit and business is good.
The AWS infrastructure has great lock-in, is way over priced and has proven to not be reliable. (Has everyone forgotten the unforgivable outage they just had?) It astounds me that people spend time writing software to get around the terrible performance of EBS... rather than saving time and money and going with any number of competitive hosts. But then, I shouldn't be surprised that people are stupid.
If anyone from Amazon wishes to dispute what I've said, please forward my comments to your legal department. I'd love to back up my claims publicly in court.
Maybe its just me (and other people who work at large non-nosql and thus not so cool) companies. But most of these numbers never seem very impressive. Granted if my personal website hit those numbers I'd be stoked (and broke).
But our CDN does 9Gb/s on the off hours and over 13Gb/s during peak hours. Just like we're serving 2500-4000 page views a second on our main site and our server farm maintains 1Gb/s+ of just compressed HTML output with our banners cluster serving over a billion impressions a day. And that's just normal operations for us.
So far none of the NoSQL stuff I've tested out has been able to scale to something useful enough for us. Which is a shame since I really like where some of it's going. We're still just plain old LAMP (MySQL/mod_perl).