Entries by HighScalability Team (1576)

Thursday
Oct212010

Machine VM + Cloud API - Rewriting the Cloud from Scratch

Write a little "Hello World" program these days and it runs inside a bewildering Russian Doll of nested environments, each layer adding its own special performance and complexity tax. First, a language executes in its own environment of data structure libraries, memory management, and so on. That, more often than not, will run inside a language VM like the JVM, CLR, or V8. The language VM will in-turn run inside a process that runs inside an OS. An application will run in one or more threads inside a process. And the whole thing will run inside a machine sharing VM layer like Xen. And across all of that are frameworks for monitoring, elasticity, storage, and so on. That's a lot of overhead for a such a little program.

What if we could remove all these taxes and run directly on the new bare metal, which some consider to be a combination of Machine VM + Cloud API? That's exactly what a system called Mirage, described in the paper Turning down the LAMP: Software Specialisation for the Cloud, sets out to do by treating the cloud virtual hardware as a compiler target, and converting high-level language source code directly into kernels that run on it.

Click to read more ...

Tuesday
Oct192010

Sponsored Post: Playfish, Electronic Arts, Tagged, Undertone, Box.net, Wiredrive, Joyent, DeviantART, CloudSigma, ManageEngine, Site24x7

Who's Hiring?

Fun and Informative Events

  • Membase Meetups Coming to Major US Cities. The first of these technical meetups is on October 28 at Zynga’s San Francisco offices.

Cool Products and Services

Click to read more ...

Friday
Oct152010

Troubles with Sharding - What can we learn from the Foursquare Incident?

For everything given something seems to be taken. Caching is a great scalability solution, but caching also comes with problems. Sharding is a great scalability solution, but as Foursquare recently revealed in a post-mortem about their 17 hours of downtime, sharding also has problems. MongoDB, the database Foursquare uses, also contributed their post-mortem of what went wrong too.

Now that everyone has shared and resharded, what can we learn to help us skip these mistakes and quickly move on to a different set of mistakes?

Click to read more ...

Friday
Oct082010

4 Scalability Themes from Surgecon

Robert Haas in his SURGE Recap of the Surge conference, reflected a bit, and came up with an interesting checklist of general themes from what he was seeing. I'm directly quoting his post, so please see the post for a full discussion. He uses this framework to think about the larger picture and where PostgreSQL stands in its progression.

  1. Make use of the academic literature. Inventing your own way to do something is fine, but at least consider the possibility that someone smarter than you has thought about this problem before.
  2. Failures are inevitable, so plan for them.  Try to minimize the possibility of cascading failures, and plan in advance how you can operate in degraded mode if disaster (or the Slashdot effect) strikes.
  3. Disk technology matters. Drive firmware bugs are common and nightmarish, and you can expect very limited help from the manufacturer, especially if the drive is billed as consumer-grade rather than enterprise-grade. SSDs can save you a lot of money, both because a given number of dollars buys more IOs-per-second, and because electricity isn't free.
  4. Large data sets require horizontal scalability.  In the era of 1TB drives, "large" doesn't mean quite what it used to,  but even though the amount of data you can manage with one machine is growing all the time, the amount of data people want to manage is growing even faster.
Thursday
Oct072010

Hot Scalability Links For Oct 8, 2010

Tuesday
Oct052010

Sponsored Post: Box.net, Wiredrive, Joyent, DeviantART, CloudSigma, ManageEngine, Site24x7

Who's Hiring?

Cool Products and Services

Click to read more ...

Monday
Oct042010

Paper: An Analysis of Linux Scalability to Many Cores  

An Analysis of Linux Scalability to Many Cores, by a number of MIT researchers, is a refreshingly practical paper on what it takes to scale Linux and common applications like Exim, memcached, Apache, PostgreSQL, gmake, Psearchy, and MapReduce to run on 48 core systems. A very timely paper given moderately massive multicore systems are reportedly the near future of computing.

This paper must have taken a lot of work. They both tracked down bottlenecks in a number of applications and the Linux kernel and they also tried to fix them. Modestly speaking the authors said they made "modest" changes to the kernel and applications, but there's nothing modest about what they did. It's excellent work.

After the next bit, which is the abstract, there is a list of the problems they found and how they fixed them.

Click to read more ...

Friday
Oct012010

Hot Scalability Links For Oct 1, 2010

Click to read more ...

Friday
Oct012010

Google Paper: Large-scale Incremental Processing Using Distributed Transactions and Notifications

This paper, Large-scale Incremental Processing Using Distributed Transactions and Notifications by Daniel Peng and Frank Dabek, is Google's much anticipated description of Percolator, their new real-time indexing system.

The abstract:

Updating an index of the web as documents are crawled requires continuously transforming a large repository of existing documents as new documents arrive. This task is one example of a class of data processing tasks that transform a large repository of data via small, independent mutations. These tasks lie in a gap between the capabilities of existing infrastructure. Databases do not meet the storage or throughput requirements of these tasks: Google’s indexing system stores tens of petabytes of data and processes billions of updates per day on thousands of machines. MapReduce and other batch-processing systems cannot process small updates individually as they rely on creating large batches for efficiency.

 

We have built Percolator, a system for incrementally processing updates to a large data set, and deployed it to create the Google web search index. By replacing a batch-based indexing system with an indexing system based on incremental processing using Percolator, we process the same number of documents per day, while reducing the average age of documents in Google search results by 50%. 
Thursday
Sep302010

More Troubles with Caching

As a tasty pairing with Facebook And Site Failures Caused By Complex, Weakly Interacting, Layered Systems, is another excellent tale of caching gone wrong by Peter Zaitsev, in an exciting twin billing: Cache Miss Storm and More on dangers of the caches. This is fascinating case where the cause turned out to be software upgrade that ran long because it had to be rolled back. During the long recovery time many of the cache entries timed out. When the database came back, slam, all the clients queried the database to repopulate the cache and bad things happened to the database. The solution was equally interesting: 

Click to read more ...