Entries by HighScalability Team (1576)

Friday
Feb042011

Stuff The Internet Says On Scalability For February 4, 2011

Submitted for your reading pleasure...

  • Super Bowl Prediction: Pittsburgh 27, Green Bay 24. I'll be rooting for Green Bay, but the Pittsburgh defense will eventually win the day, beating back the fleet footed, quick tossing, and sharp shooting Aaron Rodgers. Roethlisberger will make exactly 3 plays that matter, but they'll be the right 3 plays.
  • Reddit is now at 1 billion page views a month. Congratulations!
  • Amazon S3 Cloud Stores 262 Billion Objects.  My god, it's full of stars...
  • Quora’s Technology Examined by Phil Whelan. Excellent detective work answering the question: How Does Quora Work?
  • Quotable Quotes:
    • @timoreilly: When hardware became commoditized, software was valuable. Now that software being commoditized, data is valuable. #strataconf
    • @coldfusionPaul: "Write someone a query, they'll go away for a day. Teach someone to query, they'll just go away." so, I use #NoSQL 555
    • @squarecog: To go *really* fast, you want to get rid of spokes in your wheels, and ditch tires. Also, turning is overrated. #nosql

Click to read more ...

Wednesday
Feb022011

Piccolo - Building Distributed Programs that are 11x Faster than Hadoop

Piccolo (not this or this) is a system for distributed computing, Piccolo is a new data-centric programming model for writing parallel in-memory applications in data centersUnlike existing data-flow models, Piccolo allows computation running on different machines to share distributed, mutable state via a key-value table interface. Traditional data-centric models (such as Hadoop) which present the user a single object at a time to operate on, Piccolo exposes a global table interface which is available to all parts of the computation simultaneously. This allows users to specify programs in an intuitive manner very similar to that of writing programs for a single machine.

Using an in-memory key-value store is a very different approach from the canonical map-reduce, which is based on using distributed file systems. The results are impressive:

Experiments have shown that Piccolo is fast and pro-vides excellent scaling for many applications. The performance of PageRank and k-means on Piccolo is 11×and 4× faster than that of Hadoop. Computing a PageR-ank iteration for a 1 billion-page web graph takes only 70 seconds on 100 EC2 instances. Our distributed webcrawler can easily saturate a 100 Mbps internet uplink when running on 12 machines.

Piccolo was presented at OSDI10. For the paper take a look at Piccolo: Building Fast, Distributed Programs with Partitioned Tables, here's the slide deck, and there's a video of the talk (very good).

Click to read more ...

Tuesday
Feb012011

Google Strategy: Tree Distribution of Requests and Responses

If a large number of leaf node machines send requests to a central root node then that root node can become overwhelmed:

  • The CPU becomes a bottleneck, for either processing requests or sending replies, because it can't possibly deal with the flood of requests.
  • The network interface becomes a bottleneck because a wide fan-in causes TCP drops and retransmissions, which causes latency. Then clients start retrying requests which quickly causes a spiral of death in an undisciplined system.

One solution to this problem is a strategy given by Dr. Jeff Dean, Head of Google's School of Infrastructure Wizardry, in this Stanford video presentation: Tree Distribution of Requests and Responses.

Instead of having a root node connected to leaves in a flat topology, the idea is to create a tree of nodes. So a root node talks to a number of parent nodes and the parent nodes talk to a number of leaf nodes. Requests are pushed down the tree through the parents and only hit a subset of the leaf nodes.

With this solution:

Click to read more ...

Tuesday
Feb012011

Sponsored Post: Karmasphere, Kabam, Opera Solutions, Percona, Appirio, Newrelic, Cloudkick, Membase, EA, Joyent, CloudSigma, ManageEngine, Site24x7

Who's Hiring?

Fun and Informative Events

  • Percona Live to be held in San Francisco February 16th, 2011. A one day event run by the experts behind the MySQL Performance Blog.
  • A new round of Membase meetups have been planned for January 2011 for San Diego, Denver, Seattle, Vancouver and Chicago.

Cool Products and Services

Click to read more ...

Friday
Jan282011

Stuff The Internet Says On Scalability For January 28, 2011

 Submitted for your reading pleasure...

Thursday
Jan272011

Comet - An Example of the New Key-Code Databases

Comet is an active distributed key-value store built at the University of Washington. The paper describing Comet is Comet: An active distributed key-value store, there are also slides, and a MP3 of a presentation given at OSDI '10. Here's a succinct overview of Comet:

Today's cloud storage services, such as Amazon S3 or peer-to-peer DHTs, are highly inflexible and impose a variety of constraints on their clients: specific replication and consistency schemes, fixed data timeouts, limited logging, etc. We witnessed such inflexibility first-hand as part of our Vanish work, where we used a DHT to store encryption keys temporarily. To address this issue, we built Comet, an extensible storage service that allows clients to inject snippets of code that control their data's behavior inside the storage service.

I found this paper quite interesting because it takes the initial steps of collocating code with a key-value store, which turns it into what might called a key-code store. This is something I've been exploring as a way of moving behavior to data in order to overcome network limitations in the cloud and provide other benefits. An innovator in this area is the Alchemy Database, which has already combined Redis and Lua. A good platform for this sort of thing might be Node.js integrated with V8. This would allow complex Javascript programs to run in an efficient evented container. There are a lot of implications of this sort of architecture, more about that later, but the Comet paper describes a very interesting start.

From the abstract and conclusion:

Click to read more ...

Wednesday
Jan262011

Google Pro Tip: Use Back-of-the-envelope-calculations to Choose the Best Design

How do you know which is the "best" design for a given problem? If, for example, you were given the problem of generating an image search results page of 30 thumbnails, would you load images sequentially? In parallel? Would you cache? How would you decide?

If you could harness the power of the multiverse you could try every possible option in the design space and see which worked best. But that's crazy impractical, isn't it?

Another option is to consider the order of various algorithm alternatives. As a prophet for the Golden Age of Computational Thinking, Google would definitely do this, but what else might Google do?

Use Back-of-the-envelope Calculations to Evaluate Different Designs

Jeff Dean, Head of Google's School of Infrastructure Wizardry—instrumental in many of Google's key systems: ad serving, BigTable; search, MapReduce, ProtocolBuffers—advocates evaluating different designs using back-of-the-envelope calculations.

Click to read more ...

Thursday
Jan202011

75% Chance of Scale - Leveraging the New Scaleogenic Environment for Growth

"I'll never need to scale so why bother? We aren't Twitter or Facebook or Google after all." This is the most common email I get, a question in the form of a thinly disguised rationalization for not having to worry about scaling. And in these days of giant transformer-like machines they are probably right. But what if there are Barry Bonds enhancing type forces at work that argue for the chances of your needing to scale being higher than you think?

And if that happens, how will you cross the scalability chasm? Will you want to completely change your architecture or evolve it from a tool-chain that was meant to scale from the start? Architecturally, that's the question you have to answer. Today's tool-chains are making it possible to grow a system from small to large without needing to implement complete architectural phase changes at various scale inflection points, but that's a different topic. We're trying to think about why you may actually need to scale, that is the question.

Tumblr is a good example of a product that grew beyond expectation because they managed both to execute and harness powerful growth factors. Tumblr is a "light" blogging service that probably didn't think they were Twitter or Facebook or Google either, but need to scale they did. From Tumblr:

Click to read more ...

Wednesday
Jan192011

Sponsored Post: Percona, Appirio, Newrelic, Cloudkick, Membase, EA, Joyent, CloudSigma, ManageEngine, Site24x7

Who's Hiring?

Fun and Informative Events

  • Percona Live to be held in San Francisco February 16th, 2011. A one day event run by the experts behind the MySQL Performance Blog.
  • A new round of Membase meetups have been planned for January 2011 for San Diego, Denver, Seattle, Vancouver and Chicago.
  • O'Reilly' Strata Making Data Work Conference on February 1-3, 2011 Santa Clara, CA. Strata is a new conference from O'Reilly, focusing on the business and practice of data.

Cool Products and Services

Tuesday
Jan182011

Paper: Relational Cloud: A Database-as-a-Service for the Cloud

The Relational Cloud Project is an effort by a group of researchers at MIT to investigate technologies and challenges related to Database-as-a-Service within cloud-computing. They are trying to figure out how the advantages of the DaaS (Database-as-a-Service) model, that we've seen arise in other areas like OLAP and NoSQL, can be applied to relational databases. The DaaS advantages as they see them are: 1) predictable costs, proportional to the quality of service and actual workloads, 2) lower technical complexity, thanks to a unified and simplified service access interface, and 3) virtually infinite resources ready at hand. An interesting description of their approach is explained in the paper Relational Cloud: A Database-as-a-Service for the Cloud. From the abstract:

Click to read more ...