Stuff The Internet Says On Scalability For March 16, 2012
Friday, March 16, 2012 at 9:15AM
HighScalability Team in hot links
HighScalability is What We Do:
454,400: Number of Amazon servers; 45PB: Facebook Data Warehouse, grows exponentially; 5 Atoms: Ultimate limit of thermodynamics; YouTube: 4 billion views/day, 60 hours of video uploaded every minute, revenue doubled in 2010
Quotable quotes:
@adrianco: Walmart labs run large single region Cassandra clusters with Intel SSDs and have been in production for two years. Working well for them.
Cassandra and Solid State Drives. DataStax's Rick Branson with a sweet explanation of how Cassandra was built for a world of spinning disks, which means it only writes sequentially, which turns out to be a good way to use SSDs too.
Improving Performance by 1000x. Josiah Carlson explains how they went from a slow and expensive follower list storage solution in Redis to custom built code, that by removing hash tables and shrinking storage overhead became 1000x faster.
Adrian Cockroft with an epic slide deck on all things Netflix on the cloud. Netflix has the most evolved architecture (that we know of) on the cloud and here it all is one presentation.
Where does Big Data meet Big Database. Really good talk by Ben Stopford on the Big Data landscape. The conclusion, pick the right tool for the job isn't new, but he takes a well thought out path to get there.
Optimize Performance and Scalability with Parallelism and Concurrency. Bob Hancock with an epic talk on: how the operating system handles your requests; design principles on how to use concurrency; parallelism to optimize your program's performance and scalability; covers processes, threads, generators, coroutines, non-blocking IO, and the gevent library.
Speeding up Mongoose queries by requesting only the fields you need. Nick Fishman explains why returning a subset of fields yields such a big performance improvement: The problem isn’t so much that MongoDB can’t return the data quickly enough. Rather, Node.js has to spend much of its time parsing extra JSON into JavaScript objects, which is both unnecessary and time-consuming.
Spark - Lightning-Fast Cluster Computing: Spark provides an abstraction called resilient distributed datasets (RDDs) to support cluster programming applications efficiently. RDDs are stored in memory between queries (as long as enough RAM is available), without requiring replication for fault tolerance.
Fortress - worth a look if you are interested in languages specially designed to program peta-scale supercomputers and distributed systems. Does it have enough worse is better to succeed? Also, How to Think about Parallel Programming: Not!
Transactional Memory Everywhere: 2012 Update for HTM. Transactional memory isn't for everything, but what is it for: HTM is likely to be at its best for large in-memory data structures that are difficult to statically partition but that are dynamically partitionable, in other words, the conflict probability is reasonably low.
Ad targeting at Yahoo. Greg Linden with a good review of a paper by Yahoo on their ad targeting. Daily isn't real-time.
Article originally appeared on (http://highscalability.com/).
See website for complete article licensing information.