« This is why Microsoft won. And why they lost. | Main | Six Lessons Learned the Hard Way About Scaling a Million User System »
Friday
Apr182014

Stuff The Internet Says On Scalability For April 18th, 2014

Hey, it's HighScalability time:


Scaling to the top of "Bun Mountain" in Hong Kong
  • 44 trillion gigabytes: size of the digital universe by 2020; 6 Times: we have six times more "stuff" than the generation before us.
  • Quotable Quotes:
    • Facebook: Our warehouse stores upwards of 300 PB of Hive data, with an incoming daily rate of about 600 TB.
    • @windley: The problem with the Internet of Things is right now it’s more like the CompuServe of Things
    • Chip Overclock: If you want to eventually generate revenue, you must first optimize for developer productivity; everything else is negotiable.
    • @igrigorik: if you are gzipping your images.. you're doing it wrong: http://bit.ly/1pb8JhK  - check your server configs! and your CDN... :)
    • Seth Lloyd: The arrow of time is an arrow of increasing correlations.
    • @kitmacgillivray: When will Google enable / require all android apps to have full deep search integration so all content is visible to the engine?
    • Neal Ford: Yesterday's best practices become tomorrow's anti-patterns.
    • Rüdiger Möller: just made a quick sum up of concurrency related issues we had in a 7-15 developer team soft realtime application (clustered inmemory data grid + GUI front end). 95% of threads created are not to improve throughput but to avoid blocking (e.g. IO, don't block messaging receiver thread, avoid blocking the event thread in a GUI app, ..).
    • Ansible: When done correctly, automation tools are giving them time back -- and helping out of this problem of needing to wear many hats.

  • Amazon and Google are in an epic battle to dominate the cloud—and Amazon may already have won: If Amazon’s entire public cloud were a single computer, it would have five times more capacity than those of its next biggest 14 competitors—including Google—combined. Every day, one-third of people who use the internet visit a site or use a service running on Amazon’s cloud.

  • What books would you select to help sustain or rebuild civilization? Here's Neal Stephenson’s list. He was too shy to do so, but I would certainly add his books to my list. 

  • 12-Step Program for Scaling Web Applications on PostgreSQL. Great write up of lessons learned by wanelo.com that they used to sustain 10s of thousand of concurrent users at 3K req/sec. Highly detailed. 1) Add more cache, 2) Optimize SQL, 3) Upgrade Hardware and RAM, 4) Scale reads by replication, 5) Use more appropriate tools, 6) Move write heave table out, 7) Tune Postures and your File System, 8) Buffer and serialize frequent updates, 9) Optimize DB Schema, 10) Shard busy tables vertically, 11) Wrap busy tables with services, 12) Shard services backend horizontally.

  • Is this a soap opera? It turns out Google and not Facebook is buying Titan Aerospace, makers of high flying solar powered drones. Google's fiber network would make a great backbone network for a drone and loon powered wireless network, completely routing around the telcos.

  • Building Carousel, Part I: How we made our networked mobile app feel fast and local. Dropbox changes to an eventualy consistent / optimistic replication syncing model to make their app "feel fast, responsive, and local, even though the data on which users operate is ultimately backed by the Dropbox servers."  Lesson: don’t block the user! Instead of requiring changes to be propagated to the server synchronously. Local and remote photos are treated as equivalent objects. Actions take effect immediately locally then work there way out globally. Changes are used to stay consistent. A fast hash is used to tell what photos have not been backed up to dropbox. Remote operations happen asynchronously.

  • The Rise and Fall of AIM, the Breakthrough AOL Never Wanted. An amazing story of what happens when you have it all and then don't trust your people. Here's the scalability angle: Eric Bosco joined AOL in August 1996. Two months later, AOL would switch from an hourly rate to a flat fee. People could suddenly spend as long as they liked online. AOL’s infrastructure had trouble handling the transition. Despite the effort, Bosco said they never sold a dollar of ad space and instead just ran AOL promos. To a subscription business, a free program that could not be monetized was worthless. AOL was not about to change its entire business model for AIM. 

  • All the videos for the Fluent Conference 2014 are now available. 

  • Considering a low-end platform? HN has a fruitful discussion: The New Linode Cloud: SSDs, Double RAM and much more. rjknight with the gist: It looks like Linode are still leaving the "incredibly cheap tiny box" market to DO. Linode's cheapest option is $20/month, which makes it slightly less useful for the kind of "so cheap you don't even think about it" boxes that DO provide.

  • Jeff Darcy sacrifices some unicorns in his solution to fix GlusterFS replication split brain problems. Sometimes it really does take blood magic to make things work.

  • Interesting example of a company moving from a fully distributed to a centralized architecture. Spotify is shutting down its massive P2P network. Why? It's not clear. One thought is they want to control the quality of the experience end-to-end and they are now large enough that they can afford the necessary machines and bandwidth. Bandwidth is now much cheaper, especially at larger scales. Perhaps they want to keep all the BigData? Perhaps the user move to mobile, where p2p is not allowed, motivated the push to centralization? Perhaps to add centralized DRM? Perhaps to stop being banned from organizations that don't like p2p? Or perhaps it's just easier to manage? Google Finds: Centralized Control, Distributed Data Architectures Work Better Than Fully Decentralized Architectures.

  • Lucas Gonze makes an insightful point about Spotify on the decentralization email list: The difference with remaining P2P apps like Bitcoin and BitTorrent Sync is that P2P is necessary for the value prop rather than an implementation detail. Clay Shirky echoes and amplifies: Having some central dispatch servers, as with Napster and SETI@Home, and now Spotify and Skype, works better than mesh unless the core of the design problem is "How do we avoid having central dispatch?"BitTorrent seems like the one great counter-example, where cooperation gain is used to smooth spiky temporal demand across the whole of the routing fabric.

  • Don Hopkins with one of those great posts-in-a-comment type posts on TomTom's experience building a system for distributing maps via BitTorrent. Maps are great use of the BitTorrent protocol as they are large and many people in a region need maps. Moving to BitTorrent and away from Akamai would have saved one million euros a year. Then Akamai unilaterally lowered prices for TomTom and that was that.

  • Revolve is taking another turn, going from NoSQL to PostgreSQL. Reasons: Our data is relational; We need better querying; We have access to better resources. In short:  I've realized the value of an established ecosystem. It was fun to write a cutting edge document database using a shiny beta NoSQL storage engine. It was decidedly less fun to write tools to export data, automate tasks, run complex queries, and manage security. And it was downright painful to right one-off programs every time sales, marketing or somebody else needed to "see some numbers". SQL may not be sexy, but it sure is useful.

  • Comparision of different concurrency models: Actors, CSP, Disruptor and Threads. Rüdiger Möller doesn't like Java's threads and locks approach to concurrent programming so he benchmarked some popular alternatives. Drum roll please...the result is: Although the Disruptor worked best for this example, I think looking for "the concurrency model to go for" is wrong. If we look at the real world, we see all 4 patterns used dependent on use case. So a broad concurrency library ideally would integrate the assembly-line pattern (~Disruptor), queued messaging (~Actors) and unqueued communication (~CSP).

  • The DARPA Spectrum Challenge is interesting: Perhaps the biggest outcome was the further validation that cooperating radios can share spectrum, without any central authority or database.  Of course, we already knew that, but this is an extra data point to give regulators courage to make TV white space and other shared spectrum resources available for general use.

  • Here's a nice meat world example of deadlock: I recall reading that this was a big problem with downtown Denver back in the 60s or 70s. The whole downtown area was covered in parking lots because people were sitting on land waiting for it to go up in value. It was having a negative impact on trying to turn downtown into an attractive hub for more business to come in.

  • GopherCasts. Short screencast tutorials on Go from Jeremy Saenz. Good production values and content. It's free for now. 

  • More goings-on: Go Performance Tales, Do you feel that golang is ugly?, Go, seems to be the most modern and well-featured language in existence today. Is that the case, or are there major drawbacks that I'm not seeing?, Why I went from Python to Go (and not node.js).

  • CASSANDRA MIGRATION YIELDS INSANE PERFORMANCE IMPROVEMENTS: To summarize, the slowest of response times on Cassandra (for real-time profiling and campaign delivery) average more than 10x better than fastest we had utilizing the Mongo DB code and infrastructure. We’re now able to intelligently deliver dynamic, tailored content to the a visitor’s browser in less than 1/200th of a second. Consistently, and without spikes.

  • In their distributed query implementation DataStax found Netty gave a 50% drop for higher percentiles when compared to Tomcat. Netty uses non-blocking IO. Local in-vm transport is used to query the local index, avoiding expensive HTTP localhost requests. Smart connection pooling algorithm which fairly allocates connections, avoids creating too many connections toward too little hosts.

  • And AOL Migrates from MySQL to Apache Cassandra for 8X Improvement

  • OLTP Database Systems for Non-Volatile Memory: We explored two possible architectures for using non-volatile memory (i.e., NVM+DRAM and NVM-only architectures). Our analysis shows that memory-oriented systems are better-suited to take advantage of NVM and outperform their disk-oriented counterparts. However, in both architectures, the throughput of the memory-oriented systems decreases as workload skew is decreased while the throughput of the disk-oriented systems increases as workload skew is decreased. Because of this, we conclude that we need a new system for NVM with features of both disk-oriented and memory-oriented systems. We conclude that the ideal system for both NVM-only and NVM+DRAM architectures will possess features of both memory-oriented and disk-oriented systems.

  • This is why you never believe a systems analysis. Effects of Higher Orders of Uncertainty: Risk management failed on several levels at Fukushima Daiichi. Both TEPCO and its captured regulator bear responsibility. First, highly tailored geophysical models pre- dicted an infinitesimal chance of the region suffering an earthquake as powerful as the Tohoku quake. This model uses historical seismic data to estimate the local frequency of earthquakes of various magnitudes; none of the quakes in the data was bigger than magnitude 8.0. Second, the plant’s risk analysis did not consider the type of cascading, systemic failures that precipitated the meltdown. TEPCO never conceived of a situation in which the reactors shut down in response to an earthquake, and a tsunami topped the seawall, and the cooling pools inside the reactor buildings were overstuffed with spent fuel rods, and the main control room became too radioactive for workers to survive, and damage to local infrastructure delayed reinforcement, and hydrogen explosions breached the reactors’ outer containment structures. Instead, TEPCO and its regulators addressed each of these risks independently and judged the plant safe to operate as is.”Nick Werle, n+1, published by the n+1 Foundation, Brooklyn NY

Reader Comments (3)

Re: Revolv moving from NoSQL to SQL

At one point, things got so bad that it was faster to dump the entire database into CSV files and use Excel! Maybe this really is the right solution for ad hoc queries, but it just seems dirty and inelegant.

Clearly the amount of data they had was not large. Even a completely inappropriate data model would be sufficient to handle the amount of data you can fit in an Excel file. A SQL database might be the most appropriate database for their data, but there is also some other significant technological fail going on at Revolv.

April 18, 2014 | Unregistered Commenterorbitz

As always, thanks so much for the shout out. And lest anyone think otherwise, all of the bullet items in my "Observations on Product Development" series of articles are each based on real-life experiences over the past near forty years (!!!) of my career. They represent both best and worst practices I see in my day to day work developing (and, with any luck, shipping) a successful product.

April 18, 2014 | Unregistered CommenterChip Overclock

orbitz, I agree with the statement about technology fail at Revolv. If they needed ad hoc query capabilities, a SQL database is a very fine choice. But that doesn't mean it's your main production database. They could have replicated their data in real-time in a variety of ways into a SQL DB.

April 22, 2014 | Unregistered CommenterNils

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>