Friday
Sep162011
Stuff The Internet Says On Scalability For September 16, 2011
Friday, September 16, 2011 at 9:01AM
Between love and madness lies HighScalability:
- Google now 10x better: MapReduce sorts 1 petabyte of data using 8000 computers in 33 minutes; 1 Billion on Social Networks; Tumblr at 10 Billion Posts; Twitter at 100 Million Users; Testing at Google Scale: 1800 builds, 120 million test suites, 60 million tests run daily.
- From the Dash Memo on Google's Plan: Go is a very promising systems-programming language in the vein of C++. We fully hope and expect that Go becomes the standard back-end language at Google over the next few years. On GAE Go can load from a cold start in 100ms and the typical instance size is 4MB. Is it any wonder Go is a go? Should we expect to see Java and Python deprecated because Go is so much cheaper to run at scale?
- Potent Quotables:
- @caciufo : 30x more scalability w/ many-core. So perf doesn't have to level out or vex programmers. #IDF2011
- @joerglew : Evaluating divide&conquer vs. master-slave architecture for workflows scalability in #vCO. I wonder what fits better to cust. needs.
- @canicoll : Think biggest issues w/mobile backhaul is access to the site (fiber run expensv - FSO?) and scalability of bandwidth at site #ciscospchat
- Walmart uses Muppet labor to power their real-time social shopping systems: You can’t do MapReduce computing every time (with every Tweet). You’ll die. How do you do it in real-time? We built MapUpdate, or what we call Muppet. We could map a huge amount of data and handle a huge firehose with little latency across millions of entities… We can monitor 100 million (items) at scale. That could be products, stores, anything. It’s the equivalent of MapReduce for fast data.
- Like humans, this AI software is always seeking relations. TextRunner produces facts by digesting 500 million web pages and billions of lines of text. Peter Norvig, director of research at Google: "The significance of TextRunner is that it is scalable because it is unsupervised. It can discover and learn millions of relations, not just one at a time. With TextRunner, there is no human in the loop: it just finds relations on its own."
- How to process a million songs in 20 minutes. Paul Lamere with a fascinating explanation of how to process huge music datasets using MapReduce. Apparently one of the things you want to do is determine a song’s density - where the density is defined as the average number of notes or atomic sounds (called segments) per second in a song. Mrjob is used used to analyze the tracks in parallel on hundreds of machines. Really cool. Open data sets are processed using open frameworks on easy to use cloud infrastructure. So many things are possible these days.
- Dmitriy Samovskiy with a Spock like look at troubleshooting. 1) Never waste your time on checking A if you are observing NOT B and you know that A -> B 2) never assume that NOT A causes NOT B if you only know that A -> B. 3) never assume causality out of mere correlation of two events
- MongoDB for a large queuing system. Andrew Lyon with a great writeup on his experience creating a workflow type system that runs jobs in dependency order. With a hundred thousand fake users they ran into a write bottleneck. The solution was to manually shard in order to ensure data was distributed evenly. Now what they were expecting. Advice: Don’t use MongoDB for evenly-distributed high-write applications. One of the hugest problems is that there is a global write lock on the database. But schemaless design cut development time in half and replica sets worked well.
- StorageMojo breaks down VMWorld 2011. Some trends: flash is everywhere; huge storage scalability issues for virtual machines which requires more creative answers beyond deduplication and flash.
- Would 114 HP EcoPOD server systems really power all of Google? Hm, maybe 115, but 114 just doesn't sound like enough...
- Automated Tiered Storage for Databases. Brent Ozar with an interesting proposal: use tiered storage instead of partitioning. In a single drive, some of your data might reside on fast solid state memory, and the rest would live on slower, more capacious magnetic Frisbees. Tiered storage can be cheaper, easier, and more effective than partitioning. Tier 1: Mirrored pair of Solid State Drives; Tier B: Four 15K drives in RAID 10; Tier III: Eight cheap 7200RPM SATA drives in RAID 5.
- New GAE pricing actually not so bad says JH on Google Groups. AWS, Heroku, and Dotcloud are expensive too. Plus GAE is PaaS, which is easy to develop for and you get a lot of cool services like task queues.
- Configuring SQL Server in Amazon EC2: Training Video. Jeremiah Peschka with a great analysis of some of the problems of moving to EC2 and how to mitigate them. More information in this post. High cost: yep, EC2 costs a lot but it's easy to start and scale. Noisy Neighbor: move to a better neighborhood. Crashes: redundancy is expensive but possible. Number of drives: spread data across instances. Network Throughput: move to a better neighborhood, but there's still just a single gigabit NIC per host, be less chatty. And much more. Very well done and informative.
- How GitHub Works. Zach Holman reveals: Hours are Bullshit; Be Asynchronous; Creativity is Important. Ah, but it's the mundanely syncronous seconds that really matter...
- StackOverflow: Amazon EC2 Backup Strategy. Good straightforward advice.
- Jeff Darcy with a great pull, finding a brilliant reference to Distributed Databases in 1965 from the book The Decision Makers by Joseph Green: Our group memory is an accumulated mass of knowledge which is impressed on the memory areas of young individuals at birth, at least three such young ones for each memory segment. We are a short-lived race, dying of natural causes after eight of your years. As each individual who carries a share of the memory feels death approaching he transfers his part to a newly born child, and thus the knowledge is transferred from generation to generation, forever.
- Riak explains their secondary index feature. Indexing is real-time and atomic; the results show up in queries immediately after the write operation completes, and all indexing occurs on the partition where the object lives, so the object and its indexes stay in sync. A major limitation: Only single index queries are supported, and only for exact matches or range queries.
- PacketPushers shows why security is so hard, it's confusing and there's more than one way to do it: Securing an Internet-Facing App – Part 2 – Border Routers, Firewalls, IDS/IPS. Big trend is attacks moving to L7, the application layer. As networks and operating systems become more secure the bad guys are moving upstack and hitting the tender vittles: applications.
- Read consistency a la Cassandra in MongoDB. Nope, can't do that.
- Concurrency, Scalability & Fault-tolerance 2.0 with Akka Actors & STM. Mario Fusco shows how to use Actors to build systems. A very underutilized architecture.
- Node.js: Asynchronous I/O for Fun and Profit. Stefan Tilkov with a good presentation on what asynchronous I/O means and how it can be performed on servers and web clients using Node.js. Also on InfoQ: Membase NoSQL: Clustered by Erlang.
- The interesting case of outdated packages and multiple servers being silently started. Diagnosing why Redis slows down.
Reader Comments (2)
Is there more details on that Wallmart MapUpdate/Muppet project?
Re: automated tiered storage, I'm not sure that middle tier makes sense -- I'd go right from Flash to 7.2K-10K in RAID 10. In the in-between zone you're paying a lot for a smaller gain in IOPS. If you're starved for lower-tier reads, increase the size of your upper tier or have more replicas. If you're starved for writes, grow your upper tier or optimize. :)