Entries by Todd Hoff (380)

Monday
Feb092009

Paper: Consensus Protocols: Two-Phase Commit  

Henry Robinson has created an excellent series of articles on consensus protocols. Henry starts with a very useful discussion of what all this talk about consensus really means: The consensus problem is the problem of getting a set of nodes in a distributed system to agree on something - it might be a value, a course of action or a decision. Achieving consensus allows a distributed system to act as a single entity, with every individual node aware of and in agreement with the actions of the whole of the network. In this article Henry tackles Two-Phase Commit, the protocol most databases use to arrive at a consensus for database writes. The article is very well written with lots of pretty and informative pictures. He did a really good job. In conclusion we learn 2PC is very efficient, a minimal number of messages are exchanged and latency is low. The problem is when a co-ordinator fails availability is dramatically reduced. This is why 2PC isn't generally used on highly distributed systems. To solve that problem we have to move on to different algorithms and that is the subject of other articles.

Click to read more ...

Thursday
Feb052009

Product: HAProxy - The Reliable, High Performance TCP/HTTP Load Balancer

Update: Load Balancing in Amazon EC2 with HAProxy. Grig Gheorghiu writes a nice post on HAProxy functionality and configuration: Emulating virtual servers, Logging, SSL, Load balancing algorithms, Session persistence with cookies, Server health checks, etc. Adapted From the website: HAProxy is a free, very fast and reliable solution offering high availability, load balancing, and proxying for TCP and HTTP-based applications. It is particularly suited for web sites crawling under very high loads while needing persistence or Layer7 processing. Supporting tens of thousands of connections is clearly realistic with todays hardware. Its mode of operation makes its integration into existing architectures very easy and riskless, while still offering the possibility not to expose fragile web servers to the Net. Currently, two major versions are supported : * version 1.1 - maintains critical sites online since 200 The most stable and reliable, has reached years of uptime. Receives no new feature, dedicated to mission-critical usages only. * version 1.2 - opening the way to very high traffic sites The same as 1.1 with some new features such as poll/epoll support for very large number of sessions, IPv6 on the client side, application cookies, hot-reconfiguration, advanced dynamic load regulation, TCP keepalive, source hash, weighted load balancing, rbtree-based scheduler, and a nice Web status page. This code is still evolving but has significantly stabilized since 1.2.8. Unlike other free "cheap" load-balancing solutions, this product is only used by a few hundreds of people around the world, but those people run very big sites serving several millions hits and between several tens of gigabytes to several terabytes per day to hundreds of thousands of clients. They need 24x7 availability and have internal skills to risk to maintain a free software solution. Often, the solution is deployed for internal uses and I only know about it when they send me some positive feedback or when they ask for a missing feature ;-) According to many users HAProxy competes quite well with the likes of Pound and Ultramonkey.

Click to read more ...

Tuesday
Feb032009

10 More Rules for Even Faster Websites

Update:How-To Minimize Load Time for Fast User Experiences. Shows how to analyze the bottlenecks preventing websites and blogs from loading quickly and how to resolve them. 80-90% of the end-user response time is spent on the frontend, so it makes sense to concentrate efforts there before heroically rewriting the backend. Take a shower before buying a Porsche, if you know what I mean. Steve Souders, author of High Performance Websites and Yslow, has ten more best practices to speed up your website:

  • Split the initial payload
  • Load scripts without blocking
  • Don’t scatter scripts
  • Split dominant content domains
  • Make static content cookie-free
  • Reduce cookie weight
  • Minify CSS
  • Optimize images
  • Use iframes sparingly
  • To www or not to www Sadly, according to String Theory, there are only 26.7 rules left, so get them while they're still in our dimension. Here are slides on the first few rules. Love the speeding dog slide. That's exactly what my dog looks like traveling down the road, head hanging out the window, joyfully battling the wind. Also see 20 New Rules for Faster Web Pages.

    Click to read more ...

  • Tuesday
    Feb032009

    Paper: Optimistic Replication

    To scale in the large you have to partition. Data has to be spread around, replicated, and kept consistent (keeping replicas sufficiently similar to one another despite operations being submitted independently at different sites). The result is a highly available, well performing, and scalable system. Partitioning is required, but it's a pain to do efficiently and correctly. Until Quantum teleportation becomes a reality how data is kept consistent across a bewildering number of failure scenarios is a key design decision. This excellent paper by Yasushi Saito and Marc Shapiro takes us on a wild ride (OK, maybe not so wild) of different approaches to achieving consistency. What's cool about this paper is they go over some real systems that we are familiar with and cover how they work: DNS (single-master, state-transfer), Usenet (multi-master), PDAs (multi-master, state-transfer, manual or application-specific conflict resolution), Bayou (multi-master, operation-transfer, epidemic propagation, application conflict resolution), CVS (multi-master operation-transfer, centralized, manual conflict resolution). The paper then goes on to explain in detail the different approaches to achieving consistency. Most of us will never have to write the central nervous system of an application like this, but knowing about the different approaches and tradesoffs is priceless. The abstract:

    Data replication is a key technology in distributed data sharing systems, enabling higher availability and performance. This paper surveys optimistic replication algorithms that allow replica contents to diverge in the short term, in order to support concurrent work practices and to tolerate failures in low-quality communication links. The importance of such techniques is increasing as collaboration through wide-area and mobile networks becomes popular. Optimistic replication techniques are different from traditional “pessimistic” ones. Instead of synchronous replica coordination, an optimistic algorithm propagates changes in the background, discovers conflicts after they happen and reaches agreement on the final contents incrementally. We explore the solution space for optimistic replication algorithms. This paper identifies key challenges facing optimistic replication systems — ordering operations, detecting and resolving conflicts, propagating changes efficiently, and bounding replica divergence—and provides a comprehensive survey of techniques developed for addressing these challenges.
    If you can't wait to know the ending, here's the summary of the paper:
    We summarize some of the lessons learned from our own experience and in reviewing the literature. Optimistic, asynchronous data replication is an appealing technique; it indeed improves networking flexibility and scalability. Some environments or application areas could simply not function without optimistic replication. However, optimistic replication also comes with a cost. The algorithmic complexity of ensuring eventual consistency can be high. Conflicts usually require application-specific resolution, and the lost update problem is ultimately unavoidable. Hence our recommendations: (1) Keep it simple. Traditional, pessimistic replication, with many off-the-shelf solutions, is perfectly adequate in small-scale, fully connected, reliable networking environments. Where pessimistic techniques are the cause of poor performance or lack of availability, or do not scale well, try single-master replication: it is simple, conflictfree, and scales well in practice. State transfer using Thomas’s write rule works well for many applications. Advanced techniques such as version vectors and operation transfer should be used only when you need flexibility and semantically rich conflict resolution. (2) Propagate operations quickly to avoid conflicts. While connected, propagate often and keep replicas in close synchronization. This will minimize divergence when disconnection does occur. (3) Exploit commutativity. Commutativity should be the default; design your system so that non-commutative operations are the uncommon case. For instance, whenever possible, partition data into small, independent objects. Within an object, use monotonic data structures such as an append-only log, a monotonically increasing counter, or a union-only set. When operations are dependent upon each other, represent the invariants explicitly.

    Related Articles

  • The End of an Architectural Era (It’s Time for a Complete Rewrite)
  • Big Table
  • Google's Paxos Made Live – An Engineering Perspective
  • Dynamo: Amazon’s Highly Available Key-value Store
  • Eventually Consistent - Revisited by Werner Vogels

    Click to read more ...

  • Sunday
    Feb012009

    More Chips Means Less Salsa

    Yes, I just got through watching the Superbowl so chips and salsa are on my mind and in my stomach. In recreational eating more chips requires downing more salsa. With mulitcore chips it turns out as cores go up salsa goes down, salsa obviously being a metaphor for speed. Sandia National Laboratories found in their simulations: a significant increase in speed going from two to four multicores, but an insignificant increase from four to eight multicores. Exceeding eight multicores causes a decrease in speed. Sixteen multicores perform barely as well as two, and after that, a steep decline is registered as more cores are added. The problem is the lack of memory bandwidth as well as contention between processors over the memory bus available to each processor. The implication for those following a diagonal scaling strategy is to work like heck to make your system fit within eight multicores. After that you'll need to consider some sort of partitioning strategy. What's interesting is the research on where the cutoff point will be.

    Click to read more ...

    Monday
    Jan262009

    Paper: Scalability by Design - Coding for Systems With Large CPU Counts

    The multi-cores are coming and software designed for fewer cores usually doesn't work on more cores without substantial redesign. For a taste of the issues take a look at No new global mutexes! (and how to make the thread/connection pool work), which shows some of the difficulties of making MySQL perform on SMP servers. In this paper, Richard Smith, a –Staff Engineer at Sun, goes into some nice detail on multi-core issues. His take home lessons are:

  • Use fine-grained locking or lock-free strategy
  • Document the strategy, including correctness criteria (invariants)
  • Keep critical sections short
  • Profile the code at both light and heavy load
  • Collect HW performance counter data
  • Identify bottleneck resource (there's always at least one!)
  • Eliminate or ameliorate it

    Click to read more ...

  • Tuesday
    Jan202009

    Product: Amazon's SimpleDB

    Update 35: How and Why Glue is Using Amazon SimpleDB instead of a Relational Database. Discusses a key design decision that required duplicating data in order to mimic RDBMS joins: Given the trade off between potential inconsistencies and scalability, social services have to choose the latter. Update 34: Apparently Amazon pulled this article. I'm not sure what that means. Maybe time went backwards or something? Amazon dramatically drops SimpleDB pricing to $0.25 per GB per month from $1.50 per GB. This puts SimpleDB on par with Google App Engine. They also announced a few new features: a SQL-like SELECT API as well as a Batch Put operation to streamline uploading of multiple items or attributes. One of the complaints against SimpleDB is that programmers end up writing too much code to do simple things. These features and a much cheaper price should help considerably. And you can store lots of data now. GAE is still capped. Update 33: Amazon announces Elastic Block Store (EBS), which provides lots of normal looking disk along with value added features like snapshots and snapshot copying. But database's may find EBS too slow. RightScale tells us Why Amazon’s Elastic Block Store Matters. Update 32: You can now get all attributes for a property when querying. Previously only the ID was returned and the attributes had to be returned in separate calls. This makes the programmer's job a lot simpler. Artificial levels of parallelization code can now be dumped. Update 31: Amazon fixes a major hole in SimpleDB by adding the ability to sort query results. Previously developers had to sort results by hand which was a non-starter for many. Now you can do basic top 10 type queries with ease. Update 30: Amazon SimpleDB - A distributed, highly-scalable, light-weight, query-able, attribute store by Sebastian Stadil. It introduces the CAP theorem and the basics of SimpleDB. Sebastian does a lot of great work in the AWS world and in what must be his limited free time, runs the AWS Meetup group. Update 29: A stroll down the history of a previous RDBMS killer, object databases. Lots of fond memories of the new kid on the block showing us how objects and code were one, the endless OO vs. relational wars, writing a OODBMS training course, dealing with object migration and querying etc, and the slow decline followed by groveling in front of the old master. It would be a terrible irony if a hash table succeeded where OODBMSs failed. Update 28: I didn't make the beta program :-( Update 27: IBM has hired CouchDB creator Damien Katz as their player in the game. Teams Microsoft, IBM, and Amazon have all entered the race. Amazon is 10 furlongs ahead, but watch for team Google, a fast finisher on the outside. Update 26: Red Monk says Microsoft's Astoria project is SDBish, but developers are afraid of lock-in. Update 25: Nati Shalom thinks SDB isn't even a database. Update 24: Igvita asks why do you need SDB when Thrudb is faster and cheaper? It provides a memcached layer in front of a database storing data in S3. And even better, all its service names start with "thru" instead of "S". Update 23: For all you Perl haters, the Perl interface to SDB is clean and beautiful. Update 22: On an Erlang email list Jim Larson says the proper model is to store bulk data in S3 and indexable metadata in SimpleDB. The cost of SimpleDB is 10x for storing data versus S3. We are supposed to build our own inverted index for text searching, which is one of those decisions that sounds good in the meeting room (yay, we don't have to do all that work), but is not a good decision in the real world. Update 21: Sensepost is already creating attack models to drain your bank account through repeated queries. Update 20: Grow some stones, smoothspan says Eventual Consistency Is Not That Scary. Update 19: Jacob Harris in A First Look at Amazon SimpleDB offers up some beta Ruby libraries for accessing SDB. Update 18: Erlang folks hope to get some run, but Erlang the language is too different to go mainstream, though Erlang's concurrency model rocks. A while back I talked about how The Solution to C++ Threading is Erlang and how Java's concurrency approach is fundamentally broken. Update 17: Subbu tirelessly provides a A RESTful version of Amazon's SimpleDB. Update 16: Snarfed sees it as a sort of tuplespace implementation. Compare it to Facebook's API. Ning also has a data API. Update 15: Uncom thinks Winer & Scoble Fail In Tandem. SDB's XML response has 1,755% transmission overhead, which is genius for a per byte pricing model. And I love this one: if you are starting a business whose success hinges on scalability of a data store, you had best figure out how to shard across N machines before you launch. Using a single instance of MySQL for the whole thing is a strong indicator that you have failed at life. Update 14: Styled Bits sees SDB as more of a way to add metadata to S3 objects. Update 13: Bex Huff makes the point you'll still need a caching layer in front of SDB. Update 12: Shahzad Bhatti has been coding for SimpleDB for a few months and gives us a cool Java and Erlang API for basic CRUD operations. Update 11: DBA4Life says Amazon has just flux capacited us back to 1980s style database management. Update 10: Bob Warfield of SmoothSpan explains Why the Amazon SimpleDB is a Huge Next Step. It helps achieve the necessary "16:1 operations cost advantages over conventional software." Update 9: SimpleDB is berkleyDB and 90% of all computing will live in cloud city. Will the Troglyte's revolt? Update 8: Dave Winer says Amazon removes the database scaling wall by adding a storage ramp that scales up when needed and scales down when unneeded. You no longer need to buy expensive VC funded database talent to take your product to the next level. Update 7: Kevin Burton in Google vs Amazon in Open Infrastructure has doubts about the entire hosted model. Bandwidth costs too much, it might hurt your acquisition chances, and you can't trust 'em. He just wants to lease managed raw machine power. Update 6: Amazon SimpleDB and CouchDB compared. Some key differences: SimpleDB is hosted. CouchDB is REST/JSON and SimpleDB is REST/SOAP/XML. In SimpleDB attribute updates are atomic in CouchDB record updates are atomic. CouchDB supports JSON data types and SimpleDB thinks everything is a string. CouchDB has much more flexible indexing and queries. Update 5: Sriram Krishnan gives a more technical overview of SimpleDB. He likes the big hash table approach and brings up how the query language allows for parallelization. Update 4: Mark from areyouwatchingthis.com makes a really insightful point: I run a startup that gets 75% of our traffic from our API. The ability to move that processing and storage into a cloud _might_ save me a lot on hosting. Update 3: Marcelo Calbucci thinks SimpleDB is more of a directory service than a database because records can contain different attributes (no schema) and attributes can have multiple values. Update 2: Smug Mugs' Don MacAskill likes the service, but is concerned that field sizes are limited to 1024 characters and latency from far away datacenters. He thinks most queries will be easy to convert as they are predominantly hash like lookups anyway. Update: Scoble asks if SimpleDB kills MySQL, Oracle, et al. The answer is no. Google has a similar service internally and they are still major users of and contributors to MySQL. Sometimes you just need structured data. So RDBMSs aren't dead. They just may not be the starting point as the barrier to entry for doing the simplest thing to start a website has plummeted. No more setup or admin. Just code and go. The cherry missing from Amazon's AWS hot fudge sundae was a database service. They had a CPU scoop with EC2, they had storage scoop with S3, they had a work distribution scoop with their queue, but the database cherry was missing. Now they've added it and it's dessert time. News of SimpleDB is everywhere. Apparently it's been in development for a while. You can read about it inside looking out, GIGAOM, Innowave, SimpleDB Developer's Guide, and the SimpleDB Home Page. It seems to be a simple properties like store implemented on Erlang (as is CouchDB). It has simple query capabilities on attributes. It's fast and scalable. And At $0.14 per hour it's quite competitive with other options. What it doesn't have is a text search or complex RDBMS style queries for structured data. It's not clear if the data are geographically distributed, in case you are interested in fast response times from different parts of the world. I would be very curious on the relationship between SimpleDB and Dynamo. Even with these limitations it's a disruptive service. Most high speed websites use a property store for unstructured data and that's been hard for smaller groups to implement at scale. But if you're losing your mind trying to figure out how to store your data at scale, maybe you can now turn your attention to more productive problems.

    Click to read more ...

    Monday
    Jan192009

    Papers: Readings in Distributed Systems

    Marton Trencseni has collected a wonderful list of different papers on distributed systems. He's organized them into the following sections: The Google Papers, Distributed Filesystems, Non-relational Distributed Databases, The Lamport Papers, and Implementation Issues. Many old favorites on the list and some that are likely new to you. My new favorite is "Frangipani: A Scalable Distributed File System." How can you not love "Frangipani" as a word?

    Click to read more ...

    Saturday
    Jan172009

    Scaling in Games & Virtual Worlds  

    "Online games and virtual worlds have familiar scaling requirements, but don’t be fooled: everything you know is wrong." Jim Waldo, Sun Microsystems Laboratories * The computational environment for online games or virtual worlds is close to the exact inverse of that found in most markets serviced by the high-tech industry. * The need for a heavyweight client is, in part, an outcome of the evolution of these games. * Latency is the enemy of fun—and therefore the enemy of online games and virtual worlds. * The game server is used both to discourage cheating (by making it much more difficult) and to detect cheating (by seeing patterns of divergence between the game state reported by the client and the game state held by the server). Peer-to-peer technologies might seem a natural fit for the first role of the game server, but this second role means that few if any games or worlds trust their peers enough to avoid the server component. * Using multiple servers is a basic mechanism for scaling the server component of a game to the levels that are being seen in the online world today. * Having multiple servers means that part of building the game is deciding how to partition the load over these servers. The first technique is to exploit the geography of the game or world. The second technique is known as sharding. * While shards allow scale, they do so at the price of player interaction. * The problem is that the culture that has grown up around games and virtual worlds is not one that understands or is overly familiar with the programming techniques that are required to exploit the parallelism inherent in these systems. * It is for these reasons that we started Project Darkstar (http://www.projectdarkstar.com), a research effort attempting to build a server-side infrastructure that will exploit the multithreaded, multicore chips being produced and scaled over a large group of machines, while presenting the programmer with the illusion that he or she is developing in a single-threaded, single-machine environment. *The model is a simple event-based one in which input from the client is received by the server, which then sets off a task in response to that event. * This mechanism for concurrency control does require that all tasks access all of their data through the Darkstar data service. Our current implementation uses the Berkeley Database. we believe that we can keep the penalty for accessing through a data service small by caching data in intelligent ways. We also believe that by using the inherent parallelism in these games, we can increase the overall performance of the game as the number of players increases, even if there is a small penalty for individual data access. * We found that additional machines lowered the capacity of the overall system. We are working on removing the choke points so that adding equipment actually adds capacity.

    Click to read more ...

    Friday
    Jan162009

    Just-In-Time Scalability: Agile Methods to Support Massive Growth (IMVU case study)

    Before
    We started with a small site, a mess of open source, and a small team that didn't know much about scaling.

    After
    We ended with a large site, a medium sized team, and an architecture that has scaled.

    We never stopped. We used a roadmap and a compass, made weekly changes in direction, regularly shipped code on Wednesday to handle the next weekend's capacity constraints, and shipped new features the whole time.

    These are excerpts from the IMVU PDF presentation of their architecture which can be viewed or downloaded here.
    IMVU is an online destination where adults and teens meet new people in 3D. IMVU won the 2008 Virtual Worlds Innovation Award and was also named a Rising Star in the 2008 Silicon Valley Technology Fast 50 program.

    Click to read more ...