High Scalability -

Permalink |

distributed,

erlang

Thursday

Apr042013

Paper: A Web of Things Application Architecture - Integrating the Real-World into the Web

Thursday, April 4, 2013 at 9:25AM

How do you layer a programmable Internet of smart things on top of the web? That's the question addressed by Dominique Guinard in his ambitious dissertation: A Web of Things Application Architecture - Integrating the Real-World (slides). With the continued siloing of content, perhaps we can keep our things open and talking to each other?

In the architecture things are modeled using REST, they will be findable via search, they will be social via a social access controller, and they will be mashupable. Here's great graphical overview of the entire system:

Abstract:

Permalink |

Paper

Thursday

Mar072013

It's a VM Wasteland - A Near Optimal Packing of VMs to Machines Reduces TCO by 22%

Thursday, March 7, 2013 at 9:31AM

In Algorithm Design for Performance Aware VM Consolidation we learn some shocking facts (gambling in Casablanca?):

Average server utilization in many data centers is low, estimated between 5% and 15%. This is wasteful because an idle server often consumes more than 50% of peak power.
Surely that's just for old style datacenters? Nope. In Google data centers, workloads that are consolidated use only 50% of the processor cores. Every other processor core is left unused simply to ensure that performance does not degrade.

It's a VM wasteland. The goal is to reduce waste by packing VMs onto machines without hurting performance or wasting resources. The idea is to select VMs that interfere the least with each other and places them together on the same server.

It's a NP-Complete problem, but this paper describes a practical method that performs provably close to the optimal. Interestingly they can optimize for performance or power efficiency, so you can use different algorithms for different workloads.

The result when optimizing for performance are utilizations between 75% and 80% compared to 50% from the naıve method. This gives a 22% reduction in TCO, a significant savings at scale:

1 Comment |

Permalink |

Paper

Tuesday

Dec182012

Georeplication: When Bad Things Happen to Good Systems

Tuesday, December 18, 2012 at 9:40AM

Georeplication is one of the standard techniques for dealing when bad things--failure and latency--happen to good systems. The problem is always: how do you do that? Murat Demirbas, Associate Professor at SUNY Buffalo, has a couple of really good posts that can help: MDCC: Multi-Data Center Consistency and Making Geo-Replicated Systems Fast as Possible, Consistent when Necessary.

In MDCC: Multi-Data Center Consistency Murat discusses a paper that says synchronous wide-area replication can be feasible. There's a quick and clear explanation of Paxos and various optimizations that is worth the price of admission. We find that strong consistency doesn't have to be lost across a WAN:

The good thing about using Paxos over the WAN is you /almost/ get the full CAP (all three properties: consistency, availability, and partition-freedom). As we discussed earlier (Paxos taught), Paxos is CP, that is, in the presence of a partition, Paxos keeps consistency over availability. But, Paxos can still provide availability if there is a majority partition. Now, over a WAN, what are the chances of having a partition that does not leave a majority? WAN has a lot of redundancy. While it is possible to have a data center partitioned off the Internet due to a calamity, what are the chances of several knocked off at the same time. So, availability is also looking good for MDCC protocol using Paxos over WAN.

In Making Geo-Replicated Systems Fast as Possible, Consistent when Necessary Murat describes a paper that tries to hide the price of WAN latency for some classes of operations. In particular:

To alleviate this latency versus consistency tension, this paper proposes RedBlue consistency, which enables blue operations to be fast/asynchronous (and eventually consistent) while the remaining red operations are strongly-consistent/synchronous (and slow). So a program is partitioned into red and blue operations, which run with different consistency levels. While red operations must be executed in the same order at all sites (which make them slow), the order of execution of blue operations can vary from site to site (allowing them to be executed without requiring coordination across sites). "In systems where every operation is labeled red, RedBlue consistency is equivalent to serializability; in systems where every operation is labeled blue, RedBlue consistency allows the same set of behaviors as eventual consistency."

Just a little fun holiday reading :-)

Murat also has number of excellent posts that are a great boon for understanding the innards of distributed systems:

Permalink |

paxos

Thursday

Oct182012

Save up to 30% by Selecting Better Performing Amazon Instances

Thursday, October 18, 2012 at 9:15AM

If you like the idea of exploiting market inconsistencies to lower your costs then you will love this paper and video from the Hot Cloud '12 conference: Exploiting Hardware Heterogeneity within the Same Instance Type of Amazon EC2.

The conclusion is interesting and is a source of good guidance:

Amazon EC2 uses diversified hardware to host the same type of instance.
The hardware diversity results in performance variation.
In general, the variation between the fast instances and slow instances can reach 40%. In some applications, the variation can even approach up to 60%.
By selecting fast instances within the same instance type, Amazon EC2 users can acquire up to 30% of cost saving, if the fast instances have a relatively low probability.

The abstract:

1 Comment |

Permalink |

amazon

Thursday

Oct112012

RAMCube: Exploiting Network Proximity for RAM-Based Key-Value Store

Thursday, October 11, 2012 at 9:15AM

RAMCube is a datacenter oriented design for RAM-based key-value store that supports thousands or tens of thousands of servers to offer up to hundreds of terabytes of RAM storage. Here's the PDF Paper describing the system and here's a video of the presentation given at HotCloud.

The big idea is: RAMCube exploits the proximity of a BCube network to construct a symmetric MultiRing structure, restricting all failure detection and recovery trafﬁc within a one-hop neighborhood, which addresses problems including false failure detection and recovery trafﬁc congestion. In addition, RAMCube leverages BCube’s multiple paths between any pairs of servers to handle switch failures.

A few notes:

75% of Facebook data is stored in memcache.
RAM is 1000 time faster than disk
RAM is used in caches, but this increases application complexity as applications are responsible for cache consistency.
Under a high work load a 1% cache miss rate can lead to a 10x performance penalty.
...

Permalink |

Paper

Thursday

Aug162012

Paper: A Provably Correct Scalable Concurrent Skip List

Thursday, August 16, 2012 at 9:15AM

In MemSQL Architecture we learned one of the core strategies MemSQL uses to achieve their need for speed is lock-free skip lists. Skip lists are used to efficiently handle range queries. Making the skip-lists lock-free helps eliminate contention and make writes fast.

If this all sounds a little pie-in-the-sky then here's a very good paper on the subject that might help make it clearer: A Provably Correct Scalable Concurrent Skip List.

From the abstract:

We propose a new concurrent skip list algorithm distinguished by a combination of simplicity and scalability. The algorithm employs optimistic synchronization, searching without acquiring locks, followed by short lock-based validation before adding or removing nodes. It also logically removes an item before physically unlinking it. Unlike some other concurrent skip list algorithms, this algorithm preserves the skiplist properties at all times, which facilitates reasoning about its correctness. Experimental evidence shows that this algorithm performs as well as the best previously known algorithm under most circumstances.

2 Comments |

Permalink |

Paper

Monday

Aug062012

Paper: High-Performance Concurrency Control Mechanisms for Main-Memory Databases

Monday, August 6, 2012 at 9:15AM

If you stayed up all night watching the life reaffirming Curiosity landing on Mars, then this paper, High-Performance Concurrency Control Mechanisms for Main-Memory Databases, has nothing to do with that at all, but it is an excellent look at how to use optimistic MVCC schemes to reduce lock overhead on in-memory datastructures:

A database system optimized for in-memory storage can support much higher transaction rates than current systems. However, standard concurrency control methods used today do not scale to the high transaction rates achievable by such systems. In this paper we introduce two efficient concurrency control methods specifically designed for main-memory databases. Both use multiversioning to isolate read-only transactions from updates but differ in how atomicity is ensured: one is optimistic and one is pessimistic. To avoid expensive context switching, transactions never block during normal processing but they may have to wait before commit to ensure correct serialization ordering. We also implemented a main-memory optimized version of single-version locking. Experimental results show that while single-version locking works well when transactions are short and contention is low performance degrades under more demanding conditions. The multiversion schemes have higher overhead but are much less sensitive to hotspots and the presence of long-running transactions.

This stuff isn't just for databases. Many applications have huge in-memory datastructures that are also accessed concurrently and can benefit from some of these ideas.

Permalink |

Paper

Wednesday

Jun272012

Paper: Logic and Lattices for Distributed Programming

Wednesday, June 27, 2012 at 9:15AM

Neil Conway from Berkeley CS is giving an advanced level talk at a meetup today in San Francisco on a new paper: Logic and Lattices for Distributed Programming - extending set logic to support CRDT-style lattices.

The description of the meetup is probably the clearest introduction to the paper:

Developers are increasingly choosing datastores that sacrifice strong consistency guarantees in exchange for improved performance and availability. Unfortunately, writing reliable distributed programs without the benefit of strong consistency can be very challenging.

In this talk, I'll discuss work from our group at UC Berkeley that aims to make it easier to write distributed programs without relying on strong consistency. Bloom is a declarative programming language for distributed computing, while CALM is an analysis technique that identifies programs that are guaranteed to be eventually consistent. I'll then discuss our recent work on extending CALM to support a broader range of programs, drawing upon ideas from CRDTs (A Commutative Replicated Data Type).

If you have an eye towards understanding the future then this is for you.

1 Comment |

Permalink |

Paper

Thursday

Jun072012

Case Study on Scaling PaaS infrastructure

Thursday, June 7, 2012 at 10:00AM

In his blog post, Scaling WSO2 Stratos, Srinath Perera explains the scaling architecture of the WSO2 Stratos Platform as a Service (PaaS) infrastructure. It is explained as a series of solutions where every solution adds a new concept to solve a specific problem found in the earlier solution.

Overall, WSO2 Stratos uses a combination of intelligent Load balancing and lazy loading to scale up the architecture. More details about Stratos can be found from the paper WSO2 Stratos: An Industrial Stack to Support Cloud Computing.

Problem

Srinath Perera |

Permalink |