High Scalability -

Strategy

Wednesday

Sep032014

Strategy: Change the Problem

Wednesday, September 3, 2014 at 9:04AM

James T. Kirk's infamous gambit in Starfleet's impossible to win Kobayashi Maru test was to redefine the problem into a challenge he could beat.

Interestingly, an article titled Shifts In Algorithm Design, says something like the same gambit is the modern method of solving algorithmic problems.

In the past:

I, Dick, recall the “good old days of theory.” When I first started working in theory—a sort of double meaning—I could only use deterministic methods. I needed to get the exact answer, no approximations. I had to solve the problem that I was given—no changing the problem.

In the good old days of theory, we got a problem, we worked on it, and sometimes we solved it. Nothing shifty, no changing the problem or modifying the goal.

Today:

Preventing the Dogpile Effect - Problem and Solution

Wednesday, July 30, 2014 at 8:56AM

This is a guest repost Przemek Sobstel, who believes that dogpile effect issue is not covered enough, especially in the PHP world. Orignal article: Preventing dogpile effect.

The Dogpile effect occurs when cache expires and websites are hit by numerous requests the same time. From my own experiences working on big-traffic websites this is what I consider best the best solution. It was used sucessfully in the wild and it worked. Many people mention storing two redundant values FRESH + STALE, but for big traffic websites it was killing our network. We thought it worth sharing our solution and starting a discussion for sharing experiences.

Preventing Dogpiles

5 Comments |

Permalink |

Strategy,

cache

Monday

Jul282014

The Great Microservices vs Monolithic Apps Twitter Melee

Monday, July 28, 2014 at 8:56AM

Once upon a time a great Twitter melee was fought for the coveted title of Consensus Best Way to Structure Systems. The competition was between Microservices and Monolithic Apps.

Flying the the logo of Microservices, from a distant cloud covered land, is the Kingdom of Netflix, whose champion was Sir Adrian Cockcroft (who has pledged fealty to another). And for the Kingdom of ThoughtWorks we have Sir Sam Newman as champion.

Flying the logo of the Monolithic App is champion Sir John Allspaw, from the fair Kingdom of Etsy.

Knights from the Kingdom of Digital Ocean and several independent realms filled out the list.

To the winner goes a great prize: developer mindshare and the favor of that most fickle of ladies, Lady Luck.

May the best paradigm win.

The opening blow was wielded by the highly ranked Sir Cockcroft, a veteran of many tournaments:

7 Comments |

Permalink |

Strategy,

microservices

Wednesday

Jul162014

10 Program Busting Caching Mistakes

Wednesday, July 16, 2014 at 9:00AM

While Ten Caching Mistakes that Break your App by Omar Al Zabir is a few years old, it is still a great source of advice on using caches, especially on the differences between using a local in-memory cache and when using a distributed cache.

Here are the top 10 mistakes (summarized):

Relying on a default serializer. Default serializers can use a lot of CPU, especially for complex types. Give some thought to the best serialization and deserialization method for your language and environment.
Storing large objects in a single cache item. Because of serialization and deserialization costs, under concurrent load, frequent access to large object graphs can kill your server's CPU. Instead, break up the larger graph into smaller subgraphs and cache them separately. Retrieve only the smallest unit you need.
Using cache to share objects between threads. Race conditions, when writes are involved, develop if parts of a program are accessing the same cached items simultaneously. Some sort of external locking mechanism is needed.
Assuming items will be in cache immediately after storing them. Never assume an item will be in a cache, even after it was just written, because a cache can flush items when memory gets tight. Code should always check for a null return value from a cache.
Storing entire collection with nested objects. Storing an entire collection when you need to get a particular item results in poor performance because of the serialization overhead. Cache individual items separately so they can be retrieved separately.
Storing parent-child objects together and also separately. Sometimes an object will simultaneously be contained in two or more parent objects. To not have the same object stored in two different places in the cache store it on its own under its own key. The parent objects will then read the objects when access is needed.
Caching Configuration settings. Store configuration data in a static variable that is local to your process. Accessing cached data is expensive so you want to avoid that cost when possible.
Caching Live Objects that have open handle to stream, file, registry, or network. Don't cache objects the have references to resources like files, streams, memory, etc. When the cached item is removed from the cache those resources will not be deleted and system resources will leak.
Storing same item using multiple keys. It can be convenient to access an item by a key and an index number. This can work when a cache is in-memory because the cache can contain a reference to the same object which means changes to the object will be seen through both access paths. When using a remote cache any updates won't be visible so the objects will get out of sync.
Not updating or deleting items in cache after updating or deleting them on persistent storage. Items in a remote cache are stored as a copy, so updating an object won't update the cache. The cache must specifically be updated for the changes to be seen by anyone else. With an in-memory cache changes to an object will be seen by everyone. Same for deletion. Deleting an object won't delete it from the cache. It's up to the program make sure cached items are deleted correctly.

General Chicken |

The Secret of Scaling: You Can't Linearly Scale Effort with Capacity

Wednesday, June 25, 2014 at 8:56AM

The title is a paraphrase of something Raymond Blum, who leads a team of Site Reliability Engineers at Google, said in his talk How Google Backs Up the Internet. I thought it a powerful enough idea that it should be pulled out on its own:

Mr. Blum explained common backup strategies don’t work for Google for a very googly sounding reason: typically they scale effort with capacity.

If backing up twice as much data requires twice as much stuff to do it, where stuff is time, energy, space, etc., it won’t work, it doesn’t scale.

You have to find efficiencies so that capacity can scale faster than the effort needed to support that capacity.

A different plan is needed when making the jump from backing up one exabyte to backing up two exabytes.

When you hear the idea of not scaling effort with capacity it sounds so obvious that it doesn't warrant much further thought. But it's actually a profound notion. Worthy of better treatment than I'm giving it here:

1 Comment |

Permalink |

Strategy

Monday

Jun232014

Performance at Scale: SSDs, Silver Bullets, and Serialization

Monday, June 23, 2014 at 9:00AM

This is a guest post by Aaron Sullivan, Director & Principal Engineer at Rackspace.

We all love a silver bullet. Over the last few years, if I were to split the outcomes that I see with Rackspace customers who start using SSDs, the majority of the outcomes fall under two scenarios. The first scenario is a silver bullet—adding SSDs creates near-miraculous performance improvements. The second scenario (the most common) is typically a case of the bullet being fired at the wrong target—the results fall well short of expectations.

With the second scenario, the file system, data stores, and processes frequently become destabilized. These demoralizing results, however, usually occur when customers are trying to speed up the wrong thing.

A common phenomena at the heart of the disappointing SSD outcomes is serialization. Despite the fact that most servers have parallel processors (e.g. multicore, multi-socket), parallel memory systems (e.g. NUMA, multi-channel memory controllers), parallel storage systems (e.g. disk striping, NAND), and multithreaded software, transactions still must happen in a certain order. For some parts of your software and system design, processing goes step by step. Step 1. Then step 2. Then step 3. That’s serialization.

And just because some parts of your software or systems are inherently parallel doesn’t mean that those parts aren’t serialized behind other parts. Some systems may be capable of receiving and processing thousands of discrete requests simultaneously in one part, only to wait behind some other, serialized part. Software developers and systems architects have dealt with this in a variety of ways. Multi-tier web architecture was conceived, in part, to deal with this problem. More recently, database sharding also helps to address this problem. But making some parts of a system parallel doesn’t mean all parts are parallel. And some things, even after being explicitly enhanced (and marketed) for parallelism, still contain some elements of serialization.

How far back does this problem go? It has been with us in computing since the inception of parallel computing, going back at least as far as the 1960s(1). Over the last ten years, exceptional improvements have been made in parallel memory systems, distributed database and storage systems, multicore CPUs, GPUs, and so on. The improvements often follow after the introduction of a new innovation in hardware. So, with SSDs, we’re peering at the same basic problem through a new lens. And improvements haven’t just focused on improving the SSD, itself. Our whole conception of storage software stacks is changing, along with it. But, as you’ll see later, even if we made the whole storage stack thousands of times faster than it is today, serialization will still be a problem. We’re always finding ways to deal with the issue, but rarely can we make it go away.

Parallelization and Serialization

5 Comments |

Permalink |

SSD,

Strategy

Wednesday

May212014

9 Principles of High Performance Programs

Wednesday, May 21, 2014 at 8:54AM

Arvid Norberg on the libtorrent blog has put together an excellent list of principles of high performance programs, obviously derived from hard won experience programming on bittorrent:

Two fundamental causes of performance problems:

Memory Latency. A big performance problem on modern computers is the latency of SDRAM. The CPU waits idle for a read from memory to come back.
Context Switching. When a CPU switches context "the memory it will access is most likely unrelated to the memory the previous context was accessing. This often results in significant eviction of the previous cache, and requires the switched-to context to load much of its data from RAM, which is slow."

Rules to help balance the forces of evil:

General Chicken |

Google Says Cloud Prices Will Follow Moore’s Law: Are We All Renters Now?

Wednesday, May 14, 2014 at 8:56AM

After Google cut prices on their Google Cloud Platform Amazon quickly followed with their own price cuts. Even more interesting is what the future holds for pricing. The near future looks great. After that? We'll see.

Adrian Cockcroft highlights that Google thinks prices should follow Moore’s law, which means we should expect prices to halve every 18-24 months.

That's good news. Greater cost certainty means you can make much more aggressive build out plans. With the savings you can hire more people, handle more customers, and add those media rich features you thought you couldn't afford. Design is directly related to costs.

Without Google competing with Amazon there's little doubt the price reduction curve would be much less favorable.

As a late cloud entrant Google is now in a customer acquisition phase, so they are willing to pay for customers, which means lower prices are an acceptable cost of doing business. Profit and high margins are not the objective. Getting market share is what is important.

Amazon on the other hand has been reaping the higher margins earned from recurring customers. So Google's entrance into the early product life cycle phase is making Amazon eat into their margins and is forcing down prices over all.

But there's a floor to how low prices can go. Alen Peacock, co-founder of Space Monkey has an interesting graphic telling the story. This is Amazon's historical pricing for 1TB of storage in S3, graphed as a multiple of the historical pricing for 1TB of local hard disk:

Alen explains it this way:

Cloud prices do decrease over time, and have dropped significantly over the timespan shown in the graph, but this graph shows cloud storage prices as a multiple of hard disk prices. In other words, hard disk prices are dropping much faster than datacenter prices. This is because, right, datacenters have costs other than hard disks (power, cooling, bandwidth, building infrastructure, diesel backup generators, battery backup systems, fire suppression, staffing, etc). Most of those costs do not follow Moore's Law -- in fact energy costs are on a long trend upwards. So over time, the gap shown by the graph should continue to widen.

The economic advantages of building your own (but housed in datacenters) is there, but it isn't huge. There is also some long term strategic advantage to building your own, e.g., GDrive dropped their price dramatically at will because Google owns their datacenters, but Dropbox couldn't do that without convincing Amazon to drop the price they pay for S3.

Costs other than hardware began dominating in datacenters several years ago, Moore's Law-like effects are dampened. Energy/cooling and cooling costs do not follow Moore's Law, and those costs make up a huge component of the overall picture in datacenters. This is only going to get worse, barring some radical new energy production technology arriving on the scene.

What we're [Space Monkey] interested in, long term, is dropping the floor out from underneath all of these, and I think that only happens if you get out of the datacenter entirely.

As the size of cloud market is still growing there will still be a fight for market share. When growth slows and the market is divided between major players a collusionary pricing phase will take over. Cloud customers are sticky customers. It's hard to move off a cloud. The need for higher margins to justify the cash flow drain during the customer acquisition phase will reverse the favorable trends we are seeing now.

Until then it seems the economics indicate we are in a rent, not a buy world.

IaaS Series: Cloud Storage Pricing – How Low Can They Go? - "For now it seems we can assume we’ve not seen the last of the big price reductions."
The Cloud Is Not Green
Brad Stone: “Bill Miller, the chief investment officer at Legg Mason Capital Management and a major Amazon shareholder, asked Bezos at the time about the profitability prospects for AWS. Bezos predicted they would be good over the long term but said that he didn’t want to repeat “Steve Jobs’s mistake” of pricing the iPhone in a way that was so fantastically profitable that the smartphone market became a magnet for competition.”

1 Comment |

Permalink |