High Scalability -

Strategy

Monday

Nov242014

A Flock of Tasty Sources on How to Start Learning High Scalability

Monday, November 24, 2014 at 8:56AM

This is a guest repost by Leandro Moreira.

distributed systems

When we usually are interested about scalability we look for links, explanations, books, and references. This mini article links to the references I think might help you in this journey.

DISCLAIMER:

You don’t need to have N machines to build/test a cluster/high scalable system, currently you can use Vagrant and up N machines easily.

THE REFERENCES:

Now that you know you can empower yourself with virtual servers, I challenge you to not only read these links but put them into practice.

First of all, motivate yourself by watching this tutorial using nodejs + nginx + applying static caching + load balancing + testing, all this in 7 minutes.
Add these words and their meaning to your vocabulary: scalability, failover, single point of failure (SPOF), sharding, replication and load balancing; even if you don’t understand them completely.
In order to have a general overview and the reasons/whys about scalable systems, I strongly recommend you to read Scalable Web Architecture and Distributed Systems. This is a great introduction.
After you get the general idea you can move on to understand how to use a load balancer and what decisions and problems you will face. And then you can try to run a haproxy and make it not a single point of failure too.
Dare yourself to serve 3 million requests per second but for this task you’ll need to generate 3 million requests, fine tune your web server and finally scale and test it.
Your application is already scalable, now you need to scale your databases. They are very important part of your application, here I recommend you to read at least how MongoDB scales with sharding and replication and Cassandra with its almost linear scalability and the ease of adding nodes to the cluster.
Since your application and database are scalable and fault tolerant, it’s good to save your servers unnecessary workload and also make the responses to the user faster. Learn that a good request is the one that never reached the “real server”.
Let’s assume we’re deploying the whole infrastructure within a single data center, now we have another SPOF. Since all servers are in the same space, some natural disaster might happen or even the simple power outages. Good news is that Cassandra have support to multiple data center out of the box and you can see how google face this issue. If your user is on Brazil, don’t make him travel longer than he needs and remember even with the best situation we still have latency.

Good questions to test your knowledge:

Click to read more ...

3 Comments |

Permalink |

Strategy

Wednesday

Nov192014

We are leaving 3x-4x performance on the table just because of configuration.

Wednesday, November 19, 2014 at 8:56AM

Performance guru Martin Thompson gave a great talk at Strangeloop: Aeron: Open-source high-performance messaging, and one of the many interesting points he made was how much performance is being lost because were aren't configuring machines properly.

This point comes on the observation that "Loss, throughput, and buffer size are all strongly related."

Here's a gloss of Martin's reasoning. It's a problem that keeps happening and people aren't aware that it's happening because most people are not aware of how to tune network parameters in the OS.

The separation of programmers and system admins has become an anti-pattern. Developers don’t talk to the people who have root access on machines who don’t talk to the people that have network access. Which means machines are never configured right, which leads to a lot of loss. We are leaving 3x-4x performance on the table just because of configuration.

We need to workout how to bridge that gap, know what the parameters are, and how to fix them.

So know your OS network parameters and how to tune them.

1 Comment |

Permalink |

Performance,

Strategy

Monday

Nov032014

Improve small job completion times by 47% by running full clones.

Monday, November 3, 2014 at 8:56AM

The idea is most jobs are small. Researchers found 82% of jobs on Facebook's cluster were less than 10 tasks. Clusters have a median utilization of under 20%. And since small jobs are particularly sensitive to stragglers the audacious solution is to proactively launch clones of a job as they are submitted and pick the result from the earliest clone. The result is an average completion time of all the small jobs improved by 47% using cloning, at the cost of just 3% extra resources.

For more details take a look at the very interesting Why Let Resources Idle? Aggressive Cloning of Jobs with Dolly.

Google: Taming The Long Latency Tail - When More Machines Equals Worse Results

Microservices in Production - the Good, the Bad, the it Works

Monday, October 27, 2014 at 8:56AM

This is a guest repost written by Andrew Harmel-Law on his real world experiences with Microservices. The original article can be found here.

It’s reached the point where it’s even a cliche to state “there’s a lot written about Microservices these days.” But despite this, here’s another post on the topic. Why does the internet need another? Please bear with me…

We’re doing Microservices. We’re doing it based on a mash-up of some “Netflix Cloud” (as it seems to becoming known - we just call it “Archaius / Hystrix”), a gloop of Codahale Metrics, a splash of Spring Boot, and a lot of Camel, gluing everything together. We’ve even found time to make a bit of Open Source ourselves - archaius-spring-adapter - and also contribute some stuff back.

Lets be clear; when I say we’re “doing Microservices”, I mean we’ve got some running; today; under load; in our Production environment. And they’re running nicely. We’ve also got a lot more coming down the dev-pipe.

All the time we’ve been crafting these we’ve been doing our homework. We’ve followed the great debate, some contributions of which came from within Capgemini itself, and other less-high-profile contributions from our very own manager. It’s been clear for a while that, while there is a lot of heat and light generated in this debate, there is also a lot of valid inputs that we should be bearing in mind.

Despite this, the Microservices architectural style is still definitely in the honeymoon period, which translates personally into the following: whenever I see a new post on the topic from a Developer I respect my heart sinks a little as I open it and read… Have they discovered the fatal flaw in all of this that everyone else has so far missed? Have they put their finger on the unique aspect that mean 99% of us will never realise the benefits of this new approach and that we’re all off on a wild goose chase? Have they proven that Netflix really are unicorns and that the rest of us are just dreaming?

Despite all this we’re persisting. Despite always questioning every decision we make in this area far more than we normally would, Microservices still feel right to us for a whole host of reasons. In the rest of this post I hope I’ll be able to point out some of the subtleties which might have eluded you as you’ve researched and fiddled, and also, I’ve aimed to highlight some of the old “givens” which might not be “givens” any more.

The Good

Click to read more ...

3 Comments |

Permalink |