Core Stats

3 Comments |

Permalink |

Example

Monday

Mar162015

How and Why Swiftype Moved from EC2 to Real Hardware

Monday, March 16, 2015 at 8:56AM

This is a guest post by Oleksiy Kovyrin, Head of Technical Operations at Swiftype. Swiftype currently powers search on over 100,000 websites and serves more than 1 billion queries every month.

When Matt and Quin founded Swiftype in 2012, they chose to build the company’s infrastructure using Amazon Web Services. The cloud seemed like the best fit because it was easy to add new servers without managing hardware and there were no upfront costs.

Unfortunately, while some of the services (like Route53 and S3) ended up being really useful and incredibly stable for us, the decision to use EC2 created several major problems that plagued the team during our first year.

Swiftype’s customers demand exceptional performance and always-on availability and our ability to provide that is heavily dependent on how stable and reliable our basic infrastructure is. With Amazon we experienced networking issues, hanging VM instances, unpredictable performance degradation (probably due to noisy neighbors sharing our hardware, but there was no way to know) and numerous other problems. No matter what problems we experienced, Amazon always had the same solution: pay Amazon more money by purchasing redundant or higher-end services.

The more time we spent working around the problems with EC2, the less time we could spend developing new features for our customers. We knew it was possible to make our infrastructure work in the cloud, but the effort, time and resources it would take to do so was much greater than migrating away.

After a year of fighting the cloud, we made a decision to leave EC2 for real hardware. Fortunately, this no longer means buying your own servers and racking them up in a colo. Managed hosting providers facilitate a good balance of physical hardware, virtualized instances, and rapid provisioning. Given our previous experience with hosting providers, we made the decision to choose SoftLayer. Their excellent service and infrastructure quality, provisioning speed, and customer support made them the best choice for us.

After more than a month of hard work preparing the inter-data center migration, we were able to execute the transition with zero downtime and no negative impact on our customers. The migration to real hardware resulted in enormous improvements in service stability from day one, provided a huge (~2x) performance boost to all key infrastructure components, and reduced our monthly hosting bill by ~50%.

This article will explain how we planned for and implemented the migration process, detail the performance improvements we saw after the transition, and offer insight for younger companies about when it might make sense to do the same.

Preparing for the switch

12 Comments |

Permalink |

Example

Monday

Mar092015

AppLovin: Marketing to Mobile Consumers Worldwide by Processing 30 Billion Requests a Day

Monday, March 9, 2015 at 8:56AM

This is a guest post from AppLovin's VP of engineering, Basil Shikin, on the infrastructure of its mobile marketing platform. Major brands like Uber, Disney, Yelp and Hotels.com use AppLovin's mobile marketing platform. It processes 30 billion requests a day and 60 terabytes of data a day.

AppLovin's marketing platform provides marketing automation and analytics for brands who want to reach their consumers on mobile. The platform enables brands to use real-time data signals to make effective marketing decisions across one billion mobile consumers worldwide.

Core Stats

30 Billion ad requests per day
300,000 ad requests per second, peaking at 500,000 ad requests per second
5ms average response latency
3 Million events per second
60TB of data processed daily
~1000 servers
9 data centers
~40 reporting dimensions
500,000 metrics data points per minute
1 Pb Spark cluster
15GB/s peak disk writes across all servers
9GB/s peak disk reads across all servers
Founded in 2012, AppLovin is headquartered in Palo Alto, with offices in San Francisco, New York, London and Berlin.

Technology Stack

7 Comments |

Permalink |

Example

Monday

Mar092015

The Architecture of Algolia’s Distributed Search Network

Monday, March 9, 2015 at 8:56AM

Guest post by Julien Lemoine, co-founder & CTO of Algolia, a developer friendly search as a service API.

Algolia started in 2012 as an offline search engine SDK for mobile. At this time we had no idea that within two years we would have built a worldwide distributed search network.

Today Algolia serves more than 2 billion user generated queries per month from 12 regions worldwide, our average server response time is 6.7ms and 90% of queries are answered in less than 15ms. Our unavailability rate on search is below 10^-6 which represents less than 3 seconds per month.

The challenges we faced with the offline mobile SDK were technical limitations imposed by the nature of mobile. These challenges forced us to think differently when developing our algorithms because classic server-side approaches would not work.

Our product has evolved greatly since then. We would like to share our experiences with building and scaling our REST API built on top of those algorithms.

We will explain how we are using a distributed consensus for high-availability and synchronization of data in different regions around the world and how we are doing the routing of queries to the closest locations via an anycast DNS.

The data size misconception

10 Comments |

Permalink |

Example

Monday

Feb232015

HappyPancake: a Retrospective on Building a Simple and Scalable Foundation

Monday, February 23, 2015 at 8:56AM

This is a guest repost by Rinat Abdullin, who worked on HappyPancake, the largest free dating site in Sweden. Initially written in ASP.NET and MS SQL Database server, it eventually became overly complex and expensive to scale. This is the last post in a nearly two year long series of engaging articles on the evolution of the project. For the complete list please see the end of this article.

Our project at HappyPancake completed this week. We delivered a simple and scalable foundation for the next version of largest free dating web site in Sweden (with presence in Norway and Finland).

Journey

Below is the short map of that journey. It lists technologies and approaches that we evaluated for this project. Yellow regions highlight items which made their way into the final design.

Project Deliverables

1 Comment |

Permalink |

Example

Monday

Feb162015

Scaling Kim Kardashian to 100 Million Page Views

Monday, February 16, 2015 at 9:02AM

The team at PAPER estimated their article (NSFW) containing pictures of a very naked Kim Kardashian would quickly receive over 100 million page views. The very definition of bursty viral driven traffic.

As a comparison in 2013 it was estimated Google processed over 500 million searches a day. So a nude Kim Kardashian is worth one-fifth of a Google. Strangely, I can believe it.

How did they handle this traffic gold mine? A complete recounting of the unusual behind the scenes story is told by Paul Ford in How PAPER Magazine’s web engineers scaled their back-end for Kim Kardashian (SFW). (BTW, only one butt pun was made intentionally in this story, all others are serendipity).

What can we learn from the experience? I know what you are thinking. This is just a single static page with a few static pictures. It’s not a complex problem like search or a social network. Shouldn’t any decent CDN be enough to handle that? And you would be correct, but that’s not the whole of the story:

Vinted Architecture: Keeping a busy portal stable by deploying several hundred times per day

Monday, February 9, 2015 at 8:56AM

This is guest post by Nerijus Bendžiūnas and Tomas Varaneckas of Vinted.

Vinted is a peer-to-peer marketplace to sell, buy and swap clothes. It allows members to communicate directly and has the features of a social networking service.

Started in 2008 as a small community for Lithuanian girls, it developed into a worldwide project that serves over 7 million users in 8 different countries, and is growing non-stop, handling over 200 M requests per day.

Stats

12 Comments |

Permalink |

Example

Wednesday

Jan142015

StackExchange's Performance Dashboard

Wednesday, January 14, 2015 at 8:56AM

StackExchange created a very cool performance dashboard that looks to be updated from real system metrics. Wouldn't it be fascinating if every site had a similar dashboard?

The dashboard contains information like there are 560 million page views per month, 260,000 sustained connections, 34 TB data transferred per month, 9 web servers with 48GB of RAM handling 185 req/s at 15% CPU usage. There are 4 SQL servers, 2 redis servers, 3 tag engine servers, 3 elasticsearch servers, and 2 HAProxy servers, along with stats on each.

There's also an excellent discussion thread on reddit that goes into more interesting details, with questions being answered by folks from StackExchange.

StackExchange is still doing innovative work and is very much an example worth learning from. They've always danced to their own tune and it's a catchy tune at that. More at StackOverflow Update: 560M Pageviews A Month, 25 Servers, And It's All About Performance.

2 Comments |

Permalink |

Example

Monday

Jan122015

The Stunning Scale of AWS and What it Means for the Future of the Cloud

Monday, January 12, 2015 at 9:05AM

James Hamilton, VP and Distinguished Engineer at Amazon, and long time blogger of interesting stuff, gave an enthusiastic talk at AWS re:Invent 2014 on AWS Innovation at Scale. He’s clearly proud of the work they are doing and it shows.

James shared a few eye popping stats about AWS:

1 million active customers
All 14 other cloud providers combined have 1/5th the aggregate capacity of AWS (estimate by Gartner in 2013)
449 new services and major features released in 2014
Every day, AWS adds enough new server capacity to support all of Amazon’s global infrastructure when it was a $7B annual revenue enterprise (in 2004).
S3 has 132% year-over-year growth in data transfer
102Tbps network capacity into a datacenter.

The major theme of the talk is the cloud is a different world. It’s a special environment that allows AWS to do great things at scale, things you can’t do, which is why the transition from on premise x86 servers to the public cloud is happening at a blistering pace. With so many scale driven benefits to the public cloud, it's a transition that can't be stopped. The cloud will keep getting more reliable, more functional, and cheaper at a rate that you can't begin to match with your limited resources, generalist gear, bloated software stacks, slow supply chains, and outdated innovation paradigms.

That's the PR message at least. But one thing you can say about Amazon is they are living it. They are making it real. So a healthy doubt is healthy, but extrapolating out the lines of fate would also be wise.

One of the fickle finger of fate advantages AWS has is resources. At one million customers they have the scale to keep the engine of expansion and improvement going. Profits aren't being taken out, money is being reinvested. This is perhaps the most important advantage of scale.

But money without smarts is simply waste. Amazon wants you to know they have the smarts. We've heard how Google and Facebook build their own gear, Amazon does too. They build their own networking gear, networking software, racks, and they work with Intel to get faster processor versions of processors than are available on the market. The key is they know everything and control everything about their environment, so they can build simpler gear that does exactly what they want, which turns out to be cheaper and more reliable in the end.

Complete control allows quality metrics to be built into everything. Metrics drive a constant quality increase in all parts of the system, which is why against all odds AWS is getting more reliable as the pace of innovation quickens. Great pools of actionable data turned into knowledge is another huge advantage of scale.

Another thing AWS can do that you can't is the Availability Zone architecture itself. Each AZ is its own datacenter and AZs within a region are located very close together. This reduces messaging latencies, which means state can be synchronously replicated between AZs, which greatly improves availability compared to the typical approach where redundant datacenters are very far apart.

It's a talk rich with information and...well, spunk. The real meta-theme of the talk is how Amazon consciously uses scale to their competitive advantage. For Amazon scale isn't just an expense to be dealt with, scale is a resource to exploit, if you know how.

Here's my gloss of James Hamilton's incredible talk...

Everything in the Talk has a Foundation in Scale

13 Comments |

Permalink |

Example

Monday

Dec152014

The Machine: HP's New Memristor Based Datacenter Scale Computer - Still Changing Everything

Monday, December 15, 2014 at 8:56AM

The end of Moore’s law is the best thing that’s happened to computing in the last 50 years. Moore’s law has been a tyranny of comfort. You were assured your chips would see a constant improvement. Everyone knew what was coming and when it was coming. The entire semiconductor industry was held captive to delivering on Moore’s law. There was no new invention allowed in the entire process. Just plod along on the treadmill and do what was expected. We are finally breaking free of these shackles and entering what is the most exciting age of computing that we’ve seen since the late 1940s. Finally we are in a stage where people can invent and those new things will be tried out and worked on and find their way into the market. We’re finally going to do things differently and smarter.

-- Stanley Williams (paraphrased)

HP has been working on a radically new type of computer, enigmatically called The Machine (not this machine). The Machine is perhaps the largest R&D project in the history of HP. It’s a complete rebuild of both hardware and software from the ground up. A massive effort. HP hopes to have a small version of their datacenter scale product up and running in two years.

The story began when we first met HP’s Stanley Williams about four years ago in How Will Memristors Change Everything? In the latest chapter of the memristor story, Mr. Williams gives another incredible talk: The Machine: The HP Memristor Solution for Computing Big Data, revealing more about how The Machine works.

The goal of The Machine is to collapse the memory/storage hierarchy. Computation today is energy inefficient. Eighty percent of the energy and vast amounts of time are spent moving bits between hard disks, memory, processors, and multiple layers of cache. Customers end up spending more money on power bills than on the machines themselves. So the machine has no hard disks, DRAM, or flash. Data is held in power efficient memristors, an ion based nonvolatile memory, and data is moved over a photonic network, another very power efficient technology. When a bit of information leaves a core it leaves as a pulse of light.

On graph processing benchmarks The Machine reportedly performs 2-3 orders of magnitude better based on energy efficiency and one order of magnitude better based on time. There are no details on these benchmarks, but that’s the gist of it.

The Machine puts data first. The concept is to build a system around nonvolatile memory with processors sprinkled liberally throughout the memory. When you want to run a program you send the program to a processor near the memory, do the computation locally, and send the results back. Computation uses a wide range of heterogeneous multicore processors. By only transmitting the bits required for the program and the results the savings is enormous when compared to moving terabytes or petabytes of data around.

The Machine is not targeted at standard HPC workloads. It’s not a LINPACK buster. The problem HP is trying to solve for their customers is where a customer wants to perform a query and figure out the answer by searching through a gigantic pile of data. Problems that need to store lots of data and analyze in realtime as new data comes in

Why is a very different architecture needed for building a computer? Computer systems can’t not keep up with the flood of data that’s coming in. HP is hearing from their customers that they need the ability to handle ever greater amounts of data. The amount of bits that are being collected is growing exponentially faster than the rate at which transistors are being manufactured. It’s also the case that information collection is growing faster than the rate at which hard disks are being manufactured. HP estimates there are 250 trillion DVDs worth of data that people really want to do something with. Vast amount of data are being collected in the world are never even being looked at.

So something new is needed. That’s at least the bet HP is making. While it’s easy to get excited about the technology HP is developing, it won’t be for you and me, at least until the end of the decade. These will not be commercial products for quite a while. HP intends to use them for their own enterprise products, internally consuming everything that’s made. The idea is we are still very early in the tech cycle, so high cost systems are built first, then as volumes grow and processes improve, the technology will be ready for commercial deployment. Eventually costs will come down enough that smaller form factors can be sold.

What is interesting is HP is essentially building its own cloud infrastructure, but instead of leveraging commodity hardware and software, they are building their own best of breed custom hardware and software. A cloud typically makes available vast pools of memory, disk, and CPU, organized around instance types which are connected by fast networks. Recently there’s a move to treat these resource pools as independent of the underlying instances. So we are seeing high level scheduling software like Kubernetes and Mesos becoming bigger forces in the industry. HP has to build all this software themselves, solving many of the same problems, along with the opportunities provided by specialized chips. You can imagine programmers programming very specialized applications to eke out every ounce of performance from The Machine, but what is more likely is HP will have to create a very sophisticated scheduling system to optimize how programs run on top of The Machine. What's next in software is the evolution of a kind of Holographic Application Architecture, where function is fluid in both time and space, and identity arises at run-time from a two-dimensional structure. Schedule optimization is the next frontier being explored on the cloud.

The talk is organized in two broad sections: hardware and software. Two-thirds of the project is software, but Mr. Williams is a hardware guy, so hardware makes up the majority of the talk. The hardware section is based around the idea of optimizing the various functions around the physics that is available: electrons compute; ions store; photons communicate.

Here’s is my gloss on Mr. Williams talk. As usual with such a complex subject much can be missed. Also, Mr. Williams tosses huge interesting ideas around like pancakes, so viewing the talk is highly recommended. But until then, let’s see The Machine HP thinks will be the future of computing….