High Scalability -

The WhatsApp Architecture Facebook Bought For $19 Billion

Wednesday, February 26, 2014 at 8:56AM

Rick Reed in an upcoming talk in March titled That's 'Billion' with a 'B': Scaling to the next level at WhatsApp reveals some eye popping WhatsApp stats:

What has hundreds of nodes, thousands of cores, hundreds of terabytes of RAM, and hopes to serve the billions of smartphones that will soon be a reality around the globe? The Erlang/FreeBSD-based server infrastructure at WhatsApp. We've faced many challenges in meeting the ever-growing demand for our messaging services, but as we continue to push the envelope on size (>8000 cores) and speed (>70M Erlang messages per second) of our serving system.

But since we don’t have that talk yet, let’s take a look at a talk Rick Reed gave two years ago on WhatsApp: Scaling to Millions of Simultaneous Connections.

Having built a high performance messaging bus in C++ while at Yahoo, Rick Reed is not new to the world of high scalability architectures. The founders are also ex-Yahoo guys with not a little experience scaling systems. So WhatsApp comes by their scaling prowess honestly. And since they have a Big Hairy Audacious of Goal of being on every smartphone in the world, which could be as many as 5 billion phones in a few years, they’ll need to make the most of that experience.

Before we get to the facts, let’s digress for a moment on this absolutely fascinating conundrum: How can WhatsApp possibly be worth $19 billion to Facebook?

As a programmer if you ask me if WhatsApp is worth that much I’ll answer expletive no! It’s just sending stuff over a network. Get real. But I’m also the guy that thought we don’t need blogging platforms because how hard is it to remote login to your own server, edit the index.html file with vi, then write your post in HTML? It has taken quite a while for me to realize it’s not the code stupid, it’s getting all those users to love and use your product that is the hard part. You can’t buy love

What is it that makes WhatsApp so valuable? The technology? Ignore all those people who say they could write WhatsApp in a week with PHP. That’s simply not true. It is as we’ll see pretty cool technology. But certainly Facebook has sufficient chops to build WhatsApp if they wished.

Let’s look at features. We know WhatsApp is a no gimmicks (no ads, no gimmicks, no games) product with loyal users from across the world. It offers free texting in a cruel world where SMS charges can be abusive. As a sheltered American it has surprised me the most to see how many real people use WhatsApp to really stay in touch with family and friends. So when you get on WhatsApp it’s likely people you know are already on it, since everyone has a phone, which mitigates the empty social network problem. It is aggressively cross platform so everyone you know can use it and it will just work. It “just works” is a phrase often used. It is full featured (shared locations, video, audio, pictures, push-to-talk, voice-messages and photos, read receipt, group-chats, send messages via WiFi, and all can be done regardless of whether the recipient is online or not). It handles the display of native languages well. And using your cell number as identity and your contacts list as a social graph is diabolically simple. There’s no email verification, username and password, and no credit card number required. So it just works.

All impressive, but that can’t be worth $19 billion. Other products can compete on features.

Google wanted it is a possible reason. It’s a threat. It’s for the .99 cents a user. Facebook is just desperate. It’s for your phone book. It’s for the meta-data (even though WhatsApp keeps none).

It’s for the 450 million active users, with a user based growing at one million users a day, with a potential for a billion users. Facebook needs WhatApp for its next billion users. Certainly that must be part if it. And a cost of about $40 a user doesn’t seem unreasonable, especially with the bulk paid out in stock. Facebook acquired Instagram for about $30 per user. A Twitter user is worth $110.

Benedict Evans makes a great case that Mobile is a 1+ trillion dollar business, WhatsApp is disrupting the SMS part of this industry, which globally has over $100 billion in revenue, by sending 18 billion SMS messages a day when the global SMS system only sends 20 billion SMS messages a day. With a fundamental change in the transition from PCs to nearly universal smartphone adoption, the size of the opportunity is a much larger addressable market than where Facebook normally plays.

But Facebook has promised no ads and no interference, so where’s the win?

There’s the interesting development of business use over mobile. WhatsApp is used to create group conversations for project teams and venture capitalists carry out deal flow conversations over WhatsApp.

Instagram is used in Kuwait to sell sheep.

WeChat, a WhatsApp competitor, launched a taxi-cab hailing service in January. In the first month 21 million cabs were hailed.

With the future of e-commerce looking like it will be funneled through mobile messaging apps, it must be an e-commerce play?

It’s not just businesses using WhatsApp for applications that were once on the desktop or on the web. Police officers in Spain use WhatsApp to catch criminals. People in Italy use it to organize basketball games.

Commerce and other applications are jumping on to mobile for obvious reasons. Everyone has mobile and these messaging applications are powerful, free, and cheap to use. No longer do you need a desktop or a web application to get things done. A lot of functionality can be overlayed on a messaging app.

So messaging is a threat to Google and Facebook. The desktop is dead. The web is dying. Messaging + mobile is an entire ecosystem that sidesteps their channel.

Facebook needs to get into this market or become irrelevant?

With the move to mobile we are seeing deportalization of Facebook. The desktop web interface for Facebook is a portal style interface providing access to all the features made available by the backend. It’s big, complicated, and creaky. Who really loves the Facebook UI?

When Facebook moved to mobile they tried the portal approach and it didn’t work. So they are going with a strategy of smaller, more focussed, purpose built apps. Mobile first! There’s only so much you can do on a small screen. On mobile it’s easier to go find a special app than it is to find a menu buried deep within a complicated portal style application.

But Facebook is going one step further. They are not only creating purpose built apps, they are providing multiple competing apps that provide similar functionality and these apps may not even share a backend infrastructure. We see this with Messenger and WhatsApp, Instagram and Facebook’s photo app. Paper is an alternate interface to Facebook that provides very limited functionality, but it does what it does very well.

Conway's law may be operating here. The idea that “organizations which design systems ... are constrained to produce designs which are copies of the communication structures of these organizations.” With a monolithic backend infrastructure we get a Borg-like portal design. The move to mobile frees the organization from this way of thinking. If apps can be built that provide a view of just a slice of the Facebook infrastructure then apps can be built that don’t use Facebook’s infrastructure at all. And if they don't need Facebook's infrastructure then they are free not to be built by Facebook at all. So exactly what is Facebook then?

Facebook CEO Mark Zuckerberg has his own take, saying in a keynote presentation at the Mobile World Congress that Facebook's acquisition of WhatsApp was closely related to the Internet.org vision:

The idea is to develop a group of basic internet services that would be free of charge to use — “a 911 for the internet." These could be a social networking service like Facebook, a messaging service, maybe search and other things like weather. Providing a bundle of these free of charge to users will work like a gateway drug of sorts — users who may be able to afford data services and phones these days just don’t see the point of why they would pay for those data services. This would give them some context for why they are important, and that will lead them to paying for more services like this — or so the hope goes.

This is the long play, which is a game that having a huge reservoir of valuable stock allows you to play.

Have we reached a conclusion? I don’t think so. It’s such a stunning dollar amount with such tenuous apparent immediate rewards, that the long term play explanation actually does make some sense. We are still in the very early days of mobile. Nobody knows what the future will look like, so it pays not try to force the future to look like your past. Facebook seems to be doing just that.

But enough of this. How do you support 450 million active users with only 32 engineers? Let’s find out...

24 Comments |

Permalink |

Example

Monday

Feb172014

How the AOL.com Architecture Evolved to 99.999% Availability, 8 Million Visitors Per Day, and 200,000 Requests Per Second

Monday, February 17, 2014 at 8:56AM

This is a guest post by Dave Hagler Systems Architect at AOL.

The AOL homepages receive more than 8 million visitors per day. That’s more daily viewers than Good Morning America or the Today Show on television. Over a billion page views are served each month. AOL.com has been a major internet destination since 1996, and still has a strong following of loyal users.

The architecture for AOL.com is in it’s 5th generation. It has essentially been rebuilt from scratch 5 times over two decades. The current architecture was designed 6 years ago. Pieces have been upgraded and new components have been added along the way, but the overall design remains largely intact. The code, tools, development and deployment processes are highly tuned over 6 years of continual improvement, making the AOL.com architecture battle tested and very stable.

The engineering team is made up of developers, testers, and operations and totals around 25 people. The majority are in Dulles, Virginia with a smaller team in Dublin, Ireland.

In general the technology in use are Java, JavaServer Pages, Tomca, Apache, CentOS 5, Git, Jenkins, Selenium, and jQuery. There are some other technologies which are used outside that stack, but these are the main components.

Design Principles

12 Comments |

Permalink |

Example

Monday

Feb102014

13 Simple Tricks for Scaling Python and Django with Apache from HackerEarth

Monday, February 10, 2014 at 8:56AM

HackerEarth is a coding skill practice and testing service that in a series of well written articles describes the trials and tribulations of building their site and how they overcame them: Scaling Python/Django application with Apache and mod_wsgi, Programming challenges, uptime, and mistakes in 2013, Post-mortem: The big outage on January 25, 2014, The Robust Realtime Server, 100,000 strong - CodeFactory server, Scaling database with Django and HAProxy, Continuous Deployment System, HackerEarth Technology Stack.

What characterizes these articles and makes them especially helpful is a drive for improvement and an openness towards reporting what didn't work and how they figured out what would work.

As they say, mistakes happen when you are building a complex product with a team of just 3-4 engineers, but investing in infrastructure allowed them to take more breaks, roam the streets of Bangalore while their servers are happily serving thousands of requests every minute, while reaching a 50,000 user base with ease.

Here's a gloss on how they did it:

General Chicken |

6 Comments |

Permalink |

Example,

Strategy

Monday

Feb032014

How Google Backs Up the Internet Along With Exabytes of Other Data

Monday, February 3, 2014 at 8:56AM

Raymond Blum leads a team of Site Reliability Engineers charged with keeping Google's data secret and keeping it safe. Of course Google would never say how much data this actually is, but from comments it seems that it is not yet a yottabyte, but is many exabytes in size. GMail alone is approaching low exabytes of data.

Mr. Blum, in the video How Google Backs Up the Internet, explained common backup strategies don’t work for Google for a very googly sounding reason: typically they scale effort with capacity. If backing up twice as much data requires twice as much stuff to do it, where stuff is time, energy, space, etc., it won’t work, it doesn’t scale. You have to find efficiencies so that capacity can scale faster than the effort needed to support that capacity. A different plan is needed when making the jump from backing up one exabyte to backing up two exabytes. And the talk is largely about how Google makes that happen.

Some major themes of the talk:

No data loss, ever. Even the infamous GMail outage did not lose data, but the story is more complicated than just a lot of tape backup. Data was retrieved from across the stack, which requires engineering at every level, including the human.
Backups are useless. It’s the restore you care about. It’s a restore system not a backup system. Backups are a tax you pay for the luxury of a restore. Shift work to backups and make them as complicated as needed to make restores so simple a cat could do it.
You can’t scale linearly. You can’t have 100 times as much data require 100 times the people or machine resources. Look for force multipliers. Automation is the major way of improving utilization and efficiency.
Redundancy in everything. Google stuff fails all the time. It’s crap. The same way cells in our body die. Google doesn’t dream that things don’t die. It plans for it.
Diversity in everything. If you are worried about site locality put data in multiple sites. If you are worried about user error have levels of isolation from user interaction. If you want production from a software bug put it on different software. Store stuff on different vendor gear to reduce large vendor bug effects.
Take humans out of the loop. How many copies of an email are kept by GMail? It’s not something a human should care about. Some parameters are configured by GMail and the system take care of it. This is a constant theme. High level policies are set and systems make it so. Only bother a human if something outside the norm occurs.
Prove it. If you don’t try it it doesn’t work. Backups and restores are continually tested to verify they work.

There’s a lot to learn here for any organization, big or small. Mr. Blum’s talk is entertaining, informative, and well worth watching. He does really seem to love the challenge of his job.

Here’s my gloss on this very interesting talk where we learn many secrets from inside the beast:

4 Comments |

Permalink |

Example

Tuesday

Jan282014

How Next Big Sound Tracks Over a Trillion Song Plays, Likes, and More Using a Version Control System for Hadoop Data

Tuesday, January 28, 2014 at 8:56AM

This is a guest post by Eric Czech, Chief Architect at Next Big Sound, talks about some unique approaches taken to solving scalability challenges in music analytics.

Tracking online activity is hardly a new idea, but doing it for the entire music industry isn't easy. Half a billion music video streams, track downloads, and artist page likes occur each day and measuring all of this activity across platforms such as Spotify, iTunes, YouTube, Facebook, and more, poses some interesting scalability challenges. Next Big Sound collects this type of data from over a hundred sources, standardizes everything, and offers that information to record labels, band managers, and artists through a web-based analytics platform.

While many of our applications use open-source systems like Hadoop, HBase, Cassandra, Mongo, RabbitMQ, and MySQL, our usage is fairly standard, but there is one aspect of what we do that is pretty unique. We collect or receive information from 100+ sources and we struggled early on to find a way to deal with how data from those sources changed over time, and we ultimately decided that we needed a data storage solution that could represent those changes. Basically, we needed to be able to "version" or "branch" the data from those sources in much the same way that we use revision control (via Git) to control the code that creates it. We did this by adding a logical layer to a Cloudera distribution and after integrating that layer within Apache Pig, HBase, Hive, and HDFS, we now have a basic version control framework for large amounts of data in Hadoop clusters.

As a sort of "Moneyball for Music," Next Big Sound has grown from a single server LAMP site tracking plays on MySpace (it was cool when we started) for a handful of artists to building industry-wide popularity charts for Billboard and ingesting records of every song streamed on Spotify. The growth rate of data has been close to exponential and early adoption of distributed systems has been crucial in keeping up. With over 100 sources tracked coming from both public and proprietary providers, dealing with the heterogenous nature of music analytics has required some novel solutions that go beyond the features that come for free with modern distributed databases.

Next Big Sound has also transitioned between full cloud providers (Slicehost), hybrid providers (Rackspace), and colocation (Zcolo) all while running with a small engineering staff using nothing but open source systems. There was no shortage of lessons learned in this process and we hope others can take something away from our successes and failures.

How does Next Big Sound make beautiful data out of all that music? Step inside and see...

1 Comment |

Permalink |

Example

Monday

Jan132014

NYTimes Architecture: No Head, No Master, No Single Point of Failure

Monday, January 13, 2014 at 8:54AM

Michael Laing, a Systems Architect at NYTimes, gave this great decription of their use of RabbitMQ and their overall architecture on the RabbitMQ mailing list. The closing sentiment marks this as definitely an architecture to learn from:

Although it may seem complex, Fabrik has simple components and is mostly principles and plumbing. The key point to grasp is that there is no head, no master, no single point of failure. As I write this I can see components failing (not RabbitMQ), and we are fixing them so they are more reliable. But the system doesn't fail, users can connect, and messages are delivered, regardless - all within design parameters.

Since it's short, to the point, and I couldn't say it better, I'll just reproduce several original sources here:

3 Comments |

Permalink |

Example

Monday

Jan062014

How HipChat Stores and Indexes Billions of Messages Using ElasticSearch and Redis

Monday, January 6, 2014 at 8:54AM

This article is from an interview with Zuhaib Siddique, a production engineer at HipChat, makers of group chat and IM for teams.

HipChat started in an unusual space, one you might not think would have much promise, enterprise group messaging, but as we are learning there is gold in them there enterprise hills. Which is why Atlassian, makers of well thought of tools like JIRA and Confluence, acquired HipChat in 2012.

And in a tale not often heard, the resources and connections of a larger parent have actually helped HipChat enter an exponential growth cycle. Having reached the 1.2 billion message storage mark they are now doubling the number of messages sent, stored, and indexed every few months.

That kind of growth puts a lot of pressure on a once adequate infrastructure. HipChat exhibited a common scaling pattern. Start simple, experience traffic spikes, and then think what do we do now? Using bigger computers is usually the first and best answer. And they did that. That gave them some breathing room to figure out what to do next. On AWS, after a certain inflection point, you start going Cloud Native, that is, scaling horizontally. And that’s what they did.

But there's a twist to the story. Security concerns have driven the development of an on-premises version of HipChat in addition to its cloud/SaaS version. We'll talk more about this interesting development in a post later this week.

While HipChat isn’t Google scale, there is good stuff to learn from HipChat about how they index and search billions of messages in a timely manner, which is the key difference between something like IRC and HipChat. Indexing and storing messages under load while not losing messages is the challenge.

This is the road that HipChat took, what can you learn? Let’s see…

11 Comments |

Permalink |

Example

Monday

Dec022013

Evolution of Bazaarvoice’s Architecture to 500M Unique Users Per Month

Monday, December 2, 2013 at 8:54AM

This is a guest post written by Victor Trac, Cloud Architect at Bazaarvoice.

Bazaarvoice is a company that people interact with on a regular basis but have probably never heard of. If you read customer reviews on sites like bestbuy.com, nike.com, or walmart.com, you are using Bazaarvoice services. These sites, along with thousands of others, rely on Bazaarvoice to supply the software and technology to collect and display user conversations about products and services. All of this means that Bazaarvoice processes a lot of sentiment data on most of the products we all use daily.

Bazaarvoice helps our clients make better products by using a combination of machine learning and natural language processing to extract useful information and user sentiments from the millions of free-text reviews that go through our platform. This data gets boiled down into reports that clients can use to improve their products and services. We are also starting to look at how to show personalized sortings of reviews that speak to what we think customers care about the most. A mother browsing for cars, for example, may prefer to read reviews about safety features as compared to a 20-something male, who might want to know about the car’s performance. As more companies use Bazaarvoice technology, consumers become more informed and make better buying decisions.

Here's how it works...

4 Comments |

Permalink |

Example

Monday

Nov252013

How To Make an Infinitely Scalable Relational Database Management System (RDBMS)

Monday, November 25, 2013 at 8:54AM

This is a guest post by Mark Travis, Founder of InfiniSQL.

InfiniSQL is the specific "Infinitely Scalable RDBMS" to which the title refers. It is free software, and instructions for getting, building, running and testing it are available in the guide. Benchmarking shows that an InfiniSQL cluster can handle over 500,000 complex transactions per second with over 100,000 simultaneous connections, all on twelve small servers. The methods used to test are documented, and the code is all available so that any practitioner can achieve similar results. There are two main characteristics which make InfiniSQL extraordinary:

13 Comments |

Permalink |