« ElasticSearch - Open Source, Distributed, RESTful Search Engine | Main | GEO-aware traffic load balancing and caching at CNBC.com »
Monday
Feb082010

How FarmVille Scales to Harvest 75 Million Players a Month

Several readers had follow-up questions in response to this article. Luke's responses can be found in How FarmVille Scales - The Follow-up.

If real farming was as comforting as it is in Zynga's mega-hit Farmville then my family would have probably never left those harsh North Dakota winters. None of the scary bedtime stories my Grandma used to tell about farming are true in FarmVille. Farmers make money, plants grow, and animals never visit the red barn. I guess it's just that keep-your-shoes-clean back-to-the-land charm that has helped make FarmVille the "largest game in the world" in such an astonishingly short time.

How did FarmVille scale a web application to handle 75 million players a month? Fortunately FarmVille's Luke Rajlich has agreed to let us in on a few their challenges and secrets. Here's what Luke has to say...

The format of the interview was that I sent Luke a few general questions and he replied with this response:

FarmVille has a unique set of scaling challenges which are unique to the application. The game has had to scale fast and far. The game had 1 M daily players after 4 days and 10M after 60 days. At the time of launch, the largest social game was 5M daily players. Currently, FarmVille has 28M daily players and 75M monthly players 9 months after launch. That makes the monthly player base of FarmVille larger than the entire population of France. There are two fundamental characteristics that make FarmVille a unique scaling challenge: it is the largest game in the world and it is the largest application on a web platform. Both of these aspects present a unique set of scaling challenges that FarmVille has had to overcome. In terms of technology investment, FarmVille primarily utilizes open source components and is at its core built off the LAMP stack.

In order to make FarmVille scale as a game, we have to accommodate the workload requirements of a game. A user's state contains a large amount of data which has subtle and complex relationships. For example, in a farm, objects cannot collide with each other, so if a user places a house on their Farm, the backend needs to check that no other object in that user's farm occupies an overlapping space. Unlike most major site like Google or Facebook, which are read heavy, FarmVille has an extremely heavy write workload. The ratio of data reads to writes 3:1, which is an incredibly high write rate. A majority of the requests hitting the backend for FarmVille in some way modifies the state of the user playing the game. To make this scalable, we have worked to make our application interact primarily with cache components. Additionally, the release of new content and features tends to cause usage spikes since we are effectively extending the game. The load spikes can be as large as 50% the day of a new feature's release. We have to be able to accommodate this spikey traffic.

The other piece is making FarmVille scale as the largest application on a web platform and is as large as some of the largest websites in the world. Since the game is run inside of the Facebook platform, we are very sensitive to latency and performance variance of the platform. As a result, we've done a lot of work to mitigate that latency variance: we heavily cache Facebook data and gracefully ratchet back usage of the platform when we see performance degrade. FarmVille has deployed an entire cluster of caching servers for the Facebook platform. The amount of traffic between FarmVille and the Facebook platform is enormous: at peak, roughly 3 Gigabits/sec of traffic go between FarmVille and Facebook while our caching cluster serves another 1.5 Gigabits/sec to the application. Additionally, since performance can be variable, the application has the ability to dynamically turn off any calls back to the platform. We have a dial that we can tweak that turns off incrementally more calls back to the platform. We have additionally worked to make all calls back to the platform avoid blocking the loading of the application itself. The idea here is that, if all else fails, players can continue to at least play the game.

For any web application, high latency kills your app and highly variable latency eventually kills your app. To address the high latency, FarmVille has worked to put a lot of caching in front of high latency components. Highly variable latency is another challenge as it requires a rethinking of how the application relies on pieces of its architecture which normally have an acceptable latency. Just about every component is susceptible to this variable latency, some more than others. Because of FarmVille's nature, where the workload is very write and transaction heavy, variability in latency has a magnified effect on user experience compared with a traditional web application. The way FarmVille has handled these scenarios is through thinking about every single component as a degradable service. Memcache, Database, REST Apis, etc. are all treated as degradable services. The way in which services degrade are to rate limit errors to that service and to implement service usage throttles. The key ideas are to isolate troubled and highly latent services from causing latency and performance issues elsewhere through use of error and timeout throttling, and if needed, disable functionality in the application using on/off switches and functionality based throttles.

To help manage and monitor FarmVille's web farm, we utilize a number of open source monitoring and management tools. We use nagios for alerting, munin for monitoring, and puppet for configuration. We heavily utilize internal stats systems to track performance of the services the application uses, such as Facebook, DB, and Memcache. Additionally, when we see performance degradation, we profile a request's IO events on a sampled basis.

Lessons Learned

There's not quite as much detail as I would like about some things, but there are still a number of interesting points that I think people can learn from:

  1. Interactive games are write heavy. Typical web apps read more than they write so many common architectures may not be sufficient. Read heavy apps can often get by with a caching layer in front of a single database. Write heavy apps will need to partition so writes are spread out and/or use an in-memory architecture.
  2. Design every component as a degradable service. Isolate components so increased latencies in one area won't ruin another. Throttle usage to help alleviate problems. Turn off features when necessary.
  3. Cache Facebook data. When you are deeply dependent on an external component consider caching that component's data to improve latency.
  4. Plan ahead for new realease related usage spikes.
  5. Sample. When analyzing large streamsof data, looking for problems for example, not every piece of data needs to processed. Sampling data can yield the same resuls for much less work.

I'd like to thank Zynga and Luke Rajlich for making the time for this interview. If anyone else has an architecture that they would like to feature please let me know and we'll set it up.

Related Articles

  1. Building Big Social Games - Talks about the game play mechanics behind FarmVille.
  2. How BuddyPoke Scales on Facebook Using Google App Engine
  3. Strategy: Sample to Reduce Data Set
  4. HighScalability posts on caching and memcached.
  5. HighScalability posts on sharding.
  6. HighScalability posts on Memory Grids.
  7. How to Succeed at Capacity Planning Without Really Trying : An Interview with Flickr's John Allspaw on His New Book
  8. Scaling FarmVille by James Hamilton

Reader Comments (20)

Wow! Farville just made it from "most annoying Thing on the Web" to "coolest example of a technical challenge. In my personal dictionary, that is :)

February 8, 2010 | Unregistered CommenterBèr Kessels

The last thing I want to do is belittle what the fine folks at Zynga have accomplished but there is a massive error of omission here... Zynga runs a 200 node Vertica cluster. The scalability of the datastore is pretty important, no? Quoth a colleague:

"200 nodes running Vertica...if you can't get THAT to scale out you maybe should work at a 7-11 counter instead"


Disclosure: I don't work for Vertica, heh.

February 8, 2010 | Unregistered Commenterbos

@bos

interesting. I assume 200 nodes isn't too expensive to run? Otherwise seems like it'd be worth pointing out the decision of develop-your-own vs. pay for off the shelf.

February 8, 2010 | Unregistered Commenterkimbo305

disappointingly too generic :(
would be fascinating to actually read some of the details.
do they use cloud or dedicated servers? how many servers? what db? which "P" in LAMP(php, perl, python, etc)? examples of degrading service?
more meat, less fluff.. :)
at least something as detailed as previous CNBC article..

February 8, 2010 | Unregistered CommenterMxx

i doubt the games hit the verica cluster. that's an offline processing system

February 8, 2010 | Unregistered Commenterpaul

The article doesn't clearly explain which datastore is used to hold each each and their farm. Is it MySQL? MySQL Cluster? some other NoSQL DB?

February 8, 2010 | Unregistered CommenterChad

Its a good brief.

But as always the 'Devil' is in the details.

Where can we get more metrics about the actual app stack/ hardware which will give a good idea of scalability intended and scalability achieved ?

February 8, 2010 | Unregistered CommenterRaghuraman

It's clearly running on Amazon's EC2. Just put a sniffer on it and you'll see the details.

February 8, 2010 | Unregistered CommenterChad

So, their web 2.0 strategy is to run off the users drive and resources as much as possible and cut down net traffic.

Wow. I would have never considered that to be 2.0, since all 0.1 games ran locally and MMRPG is relatively new, but what do I know.

In other news, 75M people need jobs, apparently. I've tried several web-app-games and find them to be tedious, limited, and generally dull. I also despise the inexcusably repetitive and unavoidable notifications on facebook to the point that I've quit using it other than as a monthly check in on distant friends.

It's amazing how a poorly designed blog engine made news as myspace, and how massive advertising pulled millions into facebook. I still get facebook invites from people who DO NOT EXIST. I've had over a hundred of these in the last two years alone. Nobody seems to say much about that.

Rule #1 in business ethics:

If a company has to lie, cheat, or steal to get your business, you'll do well to avoid them. Unless, of course, you like being compared to them, and to friendfinder [worlds largest spammer at one point], yahoo [spam enabling company]. These companies did pretty well for themselves by providing almost no service whatsoever.

Welcome to America, where hype is EVERYTHING.

February 8, 2010 | Unregistered Commenterdemopoly

@paul I think you've missed something. The games are most definitely hitting the Vertica cluster. Some of Vertica's biggest clients are high frequency traders who use it to crunch real time data streams. Why would you think it's an 'offline processing system'? With a cluster that size I'm sure Zynga has no problem rolling up their 3TB of daily farmville data.

Here's an article with a few more numbers for reference:

http://tdwi.org/Blogs/WayneEckerson/2010/02/Zynga.aspx

February 8, 2010 | Unregistered Commentertrun

I believe they are using Wink Streaming's CDN to do this. I guess its all about load distribution.

Mike

February 8, 2010 | Unregistered CommenterMichael

75mil sounds cool, but imagine the challenges on chinese sites. My fiance is chinese and the farming game they play has hundreds of millions of active players. Respect to the coding team of that game : )

i think its a very interesting area (high traffic online games)

thanks

February 9, 2010 | Unregistered CommenterArtur Ejsmont

I liked the post, however I would love to know more about how this was implemented. What high level view of the "glue" that keeps all the abstract pieces together was used? The LAMP stack was a good start, but a quick fly-over of the major pieces (with bonus points for equipment used) would be extremely cool.

Either way, Farmville is definitely an impressive example of scalability. I love it!

February 9, 2010 | Unregistered CommenterRob

it is the largest application on a web platform

Mmmmm... smells like made up bullshit to me!

There is NO way he can actually KNOW this for a fact. Sure does 'sound nice' though, even if unprovable.

February 10, 2010 | Unregistered CommenterRobert Schultz

The concept of degradable services looks pretty nice. Does anyone know of any other case studies where this has been explained? I would certainly like to know more on this.

February 10, 2010 | Unregistered CommenterAayush Puri

@Robert Schultz
It's BS only in the sense that what is largest is not clearly defined. And even when defined, it's probably gonna be a debatable metric.

February 10, 2010 | Unregistered Commenterkimbo305

This is the other side of the story:
http://techcrunch.com/2009/10/31/scamville-the-social-gaming-ecosystem-of-hell/

February 14, 2010 | Unregistered Commenterspyder

Spyder,

The article on tech crunch is full of misinformation. The bad offers were yanked immediately after they were discovered. A process was put in place to ensure that it never happens again.

February 18, 2010 | Unregistered CommenterAnonymous Coward

Interesting post but it would be great to know more about the technical side of the LAMP scaling they are using. Are they using MySql cluster? MySql replicaiton? How do they manage failovers? Apparently they use EC2 too.
This would be a great of example for all of us.

March 2, 2010 | Registered CommenterTom Murphy

I really like the fact that a web site scales upto 28M users per day which is really a considerable large number and a heavy write workload in gaming world.

March 29, 2010 | Unregistered CommenterNiranjan Sarvi

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>