« Sponsored Post: Server Stack, Akiban, Wiredrive, NY Times, CouchConf, FiftyThree, Percona, ElasticHosts, ScaleOut, New Relic, NetDNA, GigaSpaces, AiCache, Logic Monitor, AppDynamics, CloudSigma | Main | Stuff The Internet Says On Scalability For October 12, 2012 »
Monday
Oct152012

Simpler, Cheaper, Faster: Playtomic's Move from .NET to Node and Heroku

This is a guest post by Ben Lowry, CEO of Playtomic. Playtomic is a game analytics service implemented in about 8000 mobile, web and downloadable games played by approximately 20 million people daily.

Here's a good summary quote by Ben Lowry on Hacker News:

Just over 20,000,000 people hit my API yesterday 700,749,252 times, playing the ~8,000 games my analytics platform is integrated in for a bit under 600 years in total play time. That's just yesterday. There are lots of different bottlenecks waiting for people operating at scale. Heroku and NodeJS, for my use case, eventually alleviated a whole bunch of them very cheaply.

Playtomic began with an almost exclusively Microsoft.NET and Windows architecture which held up for 3 years before being replaced with a complete rewrite using NodeJS.  During its lifetime the entire platform grew from shared space on a single server to a full dedicated, then spread to second dedicated, then the API server was offloaded to a VPS provider and 4 – 6 fairly large VPSs.   Eventually the API server settled on 8 dedicated servers at Hivelocity, each a quad core with hyperthreading + 8gb of ram + dual 500gb disks running 3 or 4 instances of the API stack.
 
These servers routinely serviced 30,000 to 60,000 concurrent game players and received up to 1500 requests per second, with load balancing done via DNS round robin.

In July the entire fleet of servers was replaced with a NodeJS rewrite hosted at Heroku for a significant saving.

 

Scaling Playtomic with NodeJS

There were two parts to the migration:

  1. Dedicated to PaaS:  Advantages include price, convenience, leveraging their load balancing and reducing overall complexity.  Disadvantages include no New Relic for NodeJS, very inelegant crashes, and a generally immature platform.
  2. .NET to NodeJS: Switching architecture from ASP.NET / C# with local MongoDB instances and a service preprocessing event data locally and sending it to centralized server to be completed; to NodeJS on Heroku + Redis and preprocessing on SoftLayer (see Catalyst program).

Dedicated to PaaS

The reduction in complexity is significant; we had 8 dedicated servers each running 3 or 4 instances of the API at our hosting partner Hivelocity.  Each ran a small suite of software including:

  • MongoDB instance
  • log pre-processing service
  • monitoring service
  • IIS with api sites

Deploying was done via an FTP script that uploaded new api site versions to all servers.  Services were more annoying to deploy but changed infrequently.

MongoDB was a poor choice for temporarily holding log data before it was pre-processed and sent off.  It offered a huge speed advantage of just writing to memory initially which meant write requests were “finished” almost instantly which was far superior to common message queues on Windows, but it never reclaimed space left from deleted data which meant the db size would balloon to 100+ gigabytes if it wasn’t compacted regularly.

The advantages of PaaS providers are pretty well known, they all seem quite similar although it’s easiest to have confidence in Heroku and Salesforce since they seem the most mature and have broad technology support.

The main challenges transitioning to PaaS was shaking the mentality that we could run assistive software alongside the website as we did on the dedicated servers.  Most platforms provide some sort of background worker threads you can leverage but that means you need to route data and tasks from the web threads through a 3rd party service or server which seems unnecessary.

We eventually settled on a large server at Softlayer running a dozen purpose-specfic Redis instances and some middleware rather than background workers.  Heroku doesn’t charge for outbound bandwidth and Softlayer doesn’t charge for inbound which neatly avoided the significant bandwidth involved.

Switching from .NET to NodeJS

Working with JavaScript on the serverside is a mixed experience.  On the one hand the lack of formality and boilerplate is liberating.  On the other hand there’s no New Relic and no compiler errors which makes everything harder than it needs to be.

There are two main advantages that make NodeJS spectacularly useful for our API.

  1. Background workers in the same thread and memory as the web server
  2. Persistant, shared connections to redis and mongodb (etc)

Background workers

NodeJS has the very useful ability to continue working independently of requests, allowing you to prefetch data and other operations that allow you to terminate a request very early and then finish processing it.

It is particularly advantageous for us to replicate entire MongoDB collections in memory, periodically refreshed, so that entire classes of work had access to current data without having to go an external database or local/shared caching layer.

We collectively save 100s – 1000s of database queries per second using this in:

  • Game configuration data on our main api
  • API credentials on our data exporting api
  • GameVars which developers use to store configuration or other data to hotload into their games
  • Leaderboard score tables (excluding scores)

The basic model is:

var cache = {};

module.exports = function(request, response) {
   response.end(cache[“x”]);
}

function refresh() {

   // fetch updated data from database, store in cache object
   cache[“x”] = “foo”;
   setTimeout(refresh, 30000);
}

refresh();

The advantages of this are a single connection (per dyno or instance) to your backend databases instead of per-user, and a very fast local memory cache that always has fresh data.

The caveats are your dataset must be small, and this is operating on the same thread as everything else so you need to be conscious of blocking the thread or doing too-heavy cpu work.

Persistent connections

The other massive benefit NodeJS offers over .NET for our API is persistant database connections.  The traditional method of connecting in .NET (etc) is to open your connection, do your operation, after which your connection is returned to a pool to be re-used shortly or expired if it’s no longer needed.

This is very common and until you get to a very high concurrency it will Just Work.  At a high concurrency the connection pool can’t re-use the connections fast enough which means it generates new connections that your database servers will have to scale to handle.

At Playtomic we typically have several hundred thousand concurrent game players that are sending event data which needs to be pushed back to our Redis instances in a different datacenter which with .NET would require a massive volume of connections – which is why we ran MongoDB locally on each of our old dedicated servers.

With NodeJS we have a single connection per dyno/instance which is responsible for pushing all the event data that particular dyno receives.  It lives outside of the request model something like this:

var redisclient  = redis.createClient(….);

module.exports = function(request, response) {

   var eventdata = “etc”;

   redisclient.lpush(“events”, eventdata);

}

The end result

High load:

REQUESTS IN LAST MINUTE


_exceptions: 75 (0.01%)
_failures: 5
(0.00%)
_total: 537,151
(99.99%)  
data.custommetric.success: 1,093
(0.20%)
data.levelaveragemetric.success: 2,466
(0.46%)
data.views.success: 105
(0.02%)
events.regular.invalid_or_deleted_game#2: 3,814
(0.71%)
events.regular.success: 527,837
(98.25%)
gamevars.load.success: 1,060
(0.20%)
geoip.lookup.success: 109
(0.02%)
leaderboards.list.success: 457 (0.09%)
leaderboards.save.missing_name_or_source#201: 3 (0.00%)
leaderboards.save.success: 30
(0.01%)  
leaderboards.saveandlist.success: 102
(0.02%)  
playerlevels.list.success: 62
(0.01%)  
playerlevels.load.success: 13
(0.00%)


 

This data comes from some load monitoring that operates in the background on each instance, pushes counters to Redis where they’re then aggregated and stored in MongoDB,  you can see it in action at https://api.playtomic.com/load.html.

There are a few different classes of requests in that data:

  • Events that check the game configuration from MongoDB, perform a GeoIP lookup (opensourced very fast implementation at https://github.com/benlowry/node-geoip-native), and then push to Redis
  • GameVars, Leaderboards, Player Levels all check game configuration from MongoDB and then whatever relevant MongoDB database
  • Data lookups are proxied to a Windows server because of poor NodeJS support for stored procedures

The result is 100,000s of concurrent users causing spectactularly light Redis loads fo 500,000 – 700,000 lpush’s per minute (and being pulled out on the other end):


 1  [||                                                                                      1.3%]     Tasks: 83; 4 running
 2  [|||||||||||||||||||                                                                    19.0%]     Load average: 1.28 1.20 1.19
 3  [||||||||||                                                                              9.2%]     Uptime: 12 days, 21:48:33
 4  [||||||||||||                                                                           11.8%]
 5  [||||||||||                                                                              9.9%]
 6  [|||||||||||||||||                                                                      17.7%]
 7  [|||||||||||||||                                                                        14.6%]
 8  [|||||||||||||||||||||                                                                  21.6%]
 9  [||||||||||||||||||                                                                     18.2%]
 10 [|                                                                                       0.6%]
 11 [                                                                                        0.0%]
 12 [||||||||||                                                                              9.8%]
 13 [||||||||||                                                                              9.3%]
 14 [||||||                                                                                  4.6%]
 15 [||||||||||||||||                                                                       16.6%]
 16 [|||||||||                                                                               8.0%]
 Mem[|||||||||||||||                                                                 2009/24020MB]
 Swp[                                                                                    0/1023MB]

 PID USER     PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
12518 redis     20   0 40048  7000   640 S  0.0  0.0  2:21.53  `- /usr/local/bin/redis-server /etc/redis/analytics.conf
12513 redis     20   0 72816 35776   736 S  3.0  0.1  4h06:40  `- /usr/local/bin/redis-server /etc/redis/log7.conf
12508 redis     20   0 72816 35776   736 S  2.0  0.1  4h07:31  `- /usr/local/bin/redis-server /etc/redis/log6.conf
12494 redis     20   0 72816 37824   736 S  1.0  0.2  4h06:08  `- /usr/local/bin/redis-server /etc/redis/log5.conf
12488 redis     20   0 72816 33728   736 S  2.0  0.1  4h09:36  `- /usr/local/bin/redis-server /etc/redis/log4.conf
12481 redis     20   0 72816 35776   736 S  2.0  0.1  4h02:17  `- /usr/local/bin/redis-server /etc/redis/log3.conf
12475 redis     20   0 72816 27588   736 S  2.0  0.1  4h03:07  `- /usr/local/bin/redis-server /etc/redis/log2.conf
12460 redis     20   0 72816 31680   736 S  2.0  0.1  4h10:23  `- /usr/local/bin/redis-server /etc/redis/log1.conf
12440 redis     20   0 72816 33236   736 S  3.0  0.1  4h09:57  `- /usr/local/bin/redis-server /etc/redis/log0.conf
12435 redis     20   0 40048  7044   684 S  0.0  0.0  2:21.71  `- /usr/local/bin/redis-server /etc/redis/redis-servicelog.conf
12429 redis     20   0  395M  115M   736 S 33.0  0.5 60h29:26  `- /usr/local/bin/redis-server /etc/redis/redis-pool.conf
12422 redis     20   0 40048  7096   728 S  0.0  0.0 26:17.38  `- /usr/local/bin/redis-server /etc/redis/redis-load.conf
12409 redis     20   0 40048  6912   560 S  0.0  0.0  2:21.50  `- /usr/local/bin/redis-server /etc/redis/redis-cache.conf

and very light MongoDB loads for 1800 – 2500 crud operations a minute:

insert  query update delete getmore command flushes mapped  vsize    res faults locked % idx miss %     qr|qw   ar|aw  netIn netOut  conn       time
    2      9      5      2       0       8       0  6.67g  14.8g  1.22g      0      0.1          0       0|0     0|0     3k     7k   116   01:11:12
    1      1      5      2       0       6       0  6.67g  14.8g  1.22g      0      0.1          0       0|0     0|0     2k     3k   116   01:11:13
    0      3      6      2       0       8       0  6.67g  14.8g  1.22g      0      0.2          0       0|0     0|0     3k     6k   114   01:11:14
    0      5      5      2       0      12       0  6.67g  14.8g  1.22g      0      0.1          0       0|0     0|0     3k     5k   113   01:11:15
    1      9      7      2       0      12       0  6.67g  14.8g  1.22g      0      0.1          0       0|0     0|0     4k     6k   112   01:11:16
    1     10      6      2       0      15       0  6.67g  14.8g  1.22g      0      0.1          0       0|0     1|0     4k    22k   111   01:11:17
    1      5      6      2       0      11       0  6.67g  14.8g  1.22g      0      0.2          0       0|0     0|0     3k    19k   111   01:11:18
    1      5      5      2       0      14       0  6.67g  14.8g  1.22g      0      0.1          0       0|0     0|0     3k     3k   111   01:11:19
    1      2      6      2       0       8       0  6.67g  14.8g  1.22g      0      0.2          0       0|0     0|0     3k     2k   111   01:11:20
    1      7      5      2       0       9       0  6.67g  14.8g  1.22g      0      0.1          0       0|0     0|0     3k     2k   111   01:11:21
insert  query update delete getmore command flushes mapped  vsize    res faults locked % idx miss %     qr|qw   ar|aw  netIn netOut  conn       time
    2      9      8      2       0       8       0  6.67g  14.8g  1.22g      0      0.2          0       0|0     0|0     4k     5k   111   01:11:22
    3      8      7      2       0       9       0  6.67g  14.8g  1.22g      0      0.2          0       0|0     0|0     4k     9k   110   01:11:23
    2      6      6      2       0      10       0  6.67g  14.8g  1.22g      0      0.2          0       0|0     0|0     3k     4k   110   01:11:24
    2      8      6      2       0      21       0  6.67g  14.8g  1.22g      0      0.2          0       0|0     0|0     4k    93k   112   01:11:25
    1     10      7      2       3      16       0  6.67g  14.8g  1.22g      0      0.2          0       0|0     0|0     4k     4m   112   01:11:26
    3     15      7      2       3      24       0  6.67g  14.8g  1.23g      0      0.2          0       0|0     0|0     6k     1m   115   01:11:27
    1      4      8      2       0      10       0  6.67g  14.8g  1.22g      0      0.2          0       0|0     0|0     4k     2m   115   01:11:28
    1      6      7      2       0      14       0  6.67g  14.8g  1.22g      0      0.2          0       0|0     0|0     4k     3k   115   01:11:29
    1      3      6      2       0      10       0  6.67g  14.8g  1.22g      0      0.1          0       0|0     0|0     3k   103k   115   01:11:30
    2      3      6      2       0       8       0  6.67g  14.8g  1.22g      0      0.2          0       0|0     0|0     3k    12k   114   01:11:31
insert  query update delete getmore command flushes mapped  vsize    res faults locked % idx miss %     qr|qw   ar|aw  netIn netOut  conn       time
    0     12      6      2       0       9       0  6.67g  14.8g  1.22g      0      0.2          0       0|0     0|0     4k    31k   113   01:11:32
    2      4      6      2       0       8       0  6.67g  14.8g  1.22g      0      0.1          0       0|0     0|0     3k     9k   111   01:11:33
    2      9      6      2       0       7       0  6.67g  14.8g  1.22g      0      0.1          0       0|0     0|0     3k    21k   111   01:11:34
    0      8      7      2       0      14       0  6.67g  14.8g  1.22g      0      0.2          0       0|0     0|0     4k     9k   111   01:11:35
    1      4      7      2       0      11       0  6.67g  14.8g  1.22g      0      0.2          0       0|0     0|0     3k     5k   109   01:11:36
    1     15      6      2       0      19       0  6.67g  14.8g  1.22g      0      0.1          0       0|0     0|0     5k    11k   111   01:11:37
    2     17      6      2       0      19       1  6.67g  14.8g  1.22g      0      0.2          0       0|0     0|0     6k   189k   111   01:11:38
    1     13      7      2       0      15       0  6.67g  14.8g  1.22g      0      0.2          0       0|0     1|0     5k    42k   110   01:11:39
    2      7      5      2       0      77       0  6.67g  14.8g  1.22g      0      0.1          0       0|0     2|0    10k    14k   111   01:11:40
    2     10      5      2       0     181       0  6.67g  14.8g  1.22g      0      0.1          0       0|0     0|0    21k    14k   112   01:11:41
insert  query update delete getmore command flushes mapped  vsize    res faults locked % idx miss %     qr|qw   ar|aw  netIn netOut  conn       time
    1     11      5      2       0      12       0  6.67g  14.8g  1.22g      0      0.1          0       0|0     0|0     4k    13k   116   01:11:42
    1     11      5      2       1      33       0  6.67g  14.8g  1.22g      0      0.1          0       0|0     3|0     6k     2m   119   01:11:43
    0      9      5      2       0      17       0  6.67g  14.8g  1.22g      0      0.1          0       0|0     1|0     5k    42k   121   01:11:44
    1      8      7      2       0      25       0  6.67g  14.8g  1.22g      0      0.2          0       0|0     0|0     6k    24k   125   01:11:45

Related Articles

Reader Comments (11)

Redis and MongoDB hmmm....never thought of that. Great Article

October 15, 2012 | Unregistered Commenter@mojo706

That's a great writeup, and thank you for sharing.

So you switched three things: architecture, deployment/ops model, and runtime.
The juicy headline refer to the third, which is probably the least significant part of the change, and the one that probably had the highest cost (as it required a complete rewrite).

don't get me wrong - I love me some node.js, and I am a big fan of Heroku, however all the caveats pointed out in the .NET impl are easily (or not-so, but still easier than a complete rewrite) addressable. so you *might* have been able to achieve similar results by optimizing the existing codebase (instead of a rewrite) and moving to an easy-to-deploy model with AppHarbor, Azure, (or running Mono on Heroku). And you'd still have New Relic :)

would love to hear your take on the cost-benefit balance of a complete rewrite

October 15, 2012 | Unregistered CommenterKen Egozi

Good write up!, We used to do something similar. When our collections got too large, the Node GC would pause for extended periods of time, and at a point would end up crashing with a JS Out of Memory Error (around 2 Gigs). We ended up creating a NodeJS C++ plugin to store the data outside of the V8 heap, this let us actually scale the memory resident cache to well over 10+ Gigs.

October 15, 2012 | Unregistered Commenterdevdazed

I don't know much about Node.JS, but devdazed's comment raises the point I can't get around. On what planet is the GC built into NodeJS better than what's in the JVM (or .NET, for that matter)?

Can't argue with success, I guess -- this is working for them. But it's difficult to believe that the same thing couldn't be achieved easily in the JVM or .NET environments, where you aren't going to have GC-related gotchas waiting out there. Or at least the ones you find will be well-known.

October 15, 2012 | Unregistered CommenterRoss Judson

That's a good read but i don't get some of the points:
1. Background Workers. Asp.net can do that easily as well by creating a new thread. The only thing that should be watched is not throwing unhandled exceptions in that thread.
2. Persistent Connections. Why not just have a static MongoServer instance which is a thread-safe class? The MongoCollection class is also thread-safe. so we can even have a static instance of the collection

October 16, 2012 | Unregistered CommenterDmitry

What on earth made you replace .NET with Node.js? ASP.NET supports asynchrony to the fullest. You also take a *heavy* (like 3-10x) perf hit by switching to node from .NET. If your app got faster after switching to node.js you did something wrong on the .net side in the first place.

You are also free to handle your DB connections however you want. You could have kept one open per thread for example with [ThreadStatic] or some other mechanism.

October 16, 2012 | Unregistered Commentertobi

@Dmitry:

Not that I'm an expert in this field (nor that I think Node is a panacea) but maybe I'll clarify things a bit for you by saying that:

1. I wouldn't rely on ASP.NET for executing background tasks since the implementation is often very fragile. ASP.NET simply wasn't designed for this and I guess you need a very, very good reason if you want to implement background workers via threading API within ASP.NET application context. I'd rather go for WCF service hosted as a Windows service which is much more reliable.

2. That's true, having a static MongoServer/MongoDatabase instance would make you have a persistent connection for a lifetime of an app domain. The last part is important since ASP.NET application can be restarted for many reasons (scheduled app pool recycle, web.config or application folder changes, etc.). And I guess Node is just more reliable in that regard.

But in general, I agree with others that a complete rewrite (and a move to Node) was probably unnecessary (but again, I think they knew what they were doing).

October 17, 2012 | Unregistered CommenterPavel

What on earth made you replace .NET with Node.js? ASP.NET supports asynchrony to the fullest. You also take a *heavy* (like 3-10x) perf hit by switching to node from .NET. If your app got faster after switching to node.js you did something wrong on the .net side in the first place.

You are also free to handle your DB connections however you want. You could have kept one open per thread for example with [ThreadStatic] or some other mechanism.

October 17, 2012 | Unregistered Commentertobi

What's with all the node.js articles lately?

Yes node.js is fast but it lacks so many basic functions that by the time you rewrite all the modules the "magical" speed advantage won't be there, plus the entire code base will become a bunch of callbacks that keep on calling itself, making simple changes requires you to analyze the entire sequence which feels like threading a needle.

Node.js is headache.

October 18, 2012 | Unregistered CommenterAlex Yam

We are using StatsD + Graphite with our enterprise Node app. With this you can easily achieve what you can with New Relic, for free.

November 5, 2012 | Unregistered CommenterGreg Tczap

Wow, this guys really like to over-engineer things! Using boost in api client(and forcing android users to compile with custom SDK, its even more pain for malmalade skd users).

November 18, 2012 | Unregistered CommenterAlex D

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>