Facebook Uses Non-Stored Procedures to Update Social Graphs

Facebook's Ryan Mack gave a MySQL Tech Talk where he talked about using what he called Non-stored Procedures for adding edges to Facebook's social graph. The question is: how can edges quickly be added to the social graph? The answer is ultimately one of deciding where logic should be executed, especially when locks are kept open during network hops.
Ryan explained a key element of the Facebook data model are the connections between people, things they've liked, and places they've checked-in. A lot of their writes are adding edges to the social graph.
Currently this is a two step process, run inside a transaction:
- add a new edge into the graph
- if the add was successful then increment the number of edges on a node
This approach works until there's a very hot node that is being added to rapidly. For example, a popular game adds a new character and everyone likes it at the same time or a new album comes out and everyone likes it at the same time.
They were limited to the rate they could add edges by a row lock taken to increment the edge count. The PHP code looked something like...
mysql_query("START TRANSACTION");
mysql_query("INSERT INTO graph...");
if (mysql_affected_rows() == 1)
mysql_query("INSERT INTO counts ...");
// Now a lock is taken until the commit
mysql_query("COMMIT");
Each of these queries is being sent over the network. So the row is locked until the commit happens, which means they are bounded by network latency. At peak they are limited to 600 adds per second, which increases linearly with the amount of network latency. When they were doing updates across country the latency was 100msec and they could only get 10 writes per second.
The traditional approach around this problem is to use stored procedures, which sends one command to the database and the database executes the same sort of logic as above inside the database server. No extra network latency. A trigger can be used to the same effect. This approach more than doubles the rate edges can be added, but unfortunately it causes problems with operations.
Operations Work Flow Matters
Every time a developer makes a stored procedure or a trigger change, a DB-Op must be asked to update the schema. The implications at Facebook for this request are hard to imagine. At X,000 databases * 20 shards each * 50 edge tables each for each shard, that's X million stored procedures to update. That's a lot of updating, especially during test and development. This won't work. Your shop may not be quite as large (LOL), but there are probably similar operations implications.
Use a Multiple Statement Query as a Non-Stored Procedure
The solution was to use a feature of the MySQL API called a multiple statement query, which acts like a run-time stored procedure. Basically, the application appends all the lines of the previous program together into one query which is sent to the server. All the logic runs on the server side, with no intermediate network hops.
With this approach:
- Performance is happy because performance is on par with the previous approaches.
- Ops is happy because they don't have to continually update millions of stored procedures.
- Development is happy because they can change things on the fly.
Where Should Application Logic Run?
Now let's consider this problem in the broader context of deciding where application logic should run.
Fine grain in the app with pessimistic locking. Logic runs in the application and it modifies data remotely and incrementally using network operations. There were obvious issues in the scenario when logic, remoteness, and locks are combined. However, the use of pessimistic locking via transactions made it simple to update multiple tables without a lot of programmer overhead.
Stored procedures. The trigger and stored procedure approach localizes locks and logic and removes the latency limit on operations per second. The downside is the the stored procedure language is very primitive and there's a split brain between logic in the application and logic in the database.
Non-stored procedures. Recall that this is the solution arrived upon by Facebook. The application writer gets to write code in the application, not the database, but it's still written in a primitive language and is completely outside the application framework.
Key-value with optimistic locking. An application gets a record, modifies it using a preferred application language, and writes the value back. The write is transactional in the sense that it's all or nothing, but dealing with updates in an environment with many distributed writers is difficult. Imagine a bunch of different clients wanting to add/delete items in a shared list.
How do we prevent conflicting writes? One approach is an optimistic concurrency mechanism like memcache's CAS (Check And Set operator), for scenarios where reads dominate writes. When you do a get you get also get a version number of the value. The version number is incremented on every set. When you write you include your original version number, if the version number hasn't changed then your value is written. Otherwise you have to read the value again, merge the values, and rewrite, whereupon the whole process could happen again. This why CAS works best for read dominated workloads. If you are a slow writer and there are lots of other writers then you could be live locked out of ever getting your value written. A slow node won't be able to compete with faster nodes on these operations. There's also a lot of pressure on the application to not mess up the merge. Also notice there's no concept of a transaction across multiple keys, it's only one at a time. So the example of adding and edge and updating a count could suffer from partial failures.
Key-value with handy built-in operations. Redis solves the problem of multiple list editors by natively implementing list operations. Clients are freed from the read-edit-write cycle. You tell Redis to add an item to a list. This provides server side transactions across multiple operations, just like stored procedures. You are, however, limited only to built-in operations.
Key-value with custom operations. Like Redis but with your own operations. Download a custom "stored procedure" written in a language like JavaScript that makes direct database calls. This gives you a little more convenient development, plus the potential to have transactions across multiple records. The application is still split into parts, some in the application and some in the database.
Key-function database. Behavior and state are collocated, not in a tightly bound object sense, but something like the S4 model with Node.js for implementing application logic and event responses. Functions can be assigned to data and invoked dynamically. In addition functions can be coded in the application and sent to execute remotely on the data.
Reader Comments (11)
Facebook's answer is pragmatic, but wrong. Their real problem is that they aren't able to update database code (e.g., PROCs) with the same agility as UI code. There may be X,000 databases to be updated when a stored procedure changes, but there are also Y,000 webservers and/or application servers to be updated when a UI program changes. Somehow, that's not a problem for them.
So instead of fixing their operational problem and making it easier to deploy future changes to their databases, they coded around it for this one and moved some data-specific logic ("edge" counting) into the application. It gets the job done, but not very well.
"Non-stored Procedures" LOL. That's just a Sql batch and it's been used for DECADES because it saves network round trips. Nothing new here. If the database is worth its salt it will also turn it into a parameterized stored procedure to speed things up a bit more.
"the stored procedure language is very primitive"... Yeah, right... Ever heard of PL/SQL? I wouldn't dare call this SQLized subset of Ada "primitive". And it runs not only in Oracle, recent versions of DB2 UDB support it, too. And PL/PgSQL is very similar...
Non-stored procedures are for non-databases, which MySQL is still one of. Now that Oracle owns it, maybe things will change to the better...
@Ross: code on the database has been a pain the a** to update, update at a large scale, test, etc for decades. We are still waiting for a real mainstream solution. FB has so specific needs/constraints, there is no surprise they play a different game. FB's strategy may be wrong from a conceptual view, but from teams who want the work to be done's point of view, it is the least wrong.
And it runs not only in Oracle, recent versions of DB2 UDB support it, too. And PL/PgSQL is very similar... . The site Cheap Skyland Avenger Breitling Watches
is also good,have a view please!!! "
The list of options is missing Dynamo vector clock style multiple versions.
Really nice, I did similar stuff 12 years ago. Delphi in combination with Oracle. Such non-stored procedures are called anonomous blocks in Oracle.
The point about their "non-stored procedures" - or using query batches from the client instead of a server-side stored procedure, is that at scale, an application needs to be able to be able to be upgraded gradually. Using a server-side stored procedure basically forces you to upgrade the whole infrastructure at the same time, which is clearly impossible for someone like Facebook.
@Mark R: Absolutely true. But exactly the same problem occurs when using their batched statements solution. There are still copies of the statements scattered across thousands of servers, all of which need to be upgraded. The obviously implied difference is that the Facebook deployment team has a combination of processes and technologies that allow them to upgrade the application servers painlessly, and that they don't have that for their databases. So they did what they had to do. And they'll do it again, and again, and again. Which is a shame, but it must make business sense to Facebook.
@Mark R: Enter Edition-based Redefinition, a new Oracle 11g Release 2 feature targeted exactly at this kind of challenge. In a few words, the database becomes multi-versioned not only in its concurrency model, but in application versions installed into it (Oracle calls them editions.) You install new edition (which includes changed DDL, new code, cross-edition transition stuff like views and triggers that enable both editions to run simultaneously,) test it and then eventually just flip the switch and all new sessions automatically use new edition. All of this live, without any service interruption. And if your testing didn't catch some serious issue before you went live, you can easily flip the switch back to previous stable edition and continue as if nothing really happened (or just flashback the database if the issue is really that serious...) Sounds like magic, but it really works, especially for simple cases like updating the stored procedures in live database (try to do that on a loaded pre-11.2 database and you will immediately find that it's close to impossible without application downtime because code currently being in use can't be changed and it's always in use...)
I wish I saw an argument here against stored procedures that I could use.