IT Hare: Ultimate DB Heresy: Single Modifying DB Connection. Part I. Performanc
Tuesday, November 22, 2016 at 8:56AM
HighScalability Team

Sergey Ignatchenko continues his excellent book series with a new chapter on databases. This is a guest repost

The idea of single-write-connection is used extensively in the post, as it's defined elsewhere I asked Sergey for a definition so the article would make a little more sense...

As for single-write-connection - I mean that there is just one app (named "DB Server" in the article) having a single DB connection to the database which is allowed to issue modifying statements (UPDATEs/INSERTs/DELETEs). This allows to achieve several important simplifications - first of all, all fundamentally non-testable concurrency issues (such as missing SELECT FOR UPDATE and deadlocks) are eliminated entirely, second - the whole thing becomes deterministic (which is a significant help to figure out bugs - even simple text logging has been seen to make the system quite debuggable, including post-mortem), and last but not least - this monopoly on updates can be used in quite creative ways to improve performance (in particular, to keep always-coherent app-level cache which can be like 100x-1000x more efficient than going to DB).

After we finished with all the preliminaries, we can now get to the interesting part – implementing our transactional DB and DB Server. We already mentioned implementing DB Server briefly in Chapter VII, but now we need much more detailed discussion on this all-important topic.

Hare pointing out:Transactional / operational DB is a place where all the automated decisions are made about your game (stock exchange, bank, etc.)First of all, let’s re-iterate what we’re speaking about. Transactional/operational DB is a place where all the automated decisions are made about your game (stock exchange, bank, etc.).

It stores things such as player accounts, with all their persistent attributes etc. etc.; it also stores communications related to payment processing, and so on, and so forth. And “DB Server” is our app handling access to DBMS (as noted in Chapter VII, I am firmly against having SQL statements issued directly by your Game Servers/Game Logic, so an intermediary such as DB Server is necessary).

As discussed above, ACID properties tend to be extremely important for transactional/operational DB. We don’t want money – or that artifact which is sold for real $20K on eBay – to be lost or duplicated. For this and some other reasons, we’ll be speaking about SQL databases for our transactional/operational DB (while it is possible to use NoSQL for transactional/operational DB – achieving strict guarantees is usually difficult, in particular because of lack of multi-object ACID transactions in most of NoSQL DBs out there, see discussion in [[TODO]] section above).

And now, we’re finally ready to start discussing interesting things.

Multi-Connection DB Access

As it was mentioned in Chapter VII, there are two very different approaches to organizing the access to your DB: multi-connection one and single-write-connection one. Let’s start withmuch more usual (and IMO quite deficient) approach of having multiple database connections to your DB.

Multiple DB connections are so common out there, that very few people will even think about considering anything different. The idea is on the surface – we have multiple connection to the DBMS, and throw all the requests we may have, at the database, and it will handle these requests for us. More importantly, scalability is not expected to be our problem – scaling is presented as a mere problem of buying more hardware (though this is not really the case, see below).

In spite of the multi-connection DB access being pretty much ubiquitous, it needs to be noted that it has quite a few shortcomings:

As you can see – I’m not a big fan of the multi-connection approach for your OLTP database (pretty much for the same reason why I’m not a fan of massive multi-threading, though for DBs these problems are admittedly a tad milder than for multi-threading). Still, I admit that you CAN write OLTP with multi-connections.

That is, as long as you have that OLTP-DB-guru with real-world experience with millions-writing-transactions-per-day DBs.

 

Without such a person (and no, 20-year experience with multi-terabyte mostly-read DB is not a substitute) – multi-connection DB access for your OLTP database is very risky at the very least. To make things worse (and for reasons described above) – your DB Server will pretend to work while in testing and in “beta”, but after deployment and after reaching certain level of load – it can easily become a never-ending story of lost money/artifacts, player complaints etc. etc. etc. 


1 Depending on DBMS, there can be implicit locks such as locking next row when performing an index scan. If this is the case – may Codd have mercy upon your soul  , as there is no practical way to avoid deadlocks entirely.

2 and yes, I observed it long enough to see more than one RDMBS crash 

 

On TP Monitors

Surprised hare:if we try to run all those outstanding requests exactly in parallel – we’ll have a severe performance degradation due to excessive thread context switches and fighting for resources.

 One practical note in case if you still want to go for multi-connection DB access: more likely than not, sooner or later you’ll need to run a “transaction processing monitor” a.k.a. TP Monitor. The idea behind TP monitor is simple – if we try to run all those outstanding requests exactly in parallel – we’ll have a severe performance degradation due to excessive thread context switches and fighting for resources. In other words (putting it very roughly) – TP monitor makes sure thatonly a few requests are executed against DB at any given time (in practice, number of requests to be run in parallel, is related to hardware where your DB is running, such as number of disks and/or number of CPU cores).

These days, two most popular TP monitors out there are Microsoft COM+ and Oracle Tuxedo (formerly BEA Tuxedo).

On the other hand, being a big fan of Reactors, I usually prefer to build my own kinda-TP-monitor from them. This approach is discussed in more detail in Chapter VII, but very shortly it goes along the lines of having several DB Server Working Reactors (with one connection to DB each), and one DB Server Proxy Reactor, which receives all the requests from the rest of the system, and forwards them to a “free” (least loaded) DB Server Working Reactor. Overall, there is no rocket science involved in TP Monitors, and most of the time I feel that DIY is quite appropriate here (though as always, YMMV).

The Ultimate DB Heresy – Single-Write-Connection DB Access

Ok, now as I’ve sufficiently criticized the multi-DB-connection approach ;-), it is time to speak about the single-write-connection one. In original form it is very simple – we have one DB Server app, which keeps a single DB connection; this DB Server app receives incoming requests, chooses relevant prepared SQL statement, binds its parameters and issues an API call to execute this SQL. On receiving the reply, it packages obtained result to form the reply, sends the reply back to requestor, and waits for the next incoming request.

One important thing to be noted here is that we’re speaking about single WRITE connection; in parallel to this single write connection, it is perfectly viable to run any number of read-only connections. On the other hand, for performance reasons (and to avoid locking) you’ll want to use the lowest transaction isolation level provided by your RDBMS; in other words, if you’re using RDBMS with a lock-based concurrency (see discussion in [[TODO]] section above), these parallel read-only connections may need to use Read Uncommitted transaction isolation level; for MVCC-based RDBMS (with a notable exception being MySQL+InnoDB) it will usually be Read Committed. This has quite a few implications, but for requests about historical data – and 99% of all the reporting falls under this category – both these isolation levels will work perfectly fine.

As we can see, with single-write-DB-connection there is no concurrency at all – at every given moment there is one and only one SQL statement executed. It means that all the strange problems with isolation levels mentioned above, go away by themselves with absolutely zero efforts from our side (and regardless of transaction isolation level used(!)).

Hare asking question:Of course, this simplicity comes at a price – and this price is an apparent lack of scalability.

 Of course, this simplicity comes at a price – and this price is an apparent lack of scalability. In fact, this apparent lack of scalability is a Big Fat Reason why 99% of people out there consider single-write-connection stuff for OLTP DBs as an ultimate DB heresy (i.e. you will be told “it cannot possibly work”). Still, I’ve seen heavily loaded games (and stock exchanges too) working exactly this way, more than once. Moreover, at some point, one of such DBs was referred to as “the most loaded DB/2 instance on Windows we know” – by no less than people from IBM Toronto Labs (this is where DB/2 UDB was developed at the time – and to the best of my knowledge, is still developed). So, in real world things don’t look that grim for single-write-connection DBs; let’s see the reasons why this gap between theory (which tells “it cannot possibly work”) and practice (which proves that this statement is wrong by demonstrating a counter-case ;-)) exists.

Single-Write-DB-Connection: per-connection Performance

First, let’s start with discussing performance of single-write-DB-connection architectures. While, strictly speaking, no kind of performance can be used as a substitute for scalability, it certainly can affect the point when we need to start thinking about scalability.

App-Level Cache – a BIG help performance-wise

As soon as we have exactly one DB connection which can modify DB3 – we can add an app-level cache, and we can make this app-level cache perfectly coherent with DB. This becomes possible because with single-write-connection DB, we have 100% of the information about all the DB modifications – simply because we’re the only ones who is causing these modifications  .

Hare thumb up:In practice, I’ve seen these app-level caches to help a LOT with performance.

 In practice, I’ve seen these app-level caches to help a LOT with performance; I’ve seen the overall performance improvements from using app-level caches of the order of 5x-10x (!). This result is not that surprising if we take a look at the processes involved in both cases. If we’re using app-level cache (the most popular such cache will be a cache of PLAYERS table), then all we need to do for getting necessary player data (which is needed all the time and then more) – is to calculate hash of the player ID, and then to get the Player in-memory structure by this hash. Overall, we’re speaking about 100-200 CPU clocks (or of the order of 0.1μs).

On the other hand, if going for the same thing to DBMS – we need to (a) bind a prepared statement, (b) to issue an API call, (c) the API call will marshal our data, then (d) it will go to a different process over some IPC (most likely – with usermode-kernelmode-usermode transition on the way, and at least one thread context switch), there our request will be (e) unmarshaled, (f) an execution plan which corresponds to the prepared statement will be found, then (g) execution plan will be executed – getting and parsing (!) several index pages, and at least one data page, then (h) data page will be parsed, (i) data of our user will be retrieved, (j) marshaled, (k) sent back (causing once again usermode-kernelmode-usermode transition and another thread context switch) – then it will be (l) unmarshaled, and (m) delivered to our app. As a result, there should be no surprise that going to DB takes MUCH longer than checking things in-app; in practice, for DB access we’re usually speaking about 10-100μs or so – a 100x-1000x difference from a search in app-level cache. The reason why it doesn’t speed things up more than 5x-10x I’ve mentioned earlier, is because there are other things which need to be done via DB anyway (in particular, DB transactions do need to go through DB synchronously as long as we’re going to provide durability – though see below on kinda-write-back caching).

Still, even 5x-10x of overall improvement which can be observed in practice, is a Very Good Number :-). It should be noted, though, that app-level cache is not the first thing you should do to optimize your DB Server; we’re discussing app-level cache here more as a concept to understand and justify single-write-DB-connection approach as a valid temporary step before full-scale scalability will need to be reached. The whole process of optimizing your DB (including indexes, physical DB layout and RAID levels, denormalisation, and app-level caching) will be discussed in more detail in Vol.3, tentatively Chapter [[TODO]].


3 additional read-only connections are ok, we’ll discuss them a bit later

 

Kinda-write-back App-level Cache with Durability

Wtf hare:It is further possible to make this app-level cache a sorta-write-back cache – and providing 100% correct ACID-style Durability too.

 BTW, this 5x improvement mentioned above, is for write-through app-level cache. It is further possible to make this app-level cache a sorta-write-back cache – and providing 100% correct ACID-style Durability too. This can provide additional performance benefit (however, I didn’t try it myself, so I cannot tell how much additional benefit can be obtained this way).

The idea in this case goes along the following lines:

This whole thing provides strict Durability, because we don’t release replies until appropriate transactions are committed and Durable at the DB level (in a sense, Delayed Replies are just “tentative replies”). On the other hand, in this processing model we’re saving a LOT on committing transactions – and the cost of commit is very high latency-wise (see discussion below). As noted above – I haven’t tried this particular kinda-write-back-cache model in real world, so I cannot provide an estimate of “how much it may help”; on the other hand, I’d make a not-so-uneducated guess that (a) there is a reasonable chance of saving 1.5x-2x, and (b) that most likely, there isn’t much sense in combining more than 5-10 requests into one Larger Transaction.

[[TODO: DIY fault tolerance]]

Performance Caveat: Latencies and More Latencies

Hare with hopeless face:single-write-connection DB configurations are very sensitive to latencies.

 When speaking about (IMO Really Good ;-)) performance of single-write-connection DBs, it would be unfair not to mention one important caveat of such configurations: single-write-connection DB configurations are very sensitive to latencies.

Latencies we’ll be talking about, are two-fold: communication latency and “DB log flush()/sync() latency”. The first one is simple – it is a latency of communication between our DB Server app and DBMS; dealing with it is also trivial – let’s just put our DB Server app onto the same hardware box as DBMS, play a bit with connectivity options to find the best one, and we’re done.

The second latency (“DB log flush()/sync()” one) requires more elaborate explanation. To understand it, we’ll need to describe how your usual production-ready RDBMS works with transactions:

As a result, when working with high-performance single-write-connection DBs, we should aim for in-server HDDs/SSDs (or direct-attached SCSI/SATA storage), and a BBWC RAID (or NVMe) for our server hardware (software RAID won’t do!); fortunately, it is rarely a problem (BBWC RAID cards are readily available from all major server manufacturers, and are often – though not universally – available from hosting ISPs too). With cost of BBWC RAID cards being of the order $1K (that’s a price from major server manufacturers), and you need it only for a very few DB Servers, they’re not going to eat too much out of your budget.


4 usually, these modifications aren’t written to disk until the transaction is committed

 

Single-Write-DB-Connection: Real-World Performance

Ok, all this theorizing is good – but what about real-world numbers? I have a few of them up my sleeve to share ;-).

I’ve seen a few real-world systems (with hundreds of different real-world OLTP transactions happening, most of transactions were modifying multiple rows, adding even more rows for audit trails, etc. etc. etc.) based on single-write-DB-connections.

One of such real-world systems was consistently processing over 30M real-world write transactions/day over one single DB connection (and without an in-memory DB), supporting ~100K simultaneous players

(that is, after optimizing DB and adding app-level cache for USERS table). Average time per transaction was in the range of 800 μs (this would allow to process 86400 seconds/day*1000 milliseconds/second /0.8 milliseconds/write transaction ~= 100M write transactions / day if load is even, but due to intra-day load variations and not-100%-load during peak time, it was only 30-50M write DB transactions per day in practice). Number of different SQL statements was in hundreds, and average number of rows modified and/or added per ACID transaction, was (very roughly) around 10 (these included rather complicated inter-player interactions, and various audits). In short – it was a kind of system you can expect to run yourself (opposed to artificial tests such as TPC-C which are not exactly representative of real-world systems); in fact, it was serving a game with hundreds of thousands of simultaneous players – and BTW was by far the most stable among direct competition too.

And knowing how it was implemented and the data involved – I feel that there was nothing too special about the data they processed. In other words –

I’m pretty sure that your OLTP DB can achieve the same order of magnitude of performance on a single-write-DB-connection.

Of course, YMMV (let’s say, your results can easily differ by half an order of magnitude5), and of course, achieving this kind of performance will take a lot of effort (including both DB-level optimizations and app-level cache) – but I’m confident that for most of OLTP DBs out there it is perfectly doable.


5 of course, we’re speaking about one standard deviation being half an order of magnitude, so confidence level to fit within this range is mere 68% or so 

 

Single-Write-DB Connection: Scalability

Inquisitive hare:Of course, Performance (even pretty good one as mentioned above) is NOT a substitute for Scalability

 Of course, as noted above, Performance (even pretty good one as mentioned above) is NOT a substitute for Scalability. In other words – at some point, even 30M DB transactions/day won’t be enough. On the other hand, there is a way (which has been used in real-world too) to implement Scalable system based on multiple Single-Write-DB-Connection(s) in a share-nothing fashion; in a sense – it can be seen as a rather close cousin of three microservices-related patterns: Database-Per-Service pattern (see [Fowler] and[Richardson.DatabasePerService]), accompanied by Event-Driven Architecture (see [Richardson.EventDrivenArchitecture]) and Application Publishing Events (see [Richardson.ApplicationEvents]). Moreover, there is a way (also shown to be successful on a serious real-world system) to migrate from simplistic single-write-DB-connection architecture to this Share-Nothing lots-of-single-write-DB-connections one gradually. We’ll discuss all of it in a jiff :-).

Article originally appeared on (http://highscalability.com/).
See website for complete article licensing information.