Migrating to XtraDB Cluster in EC2 at PagerDuty
Monday, June 16, 2014 at 9:01AM
HighScalability Team in Example

This is a guest post by Doug Barth, a software generalist who has currently found himself doing operations work at PagerDuty. Prior to joining PagerDuty, he worked for Signal in Chicago and Orbitz, an online travel company.

A little over six months ago, PagerDuty switched its production MySQL database to XtraDB Cluster running inside EC2. Here's the story of why and how we made the change.

How the Database Looked Before

PagerDuty's MySQL database was a fairly typical deployment. We had:

Problems With the Old Setup

Our database setup served us well for several years, but the failover architecture it provided didn't meet our reliability goals. Plus, changing hosts was a hassle: To flip from one DRBD host to another, we had to stop MySQL on the primary, unmount the DRBD volume, change the secondary to primary status, mount the volume on the secondary and start MySQL. That setup requires downtime - and once MySQL was up and running on the new server, we had a cold buffer pool which would need to warm up before the application's performance returned to normal.

We tried using Percona's buffer-pool-restore feature to shorten the downtime window, but our buffer pool was prohibitively large. We found restoring the saved buffer pool pages used more system resources than servicing incoming request traffic slowly.

Another issue: Our async slaves became unrecoverable if an unplanned flip occurred. We were storing the binlogs on a separate, non-DRBD-backed disk and had disabled sync_binlog (due to the performance penalty it introduced). This setup meant that we needed to restore our async slaves from backup after an unplanned flip.

What We Liked About XtraDB Cluster

There were a few things about XtraDB Cluster that stood out.

What We Did to Prepare

Getting our application ready for XtraDB Cluster did involve a few new constraints. Some of those were simple MySQL tweaks that are mostly hidden from the application logic, while others were more fundamental changes.

On the MySQL side:

Besides these MySQL configuration changes, which we were able to test out in isolation on the DRBD server side, the application logic needed to change:

How We Handled Schema Changes

Schema changes are especially impactful in XtraDB Cluster. There are two options for controlling how schema changes are applied in the cluster: total order isolation (TOI) and rolling schema upgrade (RSU).

RSU allows you to upgrade each node individually, desyncing that node from the cluster while the DDL statement is running and then rejoining the cluster. But this approach can introduce instability - and RSU wouldn't avoid operational issues with schema changes on large tables (since it buffers writesets in memory until the DDL statement completes).

TOI, by contrast, applies schema upgrades simultaneously on all cluster nodes, blocking the cluster until the change completes. We decided to use TOI with Percona's online schema change tool (pt-online-schema-change). It ensures that any blocking operations are quick and therefore don't block the cluster.

The Migration Process

Having established the constraints that XtraDB Cluster would introduce, we started the rollout process.

First, we stood up a cluster in production as a slave of the existing DRBD database. By having that cluster run in production and receive all write traffic, we could start seeing how it behaved under real production load. We also set up metrics collection and dashboards to keep an eye on the cluster.

In parallel to that effort, we spent some time running load tests against a test cluster to quantify how it performed relative to the existing DRBD setup.

Running those benchmarks led us to discover that a few MySQL configs had to be tweaked to get the same level of performance that we’d been enjoying:

After satisfying ourselves that XtraDB Cluster would be able to handle our production load, we started the process of moving it into production use.

Switching all of our test environments to a clustered setup (and running load and failure tests against them) came first. In the event of a cluster hang, we needed a way to quickly reconfigure our system to fall back to a one-node cluster. We scripted that procedure and tested it while under load.

Actually moving the production servers to the cluster was a multi-step process. We had already set up a production cluster as a slave of the existing DRBD server, so we stood up and slaved another DRBD pair. (This DRBD server was there in case the switch went horribly wrong and we needed to fall back to a DRBD-based solution, which thankfully we didn't end up having to do.)

We then moved the rest of our async slaves (disaster recovery, backups, etc.) behind XtraDB Cluster. With those slaves sitting behind XtraDB Cluster, we executed a normal slave promotion process to move production traffic to the new system.

Real-World Performance: Benefits

With a bit over six months of production use, we have found a number of benefits to running on XtraDB Cluster:

Real-World Performance: Annoyances

Along the way, we have also found a few frustrating issues:

 

Article originally appeared on (http://highscalability.com/).
See website for complete article licensing information.