Have you been experiencing slow MySQL startup times in GTID mode? We recently ran into this issue on one of our MySQL hosting deployments and set out to solve the problem. In this blog, we break down the issue that could be slowing down your MySQL restart times, how to debug for your deployment, and what you can do to decrease your start time and improve your understanding of GTID-based replication.
We were investigating slow MySQL startup times on a low-end, disk-based MySQL 5.7.21 deployment which had GTID mode enabled. The system was part of a master-slave pair and was under a moderate write load. When restarting during a scheduled maintenance, we noticed that the database server took 5-10 minutes to start up and begin accepting connections. That kind of delay didn’t make sense, so we set out to investigate.
We used the popular Percona tool pt-ioprofile to see what the database was doing. pt-ioprofile is a very important utility in Percona’s popular toolkit that’s used to debug MySQL issues, and you can see the complete list of features in their documentation. The pt-ioprofile tool uses strace and lsof to watch a process’s I/O and print out a table of files and I/O activity.
So, we started MySQL, waited for the mysqld process to get spawned, and started pt-ioprofile to see what the issue might be:
# pt-ioprofile --profile-process mysqld --run-time 200 Tue Oct 9 15:42:24 UTC 2018 Tracing process ID 18677 total pread read pwrite write fsync fdatasync open close getdents lseek fcntl filename ... 216.550641 0.000000 216.550565 0.000000 0.000000 0.000000 0.000000 0.000015 0.000040 0.000000 0.000021 0.000000 /mysql_data/binlogs/mysql-bin.000014 ...
On running this multiple times, we observed the following:
These observations clearly pointed to the fact that MySQL was spending a lot of time reading the latest binary log file. If the file was small, like it would be when the log file was flushed before a shutdown, the startup was fast.
As it turns out, in order to populate the values of gtid_executed and gtid_purged, the MySQL server must parse the binary log files.
Here’s the summary the MySQL 5.7 documentation method recommendation based on a FALSE or TRUE reading:
To compute gtid_executed:
This process can be very time consuming if there are a large number of binary log files without GTIDs, for example, created when gtid_mode=OFF.
Similarly, to compute gtid_purged:
So, this forms the basis of our understanding on how things used to work in older versions. However, certain optimizations can be made when binlog_gtid_simple_recovery is TRUE. This is the case we are interested in:
(Note, this is the default in MySQL 5.7.7 and later)
So, for MySQL versions 5.7.7 and above, the latest and the old binary log files are always read during system startup to correctly initialize GTID system variables. Reading the oldest binary log file isn’t as expensive since the event MySQL is looking for, Previous_gtids_log_event, is always the first event in a binary log file.
However, in order to correctly compute gtid_executed, the server must read through the entire latest binary log file and collect all the events in that file. So, the system startup time becomes directly proportional to the size of the latest binary log file.
Note that the situation is even worse when binlog_gtid_simple_recovery is FALSE. Since it is no longer the default option in recent releases, it is not much of a concern.
Having understood the cause of the issue we were running into, the solution we decided on was fairly obvious – reduce the size of the binary log files. The default size of binary log files is 1GB. It takes time to parse through a file of this size during startup, so it makes sense to decrease the value of max_binlog_size to a lower value.
If decreasing the size of the binary log file is not an option, then flushing the binary log files just before a maintenance shutdown of the mysqld process can help to decrease the binlog GTID recovery times.