This is a guest post by Jeff Behl, VP Ops @ LogicMonitor. Jeff has been a bit herder for the last 20 years, architecting and overseeing the infrastructure for a number of SaaS based companies.
An inevitable part of disaster recovery planning is making sure customer data exists in multiple locations. In the case of LogicMonitor, a SaaS-based monitoring solution for physical, virtual, and cloud environments, we wanted copies of customer data files both within a data center and outside of it. The former was to protect against the loss of individual servers within a facility, and the latter for recovery in the event of the complete loss of a data center.
Like most everyone who starts off in a Linux environment, we used our trusty friend rsync to copy data around.
The obvious solution was to store this information into a database. A database repository for backup job metadata, where jobs themselves can report their status, and where other backup components can get information in order to coordinate tasks such as removing old jobs, was clearly needed. It would also enable us to monitor backup job status via simple queries for information such as the number of jobs running (total, and on a per-server basis), the time since the last backup, the size of the backup jobs, etc., etc.
So the first idea was to keep using rsync, but track the status of jobs in MongoDB. But it was a kludge to have to wrap all sorts of reporting and querying logic in scripts surrounding rsync. The backup job metainfo and the actual backed up files were still separate and decoupled, with the metadata in MongoDB and the backed up files residing on a disk on some system (not necessarily the same). How nice it would be if the the data and the database were combined. If I could query for a specific backup job, then use the same query language again for an actual backed up file and get it. If restoring data files was just a simple query away... Enter GridFS.
And of course MongoDB replication works with GridFS, meaning backed up files are immediately replicated both within the data center and off-site. With a replica inside of Amazon EC2, snapshots can be taken to keep as many historical backups as desired. Our setup now looks like this:
Advantages
LogicMonitor believes all aspects of your infrastructure, ranging from physical to application level, should be in the same monitoring system: UPSs, chassis temperature, OS statistics, database statistics, load balancers, caching layers, JMX statistics, disk write latency, etc., etc. It should all be there, and this includes backups. To that end, LogicMonitor can not only monitor general MongoDB statistics and health, but also can execute arbitrary queries against MongoDB. These queries can be for anything from login statics to page views to (guess what?) backup jobs completed in the last hour.
Now that our backups are all done via MongoDB, I can keep track of (and more importantly, be alerted on):