Friday
Dec072007
Synchronizing databases in different geographic locations
Friday, December 7, 2007 at 11:32AM
Our company offers a web service that is provided to users from several different hosting centers across the globe.
The content and functionality at each of the servers is almost exactly the same, and we could have based them all in a single location. However, we chose to distribute the servers geographically to offer our users the best performance, regardless where they might be.
Up until now, the only content on the servers that has had to be synchronized is the server software itself. The features and functionality of our service are being updated regularly, so every week or two we push updates out to all the servers at basically the same time. We use a relatively manual approach to do the updating, but it works fine.
Sometime soon, however, our synchronization needs are going to get a bit more complex.
In particular, we'll soon start offering a feature at our site that will involve a database with content that will change on an almost second-by-second basis, based on user input and activity. For performance reasons, a complete instance of this database will have to be present locally at each of our server locations. At the same time, the content of the database will have to be synchronized across all server locations, so that users get the same database content, regardless of the server they choose to visit.
We have not yet chosen the database that we'll use for this functionality, although we are leaning towards MySql. (We are also considering PostgreSQL.)
So, my question for the assembled experts is: What approach is the best one for us to use to synchronize the database instances across our servers?
Ideally, we'd like a solution that is resilient to a server location becoming unavailable, and we'd also prefer a solution that makes efficient use of bandwidth. (Processing power doesn't cost us a lot; bandwidth, on the other hand, can get expensive.)
FWIW ...
(1) Our servers run Apache and Tomcat on top of Centos.
(2) I've found the following "how to" that suggests an approach involving MySQL that could address our needs: http://capttofu.livejournal.com/1752.html
Thanks!
The content and functionality at each of the servers is almost exactly the same, and we could have based them all in a single location. However, we chose to distribute the servers geographically to offer our users the best performance, regardless where they might be.
Up until now, the only content on the servers that has had to be synchronized is the server software itself. The features and functionality of our service are being updated regularly, so every week or two we push updates out to all the servers at basically the same time. We use a relatively manual approach to do the updating, but it works fine.
Sometime soon, however, our synchronization needs are going to get a bit more complex.
In particular, we'll soon start offering a feature at our site that will involve a database with content that will change on an almost second-by-second basis, based on user input and activity. For performance reasons, a complete instance of this database will have to be present locally at each of our server locations. At the same time, the content of the database will have to be synchronized across all server locations, so that users get the same database content, regardless of the server they choose to visit.
We have not yet chosen the database that we'll use for this functionality, although we are leaning towards MySql. (We are also considering PostgreSQL.)
So, my question for the assembled experts is: What approach is the best one for us to use to synchronize the database instances across our servers?
Ideally, we'd like a solution that is resilient to a server location becoming unavailable, and we'd also prefer a solution that makes efficient use of bandwidth. (Processing power doesn't cost us a lot; bandwidth, on the other hand, can get expensive.)
FWIW ...
(1) Our servers run Apache and Tomcat on top of Centos.
(2) I've found the following "how to" that suggests an approach involving MySQL that could address our needs: http://capttofu.livejournal.com/1752.html
Thanks!
Reader Comments (9)
The master-master approach is a good option. Another approach is to elect one server in one of the data centers as the write master. So writes go through that server and you read from local caches. On data center failure you fail to another server. This can work fine with a good network.
One of the major issues is how do you decide when to failover? Is it automated or is it a human decision? There's a lot of people who believe the consequences of a false failure are high so a failure must always be triggered by a human who has verified the failure.
we select data as xml and save it.
then you can retrieve this file in the second server with file system or ftp with a scheduled application and then bulk load it into database
Roberto Passeri
Thanks for the comments, Todd and Roberto.
Your suggestion of a central database server for writes is a useful one, Todd. What holds us back from that approach is the possibility that it could introduce a single point of failure. But one could imagine a setup with two central databases for writes, synchronized through a master-master approach, while the other servers are used for reads.
The XML approach you suggest, Roberto, would save on bandwidth, so it definitely has appeal. I like it particularly because the typical master-master approach -- as I understand it -- generally involves updating that takes place essentially real time. And that's probably overkill for our purposes.
However, the thing that concerns me about the XML approach is its custom character -- i.e., the fact that we'll have to "roll it on our own". Sure, we could do that, and it would very specifically meet exactly our needs. But I have the feeling that it wouldn't be as robust as something that comes "out of the box", and I also feel a bit like we'd be re-inventing the wheel.
Any other suggestions -- or specific solutions -- that people can recommend?
Dump data to XML and reload it?
All data? or only updated data?
If you dump all data, when you load it, first you will have to delete it and reload it. You will get moments without data on your tables, it could be <1 sec or > 10 mins, depending on your data volume. And it can be a waste of bandwith, as you send always all the info, not only the updated part. FTP is slow ( login, change dir, etc... ). Not sure about this solution for "real sync".
As for master-master replication, and bandwith, you can setup an vpn with enabled compression. If there's no data updated, you will spend no traffic, if you get updated data, you will need a minimum amount. Depending on your needs a master-master mysql replication will do, always that an async replication is ok for you.
If you need synced replication, that's another history :)
What I should suggest is one master on each location, and some read slaves on each location. You can scale a lot, and you can fallback to another master when one dies. Or even write to all of them :D
FYI ... Howto Forge has a tutorial on MySQL Master Master Replication here:
http://www.howtoforge.com/mysql_master_master_replication
Not exactly right on topic, but has anyone researched what happens in a split-brain scenario with mysql (or others) master-master?
What if the masters can't communicate between them, but can still reach some clients? (not an unlikely scenario on the internet)
Do you shut off write access? To one master? Or to both?
How will the OP's application cope with a readonly and possibly out of date DB?
Btw, amazon has released details on their SimpleDB offering:
The service runs within Amazon's high-availability data centers to provide strong and consistent performance. To prevent data from being lost or becoming unavailable, your fully indexed data is stored redundantly across multiple servers and data centers.
I'm certainly not (yet) advocating putting your database at amazon, but it might be worth looking into.
XML approach will fail as soon as you get any kind of data volume. You going to ftp 500gb? Going to be a long ass downtime. Even just ftping a 2gb data difference per day you'll be behind by 2gb of txns...way too much for me. If you only have a few gb of data maybe, but if then again if you don't have much data in the first place, I don't see how you could have performance issues that would justify the need for geographic dispersal.
You can hit any site in the world in seconds. You're better off caching or using a CDN and just staying in one data center, two for disaster recovery only.
Even a mid sized site in a single location can get awesome performance with new servers. You can pick up a single quad quad-core server w/ 96 gb of RAM for less than 23k. With 96gb you can cache nearly all small to mid size sites in RAM and w/ 16 cores and lighttpd, memcache you'll have no problem handling 15k requests / second or ~ 1.2 billion hits a day.
Just my opinion here, but look at the PlentyOfFish architecture. He gets it right. You don't need all of these servers that most sites have. They either don't know how to architect or they have some seriously crappy hardware. Unless you are in the top 100 sites on the planet, you shouldn't need more than 4 servers I spec'ed out above. That would be 64 Intel Xeon MP cores @ 2.93ghz & 384gb ram @ $92k. That would handle around 30 billion page views a month in my estimation.
How do sites like ebay do multi-data center active/active setups? I've always been curious about details like this in their setup.
Also - you don't need four servers unless you don't know how to architect? Doesn't that kinda depend upon what you're actually doing? :)
I was interested in implementing MySQL master-master replication to solve the same problem as the OP, however I researched it further and it just doesn't seem to be a very well supported solution. There's very little documentation on it and I could not find any mention of how to deal with syncing issues between the two masters. For example, the split brain issue Joris mentioned.
The dual-master approach seems to be just a temporary hack until MySQL implements multi-source (i.e. multi-master) replication which they've already stated they're working on. So far, the best currently available solution seems to be master-slave replication with a failover to a slave if the master dies. Even then, the solution isn't so simple. There's no standard way to detect failure and select a slave to switch to. Everyone basically has to roll their own solution, which I'm planning to do as well, but it sure would be nice if this was standardized so everyone didn't have to reinvent the wheel.
PostgreSQL replication, which I looked into as well, is even further behind than MySQL's. All the solutions are at best complex and/or don't work across geographically separate sites.
I'd love to be wrong and have someone point out simple way to synchronize data across sites in real-time (or close to it) with support for failover. Anyone?