Tuesday
Sep182007
Sync data on all servers

I have a few apache servers ( arround 11 atm ) serving a small amount of data ( arround 44 gigs right now ).
For some time I have been using rsync to keep all the content equal on all servers, but the amount of data has been growing, and rsync takes a few too much time to "compare" all data from source to destination, and create a lot of I/O.
I have been taking a look at MogileFS, it seems a good and reliable option, but as the fuse module is not finished, we should have to rewrite all our apps, and its not an option atm.
Any ideas?
I just want a "real time, non resource-hungry" solution alternative for rsync. If I get more features on the way, then they are welcome :)
Why I prefer to use a Distributed File System instead of using NAS + NFS?
- I need 2 NAS, if I dont want a point of failure, and NAS hard is expensive.
- Non-shared hardware, all server has their own local disks.
- As files are replicated, I can save a lot of money, RAID is not a MUST.
Thnx in advance for your help and sorry for my english :)
For some time I have been using rsync to keep all the content equal on all servers, but the amount of data has been growing, and rsync takes a few too much time to "compare" all data from source to destination, and create a lot of I/O.
I have been taking a look at MogileFS, it seems a good and reliable option, but as the fuse module is not finished, we should have to rewrite all our apps, and its not an option atm.
Any ideas?
I just want a "real time, non resource-hungry" solution alternative for rsync. If I get more features on the way, then they are welcome :)
Why I prefer to use a Distributed File System instead of using NAS + NFS?
- I need 2 NAS, if I dont want a point of failure, and NAS hard is expensive.
- Non-shared hardware, all server has their own local disks.
- As files are replicated, I can save a lot of money, RAID is not a MUST.
Thnx in advance for your help and sorry for my english :)
Reader Comments (14)
What kind of data are your replicating? Can your database replicate it instead? Rdist is a popular alternative. DRDB (http://www.drbd.org/) is used to mirror devices. Maybe the Lustre (http://www.clusterfs.com/) cluster file system. A cheap redundant 1TB NAS may still be your easiest option. Prices have come down quite a bit.
Those files are basically images. JPGs.
DRDB can be an interesting option :D
Anyone tried it?
DRDB does not allows to mount secondary nodes, not even as read-only!!!!
So it doenst solves the problem.
Todd, NAS can get very expensive, think about 12 servers in four different data centers, you should need to buy 4 NAS + 2, 8 NAS.
There should be an "easy" way to do sync data, I have been taking a look at CODA, but has some rough edges.
Ah, well, when you put it that way it is a lot of money :-) http://www.openafs.org/ is another option. Have you thought about using Amazon's S3 for storage from all your data centers? That might make sense, especially if you insert a local caching layer for often used content in each data center.
After it just might be simplest to cross mount all servers and simply write a file to each server as it comes in. The code would be pretty simple and straightforward and efficient. After all, you know which files change, so there's no reason to incur the overhead of generalized syncing mechanism.
I'm thinking about writting a small programa using FAM, monitor the 2 dirs that get " new content ", and scp directly those new / changed files.
I've never seen FAM (File Alteration Monitor, http://www.penguin-soft.com/penguin/man/1/fam.html) before. Interesting, thanks. Any ideas on how you'll handle the failure scenarios, or is the loss of an occasional image OK?
If we can't sync to one of the servers, I will probably bring it "down" and get it out of the load balancer, as its possible that there's more problems. Probably it will be already out due to heartbeat doing it before :)
I have already coded a basic prog with those features, now time to test it, and if it does the job, finish it: real and useful logging, etc...
But anyway still open to any suggestions :D
Maybe this could help improve all this:
http://kerneltrap.org/Linux/Generic_Filesystem_Caching_Facility
1. Why not just use RAID mirrors?
2. Export all your primary/secondary servers as iSCSI and create a virtual RAID-1 devices. This might be better than rsync.
Isn't the downside of this approach that the other side that only machine can own the device so the files aren't usable on the other side? Don't you also need a transaction mechanism to handle sequencing and write failures? Or simply failing over and running fsck sufficient (though it may take on long time on large disks)? Though IBM's http://unixarticles.com/articles/102/1/IBM-High-Availability-Geographic-Clustering-software">HAGEO seems to have solved all the problems.
>Isn't the downside of this approach that the other side that only machine can own the device so the files aren't usable on the other side?
This is right. You can connect iSCSI device to many machines, but: no caching on clients. The same problem appears when connecting to SAN, although I believe using SAN-oriented filesystem might help (OCFS, GPFS, PolyServe Matrix, etc.).
Anyway, since its only 44 gigs, using DFS or SAN is somewhat an overkill. DRBD could be just fine (but it takes time to switch to slave server, and its still a block device, so some filesystems may perform poorly).
Hav to say one thing.
If you wrote into your architecture a File Service gateway, you woudl simply be able to swap this code out to start using MoguleFS or something else easily.
Not to be harsh, but this architectural approach is what would have allowed you to upgrade very easily.
ged
A friend of mine pointed me to:
http://furquim.org/chironfs/
It's a basic, but simple solution! I think it could help a lot in a lot of basic configurations :D
Ged, sometimes you are not running own apps, and that's totally out of your control :)