The thing I thought that was really clever about the Google File System paper was that they didn't use erasure coding for replication. Copy-based replication is simpler and you can read any replica without having to know about any other replica. With storage being so cheap these days, I think you'd really have to think about why you were using an erasure code-based replication over a simple copy-based one.
People should look more at erasure coding for bulk transfer and distribution of data, though. There it could be really awesome, although I think there are some patents out on the "rateless" variants of such.
I guess GFS already chooses which replica is better, near . So every time it knows about all replicas for a certain file. The major question here i think is if Google pushes as many harddrives into one box as possible , if yes then the method might help to reduce count of boxes, clusters, power usage,replication bandwith , if not then bottleneck is somewhere else i.e. bandwidth or computing power and it is unlikely suitable.
btw. I did not find any real implementation of ereasure codes replications , just some pdfs about how it might be useful
I am not sure about the technology behind it (erasure codes, schmerasure codes!), but a year or so ago I met with the folks at Cleversafe who have both a commercial and open-source offering. Check it out here:
According to their CTO, they take the original data and split it up into 11 slices, each slice about 10% of the original data. For retrieval, it is sufficient to have 6 of the 11 slices accessible (i.e., 5 can be down). An added security benefit is that the slices are prepared in a way that none of the slices, if captured separately, carries any recognizable data.
Reader Comments (4)
The thing I thought that was really clever about the Google File System paper was that they didn't use erasure coding for replication. Copy-based replication is simpler and you can read any replica without having to know about any other replica. With storage being so cheap these days, I think you'd really have to think about why you were using an erasure code-based replication over a simple copy-based one.
People should look more at erasure coding for bulk transfer and distribution of data, though. There it could be really awesome, although I think there are some patents out on the "rateless" variants of such.
I guess GFS already chooses which replica is better, near . So every time it knows about all replicas for a certain file. The major question here i think is if Google pushes as many harddrives into one box as possible , if yes then the method might help to reduce count of boxes, clusters, power usage,replication bandwith , if not then bottleneck is somewhere else i.e. bandwidth or computing power and it is unlikely suitable.
btw. I did not find any real implementation of ereasure codes replications , just some pdfs about how it might be useful
Parity is an erasure code; thus, all RAID is based on erasure code replication.
I am not sure about the technology behind it (erasure codes, schmerasure codes!), but a year or so ago I met with the folks at Cleversafe who have both a commercial and open-source offering. Check it out here:
http://www.cleversafe.org/dispersed-storage
According to their CTO, they take the original data and split it up into 11 slices, each slice about 10% of the original data. For retrieval, it is sufficient to have 6 of the 11 slices accessible (i.e., 5 can be down). An added security benefit is that the slices are prepared in a way that none of the slices, if captured separately, carries any recognizable data.
Regards,
-- Peter
www.3tera.com