Tuesday
Sep012009
Cheap storage: how backblaze takes matters in hand
Tuesday, September 1, 2009 at 6:13PM
Blackblaze blogs about how they built their own storage infrastructure on the cheap to run their cloud backup service. This episode: the hardware.
Sorry, just a link this time.
Reader Comments (7)
Yeah, I just read the blog.
They are going to get a lot of trouble down the line doing this way. I can see a lot of weakness by design.
1. Bit Rotting is going to be an issue at 60TB range. I am surprised they picked JFS over Solaris ZFS for this
2. Syba SATA controllers on SiliconImage chipset. Big no no.
3. 4 SATA adapters controlling 45 SATA drives. Obviously they didn't know about SAS Expander SATA tunneling protocols.
4. If one Syba SATA controller does, their entire software RAID6 does bust.
5. No mirrored OS drives
6. Dual Core Core 2 Duo with 4GB ram for 60TB storage?
7. No hotswappability. Big oops.
8. A single 760W Zippy power supply for 46+ hard drives including the OS. In the long run, it will have spin up issues.
Those are just issues I can think of right now. Basically this is a very nice risk cheap solution that I wouldn't even trust for storing my personal stuff, let alone someone else's data. Hope they don't get into trouble for data loss down the line, which is absolutely possible given the hardware design.
Backblaze is an online backup provider, so their choices make some sense in that context.
Addressing each point that I can:
2. Syba SATA controllers. - I agree they are cheap, but they can be stable with the right drivers. You'll see failures.
3. SAS Expanders - Probably a cost choice here.. SAS SATA tunnel is cheaper than some of the other enterprise alternatives, but it might not have been cheap enough.
4. SATA controller/RAID fail - I don't think I would have used RAID, I would have instead tried to spread the data across nodes at the application level, to allow nodes to fail with no data loss. RAID 6 in a box that is going to have flaky hardware does seem risky.
5. No mirrored OS drives - Its an embedded system.. if the OS drive fails, swap in another. Personally I would have used a compact flash drive and an adapter. Cheaper and more stable.
6. CPU - For cold storage backup/filer, its adequate processing power.
7. No hotswap - If they had spread across nodes I wouldn't have a problem with this.. but in the end, pulling a unit and swapping the drives while its running is seldom a good idea. Frontpanel access wouldn't have held enough drives.
8. 760W power supply. I'd need to look at the specs on their drive choice but some of the "green" consumer drives are using 1watt idle and 5 watts active. Even without staggering the spinup they are probably ok.
To the anonymous poster above:
2. Anyone who use Syba controllers is one who don't know anything about Storage, or care about their data. Even you agree that Syba will fail.
3.Sas expander is cheaper than SATA port multipliers. Look into it. 24 port SAS expander ICs are cheaper than 5x 5port sata port multipliers while providing far better shared aggregate 4x SAS bandwidth. SAS HBAs can be had for as low as $120 dollars new now(LSI 1068E LSI based cards, which is what Sun uses in their thumpers and thors), so it is actually only slightly more than going with Syba.
4. Syba cards don't do RAID6. So if they are doing RAID6, it is stupid Linux fake raid. If you lose one Syba controller(it is not if, it is when), you use all 15 drives across the software RAID6, corrupting the entire array immediately.
5. Good luck swapping in another OS drive without interrupting service. A little too much cost cutting?
6. Good file system is all about caching. For a fake raid6, software XOR6 over 45 drives would have killed a Core 2 Quad, let alone a core 2 Duo. 4GB of ram is how much ram people use in their laptops.
7. I don't get why you think hotswapping a drive is a bad idea. With 45 drives in a cage with Seagate 7200.11s, you will see drive fail weekly. Front panel hotswapping JBODs would have given you about half the density(24 bays in 4Us), but you can use 2 of those JBODs to give you 48 drives, while giving you front panel hot swappability. By the way, 4U 24bay SAS Expander JBODs from Supermicro cost about $1200 each with redundant power supply paths. Beats custom nonhotwappable case job with red painting any day in reliability.
8. I had an error in my first post. Obviously, they had 2 Zippys, each powering half of the system. This design is crap. If the half that powers half of the drives fail, you lose the RAID6. If the half that powers the system fail, you lose the whole system. BTW, even Western Digital Green idles about 4.5Ws, draws 7.5W at work. I haven't seen any 3.5 inch drives that idle at 1W.(Only the SSDs idle at 1W). The choice of Seagate 7200.11 consumer grade drives are funny since they are the ones that had firmware issues. They draw 8W idle and about 10-12W measured in realworld.
1. You still dont have an answer for item 1 i posted above. Bit rotting and silent bit corruptions of the consumer grade hard drives will show its ugly head going this way. 1TB is 0.8*10^13 bits. so 60TB is 4.8*10^14 bits of capacity. Those seagates have what? 10^-14 non-recoverable bit error rates? So silent data corruption is not a probability, but certainty. Hope that data is spread to at least 3x filers or they will lose data in the long run.
This whole thing looks like the designer Nufire wants some internet fame. He is going to get the opposite for it. It is very obvious to me that it is a Sun Thumper copycat design with newegg "lowest priced" components. And he didn't even know that what makes the Sun Thumper tick is not the vertical mounting of the drives, but the ZFS file system underneath, making 48 drive arrays across 6 LSI 1068E SAS HBAs stable and fast, and able to tolerate up to 2 1068E HBA failures.(not you know, using Syba SATAs driving SiliconImage SATA port multipliers and Linux fakeraids and JFS)
I know my comments sound cruel. But truthfully, Nufire knows nothing about Storage.
A point has been missed though, they say that the system in itself as a box is not redundant - what makes their system work is all the application layer stuff they have above it - presumably at the very least each persons data is stored on 2 of these devices.
So in some ways it's a bit like googles solution - cheap / commodity storage, but data stored in many places, and get your redundancy that way.
I encourage you all armchair quarterbacks to read the following blog:
http://perspectives.mvdirona.com/2009/09/03/SuccessfullyChallengingTheServerTax.aspx
This guy did the architecture for Amazon S3. I think he knows what he is talking about.
I think that the point being here is that these are not designed to be self contained redundant units. Rather they are treated as one big disk by the application architecture that runs at a higher level. While the drives themselves are not hot swappable the units are. When you are running tens or even hundreds of them then making single drives hot swappable is akin to making the platters in your hard drive swappable. Its too fine a granularity to really matter. Better to take the whole unit offline for repair and replace it with a new one.
I may wish to disagree with you naysayers ..
Isn't the most important characteristic of storage - no loss and the ability to retrieve what you stored ?
Do you know anyone who has lost data or been unable to get their data back ?
Me either.
Sounds like BlackBlaze has met the criteria I would value in my storage.