Tuesday
Jun022015
Why You Dont' Want to Aim for 100% Uptime According to Google's Urs Hölzle
Tuesday, June 2, 2015 at 8:56AM
Wait, you don't want 100% uptime? Who said such a crazy thing? Risk taker Urs Hölzle, senior VP for technical infrastructure, in Google's Infrastructure Chief Talks SDN:
Whenever you try something new, there are going to be problems with it....We were willing to take the risk to get the innovation. Our VP who runs our site reliability gave a great talk about not aiming for 100% uptime....The easiest way to make it be at 100% is to resist change, because change is when bad things happen. Looks great for your SLA, but it's bad for your business because you slow down innovation.... In the first year of running B4, [we asked] "Will we have an outage?" Realistically, yes there's a high chance because it was all new code. Are we going to be perfect? Probably not. You have to have a willingness to take a little risk.
Reader Comments (3)
There's a great talk online that I can't find about the structure SRE at Google by the head...of...SREing, not sure what the title is. But the notions were, as I remember them:
1. Apps have to stay within an error budget. You blow your budget, you don't get to release any features for a while.
2. SREs for an app and devs for an app come from the same budget. If you make your app so that it requires a staff of 200 SREs, well, you're going to have fewer devs.
3. Dev teams have to spend about 5% of their time serving as SREs. Nothing like having to deal with your app's headaches yourself to prioritize fixes.
4. The SRE team can disband: everyone can go to other products or apply to be engineers or whatever. It's very rare, but it happens. Then the devs have to do SREing for their own product for at least a while, which is as difficult as you might imagine.
I post it because item 1 is like "don't have 100% uptime" but the whole list is awesome. It's basically entirely people-centered ways to encourage devs to care about developing apps safely and with minimal operational madness, without totally constricting the ability to move quickly and (slightly) break things.
Yup. We use this model where I work. SRE stands for 'Site Reliability Engineer'.
The talk is at https://www.usenix.org/conference/srecon14/technical-sessions/presentation/keys-sre.