Strategy: Rule of 3 Admins to Save Your Sanity
The idea came up in this Hacker News thread, commenting on a 37signals interview, that having three system administrators is the minimum optimal number of admins. Everyone wants to lower their costs by having each admin administer a lot of machines. The problem is when you have fewer than three admins you can never get a break from the constant corrosive pressure of always being on call. When every moment of your life you are dreading the next emergency, it eats at you. Having three admins solves that problem. With three admins you can:
- Go on a real vacation. The two remaining admins can switch off being on call.
- Not be on call all the time.
A larger shop will naturally have more admins so it's not as big an issue, but at smaller shops trying to minimize head count, carrying three admins (or people in those roles) might be something to consider.
Reader Comments (6)
Well, even two admins is better than one, in terms of allowing coverage. In fact, its probably best to hire two off the bat, so that they can squash overly "creative" solutions from being incorporated into the core infrastructure.
For 24/7 coverage in a NOC, typically you need 5 people to cover the 21 weekly shifts (if everyone works 8-hour shifts). 3 admins might do it, but it means you're either on for an entire weekend or 2 weekends in 3, and 1-2 nights per week as well.
In this market, it may be hard to justify 5, but 3 is a recipe for burnout.
The banchmark given to me by Didier Charvet, former CTO of Dutch cellco Dutchtone, was that you need 7 admins to provide 24/7 coverage. Of course, telcos have higher overheads and more gold-plated than web companies.
wow, its been said finally.
This is why Im no longer an admin even though I love the work.
Jerry and Fazal: You're both right that it definitely takes more than 3 people to run a 24/7 NOC type environment. The environment we have at 37signals is orders of magnitude smaller than that type of environment however, and the volume of alerts that we have to respond to (low) means that we can get by with 3 people just fine.
We currently do 1 week on-call rotations, but on-call for us simply means that we need to be able to respond to an alert in low single digit minutes, not that we need to stay home and watch a monitoring system. We all have 3g cards and laptops and are able to respond pretty quickly while still being able to live a mostly normal life even when on-call.
Mark, I find it odd that you are associating normal life with being able to deliver low single digit response times to alerts an average of 1 day out of 3.
I was impressed by the interview, particularly the automation, maximal utilization of 37signal's infrastructure, and the environment that seems to encourage strong creative thought.
But I'm still puzzled by the 37signals disconnect of being able to publish books like Rework, which rightly force a rethink of work practices, yet forget the equally as important and creative operational end of the business. To me it seems that the combination of Rework and a 1:3 ratio devalues your work versus that of the development team that's encouraged to work no more than an 8hr day on average.
Perhaps you can help by quantifying what a "low" number of alerts is? While on call do you get alerted only for critical or near critical conditions (what's the SNR?) If you're talking 1-2 interrupts per week that are resolved in a few minutes, that might be ok, but sustained 20+/wk seems wrong.
Have you discussed what happens if the on-call person can't respond for some reason? Do you have automated escallation or failback notification for this case? If yes, then aren't it looks like 2 people are on call to me. If no, how do you manage critical events?