« End-To-End Performance Study of Cloud Services | Main | Strategy: Scale Writes to 734 Million Records Per Day Using Time Partitioning »
Tuesday
May252010

Strategy: Rule of 3 Admins to Save Your Sanity

The idea came up in this Hacker News thread, commenting on a 37signals interview, that having three system administrators is the minimum optimal number of admins. Everyone wants to lower their costs by having each admin administer a lot of machines. The problem is when you have fewer than three admins you can never get a break from the constant corrosive pressure of always being on call. When every moment of your life you are dreading the next emergency, it eats at you. Having three admins solves that problem. With three admins you can:

  • Go on a real vacation. The two remaining admins can switch off being on call.
  • Not be on call all the time.

A larger shop will naturally have more admins so it's not as big an issue, but at smaller shops trying to minimize head count, carrying three admins (or people in those roles) might be something to consider.

 

 

 

Reader Comments (6)

Well, even two admins is better than one, in terms of allowing coverage. In fact, its probably best to hire two off the bat, so that they can squash overly "creative" solutions from being incorporated into the core infrastructure.

May 25, 2010 | Unregistered CommenterDaniel Howard

For 24/7 coverage in a NOC, typically you need 5 people to cover the 21 weekly shifts (if everyone works 8-hour shifts). 3 admins might do it, but it means you're either on for an entire weekend or 2 weekends in 3, and 1-2 nights per week as well.

In this market, it may be hard to justify 5, but 3 is a recipe for burnout.

May 26, 2010 | Unregistered CommenterJerry Altzman

The banchmark given to me by Didier Charvet, former CTO of Dutch cellco Dutchtone, was that you need 7 admins to provide 24/7 coverage. Of course, telcos have higher overheads and more gold-plated than web companies.

May 26, 2010 | Unregistered CommenterFazal Majid

wow, its been said finally.

This is why Im no longer an admin even though I love the work.

May 28, 2010 | Unregistered CommenterCH

Jerry and Fazal: You're both right that it definitely takes more than 3 people to run a 24/7 NOC type environment. The environment we have at 37signals is orders of magnitude smaller than that type of environment however, and the volume of alerts that we have to respond to (low) means that we can get by with 3 people just fine.

We currently do 1 week on-call rotations, but on-call for us simply means that we need to be able to respond to an alert in low single digit minutes, not that we need to stay home and watch a monitoring system. We all have 3g cards and laptops and are able to respond pretty quickly while still being able to live a mostly normal life even when on-call.

June 5, 2010 | Unregistered CommenterMark Imbriaco

Mark, I find it odd that you are associating normal life with being able to deliver low single digit response times to alerts an average of 1 day out of 3.

I was impressed by the interview, particularly the automation, maximal utilization of 37signal's infrastructure, and the environment that seems to encourage strong creative thought.

But I'm still puzzled by the 37signals disconnect of being able to publish books like Rework, which rightly force a rethink of work practices, yet forget the equally as important and creative operational end of the business. To me it seems that the combination of Rework and a 1:3 ratio devalues your work versus that of the development team that's encouraged to work no more than an 8hr day on average.

Perhaps you can help by quantifying what a "low" number of alerts is? While on call do you get alerted only for critical or near critical conditions (what's the SNR?) If you're talking 1-2 interrupts per week that are resolved in a few minutes, that might be ok, but sustained 20+/wk seems wrong.

Have you discussed what happens if the on-call person can't respond for some reason? Do you have automated escallation or failback notification for this case? If yes, then aren't it looks like 2 people are on call to me. If no, how do you manage critical events?

June 7, 2010 | Unregistered CommenterMartin Foster

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>