Entries in operations (13)

Friday
Sep112009

The interactive cloud

How many times have you been called in the middle of the night by your operation guys telling you that your application throws some odd red alerts? How many times did you found out that when those issues happens you don't have enough information to analyze this incident? have you tried to increase the log level just to find out that your problem became even worse - now your application throws tons of information in a continues basis most of which is complete garbage...

The current separation between the way we implement our application and the way we manage it leads to many of this ridicules situations. Cloud makes those things even worse.

In this post i suggest an alternative approach. Why don't we run our application the way we run our business? I refer to this approach as the "interactive cloud" where our application behaves just like our project team and the operations just like our managers. As with our business our application would need to take more responsibility to the way it runs and take corrective actions such as balancing it own resources, re-assign tasks to the available resources in case of failure etc. It will need to involve its manager only when it runs out of resource. It will need to provide reports in a way that makes sense to our managers.

In the first part of this post describes the general concept behind this model and the second part provides technical background which include code snippet based on our experience in GigaSpaces.

Monday
Jun292009

How to Succeed at Capacity Planning Without Really Trying : An Interview with Flickr's John Allspaw on His New Book


Update 2: Velocity 09: John Allspaw, 10+ Deploys Per Day: Dev and Ops Cooperation at Flickr. Insightful talk. Some highlights: Change is good if you can build tools and culture to lower the risk of change. Operations and developers need to become of one mind and respect each other. An automated infrastructure is the one tool you need most. Common source control. One step build. One step deploy. Don't be a pussy, deploy. Always ship trunk. Feature flags - don't branch code, make features runtime configurable in code. Dark launch - release data paths early without UI component. Shared metrics. Adaptive feedback to prioritize important features. IRC for communication for human context. Best solutions occur when dev and op work together and trust each other. Trust is earned by helping each other solve their problems. Look at what new features imply for operations, what can go wrong, and how to recover. Provide knobs and levers to help operations. Devs should have access to production machines. Fire drills to train. No finger pointing - fix stuff first. Design like you'll get woken up first when there's a problem. Say you're sorry. Not easy - like any relationship.
Update: Operational Efficiency Hacks Web20 Expo2009 by John Allspaw. 131 picture perfect slides on operations porn. If you're interested in that kind of thing.

Dream with me a little bit. Your startup becomes wildly successful. Hard work and random chance have smiled on you. To keep flirting with lady luck your system must scale. But how much stuff (space, hardware, software, etc) will you need to handle the growth, when will you need it and when will you need more?

That's what Flickr's John Allspaw helps you figure out in his ground breaking new book on capacity planning: The Art of Capacity Planning: Scaling Web Resources.

When I read statements about The Art of Capacity Planning like capacity planning is a term that to me means paying attention, All the information you need to make an educated forecast is in your historical metrics, and startups that are going to experience massive growth simply don't have time for anything but a 'steering by your wake' approach, I get the same sea change feeling I felt when the industry ran from waterfall design and embraced agile design. Current capacity planning is heavy. All up-front. Too analytical and too divorced from real life.

Other capacity planning books assault you with models, math, and simulations. Who has the time? John has developed a common sense, low math approach to capacity planning that works using the system you already have. John's goal is to have you say: Oh, right, duh. That's common sense, not voodoo.

Here's my email interview with John Allspaw on The Art of Capacity Planning. Enjoy.

Please tell us who you are and what you've brought to show and tell today?

I'm John Allspaw. I manage the Operations team at Flickr.com, and I've written a book (The Art of Capacity Planning: Scaling Web Resources) about capacity planning for growing websites.

After spending a good chunk of your life writing this book, can you summarize it in just a few sentences so people will know why it should matter to them?


This book is basically a guide to adaptive capacity planning for growing websites. It's an approach that relies much less on benchmarking and simulation, than on the close observation of production loads to guide future decisions. It's not rocket science, and I'm hoping people can use it to justify the what, why, and when of getting more resources to allow them to grow as fast as they need to. It's worked really well for me at Flickr and other organizations.

Give me your ripple of evil. What happens without capacity planning?

Capacity planning is a term that to me means paying attention. Web applications can fail in all sorts of dramatic ways, and you're not going to foresee all of them. What you can do, however, is make use of what you do know about what happens in your real world on a regular basis. Things like: my database can do X queries per second before it keels over. Or my cache can only keep Y minutes worth of changing objects. You're not going to predict every failure mode of the whole system, but knowing the failure modes of individual pieces should be considered mandatory. Armed with that, you can make decent forecasts about the future.

I'm a guy or gal at a startup who's freaking out because my boss asked me how much hardware we need for the next quarter/year? What do I do now?

Buy my book? ☺ All the information you need to make an educated forecast is in your historical metrics. You do have system and application-level statistics, right? Tying your system level stats (CPU, memory, network, storage, etc.) to application-level metrics (users, posts, photos, videos, page views, widgets sold, etc.) is key, because then you have history to back up the guesstimates. Your business, product or marketing team also has their own guesses for some of those application-level metrics, so you should get the two forecasts together and see where/how they match up or differ. Capacity should enable the business, not hinder it.

So your book, is it the best book on Capacity Planning or the greatest book on Capacity Planning?

My book is the best book on capacity planning. If it were the 'greatest' book, it'd be a lot bigger than it is. It's good for scaling your hardware. It's not good for flattening out posters.

My site is gonna explode and I don't know at what point it's going to die. What do I do now?

Argh! Panic! Handle the current explosion, and make it priority one to find out the limits of your capacity when the emergency is over.

Try to panic gracefully. If it's dying right now, and you can't easily add any more capacity, then try some of the tried and trusted WebOps 101 tricks mentioned everywhere, including this blog:
- disable features (preferably the heavier load-causing ones)
- cache previously dynamic content into static bits
- avoid loaded backend calls by serving stale content
Of course it's easier to do those things when you have easy config flags to turn things on or off, and a list to run through of what things are acceptable to serve stale and static. We currently have about 195 'features' we can turn off at Flickr in dire circumstances. And we've used those flags when we needed to.

Having said all of that, knowing when your resources are going to die should be mandatory, not optional. Know how many qps your databases, webservers, caching systems, and storage can handle before degradation. Test that stuff. With production traffic. If at all possible, with live production traffic, not just recorded and replayed production loads.

How do you compare your approach with well known approaches like Guerrilla Capacity Planning? Isn't focusing on queuing theory, Little's Law, and Possion arrival rates misguided? What will really help people on-the ground?

My approach is a bit different from Mr. Gunther's, although his book was an inspiration for mine. I do think that having a general understanding of queuing theory and the mathematics of open and closed systems can be important, don't get me wrong. But many of the startups that are going to experience massive growth simply don't have time for anything but a 'steering by your wake' approach, and done right, I think that approach will serve them well. I think some people recognize this, but in my experience, I'd say the development timelines are even tighter than most people realize.

Almost any time and effort in constructing and running a simulation, model, or benchmark that involves the main moving parts of a web site's back-end is pretty much wasted due to how quickly application logic, use cases, and even hardware configurations can change. For example, by the time I could construct a useful model to capture the webserver-database interactions for Flickr, the results won't resemble production load, since the development cycle we have is so tight. As I write this, in the last week there were 50 code deploys of 550 changes by 19 Flickr staff. I realize that's a lot more than a lot of organizations, but that rate of change requires that the capacity planning process is easily adjustable.

I also found the existing books on the topic to be pretty dry with the math. The foundations of queuing theory are interesting, but proofs of Little's Law isn't going to quickly justify to my finance guy that we need 11 more webservers.

I have a cloud, do I really need to capacity plan anymore? That's so old school. Can't I decrease my administrator-to-server ratio and reduce the cost of managing systems by removing capacity planning resources?

Nope, just trust The Cloud™ that everything will be all right. No need to pay any attention at all. Oh wait, that's a magic unicorn talking. Cloud services are a resource just like any in-house capacity: they have limits that should be paid attention to. The best use cases of cloud computing both recognize the benefits (shrinking 'procurement' times, for example) and the limitations. And those limitations are going to vary from application to application, organization to organization.

You can build a real brand around a name like "Guerrilla Capacity Planning." Don't you think you could have come up with a better name for your approach?

Yeah, I guess I should have a name for it. How about "Paying Attention Capacity Planning"?

You recommend testing the limits of your hardware and software on a production site. Are you crazy?

It's very possible that I'm crazy. But, using production traffic to define your resources ceilings in a controlled setting allows you to see firsthand what would happen when you run out of capacity in a particular resource. Of course I'm not suggesting that you run your site into the ground, but better to know what your real (not simulated) loads are while you're watching, than find out the hard way. In addition, a lot of unexpected systemic things can happen when load increases in a particular cluster or resource, and playing "find the butterfly effect" is a worthwhile exercise.

Where does capacity planning sit in the stack? It seems to impact the entire application stack, but capacity planning is more of an operations thing and that may not have much impact in engineering.

Capacity planning is a process, and should sit in the stack along with the other things that happen on a regular basis, like bug scrubs, like weekly meetings, etc. I'm spoiled as an Ops manager because we've got developers at Flickr who very much think like operations people. We're all addicted to graphs, and we're all pretty aware of how close each piece of the infrastructure is to becoming too 'hot', and we act accordingly. Planning, procurement, and measurement of capacity might lie on operations' shoulders, but intelligent use of it is everyone's business. I'm also blessed to have product management and customer care teams who are also aware of the state of capacity. As projects pop up that warrant capacity to be part of the considerations, it is. You simply can't launch things like FlickrVideo and the Yahoo Photos migration without capacity being part of the requirements, so we've got a good feedback loop going with respect to operations and capacity.

Studies say 4/5 of people like a trip to the dentist more than they like capacity planning. Why does capacity planning have such a bad rep?

Because no one wants to guess and be wrong? Because most of the cap planning literature out there are filled with boring math?

Does the move to Web 2.0 make traditional approaches to capacity planning more difficult?

I would say yes, but of course I'm biased. In many ways, use of the site is simply out of our hands. We gather metrics on as many aspects of the site as we can get our hands on, and add more all the time. Having an open API makes things interesting, because the use cases can vary wildly. Out of nowhere, a simple application can reveal an edge case that we hadn't foreseen, so we might have to adjust forecasts quickly. As to it being more difficult: I don't know. When I'm wrong, people can't see photos. If the guy at wellsfargo.com is wrong, then people can't get their money. What's more annoying? :)

I really like your dashboard. Most dashboards say what happened, yours has an estimated days left before capacity runs out, which makes it actionable. How accurate have those numbers been?

Well that particular dashboard is still being built here, but the general design is to capture those metrics on a regular basis, and to use it as a guide. The numbers for some clusters have been pretty trustworthy. But they do fluctuate in how accurate they are, since "days left" are obviously affected by things outside of the forecasting process. Sometimes, a new feature is planned that requires pounding the database shards, or bumps CPU on the webservers by a noticeable amount. It's because of these things that the dashboard is only a piece of the puzzle. Awareness and communications with product management and development have to inform planning decisions, as well as the historic metrics. Again: no crystal balls here.

One of the equations you use is a "UR LIMITZ = Ceiling * Factor of Safety". Theo Schlossnagle has noted the evolution of phenomenal spikes in traffic, even for large sites. How do you have enough capacity as a safety factor if traffic is so spikey? What are the implications for forecasting?

Start with the assumption that no one can accurately predict the future. Capacity planning isn't the only part of successful web operations, and not everything is a capacity issue. Theo nails what happens in the real-world, for sure. The answer is: you estimate the best you can with the history that you have. Obviously, one mistake is to make forecasts ignoring the spikes you've experienced in the past. Something important to consider is how expected (or not) your spikes are. If you sell things, you might have a seasonal holiday spike or plateau in traffic. If you're a content site that frequently gains traction with news-related sites (like Theo's example) and from time to time, you get massive unexpected spikes, then your factor of safety should include considerations for those spikes. But of course the flipside of that might mean that you have a boatload of servers doing nothing except wasting money while waiting for massive spikes that may or may not come. So it's a balance. Be reasonable with planning, follow the four guidelines that Theo points out in his post, and you've got yourself a strategy to deal with those unexpected spikes.

At what point does it make sense for me to buy my own equipment?

It's a difficult question. I've seen groups go from self-run colocation to managed hosting and cloud services, and others go the opposite way, for all legit reasons. Just as there are limitations with managed hosting, those limitations might not matter until you're massive. I do believe that there is a point at which owning your own equipment makes a lot of sense, because you've blown past the average needs of a managed hosting or cloud customer. Your TCO might be a lot different than mine, so to state that there's a single point for everyone would just be dumb.

Quick fire round. What's your quick reaction to:

1. Let's just go with a working prototype for now. We can change it when we grow big.

Fine, for some definitions of "working", "change", and "big". I can't complain too much about that approach in some sense because that's how a lot of Flickr's backend evolved. But on the other hand, all you need is to fail a couple of times with prototypes to figure out what sort of homework needs to be done before launching something that is potentially explosive. So again, there's a balance. Don't be lazy, but don't be rigid and too fearful.

2. Our VC told us that we're worrying about scalability too early. They doesn't want us to blow our scarce resources on preparing for success.

Your VC is smart. If they're really smart, they'll also suggest worrying about scalability before it's too late. This answer isn't a cop-out, it's just reality. Realize that being scalable means being able to easily add capacity wherever you need it, whenever you need it. Buying too much equipment too soon is just as insane as hiring people that sit around and do nothing. The trick is knowing what "too much" and "too soon" is, and it's going to differ from company to company, from product to product.

3. Premature optimization is the root of all evil.

I've not made it a secret that I think this quote has given many an engineer (both dev and ops) reasoning to make dumb decisions. I actually agree with the idea behind the quote, but I also think that it's another one of those balancing acts. Knuth (or Hoare) also said "We should forget about small efficiencies, say about 97% of the time…" Another good question might be: which 3% are you going to pay attention to?

4. We plan to throw hardware at the problem.

Ok. When does that hardware need to come, how much of it, and what will you do when you realize more hardware won't fix a particular problem? The now classic limitations of the single-write-master, many-read-slaves database architecture is a great example of this. Experienced devs and ops people consider the possibility that hardware can't solve all problems.

Note: these questions were adapted from How Important Is Being Scalable?.

Now that you have some free time again, what are you going to do with your life?

Eat. Sleep. Pay attention to my family. ☺

Related Articles

  • Capacity Management for Web Operations by John Allspaw

  • Thursday
    Feb122009

    MySpace Architecture

    Update:Presentation: Behind the Scenes at MySpace.com. Dan Farino, Chief Systems Architect at MySpace shares details of some of MySpace's cool internal operations tools. MySpace.com is one of the fastest growing site on the Internet with 65 million subscribers and 260,000 new users registering each day. Often criticized for poor performance, MySpace has had to tackle scalability issues few other sites have faced. How did they do it? Site: http://myspace.com

    Information Sources

  • Presentation: Behind the Scenes at MySpace.com
  • Inside MySpace.com

    Platform

  • ASP.NET 2.0
  • Windows
  • IIS
  • SQL Server

    What's Inside?

  • 300 million users.
  • Pushes 100 gigabits/second to the internet. 10Gb/sec is HTML content.
  • 4,500+ web servers windows 2003/IIS 6.0/APS.NET.
  • 1,200+ cache servers running 64-bit Windows 2003. 16GB of objects cached in RAM.
  • 500+ database servers running 64-bit Windows and SQL Server 2005.
  • MySpace processes 1.5 Billion page views per day and handles 2.3 million concurrent users during the day
  • Membership Milestones: - 500,000 Users: A Simple Architecture Stumbles - 1 Million Users:Vertical Partitioning Solves Scalability Woes - 3 Million Users: Scale-Out Wins Over Scale-Up - 9 Million Users: Site Migrates to ASP.NET, Adds Virtual Storage - 26 Million Users: MySpace Embraces 64-Bit Technology
  • 500,000 accounts was too much load for two web servers and a single database.
  • At 1-2 Million Accounts - They used a database architecture built around the concept of vertical partitioning, with separate databases for parts of the website that served different functions such as the log-in screen, user profiles and blogs. - The vertical partitioning scheme helped divide up the workload for database reads and writes alike, and when users demanded a new feature, MySpace would put a new database online to support it. - MySpace switched from using storage devices directly attached to its database servers to a storage area network (SAN), in which a pool of disk storage devices are tied together by a high-speed, specialized network, and the databases connect to the SAN. The change to a SAN boosted performance, uptime and reliability.
  • At 3 Million Accounts - the vertical partitioning solution didn't last because they replicated some horizontal information like user accounts across all vertical slices. With so many replications one would fail and slow down the system. - individual applications like blogs on sub-sections of the Web site would grow too large for a single database server - Reorganized all the core data to be logically organized into one database - split its user base into chunks of 1 million accounts and put all the data keyed to those accounts in a separate instance of SQL Server
  • 9 Million–17 Million Accounts - Moved to ASP.NET which used less resources than their previous architecture. 150 servers running the new code were able to do the same work that had previously required 246. - Saw storage bottlenecks again. Implementing a SAN had solved some early performance problems, but now the Web site's demands were starting to periodically overwhelm the SAN's I/O capacity—the speed with which it could read and write data to and from disk storage. - Hit limits with the 1 million-accounts-per-database division approach as these limits were exceeded. - Moved to a virtualized storage architecture where the entire SAN is treated as one big pool of storage capacity, without requiring that specific disks be dedicated to serving specific applications. MySpace now standardized on equipment from a relatively new SAN vendor, 3PARdata
  • Added a caching tier—a layer of servers placed between the Web servers and the database servers whose sole job was to capture copies of frequently accessed data objects in memory and serve them to the Web application without the need for a database lookup.
  • 26 Million Accounts - Moved to 64-bit SQL server to work around their memory bottleneck issues. Their standard database server configuration uses 64 GB of RAM.
  • Horizontally Federated Database. Databases are partition by purpose. Have profile, email databases etc. Partition is based on user range. 1 Million users live in each database. So you have Profile1, Profile2 all the way up to Profile300 as they have 300 million users.
  • Doesn't use ASP cache because they don't have a high enough hit rate on the front-end. The middle tier cache does have a high hit rate.
  • Failure isolation. Segment requests into web server by database. Allow only 7 threads per database. So if the database is slow only those threads will slowdown and the traffic in the other threads will flow.

    Operations

  • PerfCollector. Centralized collection of performance data via UDP. More reliable than Windows and allows any client to connect and see stats.
  • Web Based Stack Dump Tool. Can right-click on a problem server and get stack dump of the .Net managed threads. Used to have to RDC into system and attach a debugger and 1/2 later get an answer. Slow, nonscalable, and tedious. Not just a stack dump, gives a lot of context about what the thread is doing. Troubleshooting is easier because you can see 90 threads are blocked on a database so the database may be down.
  • Web Base Heap Dump Tool. Dumps all memory allocations. Very useful for developers. Save hours of doing it by hand.
  • Profiler. Traces a request from start to finish and produces a report. See URL, methods, status, everything that will help you identify a slow request. Looks at lock contentions, are a lot of exceptions being thrown, anything that might be interesting. Very light weight. It's running on one box in every VIP (group of 100 servers) in production. Samples 1 thread every 10 seconds. Always tracing in background.
  • Powershell. Microsoft's new shell that runs in process and pass objects between commands versus parsing text output. MySpace develops a lot of commandlets to support operations.
  • Developed their own asynchronous communication technology to get around windows networking problems and treat servers as a group. Can ship a .cs file, compile it, run it, and ship the response back.
  • Codespew. Pushes code updates on their communication technology. Used to do 5 code pushes a day, now down to 1 a week.

    Lessons Learned

  • You can build big websites using Microsoft tech.
  • A cache should have been used from the beginning.
  • The cache is a better place to store transitory data that doesn't need to be recorded in a database, such as temporary files created to track a particular user's session on the Web site.
  • Built in OS features to detect denial of service attacks can cause inexplicable failures.
  • Distribute your data to geographically diverse data centers to handle power failures.
  • Consider using virtualized storage/clustered file systems from the start. It allows you to massively parallelize IO access while being able to add disk as needed without any reorganization needed.
  • Develop tools that work in a production environment. Can't simulate everything in test environment. The scale and variety of uses APIs are put to can't be simulated in QA during testing. Legitimate users and hackers will run into corner cases that weren't hit in testing, though QA will find most of the problems.
  • Throw hardware at problems. Easier than changing their backend software to a new way of doing things. The example is they add a new database server for every million users. It might be more efficient to change their approach to more efficiently use the database hardware, but it's easier just to add servers. For now.

    Click to read more ...

  • Friday
    Dec052008

    Sprinkle - Provisioning Tool to Build Remote Servers

    At 37 Signals Joshua Sierles describes how 37 Signals uses Sprinkle to configure their servers within EC2. Sprinkle defines a domain specific meta-language for describing and processing the installation of software. You can find an interesting discussion of Sprinkle's creation story by the creator himself, Marcus Crafter, in Sprinkle Some Powder!. Marcus divides provisioning tools into two categories:

  • Task Based - the tool issues a list of commands to run on the remote system, either remotely via a network connection or smart client.
  • Policy/state Based - the tool determines what needs to be run on the remote system by examining its current and final state. Sprinkle combines both models together in a chocolate-in-my-peanut-butter approach using normal Ruby code as the DSL (domain specific language) to declaratively describe remote system configurations. 37 Signals likes the use of Ruby as the DSL because it makes learning a separate syntax unnecessary. I've successfully done similar things in Perl. You already have a scripting language, why layer another one on top? One reason not to is that you've now tied configuration and execution together so that only one tool can control the process, but the leverage is so high with this approach it's hard to ignore. There's all the usual bits about defining packages, dependencies, installation logic, pre and post actions, etc. The format is compact and clear because that's how Ruby is and the operations are task specific so there's no fluff. Capistrano is used to communicate with remote systems though that is pluggable. 37 Signals uses the EC2 security group as way to specify the role an instance should take on when it boots. A configuration script that can handle all roles is shipped with a near complete functional base image. Sprinkle then configures the system the rest of the way based on the passed in role. Joshua says they like this approach better than Puppet because it doesn't rely on a centralized configuration server or "pushing large sets of commands over SSH manually." There's always one more than one way to do "it" and Sprinkle carves out an interesting niche in the provisioning space. The 37 Signal's approach doesn't scale to a large organization with many different flavor of servers, but for a specific set of tightly cooperating servers it's a very simple, clean, and robust way of doing business. Related Articles: Product: Puppet the Automated Administration System.

    Click to read more ...

  • Monday
    Nov242008

    Product: Scribe - Facebook's Scalable Logging System


    In Log Everything All the Time I advocate applications shouldn't bother logging at all. Why waste all that time and code? No, wait, that's not right. I preach logging everything all the time. Doh. Facebook obviously feels similarly which is why they opened sourced Scribe, their internal logging system, capable of logging 10s of billions of messages per day. These messages include access logs, performance statistics, actions that went to News Feed, and many others.

    Imagine hundreds of thousands of machines across many geographical dispersed datacenters just aching to send their precious log payload to the central repository off all knowledge. Because really, when you combine all the meta data with all the events you pretty much have a complete picture of your operations. Once in the central repository logs can be scanned, indexed, summarized, aggregated, refactored, diced, data cubed, and mined for every scrap of potentially useful information.

    Just imagine the log stream from all of Facebook's Apache servers alone. Brutal. My guess is these are not real-time feeds so there are no streaming query issues, but the task is still daunting. Let's say they log 10 billion messages a day. That's over 1 million messages per second!

    When no off the shelf products worked for them they built their own. Scribe can be downloaded from Sourceforge. But the real action is on their wiki. It's here you'll find some decent documentation and their support forums. Not much activity on the site so you haven't missed your chance to be a charter member of the Scribe guild.

    A logging system has three broad components:

  • Client Code Interface - How does your code interact with the log system? Scribe doesn't do much for you here. There's a simple Thrift interface for logging from a large set of languages, but the bulk of the work is stull up to you.
  • Distribution System - This is were Scribe fits. It reliably (mostly) moves large numbers of messages around. A few error cases lead to data loss: 1) If a client can't connect to either the local or central scribe server the message will be loss; 2) If a scribe server crashes it could lose a small amount of data that's in memory but not on disk; 3) Some multiple component failure cases, such as a resender can't connect to any central server and its local disk fills up; 4) Some rare timeout conditions can lead to duplicate messages
  • Do Something Usefullizer - How do you do anything useful with over 1 million messages per second? Good question. Scribe doesn't help here. But Scribe will get your data their.

    I browsed around the source and it's a well crafted, straightforward socket server that forwards messages to other servers and can write messages to disk. Nothing fancy which is why it probably works for them. It's basic function is:

    Scribe is a server for aggregating streaming log data. It is designed to scale to a very large number of nodes and be robust to network and node failures. There is a scribe server running on every node in the system, configured to aggregate messages and send them to a central scribe server (or servers) in larger groups. If the central scribe server isn't available the local scribe server writes the messages to a file on local disk and sends them when the central server recovers. The central scribe server(s) can write the messages to the files that are their final destination, typically on an nfs filer or a distributed file system, or send them to another layer of scribe servers.
    It some ways it could be fancier. For example, there's no throttle on incoming connections so a server can chew up memory. And there is a max_msg_per_second throttle on message processing, but this is really to simple. Throttling needs to be adaptive based on local conditions and the conditions of down stream servers. Under load you want to push flow control back to the client so the data stays there until resources become available. Simple configuration file settings rarely work when the world starts getting weird.

    Client Code Interface

    Here's what the Thrift interface looks like:

    enum ResultCode
    {
    OK,
    TRY_LATER
    }

    struct LogEntry
    {
    1: string category,
    2: string message
    }

    service scribe extends fb303.FacebookService
    {
    ResultCode Log(1: list messages);
    }
    I know, I thought the same thing. Thank God there's another IDL syntax. We simply did not have enough of them. Thrift translates this IDL into the glue code necessary for making cross-language calls (marshalling arguments and responses over the wire). The Thrift library also has templates for servers and clients.

    Here's what a call looks like in PHP:

    $messages = array();
    $entry = new LogEntry;
    $entry->category = "buckettest";
    $entry->message = "something very interesting happened";
    $messages []= $entry;
    $result = $conn->Log($messages);


    Pretty simple. Usually in C++, for example, there's an elaborate set of macros for logging that provide sophisticated control of log generation. It might look something like:

    MSG(msg) - a simple message. It only prints out msg. None of the other information is printed out.
    NOTE(const char* name, const char* reason, const char* what, Module* module, msg) - something to take note of.
    WARN(const char* name, const char* reason, const char* what, Module* module, msg) - a warning.
    ERR(const char* name, const char* reason, const char* what, Module* module, msg) - an error occured.
    CRIT(const char* name, const char* reason, const char* what, Module* module, msg) - a critical error occurred.
    EMERG(const char* name, const char* reason, const char* what, Module* module, msg) - an emergency occurred.


    There's lots more to handle streams and behind the scenes things like time stamps, thread ids, function names, and line numbers. Scribe has wisely not done any of that. It has a RPC like interface to send a list of messages and that's it. It's up to you to write the wrappers.

    You'll no doubt have noticed Scribe only logs a category and message, both strings:

    Scribe is unique in that clients log entries consisting of two strings, a category and a message. The category is a high level description of the intended destination of the message and can have a specific configuration in the scribe server, which allows data stores to be moved by changing the scribe configuration instead of client code. The server also allows for configurations based on category prefix, and a default configuration that can insert the category name in the file path. Flexibility and extensibility is provided through the "store" abstraction. Stores are loaded dynamically based on a configuration file, and can be changed at runtime without stopping the server. Stores are implemented as a class hierarchy, and stores can contain other stores. This allows a user to chain features together in different orders and combinations by changing only the configuration.

    Distribution System

    The payload has whatever structure you give it. Scribe is policy neutral and doesn't push a logging model on you.

    The configuration file looks something like this:

    # BUCKETIZER TEST
    <store>
    category=buckettest
    type=buffer
    target_write_size=20480
    max_write_interval=1
    buffer_send_rate=2
    retry_interval=30
    retry_interval_range=10
    <primary>
    type=bucket
    num_buckets=6
    bucket_subdir=bucket
    bucket_type=key_hash
    delimiter=1
    <bucket>
    type=file
    fs_type=std
    file_path=/tmp/scribetest
    base_filename=buckettest
    max_size=1000000
    rotate_period=hourly
    rotate_hour=0
    rotate_minute=30
    write_meta=yes
    </bucket>
    </primary>
    <secondary>
    type=file
    fs_type=std
    file_path=/tmp
    base_filename=buckettest
    max_size=30000
    </secondary>
    </store>
    The types of stores currently available are:
  • file - writes to a file, either local or nfs.
  • network - sends messages to another scribe server.
  • buffer - contains a primary and a secondary store. Messages are sent to the primary store if possible, and otherwise the secondary. When the primary store becomes available the messages are read from the secondary store and sent to the primary.
  • bucket - contains a large number of other stores, and decides which messages to send to which stores based on a hash.
  • null - discards all messages.
  • thriftfile - similar to a file store but writes messages into a Thrift TFileTransport file.
  • multi - a store that forwards messages to multiple stores.

    Certainly a flexible and useful set of logging capabilities. You can build a hierarchy of log servers to do pretty much anything you want. You could imagine have a log server on each server that has file store to handle upstream server failures. This log server forwards messages onto a centralized server for a datacenter. And all the datacenter servers forward their logs on to the centralized data warehouse. To scale adjust fan-in and fan-out as necessary.

    Do Something Usefullizer

    You may not have over 1 million log messages a second to process, but you are likely to have your own tanker trunk full of log messages. How do you do something useful with them?
  • Log messages stored in log files are next to useless. Grep'ing on a terabyte of logs to answer simple questions about your data just doesn't work.
  • You may have a sharded datawarehouse you can pump log messages into and do reasonably effective job of querying.
  • Or you can set up a HADOOP/HDFS. style system. The idea here is you need a distributed file system to handle the continual stream of log messages. And once you have all the data stored safely away you'll need to use map-reduce to do anything with such a large amount of data.

    If you want to ask, for example, how many of your users are from Asia, log files won't work. It's likely your data warehouse can't handle it. HADOOP/HDFS is a practical option.

    If that's the direction you are going what does it imply about your log system? I would say it makes even the simple category-payload system of Scribe overkill. The with a scalable backend is to move log payloads from applications to the centralized store as quickly as possible. By definition the central store can handle the load, so there's no reason to use intermediate servers to scale. From an application write directly to the central store, even from multiple datacenters. The payload structure is unimportant until it hits the central store. If the application can't hit the central store then it queues into the file system until it can. Ideally log messages never hit the file system until HDFS is writing them to their final destination. This makes for a low latency and high throughput logging and is even simpler than Scribe.

    If you don't have a scalable central store then Scribe is a good option. It gives you all the flexibility you need to compose your logging system in a way that is mostly reliabile and scalable.
  • Wednesday
    Oct292008

    CTL - Distributed Control Dispatching Framework 

    CTL is a flexible distributed control dispatching framework that enables you to break management processes into reusable control modules and execute them in distributed fashion over the network. From their website: CTL is a flexible distributed control dispatching framework that enables you to break management processes into reusable control modules and execute them in distributed fashion over the network. What does CTL do? CTL helps you leverage your current scripts and tools to easily automate any kind of distributed systems management or application provisioning task. Its good for simplifiying large-scale scripting efforts or as another tool in your toolbox that helps you speed through your daily mix of ad-hoc administration tasks. What are CTL's features? CTL has many features, but the general highlights are: * Execute sophisticated procedures in distributed environments - Aren't you tired of writing and then endlessly modifying scripts that loop over nodes and invoke remote actions? CTL dispatches actions to remote controllers with network transparency (over SSH), parallelism, and error handling already built in. * Comes with pre-built utilities - CTL comes with pre-built utilities so you don't have to script actions like file distribution or process and port checking. * Define your own automation using the tools/languages you already know - New controller modules are defined in XML and your scripting can be done in multiple scripting languages (Perl, Python, etc.), *nix shell, Windows batch, and/or Ant. * Cross platform administration - CTL is Java-based, works on *nix and Windows.

    Related Articles

  • Blog.Control.Tier
  • Introduction to CTL - a very nice overview of features.
  • Interesting Puppet Thread Mentioning CTL

    Click to read more ...

  • Saturday
    Oct252008

    Product: Puppet the Automated Administration System

    Update: Digg on their choice and use of Puppet. They chose puppet over cfengine, and bcfg2 because they liked Puppet's resource abstraction layer (RAL), the ability to implement configuration management incrementally, support for bundles, and the overall design philosophy. Puppet implements a declarative (what not how) configuration language for automating common administration tasks. It's the system every large site writes for themselves and it's already made for you! Ilike was able to "easily" scale from 0 to hundreds of servers using Puppet. I can't believe I've never seen this before. It looks really cool. What is Puppet and how can it help you scale your website operations? From the Puppet website: Puppet has been developed to help the sysadmin community move to building and sharing mature tools that avoid the duplication of everyone solving the same problem. It does so in two ways: * It provides a powerful framework to simplify the majority of the technical tasks that sysadmins need to perform * The sysadmin work is written as code in Puppet's custom language which is shareable just like any other code. This means that your work as a sysadmin can get done much faster, because you can have Puppet handle most or all of the details, and you can download code from other sysadmins to help you get done even faster. The majority of Puppet implementations use at least one or two modules developed by someone else, and there are already tens of recipes available in Puppet's CookBook. This sound good. But does it work in the field? HJK Solutions' Adam Jacob says it does: Puppet enables us to get a huge jump-start on building automated, scaleable, easy to manage infrastructures for our clients. Using puppet, we: 1. Automate as much of the routine systems administration tasks as possible. 2. Get 10 minute unattended build times from bare metal, most of which is data transfer. Puppet takes it the rest of the way, getting the machines ready to have applications deployed on them. It’s down to two and a half minutes for Xen. 3. Bootstrap our clients production environments while building their development environment. I can’t stress how cool this really is. Because we are expressing the infrastructure at a higher level, when it comes time to deploy your production systems, it’s really a non-event. We just roll out the Puppet Master and an Operating System auto-install environment, and it’s finished. 4. Cross-pollinate between clients with similar architectures. We work with several different shops using Ruby on Rails, all of whom have very similar infrastructure needs. By using Puppet in all of them, when we solve a problem for one client, we’ve effectively solved it for the others. I love being able to tell a client that we solved a problem for them, and all it’s going to cost is the time it takes for us to add the recipe. Puppet, today, is a tool that is good enough to handle the vast majority of issues encountered in building scalable infrastructures. Even the places where it falls short are almost always just a matter of it being less elegant than it could be, and the entire community is working on making those parts better.

    Related Articles

  • Operations is a competitive advantage... (Secret Sauce for Startups!) by Jesse Robbins
  • Infrastructure 2.0 by John Willis
  • Puppet, iLike and Infrastructure 2.0 by John Willis
  • Why are people paying 3 to 5 million for configuration management software? by Adam Jacob

    Click to read more ...

  • Tuesday
    Sep162008

    Product: Func - Fedora Unified Network Controller

    Func is used to manage a large network using bash or Python scripts. It targets easy and simple remote scripting and one-off tasks over SSH by creating a secure (SSL certifications) XMLRPC API for communication. Any kind of application can be written on top of it. Other configuration management tools specialize in mass configuration. They say here's what the machine should look like and keep it that way. Func allows you to program your cluster. If you've ever tried to securely remote script a gang of machines using SSH keys you know what a total nightmare that can be. Some example commands:

    Using the command line:
    func "*.example.org" call yumcmd update
    
    Using the Pthon API:
    import func.overlord.client as fc
    client = fc.Client("*.example.org;*.example.com")
    client.yumcmd.update()
    client.service.start("acme-server")
    print client.hardware.info()
    
    Func may certainly overlap in functionality with other tools like Puppet and cfengine, but as programmers we always need more than one way to do it and definitely see how I could have used Func on a few projects.

    Related Articles
  • High Scalability Operations Tag
  • Open source project: Func, the Fedora Unified Network Controller by Michael DeHaan.
  • Func, the Fedora Unified Network Controlle by Luca Foppiano
  • Multi-system administration with Func by Jake Edge.

    Click to read more ...

  • Wednesday
    Sep032008

    Some Facebook Secrets to Better Operations

    Kim Nash in an interview with Jonathan Heiliger, Facebook VP of technical operations, provides some juicy details on how Facebook handles operations. Operations is one of those departments everyone runs differently as it is usually an ontogeny recapitulates phylogeny situation. With 2,000 databases, 25 terabytes of cache, 90 million active users, and 10,000 servers you know Facebook has some serious operational issues. What are some of Facebook's secrets to better operations?

  • Frequent Releases. A major release once a week and a minor releases every few days.
  • Create a Cyber Liability Group. At one time operations was distributed amongst several groups. A permanent operations group was created to isolate problems and revert problem software components back to previously known good states. The ability of a separate team to handle rollbacks speaks to a great deal of standardization and advanced tool building.
  • Distribute Team Across Time Zones. Split the operations team across different time zones so no one has to work the graveyard shift. Facebook has 20 people in their team located in Palo Alto, California and London, England.
  • Be Innovative, Not Safe. Fear of failure often shuts down the organizational brain and makes it hide behind excessive rules and regulations. A technology company should have a bias towards action and innovation. Release software. Don't stifle genius. Rely on your tools and processes to recover from problems.
  • Expect Problems. Software pushed to production will have problems. Expect problems, but don't let that stop you from innovating.
  • Roll Backward. When a problem is detected in a release the changes can either be rolled forward or backward. Rolling back is going to a previously good release. Rolling forward is fixing problems in the new release rather than rolling back. Bugs in production are fixed in production. Roll forward ends up being covered in the press, so prefer roll backs over roll forwards.
  • Roll Out Massive Changes Slowly. Turn on features gradually, for a few percent of users at a time. Use the slow rollout to fix problems that can only be found under real user conditions. This approach give operations and development a lot of confidence in changes.
  • Encourage Openness and Information Sharing. Design reviews, PR strategy, which servers to buy, etc are often open for informal debate among employees. Facebook has created an Ideas system where employees can create an Idea by category. There's a discussion tool for discussing the idea and a rating system for rating the idea. Tools are built on-top of Facebook platform so they are available to everyone.
  • Live-blog Key Events. Large company meetings, monthly presentations and weekly Q&As with the management team are transcribed live.

    It sounds like a relatively fun environment for pushing software live. Getting software moved into production is often harder than the original coding and testing. Now I know what you are thinking. You somehow managed to procure the ssh login. So just login remotely and do the install yourself! Nobody will know. Oh so tempting. But it's not really good corporate citizenship. And you just might screw up, then there will be some esplaining to do.

    Emphasing frequent releases and gutsy release policies makes it actually seem like someone is supporting developers instead of treating them like their software carries the plague. Data centers are often treated like quarantine stations and developers are treated like asymptomatic carriers of some unknown virulent disease. To be safe nothing should ever change, but that's not an attitude that makes things better. Nice to see that recognized.

    To setup or not to setup a separate operations group? Facebook says "to be" and creates a seperate group. Amazon says "not to be" and has developers support their own software. Secretly I think Amazon gets better results by requiring developers to support their own software. Knowing it may be you getting the "It's Down!" call gives one proper perspective. But I like not being on call and I think most developers agree. Plus the idea "following the sun" to get 24 hour support is a smart idea.

    Related Articles

  • HighScalability Operations
  • On Designing and Deploying Internet-Scale Services
  • Amazon Architecture
  • Tuesday
    Mar252008

    Paper: On Designing and Deploying Internet-Scale Services

    Greg Linden links to a heavily lesson ladened LISA 2007 paper titled On Designing and Deploying Internet-Scale Services by James Hamilton of the Windows Live Services Platform group. I know people crave nitty-gritty details, but this isn't a how to configure a web server article. It hitches you to a rocket and zooms you up to 50,000 feet so you can take a look at best web operations practices from a broad, yet practical perspective. The author and his team of contributors obviously have a lot of in the trenches experience. Many non-obvious topics are covered. And there's a lot to learn from.

    The paper has too many details to cover here, but the big sections are:

  • Recommendations
  • Automatic Management and Provisioning
  • Dependency Management
  • Release Cycle and Testing
  • Operations and Capacity Planning
  • Graceful Degradation and Admission Control
  • Customer Self-Provisioning and Self-Help
  • Customer and Press Communication Plan

    In the recommendations we see some of our old favorites:
  • Expect failure and design for failure.
  • Implement redundancy and fault recovery.
  • Depend upon a commodity hardware slice.
  • Keep things simple and robust.
  • Automate everything.

    Personally, I'm still trying to figure out how to make something simple.

    Next are some good thoughts on how to design operations friendly software:
  • Quick service health check. This is the services version of a build verification test.
  • Develop in the full environment.
  • Zero trust of underlying components.
  • Do not build the same functionality in multiple components.
  • One pod or cluster should not affect another pod or cluster.
  • Allow (rare) emergency human intervention.
  • Enforce admission control at all levels.
  • Partition services.
  • Understand the network design.
  • Analyze throughput and latency.
  • Treat operations utilities as part of the service.
  • Understand access patterns.
  • Version everything.
  • Keep the unit/functional tests from the last release.
  • Avoid single points of failure.
  • Support single-version software. Have all your customers run the same version.
  • Implement multi-tenancy. Apparently a lot of software requires cloning hardware installations to support multiple customers. Don't do that. Have your software work for multiple customers all on the same hardware.

    And the paper continues along the same lines in each section. Good detailed advice on lots of different topics.

    You'll undoubtedly agree with some of the advice and disagree with some. Greg wants faster release cycles, thinks having server affinity for some things is OK, and thinks the advice on allowing humans to throttle load won't work in a crisis. Perfectly valid points, but what's fun is to consider them. Some companies, for example, have a dead-man's switch that must be thrown before one master can failover to another in a multi-datacenter situation. Is that wrong or right? Only the shadow knows.

    The advice to "document all conceivable component failures and modes and combinations" sounds good but is truly difficult to do in practice. I went through this process once on a telco project and it took months just to cover all the failure scenarios on a few cards. But the spirit is right I think.

    My favorite part of the whole paper is:
    We have long believed that 80% of operations issues originate in design and development, so this section
    on overall service design is the largest and most important. When systems fail, there is a natural tendency
    to look first to operations since that is where the problem actually took place. Most operations issues,
    however, either have their genesis in design and development are best solved there.

    Understand this and I think much of the rest follows naturally.