« On-Demand Infinitely Scalable Database Seed the Amazon EC2 Cloud | Main | Skype Failed the Boot Scalability Test: Is P2P fundamentally flawed? »
Thursday
Aug302007

Log Everything All the Time

This JoelOnSoftware thread asks the age old question of what and how to log. The usual trace/error/warning/info advice is totally useless in a large scale distributed system. Instead, you need to log everything all the time so you can solve problems that have already happened across a potentially huge range of servers. Yes, it can be done.

To see why the typical logging approach is broken, imagine this scenario: Your site has been up and running great for weeks. No problems. A foreshadowing beeper goes off at 2AM. It seems some users can no longer add comments to threads. Then you hear the debugging deathknell: it's an intermittent problem and customers are pissed. Fix it. Now.

So how are you going to debug this? The monitoring system doesn't show any obvious problems or errors. You quickly post a comment and it works fine. This won't be easy. So you think. Commenting involves a bunch of servers and networks. There's the load balancer, spam filter, web server, database server, caching server, file server, and a few networks switches and routers along the way. Where did the fault happen? What went wrong?

All you have at this point are your logs. You can't turn on more logging because the Heisenberg already happened. You can't stop the system because your system must always be up. You can't deploy a new build with more logging because that build has not been tested and you have no idea when the problem will happen again anyway. Attaching a debugger to a process, while heroic sounding, doesn't help at all.

What you need to be able to do is trace though all relevant logs, pull together a time line of all relevant operations, and see what happened. And this is where trace/info etc is useless. You don't need function/method traces. You need a log of all the interesting things that happened in the system. Knowing "func1" was called is of no help. You need to know all the parameters that were passed to the function. You need to know the return value from the function. Along with anything else interesting it did.

So there are really no logging levels. You need to log everything that will help you diagnose any future problem. What you really need is a time machine, but you don't have one. But if you log enough state you can mimic a time machine. This is what will allow you to follow a request from start to finish and see if what you expect to be happening is actually happening. Did an interface drop a packet? Did a reply timeout? Is a mutex on perma-lock? So many things can go wrong.

Over time systems usually evolve to the point of logging everything. They start with little or no logging. Then problem by problem they add more and more logging. But the problem is the logging isn't systematic or well thought out, which leads to poor coverage and poor performance.

Logs are where you find anomalies. An anomaly is something unexpected, like operations happening that you didn't expect, in a different order than expected, or taking longer than expected. Anomalies have always driven science forward. Finding and fixing them will help make your system better too. They expose flaws you might not otherwise see. They broaden you understanding of how your system really responds to the world.

So step back and take a look at what you need to debug problems in the field. Don't be afraid to add what you need to see how your system actually works.

For example, every request needs to have assigned to it a globally unique sequence number that is passed with every operation related to the request so all work for a request can be tied together. This will allow you to trace the comment add from the client all the way through the system. Usually when looking at log data you have no idea what work maps to which request. Once you know that debugging becomes a lot easier.

Every hop a request takes should log meta information about how long the request took to process, how big the request was, what the status of the request was. This will help you pinpoint latency issues and any outliers that happen with big messages. If you do this correctly you can simulate the running of system completely from this log data.

I am not being completely honest when I say there are no debugging levels. There are two levels: system and developer. System is logging everything you need to log to debug the system. It is never turned off. There is no need. System logging is always on.

Developers can add more detailed log levels for their code that can be turned on and off on a module by module basis. For example, if you have a routing algorithm you may only want to see the detailed logging for that on occasion. The trick is there are no generic info type debug levels. You create a named module in your software with a debug level for tracing the routing algorithm. You can turn that on when you want and only that feature is impacted.

I usually have a configuration file with initial debug levels. But then I make each process have a command port hosting a simple embedded web server and telnet processor so you can change debug levels and other setting on the fly through the web or telnet interface. This is pretty handy in the field and during development.

I can hear many of you saying this is too inefficient. We could never log all that data! That's crazy! No true. I've worked on very sensitive high performance real-time embedded systems where every nanosecond was dear and they still had very high levels of logging, even in driver land. It's in how you do it.

You would be right if you logged everything within the same thread directly to disk. Then you are toast. It won't ever work. So don't do that.

There are lots of tricks you can use to make logging fast enough that you can do it all the time:

  • Make logging efficient from the start so you aren't afraid to use it.
  • Create a dead simple to use log library that makes logging trivial for developers. Document it. Provide example code. Check for it during code reviews.
  • Log to a separate task and let the task push out log data when it can.
  • Use a preallocated buffer pool for log messages so memory allocation is just pop and push.
  • Log integer values for very time sensitive code.
  • For less time sensitive code sprintf'ing into a preallocated buffer is usually quite fast. When it's not you can use reference counted data structures and do the formatting in the logging thread.
  • Triggering a log message should take exactly one table lookup. Then the performance hit is minimal.
  • Don't do any formatting before it is determined the log is needed. This removes constant overhead for each log message.
  • Allow fancy stream based formatting so developers feel free to dump all the data they wish in any format they wish.
  • In an ISR context do not take locks or you'll introduce unbounded variable latency into the system.
  • Directly format data into fixed size buffers in the log message. This way there is no unavoidable overhead.
  • Make the log message directly queueable to the log task so queuing doesn't take more memory allocations. Memory allocation is a primary source of arbitrary latency and dead lock because of the locking. Avoid memory allocation in the log path.
  • Make the logging thread a lower priority so it won't starve the main application thread.
  • Store log messages in a circular queue to limit resource usage.
  • Write log messages to disk in big sequential blocks for efficiency.
  • Every object in your system should be dumpable to a log message. This makes logging trivial for developers.
  • Tie your logging system into your monitoring system so all the logging data from every process on every host winds its way to your centralized monitoring system. At the same time you can send all your SLA related metrics and other stats. This can all be collected in the back ground so it doesn't impact performance.
  • Add meta data throughout the request handling process that makes it easy to diagnose problems and alert on future potential problems.
  • Map software components to subsystems that are individually controllable, cross application trace levels aren't useful.
  • Add a command ports to processes that make it easy to set program behaviors at run-time and view important statistics and logging information.
  • Log information like task switch counts and times, queue depths and high and low watermarks, free memory, drop counts, mutex wait times, CPU usage, disk and network IO, and anything else that may give a full picture of how your software is behaving in the real world.

    In large scale distributed systems logging data is all you have to debug most problems.
    So log everything all the time and you may still get that call at 2AM, but at least you'll know you'll have a fighting chance to fix any problems that do come up.
  • Reader Comments (24)

    What you're asking me as a developer to do is to write a logging library which is extremely sophisticated but yet extremely simple in its interface. Maybe it is doable, but I sure as hell don't have that time.

    And also, C isn't the only language in the world, to support Python (for example!) you'd need even more code. You also note that each object should be representable, and that (preferably) no logging routine should call another routine? Sure, in Python you can get a representation of every object (repr(o)) but in say C? That'd be some hours of coding right there.

    Ego Consensus:
    Yeah, you do need to log. Your tips aren't very useful in my opinion, though.

    December 31, 1999 | Unregistered CommenterLudvig Ericson

    Do you write/keep the logs on each server generating them, have the logging library write to a single shared log drive, or write-local then have separate sweep process to move files to a shared log drive, or what?

    Are your unique request-IDs system-unique, or just server-unique?

    December 31, 1999 | Unregistered CommenterBill Seitz

    > What you're asking me as a developer to do

    If you are making a large distributed scalable system it's simple ROI. You are certainly free to make a different judgment. I would say a few hours to get the interface ready. A few days to get a base system ready. A week to get a better system ready. And a long time to get a really hot system ready. Start small and build in small increments, as always.

    > And also, C isn't the only language in the world

    The same ideas work in any language, they just have a different expression.

    > Maybe it is doable, but I sure as hell don't have that time.

    When you get that call and 2AM and your customers are screaming and your management is breathing down your neck and your bugs are costing 1000s of dollars a minute, is that when you'll have the time?

    December 31, 1999 | Unregistered CommenterTodd Hoff

    > Do you write/keep the logs on each server generating them

    I would say follow your overall monitoring architecture. Many monitoring packages pull from disk on each server (or shared storage) from a centralized monitoring cluster. So I would go with the flow if that's how metrics and other data are being transported. A local disk copy is useful both in the field and local development because you can just log in to the host and see the logs.

    In my ideal world I would shoot it over the network to the monitoring cluster so it there aren't intermediate disk hops to go wrong. I would use a polling model to poll all the information sources rather than a push model that can overwhelm the monitoring system. There's a hell of a lot of data, so if you don't have a big enough memory pool for all the log data you'll have to use a disk based model anyway. For local development I always have a configuration option to send it to disk.

    If your monitoring cluster is in your corporate network and you are monitoring several large data centers, a disk based system is more practical as it can handle longer periods of outages.

    > Are your unique request-IDs system-unique, or just server-unique?

    I make them system unique by using a 64 bit ID and assigning each server an ID space from the 64 bit space. A new ID space can be reallocated when you run out. In a multi data center architecture it's common for work to be done across data centers so you need to be able to trace work globally.

    December 31, 1999 | Unregistered CommenterTodd Hoff

    Great post, Todd. I know the nay-sayers will doubt the need for all of that info or declare its impracticality, but the alternative is what amounts to flying blind when something inevitably goes wrong. You hit it on the head in your previous comment when you pointed out that it's ludicrous not to have every piece of useful data available during an outage that is costing $1000's a minute. I'd like to see Ops managers try to explain that decision to a CEO/CFO during a major outage post-mortem.

    But while I agree that you should try to "log everything all the time", you should also consider that when you focus on only system or application level logging you are leaving out an important set of data essential to incident resolution. Outages don't just happen. They can always be traced back to some change that was introduced to the environment. Sometimes these changes happen at the system level, but in my experience, they are more often introduced at that application level in the form of a new release (or worse, an in-production modification).

    Info you can't get from normal system and application logs:
    1. When did the application change?
    2. What was changed? What are all of the code, data, and content assets related to that change?
    3. Exactly what procedures were run to produce the change? Who ran the commands?
    4. What nodes did those procedures touch?
    5. What commands can I run to immediately put everything back into a last known good state? (often through a "roll-forward" rather than a true "roll-back" procedure)

    Of course you are going to need to know how all of this data ties back into what you are seeing in you system and application logs.

    This kind of info can only come from using an application change management toolset that not only produces the data but understands its semantics. I would say that this starts with a fully-automated application build and deployment automation system, but then again I am biased (shameless plug: http://open.controltier.com)

    -Damon

    http://dev2ops.org

    December 31, 1999 | Unregistered CommenterDamon Edwards

    Having done exactly this kind of stuff for a few years now (although when our system goes down we tend to measure hundreds of thousands of dollars a min) the desire to know it all isn't very useful I have found. You tend to collect too much data and it sends you off on the wrong path more often than not. All big systems have common fault patterns and you end up designing your logging and monitoring on those. You do encounter new patterns at which forces you add new stuff. But folks tend to log crap that really does not matter. Having to decide if that log entry is useful or not at 2 am too much log data isn't a lot of help. While new systems tend to like more logging to find out what is happening most of the problems we encounter is when data is entering or leaving a system. So Load Balancers aren't doing their job right and you need to find activity in diff zones, timing issues are off, so you need to spot when a write on one set of data completed and when the read expected the written data happened, and firewalls, proxies or ssl certs are just stopping data from getting from point a to point b. If you just log that info I have found you catch most of your solvable problems.

    All subsystems all have a single solution to any local problem (bring down the offending instance and bring it back up) which means that we run end to end tests and any transient problems are easily reset by bouncing a single instance of something, this simple fix captures > 90% of our issues, just restarting an app fixes a myriad of network sins.. We try to collect data coming into and out of any system to test individual systems. But intermediate data isn't required because we have a non production environment modeled on the production ones so we can sample the load (just in case it isn't a border crossing problem.) Even though we have reduced our log trails we tend to have tools to manage log size (we can have > 50 log files to hunt through on sep machines in sep subsystems.)

    Well used software typically works OK within the boundaries, and because I develop API's as well as UIs, I like to watch public facing elements to see how things are being used (so I can remove unused functionality.) BUT generating 15 gigs of log data for an hour of execution is not the quick way to solve a problem. Having talented people know where to put logging and who have fixed systems like this time and again is the best way. Typically when we get the call the Tier 3 engineer has a good idea where to look even before he hits the bridge line, and we have just enough logging in place to verify if his instinct is correct. When the pressure is on the engineer needs to come up with a solution fast, last thing we want to happen is they spot something odd in the log files and miss that it is out sequence or that they have the wrong element or something else and sending everyone on a wild goose chase, this undermines confidence. I am willing to bet that if you look at your true sphincter clenching events that the issues you spot are pretty similar and that you really only need to be watching the problem space not everything.

    December 31, 1999 | Unregistered CommenterFA Jones

    I agree about the need of logs everywhere, but you need at least two levels - production and debug. In our system, if you open all the logs to debug, the server slows down to a crawl and spits out around 10mb/s of logs, so we need 3 levels - production. regular debug and deep debug (or rather, "debug", "info" and "error" which are the most commonly used logging levels here). We also need the ability to turn on/off specific logs - otherwise we'd be swimming in a sea of unneeded data ("I just wanted to see if X works...").
    I'd say "log always, log everywhere, log to the level". Debug means everything you can think of that can be used to find out a problem. Info, on the other hand, means only information you are sure is needed. Error is obvious - only errors (with / without stack traces / with stack traces in a separate file) should appear.

    December 31, 1999 | Unregistered CommenterZaph0d

    Perhaps you can send the log to a separate log server instead of hitting the local disk? If the data is being pushed through a separate thread set a relatively low low priority, it shouldn't impact your performance. And log capacity needs to grow in proportion to the system and the length of time you want to be able to debug a problem that already happened. Small logs mean a small debug horizon. There's nothing really wrong with 10mb/s and a large system should be able to handle it. Just the web server access logs for a large site can be that much.

    I think your deep debug corresponds to my module specific debug settings. You can group these in sets so you can turn on deep debugging across a slice of modules on a slice of servers for the problem you are looking at. But none of this helps after the fact, which is where the toughest problems are. So that's why I don't see logs as unneeded.

    Your ability to debug a problem is proportional to the information you have. If your problems are easy, as mentioned in a previous comment, then no problem. Logging seems like a waste of time. But when your problems get difficult, as when you have real application running in your data center, not just a two or three tier shared nothing web app, you need the data.

    Tricks like querying your logs on sequence numbers and time ranges makes it relatively simple to do a deep dive on a specific problem across hundreds of machines in a data center. If you are going from machine to machine firing up vi on gigabyte sized files then you are indeed screwed.

    December 31, 1999 | Unregistered CommenterTodd Hoff

    Excellent post Todd!

    I've fought this battle too many times with developers.

    If you're building the same system you've built a dozen times and using only packages you've used in each of those instances - then go ahead and skip the logging. You can probably diagnose failures in that system pretty well. The rest of the time log everything. No excuses. No arguments.

    Still, this doesn't go far enough. Any user input that changes the system state should be logged as well. If you doubt the value of this, spend a little time in tech support trying to do forensic engineering.

    December 31, 1999 | Unregistered CommenterBert Armijo

    To those that think this is too complex or takes too much of a performance hit, look at the libraries from Apache like log4j, log4net, log4r, etc. Once I discovered those, it really helped my logging. And I actually ended up doing a lot more logging by using it's debug mode for really detailed stuff since it's pretty easy to check if debug is turned on (or better yet, store it in a local variable in each method) before constructing a message. Check it out and you'll like it. Not enough logging is one of my pet peaves since I write code in the financial industry where you must be able to execute trades before deadlines or material costs start mounting. It's no fun to get a call at 2 am when you have less than an hour to get trades of the door and you get a lot like "application failed" or an object error, or a generic db access error, file error, etc without being able to trace where in the program's process it was when the error occurred.

    December 31, 1999 | Unregistered Commentersclough

    Great Post! Todd

    I'm a software developer and I totally agree with you, I just wrote a blog entry http://jianwu.blogspot.com/2007/09/its-all-about-traceability.html">It is All about the Traceability to further elaborate on this: since in the real production environment, there is always some failure somewhere: router/switch could die or rebooted, connections will timeout if the firewall is jammed, SSL Certificates will expire, and there'll be DB upgrade, OS Patching, ... ..., the failure could
    happen just at the time a new merchant customer just send out his second credit authorization request, or approval of a financial transaction with $10,000 just arrived, ... ... finally, you'll grab all the forensic information you could find to diagnose.

    Jian

    December 31, 1999 | Unregistered CommenterJian Wu

    What you need to do is basically to have a standard coding convention saying that ALL structures must be loggable. If you have a convention with a macro saying "LOG_STRUCT_xxx" where xxx is the name of the structure, which expands to an expression which returns something that is loggable. i.e. a string, you can easily log any structure you have a pointer to which you know the type of.

    It is true what the original poster say, you need a time machine, and what you should log is basically what you would ask the debugger to show you when being able to reproduce it.

    What you are saying is basically that you will not spend a fixed extra amount of time for each code item you produce while not under pressure in order to avoid spending a undetermined amount of time banging on code you are not certain about is the culprit while everybody is asking you when you are done, which you perhaps have not ever seen before. I would rather call it an investment than a burden.

    Some very interesting development is getting ready with virtual machines, where that technology is used for snapshotting. It will most likely not be narrow enough for debugging purposes though. I have seen a time machine debugger which could go back and forth, but I cannot locate the link right now.

    Nother link would be http://www.ghs.com/products/timemachine.html

    December 31, 1999 | Unregistered CommenterThorbjørn Ravn Andersen

    Kudos on a great post.

    I will go even further and say that collecting performance metrics should be done without fear of affecting production. When doing capacity planning, INCLUDE the affect that logging and metric collection will have on performance.

    To help with overhead, we use the Spread Toolkit (http://spread.org) for apache logging (via mod_log_spread) and ganglia (http://ganglia.info) for performance metric collection, which both have network logging facilities to get around logging locally to disk, and really helps with aggregation on an out-of-band analysis server.

    -j

    December 31, 1999 | Unregistered Commenterjohn allspaw

    You mention that "For example, every request needs to have assigned to it a globally unique sequence number that is passed with every operation related to the request so all work for a request can be tied together."
    What are the ways to generate unique id, which will be unique across multiple machines with multiple JVMs per machine?

    Is "userid + timestamp + request type" a reasonable unique id?
    THanks,
    Unmesh
    Is

    December 31, 1999 | Unregistered CommenterAnonymous

    > Is "userid + timestamp + request type" a reasonable unique id?

    Userid wouldn't work for me because I have all processes run under the same user on all servers. You could perhaps add IP address. Timestamps are usually too low a resolution to be unique. It's very easy for things to happen simultaneously from your clock's perspective, so the ID won't be unique. And oddly, profiling shows timestamps can hurt performance noticeably. Each is a system call that stops your process from working. Once scheduled you really want to keep your processes working.

    The request type brings up the whole issue of weather you should use metadata to form IDs. In databases primary keys are usually automatically assigned sequence numbers, they aren't compound keys from the data, and that's how I like my sequence numbers for similar reasons. Each log entry simply needs to log enough meta data (ip, host, data center, etc) so you can put the message in context. There can never be enough metadata in a sequence number, so you have to do that anyway.

    For allocating sequence numbers I use a redundant sequence number service per data center. The data center ID would be part of the ID so you don't have to coordinate across data centers. When a process comes up it asks the service for a block of IDs. The logging system within each process increments the IDs and when it gets close to using up its block it asks for another range. If you use a 64bit ID you won't run out of blocks. If performance is important I would start a task to get a new batch of IDs at a certain threshold before reaching the end of the allocation.

    That's about it. It's really pretty easy.

    December 31, 1999 | Unregistered CommenterTodd Hoff

    If you're running Solaris 10 (or Mac OS 10.5, or FreeBSD (soon)), you can simply use DTrace.

    December 31, 1999 | Unregistered CommenterDavid Magda

    Nice post, only thing I'd add is a) examples and b) don't log ddos prevention stuff. Timeless post. Thanks.

    June 13, 2011 | Unregistered CommenterPete Cooper

    Thanks Pete. I have a lot of example but it's all in C++ so I'm not sure how useful that would be.

    June 14, 2011 | Registered CommenterHighScalability Team

    Todd,

    thanks for the post. I've been trying to find the right combination and permutation for effective logging for a while now. I agree with you, especially, on:
    * Store log messages in a circular queue to limit resource usage.
    * Write log messages to disk in big sequential blocks for efficiency.
    * Every object in your system should be dumpable to a log message. This makes logging trivial for developers.

    One way may be through passing the logging information off to a configurable queueing server, like ZeroMQ, which would place less load on your application and server directly; especially as a lot of logging in applications is file-based and written sequentially, not in batches.

    Great article and thanks for the inspiration.

    Matt

    June 16, 2011 | Unregistered CommenterMatthew Setter

    Great post Todd!

    You mention that all your examples are in C++. I'm curious if you are using streams or low level (C) functions to format log messages?

    As nice and easy to use streams are, I'm discovering lately that they are horrible in multi-threaded environment. It is a question of standard library implementation. I'm sure this issue will be fixed eventually but as it stands right now, if you use streams heavily in an MT environment then you are hosed.

    Best

    November 16, 2011 | Unregistered Commenteranonymous

    For C code there are printf style versions of the macros and for C++ there are stream versions. You can always use the printf style, but for C++ code the stream operators are just so natural for dumping arbitrary objects. They are horrible in many ways. To reduce the memory allocation overhead make the streams use fixed sized preallocated buffers. For the MT issues use thread specific allocators so you don't have to worry about locks.

    November 17, 2011 | Registered CommenterHighScalability Team

    You would be amazed where the badness is coming from!

    Current crop of memory allocators is pretty good. But if you need even more you can use preallocated buffers and per thread allocators. And there is a very natural, easy and standard way to do these in C++.

    But, what you can not work around is the standard mandated reliance on locales. As it turns out each construction of a stream (or rather base_ios) does at least one call to use_facet<>() and possibly multiple calls to locale class constructor. Each call to operator<<() would also make at least one call to use_facet<>() and may be to locale constructor. The issue is that these are all synchronized with one big lock around the locale logic. GCC has managed to remove the lock around locale::locale() by using atomics. MS is doing it in VC11. use_facet issue is still around. Good news is that compiler and library developers are aware and hopefully this would get fixed soon.

    Caveat emptor!

    November 17, 2011 | Unregistered Commenteranonymous

    In a latency or memory constrained environment I wouldn't trust the standard memory or stream classes. You can always create your own stream class, just implement the << operator and not much else.

    November 17, 2011 | Registered CommenterHighScalability Team

    Was Shocked to see people still refer to this post as wisdom...
    Are you guys still sacrificing children to appease the gods as well?

    Logging everything all the time is a TERRIBLE idea! Especially in modern data-rich systems
    - It creates a horrible signal to noise ratio
    - it creates tons of garbage data - that your team will have to spend time supporting pipelines for
    - It costs a lot in licensing, storage and compute resources
    - it will lead you to waste time on cleanups so you can just keep crawling forward

    This is just logging FOMO, and there are better solutions ...
    It's 2019 - use Non breaking Breakpoints ...

    July 24, 2019 | Unregistered CommenterDaniel Debug

    PostPost a New Comment

    Enter your information below to add a new comment.
    Author Email (optional):
    Author URL (optional):
    Post:
     
    Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>