Verbose logging will disturb your sleep

October 28, 2008

When I encounter a system that has already been in development or production for a while, the first sign of real trouble is always a dirty log. You know what I’m talking about. When clicking a single link on a normal flow on a web page results in a deluge of messages in the only log that the system provides. Too much logging is as useless as none at all.

If your systems are like mine, when your job is done, someone else’s job is just starting. After the system has been developed, it will hopefully live a long and prosperous life serving customers. If you’re lucky. How do you know if something goes wrong when the system is in production, and how will you deal with it?

Maybe someone monitors your system for you, or maybe you will monitor it yourself. Either way, what the system logs is going to be critical to know when you want to be woken up in the middle of the night.

For many systems, the first indication that something is wrong is because a log message of the correct level has been written to a log. Mostly, this will be the error log. So do yourself a favor: Make sure from day one that if something is logged the error log, you’re willing to have someone call and wake you in the middle of the night about it. If you can simulate load on your system during system testing, looking at a well-kept error log is also a good first indication that your system is reasonably robust. Or a first warning if it’s not.

For my systems, the best indication that everything is all right, is that the messages at a lower priority are ticking along happily. I want about one INFO-level log message for every significant event for the application.

A cluttered log is an indication that the system will be hard to control once it reaches production. Like with all problems: If you don’t expect anything to show up in the error log, it will be much easier to know what to do when something does show up.

Comments:

Johan - Oct 28, 2008

Doesn’t matter if it’s rails, postgres, tomcat or apache, it’s the idea of actually doing the mundane task of sitting there, watching all those little things scroll by, while possibly missing the bigger picture; the overall system components interacting.

But sure, it all depends on what exactly you’re looking for in those logs…

jhannes - Oct 28, 2008

My whole point is that logs scrolling by with lots of little things you don’t care about is an anti-pattern whether you watch those logs or not.

jsorensen - Oct 28, 2008

And my point is that actually sitting there watching those logs feels like an anti-pattern in the first place (information overload is just the symptom/consequence of that), when you really should be collecting and filtering that data from a bigger perspective in order to find correlations and causes, just like you would with any other system related data (performance graphs over time being a good example).

jhannes - Oct 28, 2008

I’m not watching the logs. I’m having them sent via mail. That is: if they only contain pertinent info.

I think we agree that sed/grep etc on logs is not good. I would add that if I want stats, I don’t think the debug-logs in where I’d get it.

jhannes - Nov 1, 2008

Cool. ‘Nuff said.

[tfnico] - Oct 31, 2008

Well, it is an alternative for us to use SEVERE for waking us up in the night. I feel FATAL is like the last thing that might get logged before a java.lang.Error gets thrown and the JVM dies.

All in all it’s very important that every team decides how they are going to use the log levels, and follow these conventions throughout project and maintenance.

Johan - Oct 28, 2008

The other side of the issue may very well be that your actual approach to viewing the log is flawed. I definitely see your point, but I want to raise it with a “don’t tail -f logfiles”.

The problem with tailing logfiles is that your context is wrong, you’re spend braintime trying to find that particular logging statement (at least if you’re trying to figure out the symptoms of an issue), instead of trying to figure out what actually is wrong when all the cogs run together.
Logging to a system that allows you to filter and match timestamps against other systems’ logs seems like a much nicer way of spending “braintime”. I keep hearing splunk.com is good, and there’s a few other solutions whose names escape me right now.

Granted, I haven’t used such an approach too much, but frankly I’m just tired of shifting and grepping through all those log files. I’d much rather be in a situation where I could log more and filter more afterwards/meanwhile, than sacrificing (useful) log output in order to ease my what-feels-like-stone-age approach to logging (tail + grep/sed).

jhannes - Oct 28, 2008

The

jhannes - Oct 30, 2008

Very cool case. Another nail in the SOA coffin. ;-)

What is the consequence of one such hiccup? If there is some chance that manual intervention is needed, I would log this as ERROR. If there is a sufficient compensation (a retry algorithm, say), I would log this as INFO, and log a permanent failure as ERROR.

If it’s in between (e.g. the user got an error message, but no manual intervention is needed) I would log as WARN and collect errors like you do. Knowing me, I would perhaps even keep track of this in the application database, but that’s probably gold plating.

[tfnico] - Oct 31, 2008

Good idea. I’ll try to get something like that in. Currently our use of WARN is more or less the same as DEBUG :)

New record yesterday, only 17 errors in the log! o/

jhannes - Oct 31, 2008

I’m starting to wonder: Is it just me, or are we all neglecting the “SEVERE/FATAL” error category. Should we use it more?

[tfnico] - Oct 30, 2008

My problem is that I have many different kind of errors. Since I’m lucky enough to live in a SOA world, I’m using 10 different service providers that each have a handful of hickups per day, and therefore my log-file has about 25-50 new lines each day (still small enough to manage with tail, I think).

What I do today is manually count number of errors from each service and store them in an excel sheet. Over time I judge which error is critical enough to do something about. Since

Johannes Brodwall - Nov 1, 2008

Cool. ‘Nuff said.