Logging best practices

Paul Stovell picture Paul Stovell · Feb 23, 2009 · Viewed 168.2k times · Source

I'd like to get stories on how people are handling tracing and logging in real applications. Here are some questions that might help to explain your answer.

Frameworks

What frameworks do you use?

  • log4net
  • System.Diagnostics.Trace
  • System.Diagnostics.TraceSource
  • Logging application block
  • Other?

If you use tracing, do you make use of Trace.Correlation.StartLogicalOperation?

Do you write this code manually, or do you use some form of aspect oriented programming to do it? Care to share a code snippet?

Do you provide any form of granularity over trace sources? E.g., WPF TraceSources allow you to configure them at various levels:

  • System.Windows - settings for all of WPF
  • System.Windows.Animation - override specifically for Animation.

Listeners

What log outputs do you use?

  • Text files
  • XML files
  • Event log
  • Other?

If using files, do you use rolling logs or just a single file? How do you make the logs available for people to consume?

Viewing

What tools to you use for viewing the logs?

  • Notepad
  • Tail
  • Event viewer
  • Systems Center Operations Manager/Microsoft Operations Manger
  • WCF Service Trace Viewer
  • Other?

If you are building an ASP.NET solution, do you also use ASP.NET Health Monitoring? Do you include trace output in the health monitor events? What about Trace.axd?

What about custom performance counters?

Answer

Sly Gryphon picture Sly Gryphon · Jun 2, 2009

Update: For extensions to System.Diagnostics, providing some of the missing listeners you might want, see Essential.Diagnostics on CodePlex (http://essentialdiagnostics.codeplex.com/)


Frameworks

Q: What frameworks do you use?

A: System.Diagnostics.TraceSource, built in to .NET 2.0.

It provides powerful, flexible, high performance logging for applications, however many developers are not aware of its capabilities and do not make full use of them.

There are some areas where additional functionality is useful, or sometimes the functionality exists but is not well documented, however this does not mean that the entire logging framework (which is designed to be extensible) should be thrown away and completely replaced like some popular alternatives (NLog, log4net, Common.Logging, and even EntLib Logging).

Rather than change the way you add logging statements to your application and re-inventing the wheel, just extended the System.Diagnostics framework in the few places you need it.

It seems to me the other frameworks, even EntLib, simply suffer from Not Invented Here Syndrome, and I think they have wasted time re-inventing the basics that already work perfectly well in System.Diagnostics (such as how you write log statements), rather than filling in the few gaps that exist. In short, don't use them -- they aren't needed.

Features you may not have known:

  • Using the TraceEvent overloads that take a format string and args can help performance as parameters are kept as separate references until after Filter.ShouldTrace() has succeeded. This means no expensive calls to ToString() on parameter values until after the system has confirmed message will actually be logged.
  • The Trace.CorrelationManager allows you to correlate log statements about the same logical operation (see below).
  • VisualBasic.Logging.FileLogTraceListener is good for writing to log files and supports file rotation. Although in the VisualBasic namespace, it can be just as easily used in a C# (or other language) project simply by including the DLL.
  • When using EventLogTraceListener if you call TraceEvent with multiple arguments and with empty or null format string, then the args are passed directly to the EventLog.WriteEntry() if you are using localized message resources.
  • The Service Trace Viewer tool (from WCF) is useful for viewing graphs of activity correlated log files (even if you aren't using WCF). This can really help debug complex issues where multiple threads/activites are involved.
  • Avoid overhead by clearing all listeners (or removing Default); otherwise Default will pass everything to the trace system (and incur all those ToString() overheads).

Areas you might want to look at extending (if needed):

  • Database trace listener
  • Colored console trace listener
  • MSMQ / Email / WMI trace listeners (if needed)
  • Implement a FileSystemWatcher to call Trace.Refresh for dynamic configuration changes

Other Recommendations:

Use structed event id's, and keep a reference list (e.g. document them in an enum).

Having unique event id's for each (significant) event in your system is very useful for correlating and finding specific issues. It is easy to track back to the specific code that logs/uses the event ids, and can make it easy to provide guidance for common errors, e.g. error 5178 means your database connection string is wrong, etc.

Event id's should follow some kind of structure (similar to the Theory of Reply Codes used in email and HTTP), which allows you to treat them by category without knowing specific codes.

e.g. The first digit can detail the general class: 1xxx can be used for 'Start' operations, 2xxx for normal behaviour, 3xxx for activity tracing, 4xxx for warnings, 5xxx for errors, 8xxx for 'Stop' operations, 9xxx for fatal errors, etc.

The second digit can detail the area, e.g. 21xx for database information (41xx for database warnings, 51xx for database errors), 22xx for calculation mode (42xx for calculation warnings, etc), 23xx for another module, etc.

Assigned, structured event id's also allow you use them in filters.

Q: If you use tracing, do you make use of Trace.Correlation.StartLogicalOperation?

A: Trace.CorrelationManager is very useful for correlating log statements in any sort of multi-threaded environment (which is pretty much anything these days).

You need at least to set the ActivityId once for each logical operation in order to correlate.

Start/Stop and the LogicalOperationStack can then be used for simple stack-based context. For more complex contexts (e.g. asynchronous operations), using TraceTransfer to the new ActivityId (before changing it), allows correlation.

The Service Trace Viewer tool can be useful for viewing activity graphs (even if you aren't using WCF).

Q: Do you write this code manually, or do you use some form of aspect oriented programming to do it? Care to share a code snippet?

A: You may want to create a scope class, e.g. LogicalOperationScope, that (a) sets up the context when created and (b) resets the context when disposed.

This allows you to write code such as the following to automatically wrap operations:

  using( LogicalOperationScope operation = new LogicalOperationScope("Operation") )
  {
    // .. do work here
  }

On creation the scope could first set ActivityId if needed, call StartLogicalOperation and then log a TraceEventType.Start message. On Dispose it could log a Stop message, and then call StopLogicalOperation.

Q: Do you provide any form of granularity over trace sources? E.g., WPF TraceSources allow you to configure them at various levels.

A: Yes, multiple Trace Sources are useful / important as systems get larger.

Whilst you probably want to consistently log all Warning & above, or all Information & above messages, for any reasonably sized system the volume of Activity Tracing (Start, Stop, etc) and Verbose logging simply becomes too much.

Rather than having only one switch that turns it all either on or off, it is useful to be able to turn on this information for one section of your system at a time.

This way, you can locate significant problems from the usually logging (all warnings, errors, etc), and then "zoom in" on the sections you want and set them to Activity Tracing or even Debug levels.

The number of trace sources you need depends on your application, e.g. you may want one trace source per assembly or per major section of your application.

If you need even more fine tuned control, add individual boolean switches to turn on/off specific high volume tracing, e.g. raw message dumps. (Or a separate trace source could be used, similar to WCF/WPF).

You might also want to consider separate trace sources for Activity Tracing vs general (other) logging, as it can make it a bit easier to configure filters exactly how you want them.

Note that messages can still be correlated via ActivityId even if different sources are used, so use as many as you need.


Listeners

Q: What log outputs do you use?

This can depend on what type of application you are writing, and what things are being logged. Usually different things go in different places (i.e. multiple outputs).

I generally classify outputs into three groups:

(1) Events - Windows Event Log (and trace files)

e.g. If writing a server/service, then best practice on Windows is to use the Windows Event Log (you don't have a UI to report to).

In this case all Fatal, Error, Warning and (service-level) Information events should go to the Windows Event Log. The Information level should be reserved for these type of high level events, the ones that you want to go in the event log, e.g. "Service Started", "Service Stopped", "Connected to Xyz", and maybe even "Schedule Initiated", "User Logged On", etc.

In some cases you may want to make writing to the event log a built-in part of your application and not via the trace system (i.e. write Event Log entries directly). This means it can't accidentally be turned off. (Note you still also want to note the same event in your trace system so you can correlate).

In contrast, a Windows GUI application would generally report these to the user (although they may also log to the Windows Event Log).

Events may also have related performance counters (e.g. number of errors/sec), and it can be important to co-ordinate any direct writing to the Event Log, performance counters, writing to the trace system and reporting to the user so they occur at the same time.

i.e. If a user sees an error message at a particular time, you should be able to find the same error message in the Windows Event Log, and then the same event with the same timestamp in the trace log (along with other trace details).

(2) Activities - Application Log files or database table (and trace files)

This is the regular activity that a system does, e.g. web page served, stock market trade lodged, order taken, calculation performed, etc.

Activity Tracing (start, stop, etc) is useful here (at the right granuality).

Also, it is very common to use a specific Application Log (sometimes called an Audit Log). Usually this is a database table or an application log file and contains structured data (i.e. a set of fields).

Things can get a bit blurred here depending on your application. A good example might be a web server which writes each request to a web log; similar examples might be a messaging system or calculation system where each operation is logged along with application-specific details.

A not so good example is stock market trades or a sales ordering system. In these systems you are probably already logging the activity as they have important business value, however the principal of correlating them to other actions is still important.

As well as custom application logs, activities also often have related peformance counters, e.g. number of transactions per second.

In generally you should co-ordinate logging of activities across different systems, i.e. write to your application log at the same time as you increase your performance counter and log to your trace system. If you do all at the same time (or straight after each other in the code), then debugging problems is easier (than if they all occur at diffent times/locations in the code).

(3) Debug Trace - Text file, or maybe XML or database.

This is information at Verbose level and lower (e.g. custom boolean switches to turn on/off raw data dumps). This provides the guts or details of what a system is doing at a sub-activity level.

This is the level you want to be able to turn on/off for individual sections of your application (hence the multiple sources). You don't want this stuff cluttering up the Windows Event Log. Sometimes a database is used, but more likely are rolling log files that are purged after a certain time.

A big difference between this information and an Application Log file is that it is unstructured. Whilst an Application Log may have fields for To, From, Amount, etc., Verbose debug traces may be whatever a programmer puts in, e.g. "checking values X={value}, Y=false", or random comments/markers like "Done it, trying again".

One important practice is to make sure things you put in application log files or the Windows Event Log also get logged to the trace system with the same details (e.g. timestamp). This allows you to then correlate the different logs when investigating.

If you are planning to use a particular log viewer because you have complex correlation, e.g. the Service Trace Viewer, then you need to use an appropriate format i.e. XML. Otherwise, a simple text file is usually good enough -- at the lower levels the information is largely unstructured, so you might find dumps of arrays, stack dumps, etc. Provided you can correlated back to more structured logs at higher levels, things should be okay.

Q: If using files, do you use rolling logs or just a single file? How do you make the logs available for people to consume?

A: For files, generally you want rolling log files from a manageability point of view (with System.Diagnostics simply use VisualBasic.Logging.FileLogTraceListener).

Availability again depends on the system. If you are only talking about files then for a server/service, rolling files can just be accessed when necessary. (Windows Event Log or Database Application Logs would have their own access mechanisms).

If you don't have easy access to the file system, then debug tracing to a database may be easier. [i.e. implement a database TraceListener].

One interesting solution I saw for a Windows GUI application was that it logged very detailed tracing information to a "flight recorder" whilst running and then when you shut it down if it had no problems then it simply deleted the file.

If, however it crashed or encountered a problem then the file was not deleted. Either if it catches the error, or the next time it runs it will notice the file, and then it can take action, e.g. compress it (e.g. 7zip) and email it or otherwise make available.

Many systems these days incorporate automated reporting of failures to a central server (after checking with users, e.g. for privacy reasons).


Viewing

Q: What tools to you use for viewing the logs?

A: If you have multiple logs for different reasons then you will use multiple viewers.

Notepad/vi/Notepad++ or any other text editor is the basic for plain text logs.

If you have complex operations, e.g. activities with transfers, then you would, obviously, use a specialized tool like the Service Trace Viewer. (But if you don't need it, then a text editor is easier).

As I generally log high level information to the Windows Event Log, then it provides a quick way to get an overview, in a structured manner (look for the pretty error/warning icons). You only need to start hunting through text files if there is not enough in the log, although at least the log gives you a starting point. (At this point, making sure your logs have co-ordinated entires becomes useful).

Generally the Windows Event Log also makes these significant events available to monitoring tools like MOM or OpenView.

Others --

If you log to a Database it can be easy to filter and sort informatio (e.g. zoom in on a particular activity id. (With text files you can use Grep/PowerShell or similar to filter on the partiular GUID you want)

MS Excel (or another spreadsheet program). This can be useful for analysing structured or semi-structured information if you can import it with the right delimiters so that different values go in different columns.

When running a service in debug/test I usually host it in a console application for simplicity I find a colored console logger useful (e.g. red for errors, yellow for warnings, etc). You need to implement a custom trace listener.

Note that the framework does not include a colored console logger or a database logger so, right now, you would need to write these if you need them (it's not too hard).

It really annoys me that several frameworks (log4net, EntLib, etc) have wasted time re-inventing the wheel and re-implemented basic logging, filtering, and logging to text files, the Windows Event Log, and XML files, each in their own different way (log statements are different in each); each has then implemented their own version of, for example, a database logger, when most of that already existed and all that was needed was a couple more trace listeners for System.Diagnostics. Talk about a big waste of duplicate effort.

Q: If you are building an ASP.NET solution, do you also use ASP.NET Health Monitoring? Do you include trace output in the health monitor events? What about Trace.axd?

These things can be turned on/off as needed. I find Trace.axd quite useful for debugging how a server responds to certain things, but it's not generally useful in a heavily used environment or for long term tracing.

Q: What about custom performance counters?

For a professional application, especially a server/service, I expect to see it fully instrumented with both Performance Monitor counters and logging to the Windows Event Log. These are the standard tools in Windows and should be used.

You need to make sure you include installers for the performance counters and event logs that you use; these should be created at installation time (when installing as administrator). When your application is running normally it should not need have administration privileges (and so won't be able to create missing logs).

This is a good reason to practice developing as a non-administrator (have a separate admin account for when you need to install services, etc). If writing to the Event Log, .NET will automatically create a missing log the first time you write to it; if you develop as a non-admin you will catch this early and avoid a nasty surprise when a customer installs your system and then can't use it because they aren't running as administrator.