Web application monitoring best practices

Sergey Golovchenko picture Sergey Golovchenko · Jan 30, 2009 · Viewed 16.3k times · Source

We are finishing up our web application and planning for deployment. Very important aspect of deployment to production is monitoring the health of the system. Having a small team of developers/support makes it very critical for us to get the early notifications of potential problems and resolve them before they have impact on users.

Using Nagios seams like a good option, but wanted to get more opinions on what are the best monitoring tools/practices for web application in general and specifically for Django app? Also would welcome recommendations on what should be monitored aside from the obvious CPU, memory, disk space, database connectivity.

Our web app is written in Django, we are running on Linux (Ubuntu) under Apache + Fast CGI with PostgreSQL database.

EDIT We have a completely virtualized environment under Linode.

EDIT We are using django-logging so we have a way separate info, errors, critical issues, etc.

Answer

Jas Panesar picture Jas Panesar · Jan 30, 2009

Nagios is good, it's good to maybe have system testing (Selenium) running regularily.

Edit: Hyperic and Groundwork also look interesting.

There is probably a test suite system that can keep pressure testing everything as well for you. I can't remember the name off the top of my head, maybe someone can mention one below.

Other things I like to do:

The best motto for infrastructure is always fix, detect, repair. Get it up, get to the root of it, and cure/prevent it if you can.

Since a system exists at many levels, we should test at many levels:

Edit: Have all errors or warnings posted directly to your case manager via email. That way you can track occurrences in one place.

1) Connection : monitor your internet connectivity from the server and from the outside. Log this somewhere

2) Server : monitor all the processes that you need to to ensure they are running and not pinning the server. Use a HP Server or something equivalent with hardware failure notification that it can do from a bios level. Notify and log if they are.

3) Software : Identify the key software that always needs to be running. Set the performance levels if any and then monitor them. Nagios should be able to help with this. On windows it can be a bit more. When an exception occurs, you should be able to run a script from it to restart processes automatically. My dream system is allowing me to interact with servers via SMS if the server sees it as an exception that I have to either permit, or one that will happen automatically unless I cancel by sms. One day..

4) Remote Power : Ensure Remote power-reset capabilities are in your hand. You might want to schedule weekly reboots if you ever use windows for anything.

5) Business Logic Testing : Have regularly running scripts testing the workflow of your system. Selenium can probably achieve some of this, but I like logging the results as well to say this ran at this time and these files had errors. If possible anywhere, have the system monitor itself through your scripts.

6) Backups : Make a backup that you can set and forget. If you can get things into virtual machines it would be ideal as you can scale, move, or deploy any part of your infrastructure anywhere. I have had instances where I moved a dead server onto my laptop, let it run in vmware while I fixed a problem.