Restarting / Autorepairing Mongodb in Production

Tilo picture Tilo · Nov 25, 2011 · Viewed 9.5k times · Source

What I want to achieve is to have an /etc/init.d script which more reliably starts Mongodb, even if it went down hard -- it should attempt an auto-repair in case the system is in a locked state.

Yes, I could script this myself, but I think somebody out there must have done this already.

I noticed that after a server goes down hard, that Mongodb is in a state where it doesn't restart via the /etc/init.d/mongod script. Obviously the lock file(s) need to be removed and it needs to be started with the --repair option and correct --dbpath first, before it can be successfully restarted. In some cases one also needs to change the ownership of the db-files to the user who runs mongodb. One additional problem is that the standard /etc/init.d/mongod script does not report a failure in this situation, but rather joyfully and incorrectly returns with "OK" status, reporting that Mongod was started, although it wasn't.

$ sudo /etc/init.d/mongod start
Starting mongod: forked process: 9220
all output going to: /data/mongo/log/mongod.log
                                                           [  OK  ]
$ sudo /etc/init.d/mongod status
mongod dead but subsys locked

The OS is either CentOS or Fedora.

Does anybody have modified /etc/init.d scripts or a pointer to such scripts, which attempt a repair automatically in that situation? Or is there another tool which functions as a watch dog for Mongod?

Any opinions on why it might be a bad idea to try to automatically repair mongodb?

$ sudo /etc/init.d/mongod status
mongod dead but subsys locked

$ sudo ls -l /var/lib/mongo/mongod.lock 
-rw-r--r--. 1 mongod mongod 5 Nov 19 11:52 /var/lib/mongo/mongod.lock


$ sudo tail -50 /data/mongo/log/mongod.log
************** 
old lock file: /data/mongo/db/mongod.lock.  probably means unclean shutdown
recommend removing file and running --repair
see: http://dochub.mongodb.org/core/repair for more information
*************
Sat Nov 19 11:55:44 exception in initAndListen std::exception: old lock file, terminating
Sat Nov 19 11:55:44 dbexit: 

Sat Nov 19 11:55:44 shutdown: going to close listening sockets...
Sat Nov 19 11:55:44 shutdown: going to flush oplog...
Sat Nov 19 11:55:44 shutdown: going to close sockets...
Sat Nov 19 11:55:44 shutdown: waiting for fs preallocator...
Sat Nov 19 11:55:44 shutdown: closing all files...
Sat Nov 19 11:55:44     closeAllFiles() finished

Sat Nov 19 11:55:44 dbexit: really exiting now

Answer

Gates VP picture Gates VP · Nov 27, 2011

So the first bit to mention is journaling. Journaling is effectively billed as "fast repair". Journaling is on by default in 2.0+ and it will perform that "repair" by default.

So if your disks can handle the extra write-throughput of journaling this may solve your problem.

Any opinions on why it might be a bad idea to try to automatically repair mongodb?

The #1 issue with repairing MongoDB automatically is simply one of time.

If you have a 200GB database, the system will need to do the following when repairing:

  1. Allocate ~200GB of files (do you have the drive space?)
  2. Read all of the data from the existing files into memory (200GB read)
  3. Check each document for validity and write it back to the new files (200GB write)
  4. Re-create all indexes (200GB reads + large number of writes)
  5. Flush everything to disk

If you look at my notes that's a serious amount of drive thrashing to perform a repair.

But most production installs are running replica sets. In this case, instead of repairing, you can just restore from a backup. Restoring from a backup only writes the data once and it's a process you should already have in place.

Despite the init.d script returning OK, your system monitoring should tell you that the DB is not up.