Running a big MongoDB installation requires a certain amount of routine maintenance. Over time, collections in a MongoDB database can become fragmented. This can be a particularly serious problem if your data usage patterns are relatively unstructured. In the long run, this can result in your databases taking up more space on disk and in RAM to hold the same amount of data, it can make many database operations noticeably slower, and it can reduce your overall query capacity significantly.
Conveniently, MongoDB provides 2 different ways to compact your data and restore optimal performance: repairDatabase and compact. RepairDatabase is appropriate if your databases are relatively small, or you can afford to take a node out of rotation for quite a long time. For our database sizes and query workload, it made more sense to run continuous compaction over all our collections.
To do this we wrote a small utility script to help us compact all our databases incrementally. We run this utility on a secondary node in our replicaset, and once it's compacted everything, we can rotate that node in to be the primary node with minimal downtime. We also have this secondary node configured as our snapshot backup host, so if we ever need to reconstruct nodes from snapshots, the new nodes are as freshly compacted as possible.
Here's how it works: it first fetches the list of all the databases in your replicaset, and then lists of all the collections in each database. It then goes through all of these collections and runs the compact command on each one. This is a blocking operation that puts the database into RECOVERY mode, so after each collection, it checks to see if replication has fallen too far behind, and if so, it waits for replication to catch up before resuming. If it's interrupted or encounters an error, it saves the list of collections remaining to a file and then prints out instructions for how to resume it.
Here's how to use it. To compact everything on the localhost mongo instance, you run it with no arguments (note that if you run this on the primary node, it will silently do nothing):
To run it on a particular set of your databases (comma separated) you specify them with the -d option:
./mongo_compact.rb -d userdata1,userdata2,userdata3
To run it from cron, you use the -c option, and it will automatically save and resume its collection list in /var/run/mongo_compact/ and check if it's already running using a .pid file in the same directory.
You can see a full list of options with --help:
We hope you find this useful! If you want to hear more of these kind of tips, we'll be sharing more of our tools and best practices at MongoSF on May 10th.