Parse has been running on MongoDB since version 1.8, so we’ve accumulated a massive amount of operational knowledge about what it takes to run a production MongoDB deployment. Running MongoDB is fairly easy operationally; however, scaling MongoDB can be challenging with things like storage inefficiencies and database-level lock contention. The new modular storage engine API in MongoDB 3.0 is like a magical wand that enables us to run MongoDB on a storage engine optimized for our workloads.
As we said last week, this will be the first in a series of posts where we do a deep dive into our experiences benchmarking MongoDB 3.0 with RocksDB. Parse has had the tremendous benefit of working side by side with the RocksDB team, who have engineered a screaming fast write-optimized storage engine that is already being used to power multiple internal Facebook services.
Flashback is a tool we open-sourced last year to record our production workloads and replay them in an isolated environment at varying speeds. We evaluated performance using Flashback’s per-operation latency metrics. In addition, we also set up a log processing pipeline that allowed us to perform ad-hoc analysis and diagnose performance regressions. At a high level, the pipeline works by first logging every operation in MongoDB, and then parsing the log lines into JSON using mongologtools. The JSON is then shipped to an internal Facebook data diving tool that we used to analyze performance in realtime.
To come up with a replayable production workload, we need to do two things: start capturing all operations of a replica set, and create a consistent snapshot from the start of the recorded session.
Capturing a workload
First we need to begin capturing all the production operations for a set period of time by initiating a “record” session using Flashback. To set up a Flashback record session, we copied the config template (config.template) and modified the following stanzas:
- target_databases: A list of all the databases you would like to record
- oplog_server: A secondary that will be used to tail the oplog for write operations
- profiler_server: The primary in the target replica set to capture profiling data
- duration_sec: Defines how long you want to record
We recommend setting duration_sec to a long time, like 8 to 12 hours, so you get a representative sample of your requests.
Next, we enabled profiling on the primary target node using the set_mongo_profiling.py script. Warning: this can impact the performance of your production replica set because it does an additional write for each operation to the profiling collection, although in practice this has generally been fine for us.
./set_mongo_profiling.py -a enable -n $PRIMARY_HOSTNAME
Now we kick off a record session using
Once the record process has finished, there will be a large JSON file with all of the recorded operations in the order they were executed.
Creating a Consistent Snapshot
Around the same time you start recording operations, you need to generate a consistent backup that you can restore in your isolated test environment. Parse uses EBS snapshots, so generating an initial backup for us is as simple as locking mongod, creating an EBS snapshot of all the RAIDed volumes on /var/lib/mongodb, and unlocking mongod.
Quickly Replaying the same Workload
After replaying a workload, we wanted an efficient way to roll back to a consistent state. We could restore back to the snapshotted EBS volumes each time, but this is very time-consuming because you also need to warm up the EBS volumes and pull the blocks down from S3 for each fresh run. This can take hours or days if you have terabytes of data.
With that in mind, we decided to use LVM on top of EBS to provide a copy-on-write restore point that can be easily rolled back. Although this will incur a small amount of I/O overhead, we can now replay the same workload multiple times quickly and easily. As part of our production provisioning process, we create the logical volumes on top of the RAIDed set of EBS volumes.
#/dev/md0 is the EBS RAID pvcreate /dev/md0 vgcreate mongovg /dev/md0 lvcreate -l 90%%VG -n mongoraid mongovg
When restoring from backup in our test environment, we define a restore point:
lvcreate -l 10%VG -s -n restore_point /dev/mongovg/mongoraid
After finishing a benchmark, we stop mongod, unmount the filesystem and merge the copy-on-write logical volume to rollback to a consistent state:
lvconvert --merge /dev/mongovg/restore_point
It's a simple wash, rinse, and repeat to replay the same workload with different tuning parameters.
Benchmarking Multiple Storage Engines
Normally when we run Flashback benchmarks, we can just restore from backup and replay our captured workload over and over. However in MongoDB 3.0, each storage engine has a different on-disk format. So we also need to run an initial sync of each new storage engine against our restored MMAPv1 backup, and then run benchmarks on each format.
Storage engine efficiency results
We ran an initial sync on 3.0MMAPv1, RocksDB and WiredTiger from our production backup. This particular replica set contained approximately 150 databases and 370k collections. Both WiredTiger and RocksDB had compression enabled using Snappy. The test hosts ran on EC2. We used r3.4xlarge instances with 6TB of EBS storage and 6000 provisioned IOPS.
After the initial sync, RocksDB and WiredTiger cut storage used by more than 10x when compared to MMAPv1, which is pretty insane. Roughly half of the entire data set can now fit in memory (144GB). So for both WiredTiger and RocksDB, an operation that would normally require a disk read in MMAPv1 is more likely to be a simple memory lookup.
After getting over the shock and delight of potentially saving 10x on our storage costs, we wanted to focus on operational latencies and throughput. The second post in this series will be out next week with results of the benchmarking runs broken down by query type.