We would like to share more details about the nature of the events during the January 22nd service disruption. We are working hard to improve reliability and would like to relay information about the nature of the service disruption, our efforts to restore functionality, and the steps we are taking to prevent this sort of issue from happening again.
The initial outage began at 8:16 A.M. PST and was related to a routine operation that involved rotating nodes to defragment and compact our database. This triggered a rare edge case in Cloud Code which caused all Cloud Code requests to simultaneously time out, which in turn caused a timeout feedback loop at the app server layer. In the process of restoring service, we also discovered that some indexes had been incorrectly built on one instance. This delayed our ability to recover quickly. Service was fully restored at 9:24 A.M.
In response to this incident, we have taken the following steps:
- We are adding more sanity checks and safeguards around our routine database maintenance tasks and migrations.
- We are adding significantly more monitoring around Cloud Code capacity and handling for unusual error states.
- We are improving our ability to selectively target Parse functionality, so the impact of any performance degradation is localized and does not affect the majority of Parse apps.
- We are adding post hoc analysis of smart indexing to catch any edge cases.
We apologize for the outage. We built this platform for engineers like ourselves and know that a platform outage can be terribly disruptive for our customers. Rest assured, the entire team here is committed to the long term stability of the platform and works hard to avoid such events. If you have any questions, please reach out to us through our Help & Community portal.