Now that we have fully restored functionality to Parse, we would like to share more details about the nature of the events during the service disruption today. We know that service disruptions greatly impact our customers. We are working hard to improve reliability and would like to share information about the nature of the service disruption, our efforts to restore functionality, and the steps we are taking to prevent this sort of issue from happening again.
The outage related to the wildcard parse.com SSL certificate, which expired at 5 A.M. Pacific Standard Time. Several months ago, we updated our SSL certificates to avoid this expiration problem. Our policy for our infrastructure in general is that operational changes all happen via a configuration management system. In this case, a bug in our configuration management led to the SSL certificate not being successfully deployed to the production environment via our configuration management tools. Thus, as we subsequently updated our SSL endpoints, the certificate was not correctly updated.
Our efforts to restore functionality were also delayed by an insufficient set of restrictions in our automated monitoring systems. Due to divergence between the simulated mobile environments in our automated testing, and actual mobile devices, particularly around SSL verification, our automated systems did not immediately catch this outage. Once we did detect this issue, we upgraded the certificate and returned to full functionality at 8:42 A.M. Pacific Standard Time.
We are taking several steps to improve Parse’s reliability in response to this service disruption:
- We fixed the proximate cause by upgrading the SSL certificate.
- We are adding a specific monitoring rule to check SSL certificate expiration.
- We are auditing our configuration management process to ensure that verification is sufficiently automated.
- We are auditing our automated end-to-end testing to ensure that failures at this layer are detected by multiple test systems.
- We are adding general monitoring to detect any error that leads to statistical anomalies in traffic.
Finally, we would like to apologize. We know our service is critical to our customers and we are committed to work hard to learn from this event in order to improve the reliability of our services. If you have any questions, please reach out to us.