Small Expired Checks Incident
Incident Report for CloudWisdom
Postmortem

During a blue/green deployment of our ingest service, load on the underlying Docker cluster increased. The elevated load caused communication to our downstream message bus to slow down resulting in container cluster instability. This lead to rejected HTTP requests and check failures for a subset of elements. Though it was later determined this issue was not related to a code change, we immediately rolled back the deployment as a precaution. We have reduced overall load on the cluster to provide more headroom during future ingest service deployments, and plan to deploy it in smaller batches moving forward.

Posted Jan 14, 2020 - 14:36 EST

Resolved
Between 9:35 AM and 9:49 AM Eastern we had a small number of checks expire inadvertently. Investigation so far has shown only a small number of false alerts and we are continuing to investigate the root cause so we can provide a post-mortem within one business day.
Posted Jan 13, 2020 - 10:15 EST