Partial Policy Processing Outage
Incident Report for CloudWisdom
Postmortem

Between Sunday 9/20 at 7PM Eastern and Wednesday 9/23 at 3PM Eastern we had an undetected partial data processing outage. Elevated pressure on one of our databases caused two of our policy processing application containers to have partial internal issues which caused them to stop evaluating policies. Messages which would have been evaluated by these containers did not get picked up by other containers.

We have alerting in place to catch delays in processing. We had a bug in one of our alerts which let delays in this message topic slip through undetected. We have already fixed this alerting bug.

During the incident, roughly 20% of elements across all customer accounts were not evaluated by policies. This means these elements, if they had metrics in a violating range and were not in maintenance mode, did not generate alerts. However, metric data was not affected during this time so users are able to use the Metrics page to view metric data history if they are concerned an issue went undetected.

Our team has both immediate and long term plans to address this issue, and some action has already been taken. As a tool that is heavily relied upon for accurate and timely alerting, we take issues like these very seriously and recognize we must do better. We firmly believe tackling it from all angles with multiple solutions gives us the best opportunity to do this.

Posted Sep 24, 2020 - 08:35 EDT

Resolved
Between Sunday 9/20 at 7PM Eastern and Wednesday 9/23 at 3PM Eastern we had an undetected partial data processing outage. Elevated pressure on one of our databases caused two of our policy processing application containers to have partial internal issues which caused them to stop evaluating policies. Messages which would have been evaluated by these containers did not get picked up by other containers.
Posted Sep 20, 2020 - 19:00 EDT