Degraded Ingest and Alerting
Incident Report for CloudWisdom
Postmortem

Yesterday, we had a failure scenario where an ingest pipeline issue cascaded into two alerting issues. A single node in our messaging infrastructure had an Out of Memory error that caused the service to go down on that host. The downed host degraded ingest performance so the service responsible for check ingestion became unstable, which caused a burst of false alarm check failures between 6:58 PM and 7:03 PM Eastern.

Once ingestion came back online, these checks cleared, but doing so increased load on the database that stores policies, causing ~10% of the clear events to not get properly processed. Consequently, alerts that should have closed remained open.

We have since closed all open check alerts from that time period. We have begun the work towards protection against this failure scenario including more sensitive memory monitoring, better buffering between our messaging infrastructure and check ingest pipeline, and improved resiliency of our alert processing pipeline.

Posted Dec 10, 2019 - 17:45 EST

Resolved
We've resolved the issue. We will post a summary of the issue on this incident within one business day.
Posted Dec 09, 2019 - 19:58 EST
Update
We're continuing to work through the processing and alerting backlog.
Posted Dec 09, 2019 - 19:25 EST
Update
We're seeing that a set of false alarm checks went out. They have since subsided. We are still monitoring the recovery of the system.
Posted Dec 09, 2019 - 19:14 EST
Monitoring
A fix has been implemented and the system is recovering. We will monitor for 10 minutes before closing this incident.
Posted Dec 09, 2019 - 19:05 EST
Identified
We're currently experiencing an issue with our messaging infrastructure resulting in delayed ingest and alerting.
Posted Dec 09, 2019 - 18:54 EST
This incident affected: Data Ingestion and Alerting.