Yesterday, we had a failure scenario where an ingest pipeline issue cascaded into two alerting issues. A single node in our messaging infrastructure had an Out of Memory error that caused the service to go down on that host. The downed host degraded ingest performance so the service responsible for check ingestion became unstable, which caused a burst of false alarm check failures between 6:58 PM and 7:03 PM Eastern.
Once ingestion came back online, these checks cleared, but doing so increased load on the database that stores policies, causing ~10% of the clear events to not get properly processed. Consequently, alerts that should have closed remained open.
We have since closed all open check alerts from that time period. We have begun the work towards protection against this failure scenario including more sensitive memory monitoring, better buffering between our messaging infrastructure and check ingest pipeline, and improved resiliency of our alert processing pipeline.