Messsaging Outage
Incident Report for CloudWisdom
Postmortem

Saturday morning we had a major data ingestion outage between 06:25 AM and 06:55 AM Eastern time. Our messaging infrastructure had a major outage due to a cascading Out of Memory (OOM) failure. Two of our nodes failed resulting in additional load on the remaining nodes which, in turn, caused them to fail. During this time incoming data was not ingested nor did alerts go out. After the incident was resolved AWS data and alerting was backfilled over the next 90 minutes (07:00 AM to 08:30 AM Eastern). Our Linux and Windows agents store data locally when there are issues posting to the API and we saw a jump in ingested data from these agents once our messaging infrastructure was repaired.

We are further investigating why our messaging infrastructure nodes ran into OOM errors to see if it was load based or if a set of particularly heavy messages were moving through the system so we can better protect against cascading failures like this and implement a permanent fix for this failure mode. On the other hand, we are happy to say that our recent load balancing changes limited this outage to data ingestion and did not result in a burst of false check alerts. Check ingestion remained stable during the entirety of the incident.

We have more work to do on protecting our messaging infrastructure from failures. We are, however, encouraged by our recent progress around check stability.

Posted Jun 08, 2020 - 16:50 EDT

Resolved
Data ingestion, data collection, and alerting is all operational. We'll post a summary within one business day.
Posted Jun 06, 2020 - 08:35 EDT
Update
We ran into a minor issue while rolling out our most recent change. We've resolved and we've resumed rolling our our change to our cloud collection services.
Posted Jun 06, 2020 - 08:24 EDT
Update
We are still working to resolve the cloud collection degredation.
Posted Jun 06, 2020 - 08:06 EDT
Update
We discovered a remaining issue with cloud collection. We are currently fixing and monitoring results.
Posted Jun 06, 2020 - 07:46 EDT
Monitoring
The system is stabilizing. Data ingestion is back online and we are monitoring.
Posted Jun 06, 2020 - 07:16 EDT
Update
We are continuing to work on a fix for this issue.
Posted Jun 06, 2020 - 06:59 EDT
Update
Our messaging infrastructure is back online and we are accepting traffic again. We are working through backlogged work. AWS data is being backfilled.
Posted Jun 06, 2020 - 06:56 EDT
Identified
Out messaging infrastructure has had an outage. We are bringing the infrastructure back online. We expect data ingest delays and false check alarms
Posted Jun 06, 2020 - 06:48 EDT
This incident affected: Check Ingestion, Data Ingestion, Data Processing, and Alerting.