Saturday morning we had a major data ingestion outage between 06:25 AM and 06:55 AM Eastern time. Our messaging infrastructure had a major outage due to a cascading Out of Memory (OOM) failure. Two of our nodes failed resulting in additional load on the remaining nodes which, in turn, caused them to fail. During this time incoming data was not ingested nor did alerts go out. After the incident was resolved AWS data and alerting was backfilled over the next 90 minutes (07:00 AM to 08:30 AM Eastern). Our Linux and Windows agents store data locally when there are issues posting to the API and we saw a jump in ingested data from these agents once our messaging infrastructure was repaired.
We are further investigating why our messaging infrastructure nodes ran into OOM errors to see if it was load based or if a set of particularly heavy messages were moving through the system so we can better protect against cascading failures like this and implement a permanent fix for this failure mode. On the other hand, we are happy to say that our recent load balancing changes limited this outage to data ingestion and did not result in a burst of false check alerts. Check ingestion remained stable during the entirety of the incident.
We have more work to do on protecting our messaging infrastructure from failures. We are, however, encouraged by our recent progress around check stability.