On Sunday 3/21 beginning at 3:09 AM Eastern we had a monitoring data ingest outage. One of the nodes in our messaging infrastructure cluster had a higher load relative to the other nodes in the cluster. In an attempt to rebalance load across the cluster, three nodes were mistakenly removed from the load balancer which serves the cluster. This caused the remaining node to run out of memory, resulting in an empty cluster, which caused delays for AWS and non-AWS metric data and some loss in non-AWS data.
The incident was resolved Sunday 3/21 at 4:44 AM Eastern after the team confirmed the cluster was fully operational with all of its nodes and metric data collection resumed. The team determined that all AWS data was eventually backfilled while non-AWS data from 3:09 AM Eastern to 3:51 AM Eastern was not backfilled.
As always, we take issues regarding data collection and alerting very seriously. Our team has received additional training to exercise extra caution when rebalancing nodes in the cluster. Our goal is to rearchitect our load balancing infrastructure to better distribute traffic across the cluster to prevent the need to rebalance.