Data Ingest Outage
Incident Report for CloudWisdom

On Sunday 3/21 beginning at 3:09 AM Eastern we had a monitoring data ingest outage. One of the nodes in our messaging infrastructure cluster had a higher load relative to the other nodes in the cluster. In an attempt to rebalance load across the cluster, three nodes were mistakenly removed from the load balancer which serves the cluster. This caused the remaining node to run out of memory, resulting in an empty cluster, which caused delays for AWS and non-AWS metric data and some loss in non-AWS data.

The incident was resolved Sunday 3/21 at 4:44 AM Eastern after the team confirmed the cluster was fully operational with all of its nodes and metric data collection resumed. The team determined that all AWS data was eventually backfilled while non-AWS data from 3:09 AM Eastern to 3:51 AM Eastern was not backfilled.

As always, we take issues regarding data collection and alerting very seriously. Our team has received additional training to exercise extra caution when rebalancing nodes in the cluster. Our goal is to rearchitect our load balancing infrastructure to better distribute traffic across the cluster to prevent the need to rebalance.

Posted Mar 23, 2021 - 13:36 EDT

The system is in a healthy state again. We'll provide further details soon in a postmortem. Thank you for your patience.
Posted Mar 21, 2021 - 04:44 EDT
Cloud collection is operating again meaning the system is fully operational. We're continuing to monitor.
Posted Mar 21, 2021 - 04:38 EDT
Linux, Windows, and other non-cloud data collection is operational. We are continuing to work on restoring cloud data collection.
Posted Mar 21, 2021 - 04:15 EDT
We're currently experiencing a data ingest outage across all element types. We've identified the issue and are working to resolve it. Data updates and alerts will be delayed.
Posted Mar 21, 2021 - 03:55 EDT
This incident affected: Data Ingestion.