Delayed Data Ingestion
Incident Report for CloudWisdom
Postmortem

Yesterday, we had two batches of false check alerts and a degradation of data ingest with a small amount of data loss. We had a new failure in our messaging infrastructure that caused familiar symptoms.

One node from our messaging infrastructure had a catastrophic failure leading to a degradation in check ingest, which caused the first set of false check alerts. We responded to this quickly by restarting the service on the node. The system appeared to stabilize, but the problematic node's service was unknowingly restarted into a degraded state. We found that it had lost permissions to its data directory, an unprecedented failure mode, which prevented the node from transmitting the data it was receiving downstream. This led to the second incident. We responded by removing the problematic node from the cluster, but it was not removed gracefully enough, which caused the second set of false check alerts. The data directory permission issue was ultimately fixed by stopping and starting the underlying EC2. Once fixed, we brought the node back online and backfilled its ingest data.

We recognize that, although the root cause of this issue is unique, the symptoms are not. The reliance on our messaging infrastructure is too high for our check ingest pipeline. We're taking drastic action after this incident by working around inevitable messaging infrastructure failures. Part of distributed computing is the expectation of failures and we plan to improve our ingestion pipeline in the near term to work around both known and unknown failures.

Posted Apr 14, 2020 - 13:55 EDT

Resolved
This incident has been resolved.

All open check expiration incidents have been closed, and the system is stable.

We will post a summary of the issue on this incident within one business day.
Posted Apr 13, 2020 - 18:11 EDT
Update
Many check expiration incidents from earlier remain open. Our team is going to manually close each incident. Users may receive an influx of clear notifications when this happens.
Posted Apr 13, 2020 - 17:32 EDT
Update
The problematic node has been added back to the cluster in a limited capacity. We are being cautious with the data it receives.

There was some data loss. More details will be provided in our summary of the incident tomorrow.

We are continuing to monitor.
Posted Apr 13, 2020 - 17:30 EDT
Monitoring
Data ingestion and the rest of the system remain stable.

We are continuing to monitor.
Posted Apr 13, 2020 - 17:06 EDT
Update
Data ingestion is now stable. We are attempting to bring the problematic node back online.

Our hope is to backfill queued up data but there is a potential for some data loss.
Posted Apr 13, 2020 - 16:44 EDT
Identified
The issue has been identified and we're working on a fix.

Check expiration counts are near zero.
Posted Apr 13, 2020 - 16:31 EDT
Update
Some checks have erroneously expired again.

We're continuing to investigate the issue.
Posted Apr 13, 2020 - 16:20 EDT
Investigating
We're investigating an additional issue stemming from the earlier incident causing delayed data ingestion.

Users can expect to see delayed raw samples.
Posted Apr 13, 2020 - 16:06 EDT
This incident affected: Check Ingestion and Data Ingestion.