Yesterday, we had two batches of false check alerts and a degradation of data ingest with a small amount of data loss. We had a new failure in our messaging infrastructure that caused familiar symptoms.
One node from our messaging infrastructure had a catastrophic failure leading to a degradation in check ingest, which caused the first set of false check alerts. We responded to this quickly by restarting the service on the node. The system appeared to stabilize, but the problematic node's service was unknowingly restarted into a degraded state. We found that it had lost permissions to its data directory, an unprecedented failure mode, which prevented the node from transmitting the data it was receiving downstream. This led to the second incident. We responded by removing the problematic node from the cluster, but it was not removed gracefully enough, which caused the second set of false check alerts. The data directory permission issue was ultimately fixed by stopping and starting the underlying EC2. Once fixed, we brought the node back online and backfilled its ingest data.
We recognize that, although the root cause of this issue is unique, the symptoms are not. The reliance on our messaging infrastructure is too high for our check ingest pipeline. We're taking drastic action after this incident by working around inevitable messaging infrastructure failures. Part of distributed computing is the expectation of failures and we plan to improve our ingestion pipeline in the near term to work around both known and unknown failures.