Degraded Data Ingest
Incident Report for CloudWisdom
Postmortem

Last Thursday night through Saturday morning we had three separate failures to our ingest pipeline. All three incidents were caused by failures to nodes in our messaging infrastructure which caused high ingest latencies, eventually leading to system-check timers expiring and causing false check alerts.

Our high-throughput ingest pipeline requires a low-latency architecture which in turn is sensitive to any delay incurred by the load balancing tier. Our original load balancing strategy was very high performing however it didn't sufficiently protect us against certain single-node failure modes. Our interim attempts to remedy didn't prove fruitful, so this weekend we took the step to transition our platform to a new load balancing strategy.

This morning's maintenance window was meant to configure additional computing resources to the nodes arranged behind our new load balancing strategy. We have achieved a new balance between our performance and resilience requirements with this new architecture and have addressed the failure mode that caused the recent false check alerts.

Posted May 18, 2020 - 11:43 EDT

Resolved
We've fully rolled out our alternative fix. This should give us more resiliency to node failures. We will further investigate the node failures in the morning.
Posted May 16, 2020 - 01:13 EDT
Update
We've nearly completed the rollout of our alternative fix. There is currently no disruption in the system.
Posted May 16, 2020 - 00:45 EDT
Update
Our alternative fix is going well. We continue to roll it out slowly.
Posted May 16, 2020 - 00:25 EDT
Update
All backlogged work has been completed. Our test in our test environment went well, we are going to slowly implement in production.
Posted May 16, 2020 - 00:07 EDT
Monitoring
We're continuing to process backlogged work. We're testing alternative fixes in our test environment.
Posted May 15, 2020 - 23:39 EDT
Update
We are still investigating the reason these nodes are failing. So far no additional nodes have failed. We're working through backlogged work.
Posted May 15, 2020 - 23:20 EDT
Update
More nodes have gone down, we are trying to narrow down why the nodes are running into issues.
Posted May 15, 2020 - 23:07 EDT
Identified
Another node in our messaging infrastructure has gone down. We have fixed our messaging infrastructure and are working on resolving the issue. Some false check alerts have gone out.
Posted May 15, 2020 - 22:51 EDT
This incident affected: Check Ingestion and Data Ingestion.