Last Thursday night through Saturday morning we had three separate failures to our ingest pipeline. All three incidents were caused by failures to nodes in our messaging infrastructure which caused high ingest latencies, eventually leading to system-check timers expiring and causing false check alerts.
Our high-throughput ingest pipeline requires a low-latency architecture which in turn is sensitive to any delay incurred by the load balancing tier. Our original load balancing strategy was very high performing however it didn't sufficiently protect us against certain single-node failure modes. Our interim attempts to remedy didn't prove fruitful, so this weekend we took the step to transition our platform to a new load balancing strategy.
This morning's maintenance window was meant to configure additional computing resources to the nodes arranged behind our new load balancing strategy. We have achieved a new balance between our performance and resilience requirements with this new architecture and have addressed the failure mode that caused the recent false check alerts.