Yesterday, our relational database experienced a spike in write operations that increased write latency. Our login, metadata processing, and alert processing services rely on the relational database. The increase in write latency caused threadpool exhaustion, introducing a delay to processing for those services. The system recovered on its own once write operations subsided, as write latency decreased and application processing resumed.
Due to caching in critical parts of our system, such as the check and ingest pipelines, the synchronous services stayed operational during the incident, preventing a false check failure storm and ingested data loss.
We are working to better monitor and prevent write operation spikes of this size in the future, including adding protection to services that rely on the relational database.