Database Connectivity Issue
Incident Report for CloudWisdom
Postmortem

Yesterday, our relational database experienced a spike in write operations that increased write latency. Our login, metadata processing, and alert processing services rely on the relational database. The increase in write latency caused threadpool exhaustion, introducing a delay to processing for those services. The system recovered on its own once write operations subsided, as write latency decreased and application processing resumed.

Due to caching in critical parts of our system, such as the check and ingest pipelines, the synchronous services stayed operational during the incident, preventing a false check failure storm and ingested data loss.
We are working to better monitor and prevent write operation spikes of this size in the future, including adding protection to services that rely on the relational database.

Posted Jan 24, 2020 - 12:15 EST

Resolved
The system has fully recovered.

We'll provide a summary within one business day.
Posted Jan 23, 2020 - 10:31 EST
Update
We are continuing to monitor for any further issues.
Posted Jan 23, 2020 - 10:25 EST
Update
We are continuing to monitor for any further issues.
Posted Jan 23, 2020 - 10:21 EST
Monitoring
The database has recovered and we are catching up on backlogged data.
Posted Jan 23, 2020 - 10:17 EST
Investigating
We're currently investigating a database connectivity issue affecting cloud ingest, data processing, and data access including the UI.
Posted Jan 23, 2020 - 10:02 EST
This incident affected: Data Ingestion, Data Processing, Data Access, Alerting, and Login.