Degraded Data Access and Processing
Incident Report for CloudWisdom
Postmortem

Yesterday on 9/23 at 10:45AM Eastern we began having issues with our search database. Due to increased load the database began to return slow results and then return no results at all. This resulted in a UI and query API outage. The database was back online briefly at 10:51AM Eastern, but began returning errors again at 10:53AM. At 11:10AM Eastern the database fully recovered and stayed online.

During the UI and query API outage data processing was delayed as parts of our data processing pipeline requires the search database. Slight delays in some AWS data collection occurred until 12:04PM Eastern. At that point search was back online and AWS collection and data processing was fully recovered.

We're continuing to investigate the reasons why our search database became overloaded so we can better protect both against the quantity and complexity of the queries.

Posted Sep 24, 2020 - 11:36 EDT

Resolved
We've fully recovered from this degradation and will provide more details soon.
Posted Sep 23, 2020 - 12:04 EDT
Update
Collection delays continue to improve. We're continuing to monitor.
Posted Sep 23, 2020 - 11:52 EDT
Update
We're starting to see improvements in collection delays. We're continuing to monitor.
Posted Sep 23, 2020 - 11:44 EDT
Update
While monitoring the state of the UI we discovered there were additional issues related to AWS data collection that caused a roughly 1 hour delay of collection. We're in the process of issuing a fix to resolve the collection delay.
Posted Sep 23, 2020 - 11:33 EDT
Update
After another period of degraded access we're still in a monitoring state. We're waiting for delayed data processing to catch up.
Posted Sep 23, 2020 - 11:10 EDT
Monitoring
The database is recovering. Users may experience slowness in the UI while we continue to recover.
Posted Sep 23, 2020 - 10:51 EDT
Identified
We've identified high load on one of our database clusters and are working to resolve the issue.
Posted Sep 23, 2020 - 10:48 EDT
This incident affected: Data Processing and Data Access.