Metadata Search Outage
Incident Report for CloudWisdom
Postmortem

Yesterday, our search cluster's performance was severely degraded for 8 minutes between 1:48PM and 1:56PM Eastern. Extensive garbage collection (GC) across the cluster began at 1:40PM, which lead to one node in the cluster running out of memory. This put additional memory pressure on the remaining nodes, eventually causing more nodes to fail. For remediation, we restarted the nodes and brought the services back online immediately.

We're continuing to investigate root cause of the garbage collection pressure, and will implement additional instrumentation and tuning to prevent the same event from happening in the future.

Posted Jan 30, 2020 - 16:48 EST

Resolved
The system has fully recovered. We will provide more details within a business day.
Posted Jan 29, 2020 - 14:10 EST
Monitoring
The database is recovering. We'll continue to monitor as we work through backlogged jobs.
Posted Jan 29, 2020 - 14:00 EST
Identified
Our metadata database is degraded. We are working to restore it. The UI will be degraded until we resolve the issue.
Posted Jan 29, 2020 - 13:54 EST
This incident affected: Data Processing and Data Access.