Partial outage in our Ingest subsystem
Incident Report for CloudWisdom
Postmortem

Early yesterday morning we had an outage in our messaging subsystem. This has been a problematic piece of our monitoring infrastructure, but we are happy to see that recent improvements to this subsystem resulted in a lower customer impact than similar issues in the past.

Between 2:30AM-2:50AM Eastern time on 9/1/2020 we lost instances behind a load balancer for part of our messaging infrastructure due to a cascading out of memory issue. This resulted in a loss of raw samples and rollups for elements whose data is ingested through the API (e.g. Linux and Windows agents) from 2:30AM-3:00AM Eastern Time. Historically, an outage of this severity would have caused a false alarm check storm. We're happy that this partial outage did not have that effect.

During this period no analytics were processed; however, after the incident we began to backfill analytics and policy evaluation. We also began to backfill AWS data, resulting in no gap for the 2:30AM-2:50AM period for those elements. Analytic processing was fully caught up at 3:48AM Eastern time.

We apologize for the outage and plan to make further improvement protecting our messaging infrastructure from cascading issues so we can further reduce the impact of these types of issues.

Posted Sep 02, 2020 - 14:33 EDT

Resolved
Issue has been fully resolved. We will post an incident summary soon.
Posted Sep 01, 2020 - 03:51 EDT
Monitoring
We've resolved the issue with our messaging infrastructure and are working through backlogged work. There will be slight delays in data analytics and alerts as we recover.
Posted Sep 01, 2020 - 03:31 EDT
Update
We are currently having issues with our messaging pipeline. We've taken action to resolve the issue and are currently checking that the system is recovering before going into a Monitoring state.
Posted Sep 01, 2020 - 03:21 EDT
Identified
We've identified the issue. We are actively working to restore the service.
Posted Sep 01, 2020 - 03:01 EDT
Investigating
We are currently currently experiencing a partial outage of our Ingest subsystem. We are working to identify the root cause.
Posted Sep 01, 2020 - 02:58 EDT
This incident affected: Data Ingestion.