Early yesterday morning we had an outage in our messaging subsystem. This has been a problematic piece of our monitoring infrastructure, but we are happy to see that recent improvements to this subsystem resulted in a lower customer impact than similar issues in the past.
Between 2:30AM-2:50AM Eastern time on 9/1/2020 we lost instances behind a load balancer for part of our messaging infrastructure due to a cascading out of memory issue. This resulted in a loss of raw samples and rollups for elements whose data is ingested through the API (e.g. Linux and Windows agents) from 2:30AM-3:00AM Eastern Time. Historically, an outage of this severity would have caused a false alarm check storm. We're happy that this partial outage did not have that effect.
During this period no analytics were processed; however, after the incident we began to backfill analytics and policy evaluation. We also began to backfill AWS data, resulting in no gap for the 2:30AM-2:50AM period for those elements. Analytic processing was fully caught up at 3:48AM Eastern time.
We apologize for the outage and plan to make further improvement protecting our messaging infrastructure from cascading issues so we can further reduce the impact of these types of issues.