Analytics Service Errors in EU
Incident Report for CloudWisdom
At 8:17 PM GMT, the team noticed a steep increase in errors related to element processing for the Analytics service. Two of the cluster consumers were responsible for contributing to these errors. Consequently, the element processing queue became large and consumption from it began to fall behind. We immediately restarted the consumers to process the element backlog, which seemed to have resolved the issue but the errors did not subside entirely. In order to correct this, we stopped all consumers within the cluster and started them again. This caused errors to completely subside, while also correctly processing elements in the queue. Unfortunately, a loss of 5 minute rollup data was incurred between 8:00 PM GMT and 9:25 PM GMT affecting roughly half of the elements in the system. We plan to adjust related alerting policies to better identify this issue in the future.

Please note this incident affected users of the EU platform only. Users of the U.S. platform were not affected.
Posted Mar 25, 2020 - 16:00 EDT