Yesterday, our search cluster's performance was severely degraded for 8 minutes between 1:48PM and 1:56PM Eastern. Extensive garbage collection (GC) across the cluster began at 1:40PM, which lead to one node in the cluster running out of memory. This put additional memory pressure on the remaining nodes, eventually causing more nodes to fail. For remediation, we restarted the nodes and brought the services back online immediately.
We're continuing to investigate root cause of the garbage collection pressure, and will implement additional instrumentation and tuning to prevent the same event from happening in the future.