The incident was opened on October 14, 2020 - 20:55 UTC.
The impact was largely resolved by October 16, 2020 - 23:00 UTC.
We monitored usage until the incident was closed, on October 19, 2020 - 23:35 UTC.
Attempts to send new logs to our service timed out, approximately 3% to 5% of the time. This resulted in intermittent failures to ingest logs from agents, code libraries, and REST API calls.
Most customers use our agents, which resend logs that fail to be ingested. Customers using other means to submit logs had to use their own retry methods.
A node was added to our service to handle an increased need for resources. This node had previously been cordoned off because of networking failures. When it became operational, our load balancers directed a percentage of ingestion calls to it. Those calls, which amounted to about 3% to 5% of the total, would fail and eventually timeout.
Monitoring revealed the failures were particular to pods running on this node. We found other means to handle the increased need for resources, then stopped the pods running on the problematic node and cordoned it off again. The rate of timeouts returned to normal levels and ingestion proceeded normally.
We’re improving how we identify nodes that have been cordoned off because of problematic behavior and should not be reintroduced to our service.