Users were unable to connect calls due to an AWS Amazon global incident.
Timeline
- 06:55 UTC: We detected elevated failure rates in call-routing attempts. Initial alert issued.
- 08:46 UTC: Engineering team confirmed that the issue spans our call-routing subsystem; investigation initiated.
- 09:11 UTC: It was determined that root cause stems from an external disruption in our cloud provider’s infrastructure, affecting our connectivity and routing flows. -
11:49 UTC: Recovery achieved: calls began routing again.
Root Cause Analysis
A disruption in the communications infrastructure of a third-party provider prevented the proper creation and routing of calls.
The disruption was triggered by a major infrastructure outage at our video call vendor's upstream cloud provider. The provider reported that the incident began in the US-East-1 region, and involved an increase in error rates and latencies across multiple services.
Specifically, the provider indicated a DNS resolution issue impacting the API endpoint for their database service (DynamoDB) in US-East-1, which cascaded into elevated latency and failures for other dependent services