Root Cause Analysis
A deployment introduced changes to the database that were not fully compatible with the application code, resulting in service disruptions. Additionally, database operations during the deployment caused resource contention, leading to temporary outages across multiple features.
Timeline
07:52 UTC - Deployment initiated, leading to increased database load and connection issues.
08:03 UTC - Be My AI (chat) became fully inaccessible, and intermittent errors were observed across other services.
08:10 UTC - A rollback was initiated to stabilize the system.
08:15 UTC - All services, except Be My AI, were restored.
08:25 UTC - Be My AI was brought back online after further fixes.
Impact
Be My AI (chat): Fully unavailable during the incident.
Other Services: Brief disruptions lasting 5-10 minutes during deployments.
Resolution
All services were restored by 08:25 UTC, with system stability confirmed through checks.
We are implementing measures to prevent similar incidents, including:
- Enhancing deployment processes to better coordinate application and database changes.
- Improving database operation handling to reduce the risk of resource contention during deployments.
We apologize for the inconvenience caused and are committed to improving our systems for better reliability.