Root Cause Analysis
Our production Redis database server became unreachable, which led to errors and downtime in other server components that rely on it. We are currently investigating the cause of this issue. Notably, the Redis server was upgraded earlier on the same day by our hosting provider, and we are verifying whether this upgrade contributed to the problem.
Timeline
12:18 UTC - Be My Eyes development team noticed that API calls to the backend started to fail.
12:22 UTC - From the logs it was noticed that the problem was caused by the failing Redis database connections.
12:26 UTC - We decided to increase the size of the Redis database server. This operation reinstalled and restarted the problematic server.
12:42 UTC - New upgraded Redis server became online and fixed the incident.
Impact
Be My AI (chat): Fully unavailable during the incident.
Calls to volunteers and service directory profiles: Fully unavailable during the incident.
Other Services: Disruptions during the incident.
Resolution
All services were restored by 12:42 UTC, with system stability confirmed through checks. No user data was in danger due to the incident.
We are implementing measures to prevent similar incidents, including:
- We investigate together with our hosting provider to learn the root cause of the server failure
- We work on minimizing the impact of potential issues if the Redis server becomes unreachable in the future. Since Redis is primarily used to improve service speed, the core functionality can be made to remain intact even if it experiences downtime.
We apologize for the inconvenience caused and are committed to improving our systems for better reliability.