On June 3rd, 2025, several of our services experienced high latency and an increased error rate for certain customers. The incident lasted from approximately 17:00 CEST to 21:20 CEST. Updates were continuously communicated to our team, and necessary actions were taken to restore normal operations.
Beginning at 17:00 CEST, our on-call team was alerted to degraded performance across several services and database instances. This was triggered by an unforeseen issue with a database script, leading to significant disruptions.
All services using these database instances in their downstream exhibited high latencies. To mitigate the disruption, our incident responders had to temporarily pause some customer operations by stopping problematic queries.
Immediate Identification: Our team dedicated all available resources to quickly identify the root cause of the degraded performance. Their focused efforts revealed that an error in a database script was causing the issue.
Collaboration and Support: Members from various departments collaborated, combining a wide range of expertise to effectively explore and address the problem. Their teamwork was crucial in uncovering the source of the disruption.
Infrastructure Amendments: Once the problem was identified, steps were taken to rerun the script correctly across all affected areas, ensuring that services were restored smoothly.
By 21:20 CEST, the required adjustments were implemented, and normal operations were successfully restored.
This incident highlighted vulnerabilities in our approach to script execution and monitoring. Key takeaways include: