We are currently experiencing degraded performance on our systems

Incident Report for Magicline

Postmortem

Incident Summary

On June 3rd, 2025, several of our services experienced high latency and an increased error rate for certain customers. The incident lasted from approximately 17:00 CEST to 21:20 CEST. Updates were continuously communicated to our team, and necessary actions were taken to restore normal operations.

What Happened?

Beginning at 17:00 CEST, our on-call team was alerted to degraded performance across several services and database instances. This was triggered by an unforeseen issue with a database script, leading to significant disruptions.

Impact on Users

All services using these database instances in their downstream exhibited high latencies. To mitigate the disruption, our incident responders had to temporarily pause some customer operations by stopping problematic queries.

Our Response

Immediate Identification: Our team dedicated all available resources to quickly identify the root cause of the degraded performance. Their focused efforts revealed that an error in a database script was causing the issue.

Collaboration and Support: Members from various departments collaborated, combining a wide range of expertise to effectively explore and address the problem. Their teamwork was crucial in uncovering the source of the disruption.

Infrastructure Amendments: Once the problem was identified, steps were taken to rerun the script correctly across all affected areas, ensuring that services were restored smoothly.

Resolution:

By 21:20 CEST, the required adjustments were implemented, and normal operations were successfully restored.

What We Learned

This incident highlighted vulnerabilities in our approach to script execution and monitoring. Key takeaways include:

  • Enhancing the resilience of our scripts, making their operation more visible and transparent.
  • Continuous monitoring of long-running scripts should be enforced, with a handover plan for operations involving all tenants.
Posted Jun 04, 2025 - 19:54 CEST

Resolved

From 16:30 CET to 21:20 CET, we saw degraded performances of some of our services.
Our Engineering team has resolved the issue and will publish a post mortem shortly to provide more details which led to this incident.
If you continue to experience problems, please open a ticket with our support team. We apologize for any inconvenience.
Posted Jun 03, 2025 - 21:31 CEST

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Jun 03, 2025 - 21:22 CEST

Identified

The issue has been identified and a fix is being implemented.
Posted Jun 03, 2025 - 20:00 CEST

Update

We are continuing to investigate this issue.
Posted Jun 03, 2025 - 19:14 CEST

Investigating

Some of our services are still experiencing degraded performance. We already solved some of the related issues, but we still see degradation on some clusters. Our team is engaged in identifying the underlying cause and working towards a resolution. We apologize for any inconvenience this may cause and thank you for your understanding.
Posted Jun 03, 2025 - 19:11 CEST

Update

We are continuing to monitor for any further issues.
Posted Jun 03, 2025 - 18:43 CEST

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Jun 03, 2025 - 18:43 CEST

Update

We are continuing to work on a fix for this issue.
Posted Jun 03, 2025 - 18:12 CEST

Identified

The issue has been identified and a fix is being implemented.
Posted Jun 03, 2025 - 18:12 CEST

Update

Some of our services are currently experiencing degraded performance. Our team is engaged in identifying the underlying cause and working towards a resolution. We apologize for any inconvenience this may cause and thank you for your understanding.
Posted Jun 03, 2025 - 18:12 CEST

Update

We are continuing to investigate this issue.
Posted Jun 03, 2025 - 18:02 CEST

Investigating

We are currently investigating this issue.
Posted Jun 03, 2025 - 18:02 CEST
This incident affected: Web Application and MySports API.