Global network experiencing issues

Incident Report for Karla

Postmortem

**Incident:** Network provider outage impact on certificate management

## Summary

A network provider outage revealed we were heavily dependent on their certificates for TLS management. This created a single point of failure: we couldn't issue or manage certificates during the outage.

## What We Did

Implemented an alternative approach to automatically provision certificates for our apps:
- Certificates now auto-renew with our own infrastructure
- Existing network provider certificates remain as backup
- Can switch between providers by changing a single field

## Lessons Learned

We were more dependent on our network provider than we realized. We now have a fallback plan and can continue operating if a similar outage occurs.

Posted Nov 18, 2025 - 16:11 UTC

Resolved

The network issues have been resolved and services are operating normally. We're continuing to monitor closely. Thank you for your patience.

Posted Nov 18, 2025 - 14:42 UTC

Update

Our remediation procedure hasn't been successful to remove our dependency to our network provider temporarily. While our network provider fixes the issue, we are working on generating our own certificates to bring services back as soon as possible.

Posted Nov 18, 2025 - 13:48 UTC

Update

Our network provider has reported to have found the root cause of the issue. Nonetheless, we expect our systems to work once DNS changes are propagated.

Posted Nov 18, 2025 - 13:12 UTC

Monitoring

We continue monitoring our systems and still see an increased amount of network errors.

Posted Nov 18, 2025 - 13:08 UTC

Investigating

We have implemented a fix and we are now switching back to our main network provider. Service will return to normal once DNS changes propagate and client caches update (typically 5-60 minutes depending on TTL settings).

Posted Nov 18, 2025 - 12:50 UTC

Update

We temporarily switched to an alternative DNS provider during the outage.

Posted Nov 18, 2025 - 12:30 UTC

Monitoring

Our monitoring systems have reported a significant amount of requests failing to our servers

Posted Nov 18, 2025 - 11:32 UTC

This incident affected: REST API, MCP, Tracking Pages and Resolve, and Portal.