Write-up
Partial outage on Ghost(Pro)

The root cause of this incident was an invisible change made by our upstream provider, which caused one of our database servers to behave as though it was under significantly more load than usual.

We subsequently spent far too long initially a to find the load and fix it and then - after discovering the issue was upstream - waiting for their resolution, rather than failing over to a new server.

We completed a detailed postmortem process. The incident lasted for far longer than we feel is acceptable, and although the root cause was upstream, the time to recovery was entirely within our control.

We’re making significant changes to how we respond to similar incidents in the future, including:

  • Improved approach for analysis and remediation for high-impact incidents as a team

  • Better default remediation steps for DB performance issues

  • Changes to our DB cluster setups to make performance issues easier to mitigate

  • Our upstream vendor has also changed their policy for when they will perform changes and how they will communicate that to us

These changes will help us recover service significantly faster in the future.

We apologise to everyone who was impacted by this issue. If you have further questions, please reach out to support@ghost.org.