Hi James,
We took a look at the timeline of events, as well as what caused the issue and the facts are as follow:
- 2023-07-17 17:49:17 UTC
- a shutdown request was sent to your project's Postgres service
- internally, we use systemd to manage services
- the default value for a systemd service's TimeoutStopSec is 90 seconds
- this is a variable specifying how long systemd waits for a process to exit before outright killing the process after receiving a shutdown signal
- 90 seconds is usually sufficient time for Postgres to exit, but given your project's database size, and number of IO operations it was running when it received the shutdown signal, this proved insufficient
- 2023-07-17 17:50:47 UTC
- systemd terminates Postgres' process, forcing an ungraceful shutdown of your project's database
- 2023-07-17 17:51:43 UTC
- your project's Postgres service attempts booting up and replaying WAL changes, exactly like you've assumed in the outage doc you've shared
- unfortunately, since the default for TimeoutStartSec is also 90 seconds, after 90 seconds from startup, systemd signalled Postgres to terminate its process, exactly as it did when this issue started
- this effectively put your database service in shutdown mode for another 90 seconds, before systemd killed the Postgres process and the chain of events started repeating
This can be correlated through the Postgres service's logs, which you can also access, by searching for "starting PostgreSQL 15.1".
As observed, the time interval between Postgres starts is roughly 3 minutes (90 seconds for shutdown + 90 seconds for startup):
To restore your database service, we had to intervene and apply a set of changes which address this behaviour - namely systemd process lifecycle timeouts.
We have addressed this issue for newer projects, but the rollout to update projects provisioned prior to this changes is currently in progress - unfortunately, the changes hadn't yet reached your project before this took place.
We've prioritized your project in the rollout and it received the changes, allowing Postgres to run its full recovery cycle, and to avoid similar situations from occurring.
Please let me know if I can be of assistance in any other form, or provide any other information.
Best regards,
Paul Cioanca
Supabase Engineer