How to prevent/deal with corrupted WAL for Azure Database for PostgreSQL flexible server.

Konrad Wölms 0 Reputation points
2025-02-24T10:47:06.3266667+00:00

We are running a Postgres instance with a Timescale DB extension on Azure and are currently not able to log into the service from any client. With the error
"Connection refused Is the server running on that host and accepting TCP/IP connections?"

We had the same issue a month ago and attempted a restart of the instance, but the service was not able to fully restart and was basically stuck in "restarting". At the time we contacted the Support and their analysis yielded that somehow the WAL got corrupted and they reset it, presumably with something like "pg_resetwal" after which the instance was available again. We reset the database and afterwards the instance was running fine up until last Friday. While the error message does not provide enough details to be certain that this is again the same issue it seems likely.

I have the following related questions. Is there any way to figure out from the user perspective, whether WAL corruption is again responsible. If so, how to best diagnose what causes the WAL corruption in order to prevent in the future?

The instance belongs to a non-critical test system and we don't need the actual data. It is currently still left in the problematic state in order to better learn how to deal with this issue that now presumably happend a second time. However, we only have point in time restore available until this Thursday. We would like to use that to restore the database if we don't learn of better options by then. So any quick feedback is very much appreciated.

Azure Database for PostgreSQL
{count} votes

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.