Spent my morning figuring out why Nginx was dead on a server with many days of uptime.
-
Spent my morning figuring out why Nginx was dead on a server with many days of uptime. No reboot, no kernel panic. Just... down. Ubuntu 24.04.
The cause? An automatic unattended-upgrade of libc6. This prompted systemd to work its magic, wisely deciding to restart every running service to apply the patch. Fine.
The problem is, in the exact same minute, the systemd timer for certbot decided it was time to renew certificates.
The result:
- systemd stops Nginx.
- Port 80 becomes free.
- certbot, in standalone mode, immediately grabs it for validation.
- systemd tries to restart Nginx, which fails with "Address already in use".The web server was knocked offline by its own certificate renewal script.
I swear, this is the kind of cascading failure that has never happened to me in years of running *BSD. With a classic cron job, certbot would have failed, logged an error, and tried again the next day. The web server would have remained untouched.
systemd was doing its job, but something failed because of the interactions.
Sometimes, too much automation and too many interconnected parts just create more spectacular ways for things to break.
@stefano How did you even trace this down??
-
@chebra looking carefully at the logs and studying the timing and inteactions
-
@chebra looking carefully at the logs and studying the timing and inteactions
@stefano Oh I need to step up my logging game... a lot...
-
@chebra for me, logs are critical. They saved my life so many times...