Spent my morning figuring out why Nginx was dead on a server with many days of uptime.
-
Spent my morning figuring out why Nginx was dead on a server with many days of uptime. No reboot, no kernel panic. Just... down. Ubuntu 24.04.
The cause? An automatic unattended-upgrade of libc6. This prompted systemd to work its magic, wisely deciding to restart every running service to apply the patch. Fine.
The problem is, in the exact same minute, the systemd timer for certbot decided it was time to renew certificates.
The result:
- systemd stops Nginx.
- Port 80 becomes free.
- certbot, in standalone mode, immediately grabs it for validation.
- systemd tries to restart Nginx, which fails with "Address already in use".The web server was knocked offline by its own certificate renewal script.
I swear, this is the kind of cascading failure that has never happened to me in years of running *BSD. With a classic cron job, certbot would have failed, logged an error, and tried again the next day. The web server would have remained untouched.
systemd was doing its job, but something failed because of the interactions.
Sometimes, too much automation and too many interconnected parts just create more spectacular ways for things to break.
@stefano Says more about certbot than systemd though.
Like web server can just stay up with using the other ACME challenges (which can be DNS or reverse-proxying the acme client), so web server never has to go down. -
@stefano Says more about certbot than systemd though.
Like web server can just stay up with using the other ACME challenges (which can be DNS or reverse-proxying the acme client), so web server never has to go down.@lanodan when I create some cron jobs, I force the "--nginx" or "--apache" - so it will never start listening. The script shipped with Ubuntu seems to fallback to "standalone" mode if nginx|apache isn't running.
-
@farooqkz looking at the logs, it seems that certbot will run in --nginx mode if it finds an active nginx - but it didn't find it when launched, so used the standalone mode
I agree about the problem of Ubuntu here. But I don't think behavior of certbot is fine here either.
I don't think doing
certbot --nginx
and then it falling back to standalone without explicit request of the user(here the sysadmin) aligns well with Unix philosophy and designs. To be honest, the certbot itself doesn't very much align with Unix philosophy IMO. -
I agree about the problem of Ubuntu here. But I don't think behavior of certbot is fine here either.
I don't think doing
certbot --nginx
and then it falling back to standalone without explicit request of the user(here the sysadmin) aligns well with Unix philosophy and designs. To be honest, the certbot itself doesn't very much align with Unix philosophy IMO.@farooqkz I agree. On many of my servers, I'm using acme.sh or lego. Or acme client on OpenBSD, of course
-
Spent my morning figuring out why Nginx was dead on a server with many days of uptime. No reboot, no kernel panic. Just... down. Ubuntu 24.04.
The cause? An automatic unattended-upgrade of libc6. This prompted systemd to work its magic, wisely deciding to restart every running service to apply the patch. Fine.
The problem is, in the exact same minute, the systemd timer for certbot decided it was time to renew certificates.
The result:
- systemd stops Nginx.
- Port 80 becomes free.
- certbot, in standalone mode, immediately grabs it for validation.
- systemd tries to restart Nginx, which fails with "Address already in use".The web server was knocked offline by its own certificate renewal script.
I swear, this is the kind of cascading failure that has never happened to me in years of running *BSD. With a classic cron job, certbot would have failed, logged an error, and tried again the next day. The web server would have remained untouched.
systemd was doing its job, but something failed because of the interactions.
Sometimes, too much automation and too many interconnected parts just create more spectacular ways for things to break.
@stefano Wow.. talk about the worst timing!
-
Spent my morning figuring out why Nginx was dead on a server with many days of uptime. No reboot, no kernel panic. Just... down. Ubuntu 24.04.
The cause? An automatic unattended-upgrade of libc6. This prompted systemd to work its magic, wisely deciding to restart every running service to apply the patch. Fine.
The problem is, in the exact same minute, the systemd timer for certbot decided it was time to renew certificates.
The result:
- systemd stops Nginx.
- Port 80 becomes free.
- certbot, in standalone mode, immediately grabs it for validation.
- systemd tries to restart Nginx, which fails with "Address already in use".The web server was knocked offline by its own certificate renewal script.
I swear, this is the kind of cascading failure that has never happened to me in years of running *BSD. With a classic cron job, certbot would have failed, logged an error, and tried again the next day. The web server would have remained untouched.
systemd was doing its job, but something failed because of the interactions.
Sometimes, too much automation and too many interconnected parts just create more spectacular ways for things to break.
@stefano Call me ol'fashioned ☺️
APT::Periodic::Update-Package-Lists "0";
APT::Periodic::Unattended-Upgrade "0"; -
Spent my morning figuring out why Nginx was dead on a server with many days of uptime. No reboot, no kernel panic. Just... down. Ubuntu 24.04.
The cause? An automatic unattended-upgrade of libc6. This prompted systemd to work its magic, wisely deciding to restart every running service to apply the patch. Fine.
The problem is, in the exact same minute, the systemd timer for certbot decided it was time to renew certificates.
The result:
- systemd stops Nginx.
- Port 80 becomes free.
- certbot, in standalone mode, immediately grabs it for validation.
- systemd tries to restart Nginx, which fails with "Address already in use".The web server was knocked offline by its own certificate renewal script.
I swear, this is the kind of cascading failure that has never happened to me in years of running *BSD. With a classic cron job, certbot would have failed, logged an error, and tried again the next day. The web server would have remained untouched.
systemd was doing its job, but something failed because of the interactions.
Sometimes, too much automation and too many interconnected parts just create more spectacular ways for things to break.
@stefano How did you even trace this down??
-
@chebra looking carefully at the logs and studying the timing and inteactions
-
@chebra looking carefully at the logs and studying the timing and inteactions
@stefano Oh I need to step up my logging game... a lot...
-
@chebra for me, logs are critical. They saved my life so many times...