Spent my morning figuring out why Nginx was dead on a server with many days of uptime.
-
hmm I think the problem's here using certbot in standalone mode. Don't you think so?
@farooqkz looking at the logs, it seems that certbot will run in --nginx mode if it finds an active nginx - but it didn't find it when launched, so used the standalone mode
-
@stefano I'm not sure about the renew subcommand of certbot, but I know there is an
--nginx
flag that will tell certbot to use the already running nginx instance. You would need thepython3-certbot-nginx
package installed.@hyperreal it usually run in --nginx mode - but looking at the logs, it seems it didn't detect a running nginx so switched back to the standalone mode
-
Spent my morning figuring out why Nginx was dead on a server with many days of uptime. No reboot, no kernel panic. Just... down. Ubuntu 24.04.
The cause? An automatic unattended-upgrade of libc6. This prompted systemd to work its magic, wisely deciding to restart every running service to apply the patch. Fine.
The problem is, in the exact same minute, the systemd timer for certbot decided it was time to renew certificates.
The result:
- systemd stops Nginx.
- Port 80 becomes free.
- certbot, in standalone mode, immediately grabs it for validation.
- systemd tries to restart Nginx, which fails with "Address already in use".The web server was knocked offline by its own certificate renewal script.
I swear, this is the kind of cascading failure that has never happened to me in years of running *BSD. With a classic cron job, certbot would have failed, logged an error, and tried again the next day. The web server would have remained untouched.
systemd was doing its job, but something failed because of the interactions.
Sometimes, too much automation and too many interconnected parts just create more spectacular ways for things to break.
@stefano I read this as a simple race condition for port 80, and can't see how this is an "only on Linux" thing.
-
@stefano I read this as a simple race condition for port 80, and can't see how this is an "only on Linux" thing.
@monospace I didn't say it's a problem "only on Linux". It's more of a "let's make things complex" problem. The fact that it's never happened on BSDs is directly related to the fact that they don't provide that kind of automation - so it can't break anything. 🙂
-
@monospace I didn't say it's a problem "only on Linux". It's more of a "let's make things complex" problem. The fact that it's never happened on BSDs is directly related to the fact that they don't provide that kind of automation - so it can't break anything. 🙂
@stefano To me, this is just a coincidence of two scheduled jobs (package upgrades and certificate renewal) running at the same time. Maybe I'm missing something, but port 80 being open to be taken over by certbot would have happened with a traditional cron job on any old Unix system just the same.
-
@stefano To me, this is just a coincidence of two scheduled jobs (package upgrades and certificate renewal) running at the same time. Maybe I'm missing something, but port 80 being open to be taken over by certbot would have happened with a traditional cron job on any old Unix system just the same.
@monospace the certbot renewal cronjob is usually enforcing the --nginx (or --apache), so it would fail if nginx/apache is down. This script tries to detect if nginx or apache is running and, if not, it's using the certbot as standalone. This created the problem - otherwise, it would just fail and retry the morning after.
-
@stefano To me, this is just a coincidence of two scheduled jobs (package upgrades and certificate renewal) running at the same time. Maybe I'm missing something, but port 80 being open to be taken over by certbot would have happened with a traditional cron job on any old Unix system just the same.
@stefano Wait, no, I see your point. I had to edit in "post-install restarts", and realized that systemd taking care of that is indeed something special. I will still put the blame on certbot for taking over port 80 even though instructed to use nginx. That should have resulted in a fatal error.
-
@monospace the certbot renewal cronjob is usually enforcing the --nginx (or --apache), so it would fail if nginx/apache is down. This script tries to detect if nginx or apache is running and, if not, it's using the certbot as standalone. This created the problem - otherwise, it would just fail and retry the morning after.
@stefano I agree, it's certbot's behaviour that caused the issue in the end, not systemd doing a good job at system maintenance.
-
@stefano Wait, no, I see your point. I had to edit in "post-install restarts", and realized that systemd taking care of that is indeed something special. I will still put the blame on certbot for taking over port 80 even though instructed to use nginx. That should have resulted in a fatal error.
@monospace Exactly, I agree.
-
@stefano I agree, it's certbot's behaviour that caused the issue in the end, not systemd doing a good job at system maintenance.
@monospace I've updated the original post to clarify that systemd has done its job, but the interaction caused problems
-
Spent my morning figuring out why Nginx was dead on a server with many days of uptime. No reboot, no kernel panic. Just... down. Ubuntu 24.04.
The cause? An automatic unattended-upgrade of libc6. This prompted systemd to work its magic, wisely deciding to restart every running service to apply the patch. Fine.
The problem is, in the exact same minute, the systemd timer for certbot decided it was time to renew certificates.
The result:
- systemd stops Nginx.
- Port 80 becomes free.
- certbot, in standalone mode, immediately grabs it for validation.
- systemd tries to restart Nginx, which fails with "Address already in use".The web server was knocked offline by its own certificate renewal script.
I swear, this is the kind of cascading failure that has never happened to me in years of running *BSD. With a classic cron job, certbot would have failed, logged an error, and tried again the next day. The web server would have remained untouched.
systemd was doing its job, but something failed because of the interactions.
Sometimes, too much automation and too many interconnected parts just create more spectacular ways for things to break.
@stefano Says more about certbot than systemd though.
Like web server can just stay up with using the other ACME challenges (which can be DNS or reverse-proxying the acme client), so web server never has to go down. -
@stefano Says more about certbot than systemd though.
Like web server can just stay up with using the other ACME challenges (which can be DNS or reverse-proxying the acme client), so web server never has to go down.@lanodan when I create some cron jobs, I force the "--nginx" or "--apache" - so it will never start listening. The script shipped with Ubuntu seems to fallback to "standalone" mode if nginx|apache isn't running.
-
@farooqkz looking at the logs, it seems that certbot will run in --nginx mode if it finds an active nginx - but it didn't find it when launched, so used the standalone mode
I agree about the problem of Ubuntu here. But I don't think behavior of certbot is fine here either.
I don't think doing
certbot --nginx
and then it falling back to standalone without explicit request of the user(here the sysadmin) aligns well with Unix philosophy and designs. To be honest, the certbot itself doesn't very much align with Unix philosophy IMO. -
I agree about the problem of Ubuntu here. But I don't think behavior of certbot is fine here either.
I don't think doing
certbot --nginx
and then it falling back to standalone without explicit request of the user(here the sysadmin) aligns well with Unix philosophy and designs. To be honest, the certbot itself doesn't very much align with Unix philosophy IMO.@farooqkz I agree. On many of my servers, I'm using acme.sh or lego. Or acme client on OpenBSD, of course