Spent my morning figuring out why Nginx was dead on a server with many days of uptime.
-
hmm I think the problem's here using certbot in standalone mode. Don't you think so?
This post is deleted! -
@stefano I'm not sure about the renew subcommand of certbot, but I know there is an
--nginxflag that will tell certbot to use the already running nginx instance. You would need thepython3-certbot-nginxpackage installed.This post is deleted! -
Spent my morning figuring out why Nginx was dead on a server with many days of uptime. No reboot, no kernel panic. Just... down. Ubuntu 24.04.
The cause? An automatic unattended-upgrade of libc6. This prompted systemd to work its magic, wisely deciding to restart every running service to apply the patch. Fine.
The problem is, in the exact same minute, the systemd timer for certbot decided it was time to renew certificates.
The result:
- systemd stops Nginx.
- Port 80 becomes free.
- certbot, in standalone mode, immediately grabs it for validation.
- systemd tries to restart Nginx, which fails with "Address already in use".The web server was knocked offline by its own certificate renewal script.
I swear, this is the kind of cascading failure that has never happened to me in years of running *BSD. With a classic cron job, certbot would have failed, logged an error, and tried again the next day. The web server would have remained untouched.
systemd was doing its job, but something failed because of the interactions.
Sometimes, too much automation and too many interconnected parts just create more spectacular ways for things to break.
@stefano I read this as a simple race condition for port 80, and can't see how this is an "only on Linux" thing.
-
@stefano I read this as a simple race condition for port 80, and can't see how this is an "only on Linux" thing.
This post is deleted! -
This post is deleted!
@stefano To me, this is just a coincidence of two scheduled jobs (package upgrades and certificate renewal) running at the same time. Maybe I'm missing something, but port 80 being open to be taken over by certbot would have happened with a traditional cron job on any old Unix system just the same.
-
@stefano To me, this is just a coincidence of two scheduled jobs (package upgrades and certificate renewal) running at the same time. Maybe I'm missing something, but port 80 being open to be taken over by certbot would have happened with a traditional cron job on any old Unix system just the same.
This post is deleted! -
@stefano To me, this is just a coincidence of two scheduled jobs (package upgrades and certificate renewal) running at the same time. Maybe I'm missing something, but port 80 being open to be taken over by certbot would have happened with a traditional cron job on any old Unix system just the same.
@stefano Wait, no, I see your point. I had to edit in "post-install restarts", and realized that systemd taking care of that is indeed something special. I will still put the blame on certbot for taking over port 80 even though instructed to use nginx. That should have resulted in a fatal error.
-
This post is deleted!
@stefano I agree, it's certbot's behaviour that caused the issue in the end, not systemd doing a good job at system maintenance.
-
@stefano Wait, no, I see your point. I had to edit in "post-install restarts", and realized that systemd taking care of that is indeed something special. I will still put the blame on certbot for taking over port 80 even though instructed to use nginx. That should have resulted in a fatal error.
This post is deleted! -
@stefano I agree, it's certbot's behaviour that caused the issue in the end, not systemd doing a good job at system maintenance.
This post is deleted! -
Spent my morning figuring out why Nginx was dead on a server with many days of uptime. No reboot, no kernel panic. Just... down. Ubuntu 24.04.
The cause? An automatic unattended-upgrade of libc6. This prompted systemd to work its magic, wisely deciding to restart every running service to apply the patch. Fine.
The problem is, in the exact same minute, the systemd timer for certbot decided it was time to renew certificates.
The result:
- systemd stops Nginx.
- Port 80 becomes free.
- certbot, in standalone mode, immediately grabs it for validation.
- systemd tries to restart Nginx, which fails with "Address already in use".The web server was knocked offline by its own certificate renewal script.
I swear, this is the kind of cascading failure that has never happened to me in years of running *BSD. With a classic cron job, certbot would have failed, logged an error, and tried again the next day. The web server would have remained untouched.
systemd was doing its job, but something failed because of the interactions.
Sometimes, too much automation and too many interconnected parts just create more spectacular ways for things to break.
@stefano Says more about certbot than systemd though.
Like web server can just stay up with using the other ACME challenges (which can be DNS or reverse-proxying the acme client), so web server never has to go down. -
@stefano Says more about certbot than systemd though.
Like web server can just stay up with using the other ACME challenges (which can be DNS or reverse-proxying the acme client), so web server never has to go down.This post is deleted! -
This post is deleted!
I agree about the problem of Ubuntu here. But I don't think behavior of certbot is fine here either.
I don't think doing
certbot --nginxand then it falling back to standalone without explicit request of the user(here the sysadmin) aligns well with Unix philosophy and designs. To be honest, the certbot itself doesn't very much align with Unix philosophy IMO. -
I agree about the problem of Ubuntu here. But I don't think behavior of certbot is fine here either.
I don't think doing
certbot --nginxand then it falling back to standalone without explicit request of the user(here the sysadmin) aligns well with Unix philosophy and designs. To be honest, the certbot itself doesn't very much align with Unix philosophy IMO.This post is deleted! -
Spent my morning figuring out why Nginx was dead on a server with many days of uptime. No reboot, no kernel panic. Just... down. Ubuntu 24.04.
The cause? An automatic unattended-upgrade of libc6. This prompted systemd to work its magic, wisely deciding to restart every running service to apply the patch. Fine.
The problem is, in the exact same minute, the systemd timer for certbot decided it was time to renew certificates.
The result:
- systemd stops Nginx.
- Port 80 becomes free.
- certbot, in standalone mode, immediately grabs it for validation.
- systemd tries to restart Nginx, which fails with "Address already in use".The web server was knocked offline by its own certificate renewal script.
I swear, this is the kind of cascading failure that has never happened to me in years of running *BSD. With a classic cron job, certbot would have failed, logged an error, and tried again the next day. The web server would have remained untouched.
systemd was doing its job, but something failed because of the interactions.
Sometimes, too much automation and too many interconnected parts just create more spectacular ways for things to break.
This post is deleted! -
Spent my morning figuring out why Nginx was dead on a server with many days of uptime. No reboot, no kernel panic. Just... down. Ubuntu 24.04.
The cause? An automatic unattended-upgrade of libc6. This prompted systemd to work its magic, wisely deciding to restart every running service to apply the patch. Fine.
The problem is, in the exact same minute, the systemd timer for certbot decided it was time to renew certificates.
The result:
- systemd stops Nginx.
- Port 80 becomes free.
- certbot, in standalone mode, immediately grabs it for validation.
- systemd tries to restart Nginx, which fails with "Address already in use".The web server was knocked offline by its own certificate renewal script.
I swear, this is the kind of cascading failure that has never happened to me in years of running *BSD. With a classic cron job, certbot would have failed, logged an error, and tried again the next day. The web server would have remained untouched.
systemd was doing its job, but something failed because of the interactions.
Sometimes, too much automation and too many interconnected parts just create more spectacular ways for things to break.
This post is deleted! -
Spent my morning figuring out why Nginx was dead on a server with many days of uptime. No reboot, no kernel panic. Just... down. Ubuntu 24.04.
The cause? An automatic unattended-upgrade of libc6. This prompted systemd to work its magic, wisely deciding to restart every running service to apply the patch. Fine.
The problem is, in the exact same minute, the systemd timer for certbot decided it was time to renew certificates.
The result:
- systemd stops Nginx.
- Port 80 becomes free.
- certbot, in standalone mode, immediately grabs it for validation.
- systemd tries to restart Nginx, which fails with "Address already in use".The web server was knocked offline by its own certificate renewal script.
I swear, this is the kind of cascading failure that has never happened to me in years of running *BSD. With a classic cron job, certbot would have failed, logged an error, and tried again the next day. The web server would have remained untouched.
systemd was doing its job, but something failed because of the interactions.
Sometimes, too much automation and too many interconnected parts just create more spectacular ways for things to break.
@stefano How did you even trace this down??
-
@stefano How did you even trace this down??
This post is deleted! -
This post is deleted!
@stefano Oh I need to step up my logging game... a lot...
-
@stefano Oh I need to step up my logging game... a lot...
This post is deleted!