Great, my home server is being hammered and my connection has dropped dramatically in performance.
- 
Great, my home server is being hammered and my connection has dropped dramatically in performance. Time to look into fail2banning those assholes. This time it's apparently a huge subnet, 146.174.x.x 
- 
Great, my home server is being hammered and my connection has dropped dramatically in performance. Time to look into fail2banning those assholes. This time it's apparently a huge subnet, 146.174.x.x Gone for a manual ban for the time being, but I'm going to have to look into something more sophisticated. I'll probably take some ideas from @ansuz 
 https://cryptography.dog/blog/AI-scrapers-request-commented-scripts/
 but with all the IP hopping they do I wonder how effective it could be.The pattern between a real user and a bot is actually pretty easy to detect βin principleβ: UAs for actual users after fetching a web page also fetch the associated auxiliary files (CSS, possibly JS). These bots don't even do that. 
- 
Great, my home server is being hammered and my connection has dropped dramatically in performance. Time to look into fail2banning those assholes. This time it's apparently a huge subnet, 146.174.x.x @oblomov I can see two /17 matching that, is that Huawei Cloud or the other one? 
- 
Gone for a manual ban for the time being, but I'm going to have to look into something more sophisticated. I'll probably take some ideas from @ansuz 
 https://cryptography.dog/blog/AI-scrapers-request-commented-scripts/
 but with all the IP hopping they do I wonder how effective it could be.The pattern between a real user and a bot is actually pretty easy to detect βin principleβ: UAs for actual users after fetching a web page also fetch the associated auxiliary files (CSS, possibly JS). These bots don't even do that. I've given the fail2ban conf doc a quick read, but this doesn't seem to be easy to detect, and it would be of limited use probably, unless the detection is done at a wider subnet level. 
- 
@oblomov I can see two /17 matching that, is that Huawei Cloud or the other one? @Uilebheist no idea. For the time being I've manually banned 146.174.0.0/16 202.76.0.0/16 and things are smoother. 
- 
@Uilebheist no idea. For the time being I've manually banned 146.174.0.0/16 202.76.0.0/16 and things are smoother. @oblomov RIght, if one assume it's 146.174.128.0/17 and 202.76.128.0/17 - they are both Huawei cloud. Which I note I was already banning and I wonder why. 
- 
@oblomov RIght, if one assume it's 146.174.128.0/17 and 202.76.128.0/17 - they are both Huawei cloud. Which I note I was already banning and I wonder why. @Uilebheist OK maybe I exaggerated for this 8-D 
- 
 undefined Oblomov shared this topic undefined Oblomov shared this topic
- 
I've given the fail2ban conf doc a quick read, but this doesn't seem to be easy to detect, and it would be of limited use probably, unless the detection is done at a wider subnet level. @oblomov my approach involves a nodejs service and applies a chain of fairly complicated rules to categorize each one. Depending on how different requests are classified it then writes offending IPs to a different log which fail2ban follows. I don't think I could accomplish the same with fail2ban alone, or at least if I could it would be much less readable. Still, the write-to-a-log-to-ban is a nice API and I appreciate that fail2ban handles the rest of the details with so little attention. 
- 
@oblomov my approach involves a nodejs service and applies a chain of fairly complicated rules to categorize each one. Depending on how different requests are classified it then writes offending IPs to a different log which fail2ban follows. I don't think I could accomplish the same with fail2ban alone, or at least if I could it would be much less readable. Still, the write-to-a-log-to-ban is a nice API and I appreciate that fail2ban handles the rest of the details with so little attention. @ansuz that's very useful information, thanks. 
- 
Gone for a manual ban for the time being, but I'm going to have to look into something more sophisticated. I'll probably take some ideas from @ansuz 
 https://cryptography.dog/blog/AI-scrapers-request-commented-scripts/
 but with all the IP hopping they do I wonder how effective it could be.The pattern between a real user and a bot is actually pretty easy to detect βin principleβ: UAs for actual users after fetching a web page also fetch the associated auxiliary files (CSS, possibly JS). These bots don't even do that. @oblomov @ansuz It's even easier than that, and most bots can be caught on the first request: if the user-agent contains Firefox/orChrome/, and you're serving on HTTPS, the request will1 contain asec-fetch-modeheader too, when coming from a real browser. Bots don't send it.Pair it with blocking agents listed in ai.robots.txt, and ~90% of your bot traffic is gone. If you can afford to block Huawei's and Alibaba's ASNs, you pretty much got rid of all of them. Many of the bots do download CSS, and some even fetch the JS too, by the way. And images? Some of them love 'em. - Exceptions apply: if you put a page in Reader Mode in Firefox, and reload while in reader mode, no - sec-fetch-modeis sent. There are also some applications like- gnome-podcaststhat uses a Firefox user-agent, but doesn't send- sec-fetch-mode. While there will be false positives, most of them can be worked around, and the gain of catching all the lame bots far outweights the cons, imo.Β β©οΈ
 
- 
@oblomov @ansuz It's even easier than that, and most bots can be caught on the first request: if the user-agent contains Firefox/orChrome/, and you're serving on HTTPS, the request will1 contain asec-fetch-modeheader too, when coming from a real browser. Bots don't send it.Pair it with blocking agents listed in ai.robots.txt, and ~90% of your bot traffic is gone. If you can afford to block Huawei's and Alibaba's ASNs, you pretty much got rid of all of them. Many of the bots do download CSS, and some even fetch the JS too, by the way. And images? Some of them love 'em. - Exceptions apply: if you put a page in Reader Mode in Firefox, and reload while in reader mode, no - sec-fetch-modeis sent. There are also some applications like- gnome-podcaststhat uses a Firefox user-agent, but doesn't send- sec-fetch-mode. While there will be false positives, most of them can be worked around, and the gain of catching all the lame bots far outweights the cons, imo.Β β©οΈ
 @algernon @ansuz that's useful information too, thanks. I'm actually considering collecting more information about the request headers in general to see if there's other subtle hints about them. Is there a way to tell apache to log all request headers for every request? At least while debugging it'd come in handy. 
- 
@algernon @ansuz that's useful information too, thanks. I'm actually considering collecting more information about the request headers in general to see if there's other subtle hints about them. Is there a way to tell apache to log all request headers for every request? At least while debugging it'd come in handy. @oblomov @ansuz I'm not an Apache person, but this module might do the trick. I also have about a week's worth of logs from mid-April this year, iirc, with full headers, but I'll have to double check. The bots haven't changed much since. If that'd be useful for you, I'll go and figure out where I put them... they're somewhere on my storage server, just gotta find which bucket. 
- 
@oblomov @ansuz I'm not an Apache person, but this module might do the trick. I also have about a week's worth of logs from mid-April this year, iirc, with full headers, but I'll have to double check. The bots haven't changed much since. If that'd be useful for you, I'll go and figure out where I put them... they're somewhere on my storage server, just gotta find which bucket. 
- 
@oblomov @ansuz It's even easier than that, and most bots can be caught on the first request: if the user-agent contains Firefox/orChrome/, and you're serving on HTTPS, the request will1 contain asec-fetch-modeheader too, when coming from a real browser. Bots don't send it.Pair it with blocking agents listed in ai.robots.txt, and ~90% of your bot traffic is gone. If you can afford to block Huawei's and Alibaba's ASNs, you pretty much got rid of all of them. Many of the bots do download CSS, and some even fetch the JS too, by the way. And images? Some of them love 'em. - Exceptions apply: if you put a page in Reader Mode in Firefox, and reload while in reader mode, no - sec-fetch-modeis sent. There are also some applications like- gnome-podcaststhat uses a Firefox user-agent, but doesn't send- sec-fetch-mode. While there will be false positives, most of them can be worked around, and the gain of catching all the lame bots far outweights the cons, imo.Β β©οΈ
 @algernon @oblomov this aligns closely with my experience So far I don't block on the absence of those specific headers because I want RSS readers to be able to get through. For the most part they should mainly be fetching the feed URL and maybe the site's favicon, but there are exceptions as you noted. Some reader software will fetch arbitrary pages (at the user's request) and check for the existence of a <link rel="alternate"> tag. Since I strongly encourage readers to follow via RSS I'd hate to ban them when they try to do so π 
- 
I've given the fail2ban conf doc a quick read, but this doesn't seem to be easy to detect, and it would be of limited use probably, unless the detection is done at a wider subnet level. @oblomov i have created a /llm and the bots love it. Behind that /llm is quixotic and link-maze 
 https://marcusb.org/hacks/quixotic.htmlDM if you want me to share my setup and how I poison the well. 
- 
@algernon @oblomov this aligns closely with my experience So far I don't block on the absence of those specific headers because I want RSS readers to be able to get through. For the most part they should mainly be fetching the feed URL and maybe the site's favicon, but there are exceptions as you noted. Some reader software will fetch arbitrary pages (at the user's request) and check for the existence of a <link rel="alternate"> tag. Since I strongly encourage readers to follow via RSS I'd hate to ban them when they try to do so π @ansuz @oblomov Looking at my logs, most RSS readers are unaffected: they either use their own user agent, and don't try to pretend to be Firefox or Chrome, or they are running within a real browser, in which case the expected headers will be there. Quick look at my logs from yesterday: - 2582 total requests against - atom.xmlon my blog.
- 105 unique user agents 
- Only 24 of those user agents had - Chrome/or- Firefox/in their user agent
- These 24 made 165 requests total. 
- Out of that 165, 54 did not have - sec-fetch-mode.
- Out of those 54, the majority came from either Cloudflare or Amazon, or another cloud provider. 
- Still out of those 54, 21 pretended to be Firefox, but the user agent wasn't what a real browser sends: - Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/115.0- in real browsers,- rvand the- Firefox/version match. (All 21 were from Cloudflare IPs too)
- Still out of the 54, 27 pretended to be Chrome, but did not send a - sec-ch-uaheader, nor- sec-fetch-mode, and they said they're- Chrome/84.0.4147.105from 2020 - coming from Amazon AWS. I don't believe for a second these would be real browsers.
 This leaves us with 6 requests that may have come from legit browsers. Five of those were Chrome on Android, coming from a DigitalOcean IP, without sec-fetch-modeorsec-ch-ua. I don't think those were legit.There was one Firefox/, coming from an American residential IP, withoutsec-fetch-mode... that might have been legit, maybe?But out of 2.5k requests, 1 false positive1 is, imo, acceptable. Of course, what's acceptable varies a lot, and the people who visit (or rather, subscribe to) my blog are likely a bit atypical. What I'm trying to convey here is that the majority of RSS readers don't pretend to be Firefox or Chrome, or - because they're running in one - send the appropriate headers anyway. - It is likely a false positive, that IP made a single request the entire day.Β β©οΈ 
 
- 
@oblomov i have created a /llm and the bots love it. Behind that /llm is quixotic and link-maze 
 https://marcusb.org/hacks/quixotic.htmlDM if you want me to share my setup and how I poison the well. @mxfraud very interesting, thanks. I am interested in these kinds of setup. Have you also considered throttling those connections too, as in only having quixotic send at a rate of like 60 bytes per second or so? 
- 
 undefined filobus shared this topic undefined filobus shared this topic














