Skip to content

Piero Bosio Social Web Site Personale Logo Fediverso

Social Forum federato con il resto del mondo. Non contano le istanze, contano le persone

Great, my home server is being hammered and my connection has dropped dramatically in performance.

Uncategorized
17 5 0
  • I've given the fail2ban conf doc a quick read, but this doesn't seem to be easy to detect, and it would be of limited use probably, unless the detection is done at a wider subnet level.

    @oblomov my approach involves a nodejs service and applies a chain of fairly complicated rules to categorize each one.

    Depending on how different requests are classified it then writes offending IPs to a different log which fail2ban follows. I don't think I could accomplish the same with fail2ban alone, or at least if I could it would be much less readable.

    Still, the write-to-a-log-to-ban is a nice API and I appreciate that fail2ban handles the rest of the details with so little attention.

  • @oblomov my approach involves a nodejs service and applies a chain of fairly complicated rules to categorize each one.

    Depending on how different requests are classified it then writes offending IPs to a different log which fail2ban follows. I don't think I could accomplish the same with fail2ban alone, or at least if I could it would be much less readable.

    Still, the write-to-a-log-to-ban is a nice API and I appreciate that fail2ban handles the rest of the details with so little attention.

    @ansuz that's very useful information, thanks.

  • Gone for a manual ban for the time being, but I'm going to have to look into something more sophisticated. I'll probably take some ideas from @ansuz
    https://cryptography.dog/blog/AI-scrapers-request-commented-scripts/
    but with all the IP hopping they do I wonder how effective it could be.

    The pattern between a real user and a bot is actually pretty easy to detect “in principle”: UAs for actual users after fetching a web page also fetch the associated auxiliary files (CSS, possibly JS). These bots don't even do that.

    @oblomov @ansuz It's even easier than that, and most bots can be caught on the first request: if the user-agent contains Firefox/ or Chrome/, and you're serving on HTTPS, the request will1 contain a sec-fetch-mode header too, when coming from a real browser. Bots don't send it.

    Pair it with blocking agents listed in ai.robots.txt, and ~90% of your bot traffic is gone. If you can afford to block Huawei's and Alibaba's ASNs, you pretty much got rid of all of them.

    Many of the bots do download CSS, and some even fetch the JS too, by the way. And images? Some of them love 'em.


    1. Exceptions apply: if you put a page in Reader Mode in Firefox, and reload while in reader mode, no sec-fetch-mode is sent. There are also some applications like gnome-podcasts that uses a Firefox user-agent, but doesn't send sec-fetch-mode. While there will be false positives, most of them can be worked around, and the gain of catching all the lame bots far outweights the cons, imo. ↩︎

  • @oblomov @ansuz It's even easier than that, and most bots can be caught on the first request: if the user-agent contains Firefox/ or Chrome/, and you're serving on HTTPS, the request will1 contain a sec-fetch-mode header too, when coming from a real browser. Bots don't send it.

    Pair it with blocking agents listed in ai.robots.txt, and ~90% of your bot traffic is gone. If you can afford to block Huawei's and Alibaba's ASNs, you pretty much got rid of all of them.

    Many of the bots do download CSS, and some even fetch the JS too, by the way. And images? Some of them love 'em.


    1. Exceptions apply: if you put a page in Reader Mode in Firefox, and reload while in reader mode, no sec-fetch-mode is sent. There are also some applications like gnome-podcasts that uses a Firefox user-agent, but doesn't send sec-fetch-mode. While there will be false positives, most of them can be worked around, and the gain of catching all the lame bots far outweights the cons, imo. ↩︎

    @algernon @ansuz that's useful information too, thanks. I'm actually considering collecting more information about the request headers in general to see if there's other subtle hints about them. Is there a way to tell apache to log all request headers for every request? At least while debugging it'd come in handy.

  • @algernon @ansuz that's useful information too, thanks. I'm actually considering collecting more information about the request headers in general to see if there's other subtle hints about them. Is there a way to tell apache to log all request headers for every request? At least while debugging it'd come in handy.

    @oblomov @ansuz I'm not an Apache person, but this module might do the trick.

    I also have about a week's worth of logs from mid-April this year, iirc, with full headers, but I'll have to double check. The bots haven't changed much since. If that'd be useful for you, I'll go and figure out where I put them... they're somewhere on my storage server, just gotta find which bucket.

  • @oblomov @ansuz I'm not an Apache person, but this module might do the trick.

    I also have about a week's worth of logs from mid-April this year, iirc, with full headers, but I'll have to double check. The bots haven't changed much since. If that'd be useful for you, I'll go and figure out where I put them... they're somewhere on my storage server, just gotta find which bucket.

    @algernon @ansuz thanks, that looks exactly like what I needed. I think I have enough scrapers attacking me these days that I hopefully won't need other people's logs ;-)

  • @oblomov @ansuz It's even easier than that, and most bots can be caught on the first request: if the user-agent contains Firefox/ or Chrome/, and you're serving on HTTPS, the request will1 contain a sec-fetch-mode header too, when coming from a real browser. Bots don't send it.

    Pair it with blocking agents listed in ai.robots.txt, and ~90% of your bot traffic is gone. If you can afford to block Huawei's and Alibaba's ASNs, you pretty much got rid of all of them.

    Many of the bots do download CSS, and some even fetch the JS too, by the way. And images? Some of them love 'em.


    1. Exceptions apply: if you put a page in Reader Mode in Firefox, and reload while in reader mode, no sec-fetch-mode is sent. There are also some applications like gnome-podcasts that uses a Firefox user-agent, but doesn't send sec-fetch-mode. While there will be false positives, most of them can be worked around, and the gain of catching all the lame bots far outweights the cons, imo. ↩︎

    @algernon @oblomov this aligns closely with my experience

    So far I don't block on the absence of those specific headers because I want RSS readers to be able to get through. For the most part they should mainly be fetching the feed URL and maybe the site's favicon, but there are exceptions as you noted.

    Some reader software will fetch arbitrary pages (at the user's request) and check for the existence of a <link rel="alternate"> tag. Since I strongly encourage readers to follow via RSS I'd hate to ban them when they try to do so 😅

  • I've given the fail2ban conf doc a quick read, but this doesn't seem to be easy to detect, and it would be of limited use probably, unless the detection is done at a wider subnet level.

    @oblomov i have created a /llm and the bots love it.

    Behind that /llm is quixotic and link-maze
    https://marcusb.org/hacks/quixotic.html

    DM if you want me to share my setup and how I poison the well.

  • @algernon @oblomov this aligns closely with my experience

    So far I don't block on the absence of those specific headers because I want RSS readers to be able to get through. For the most part they should mainly be fetching the feed URL and maybe the site's favicon, but there are exceptions as you noted.

    Some reader software will fetch arbitrary pages (at the user's request) and check for the existence of a <link rel="alternate"> tag. Since I strongly encourage readers to follow via RSS I'd hate to ban them when they try to do so 😅

    @ansuz @oblomov Looking at my logs, most RSS readers are unaffected: they either use their own user agent, and don't try to pretend to be Firefox or Chrome, or they are running within a real browser, in which case the expected headers will be there.

    Quick look at my logs from yesterday:

    • 2582 total requests against atom.xml on my blog.

    • 105 unique user agents

    • Only 24 of those user agents had Chrome/ or Firefox/ in their user agent

    • These 24 made 165 requests total.

    • Out of that 165, 54 did not have sec-fetch-mode.

    • Out of those 54, the majority came from either Cloudflare or Amazon, or another cloud provider.

    • Still out of those 54, 21 pretended to be Firefox, but the user agent wasn't what a real browser sends: Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/115.0 - in real browsers, rv and the Firefox/ version match. (All 21 were from Cloudflare IPs too)

    • Still out of the 54, 27 pretended to be Chrome, but did not send a sec-ch-ua header, nor sec-fetch-mode, and they said they're Chrome/84.0.4147.105 from 2020 - coming from Amazon AWS. I don't believe for a second these would be real browsers.

    This leaves us with 6 requests that may have come from legit browsers. Five of those were Chrome on Android, coming from a DigitalOcean IP, without sec-fetch-mode or sec-ch-ua. I don't think those were legit.

    There was one Firefox/, coming from an American residential IP, without sec-fetch-mode... that might have been legit, maybe?

    But out of 2.5k requests, 1 false positive1 is, imo, acceptable.

    Of course, what's acceptable varies a lot, and the people who visit (or rather, subscribe to) my blog are likely a bit atypical.

    What I'm trying to convey here is that the majority of RSS readers don't pretend to be Firefox or Chrome, or - because they're running in one - send the appropriate headers anyway.


    1. It is likely a false positive, that IP made a single request the entire day. ↩︎

  • @oblomov i have created a /llm and the bots love it.

    Behind that /llm is quixotic and link-maze
    https://marcusb.org/hacks/quixotic.html

    DM if you want me to share my setup and how I poison the well.

    @mxfraud very interesting, thanks. I am interested in these kinds of setup. Have you also considered throttling those connections too, as in only having quixotic send at a rate of like 60 bytes per second or so?

  • filobusundefined filobus shared this topic

Gli ultimi otto messaggi ricevuti dalla Federazione
Post suggeriti