Great, my home server is being hammered and my connection has dropped dramatically in performance.

Oblomov

Great, my home server is being hammered and my connection has dropped dramatically in performance. Time to look into fail2banning those assholes. This time it's apparently a huge subnet, 146.174.x.x

Oblomov

Gone for a manual ban for the time being, but I'm going to have to look into something more sophisticated. I'll probably take some ideas from @ansuz
https://cryptography.dog/blog/AI-scrapers-request-commented-scripts/
but with all the IP hopping they do I wonder how effective it could be.

The pattern between a real user and a bot is actually pretty easy to detect “in principle”: UAs for actual users after fetching a web page also fetch the associated auxiliary files (CSS, possibly JS). These bots don't even do that.

Uilebheist

@oblomov I can see two /17 matching that, is that Huawei Cloud or the other one?

Oblomov

I've given the fail2ban conf doc a quick read, but this doesn't seem to be easy to detect, and it would be of limited use probably, unless the detection is done at a wider subnet level.

Oblomov

@Uilebheist no idea. For the time being I've manually banned 146.174.0.0/16 202.76.0.0/16 and things are smoother.

Uilebheist

@oblomov RIght, if one assume it's 146.174.128.0/17 and 202.76.128.0/17 - they are both Huawei cloud.

Which I note I was already banning and I wonder why.

Oblomov

@Uilebheist OK maybe I exaggerated for this 8-D

ansuz / ऐरन

@oblomov my approach involves a nodejs service and applies a chain of fairly complicated rules to categorize each one.

Depending on how different requests are classified it then writes offending IPs to a different log which fail2ban follows. I don't think I could accomplish the same with fail2ban alone, or at least if I could it would be much less readable.

Still, the write-to-a-log-to-ban is a nice API and I appreciate that fail2ban handles the rest of the details with so little attention.

Oblomov

@ansuz that's very useful information, thanks.

algernon in a ChatGPT costume (it's pure garbage)

@oblomov @ansuz It's even easier than that, and most bots can be caught on the first request: if the user-agent contains Firefox/ or Chrome/, and you're serving on HTTPS, the request will¹ contain a sec-fetch-mode header too, when coming from a real browser. Bots don't send it.

Pair it with blocking agents listed in ai.robots.txt, and ~90% of your bot traffic is gone. If you can afford to block Huawei's and Alibaba's ASNs, you pretty much got rid of all of them.

Many of the bots do download CSS, and some even fetch the JS too, by the way. And images? Some of them love 'em.

Exceptions apply: if you put a page in Reader Mode in Firefox, and reload while in reader mode, no sec-fetch-mode is sent. There are also some applications like gnome-podcasts that uses a Firefox user-agent, but doesn't send sec-fetch-mode. While there will be false positives, most of them can be worked around, and the gain of catching all the lame bots far outweights the cons, imo. ↩︎

Oblomov

@algernon @ansuz that's useful information too, thanks. I'm actually considering collecting more information about the request headers in general to see if there's other subtle hints about them. Is there a way to tell apache to log all request headers for every request? At least while debugging it'd come in handy.

algernon in a ChatGPT costume (it's pure garbage)

@oblomov @ansuz I'm not an Apache person, but this module might do the trick.

I also have about a week's worth of logs from mid-April this year, iirc, with full headers, but I'll have to double check. The bots haven't changed much since. If that'd be useful for you, I'll go and figure out where I put them... they're somewhere on my storage server, just gotta find which bucket.

Oblomov

@algernon @ansuz thanks, that looks exactly like what I needed. I think I have enough scrapers attacking me these days that I hopefully won't need other people's logs ;-)

ansuz / ऐरन

@algernon @oblomov this aligns closely with my experience

So far I don't block on the absence of those specific headers because I want RSS readers to be able to get through. For the most part they should mainly be fetching the feed URL and maybe the site's favicon, but there are exceptions as you noted.

Some reader software will fetch arbitrary pages (at the user's request) and check for the existence of a <link rel="alternate"> tag. Since I strongly encourage readers to follow via RSS I'd hate to ban them when they try to do so 😅

MxFraud

@oblomov i have created a /llm and the bots love it.

Behind that /llm is quixotic and link-maze
https://marcusb.org/hacks/quixotic.html

DM if you want me to share my setup and how I poison the well.

algernon in a ChatGPT costume (it's pure garbage)

@ansuz @oblomov Looking at my logs, most RSS readers are unaffected: they either use their own user agent, and don't try to pretend to be Firefox or Chrome, or they are running within a real browser, in which case the expected headers will be there.

Quick look at my logs from yesterday:

2582 total requests against atom.xml on my blog.
105 unique user agents
Only 24 of those user agents had Chrome/ or Firefox/ in their user agent
These 24 made 165 requests total.
Out of that 165, 54 did not have sec-fetch-mode.
Out of those 54, the majority came from either Cloudflare or Amazon, or another cloud provider.
Still out of those 54, 21 pretended to be Firefox, but the user agent wasn't what a real browser sends: Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/115.0 - in real browsers, rv and the Firefox/ version match. (All 21 were from Cloudflare IPs too)
Still out of the 54, 27 pretended to be Chrome, but did not send a sec-ch-ua header, nor sec-fetch-mode, and they said they're Chrome/84.0.4147.105 from 2020 - coming from Amazon AWS. I don't believe for a second these would be real browsers.

This leaves us with 6 requests that may have come from legit browsers. Five of those were Chrome on Android, coming from a DigitalOcean IP, without sec-fetch-mode or sec-ch-ua. I don't think those were legit.

There was one Firefox/, coming from an American residential IP, without sec-fetch-mode... that might have been legit, maybe?

But out of 2.5k requests, 1 false positive¹ is, imo, acceptable.

Of course, what's acceptable varies a lot, and the people who visit (or rather, subscribe to) my blog are likely a bit atypical.

What I'm trying to convey here is that the majority of RSS readers don't pretend to be Firefox or Chrome, or - because they're running in one - send the appropriate headers anyway.

It is likely a false positive, that IP made a single request the entire day. ↩︎

Oblomov

@mxfraud very interesting, thanks. I am interested in these kinds of setup. Have you also considered throttling those connections too, as in only having quixotic send at a rate of like 60 bytes per second or so?

Piero Bosio Social Web Site Personale

Great, my home server is being hammered and my connection has dropped dramatically in performance.

Feed RSS

Gli ultimi otto messaggi ricevuti dalla Federazione

Post suggeriti

Polistirolo nella #zuppa alla mensa della #scuola primaria: "avevamo finito il parmigiano".

I'm trying to collect some best practices around building modern, multi-tenant web apps.

Near Alfhausen, the ghost of an evil publican can be challenged to a race.

A LITTLE HELP FORM MY FRIENDS...