sysadmins/webmasters of fedi:

jonny (good kind)

@ansuz
We found that blocking them just leads them to return with another IP but not follow the bait, redirecting to a tarpit seems to work, however.

ollibaba

@jonny @ansuz Here's a quite interesting post that also supports this approach: https://maurycyz.com/misc/the_cost_of_trash/

jonny (good kind)

@ollibaba
@ansuz
Yeah some of the bots get stuck in nepenthes and never come out. There have been some tencent bots rattling around in there for months last I checked. We get very little bot traffic on sciop, when I watch the request logs its mostly requests for RSS feeds from torrent clients and I very rarely see the kind of scraping activity I see on my other sites.

I think because

the bots hit the domain root first which only has the hidden crawler link in it and top-level nav links
each of the content-bearing index pages they would find are two-step lazy loads, where htmx triggers the load of more links after an initial page load, and only a subset of the crawlers seem to always have/start with full-browser emulation
many of the crawlers seem to do a "second pass" with a browser emulator if they complete the domain quickly or hit some block, I can't tell if they always do this or if its only below some threshold page count or something.
however since they are still crawling the sweet sweet tarpit, which is served under the same domain from a different machine so it just looks like normal pages, they seem perfectly content to just chow on that and dont seem to try and come back to the main site at least for awhile.

The tarpit is quite soothing, we have it trained on a combination of WWE announcer transcripts and Kropotkin's mutual aid among some other texts: https://sciop.net/crawlers/

Anecdotally, and I haven't tested this in a serious way, but having any kind of block seems to make it worse, since active countermeasures are a decent signal that you have some juicy human text in there you're trying to protect. When I put user agent blocks on my forgejo instance I noticed a substantial increase in traffic.

Also, p much all of the Anubis stuff was done by @ashley , I just watch the logs on sciop

StoneBear :potion_genderqueer:

@Argonel

I would absolutely fsck the hell out of somebody doing that. If it tries to get the bogus .js file, feed it something that grabs the incoming IP and feeds it back on itself .... basically make the bots DDOS themselves.

Like in the old days sending an XTREE packet (which would lock up a Windows box, back when the skript kiddies used such things...)

Don't MESS with Mama Bear, motherfrackers.

@ansuz @oblomov

ansuz / ऐरन

@stonebear2 @Argonel @oblomov the main complication is that it seems there are lots of different scraper projects exhibiting this behaviour, so there's not too many methods that can be applied broadly.

zip-bombs (as others have suggested) are fairly likely to be effective, but they require extra traffic on my part. blocking them is the most universal and scalable response.

that said, if I ever encounter a scraper that I simply can't block, I'll probably take the zip-bomb route.

ansuz / ऐरन

Since so many people are boosting this thread I think I'll take the opportunity to mention that I'm available for hire on a part-time or contract basis.

Feel free to reach out if you like my ideas about computer-related topics and have both the budget and need of someone who has such ideas.

I can be reached by Fediverse DM or the contact form on my website:

https://cryptography.dog/contact/

Colin the Mathmo

@ansuz My random thought is that if you block them quickly, they will notice and adapt. If you tarpit(*) them then it might be effective for longer, and have a greater chance of poisoning the bad actors.

(*) I know, verbing weirds language

CC: @Argonel @oblomov

Androcat

@ansuz

One thing to note is also that the Atlas "Browser" is a distributed scrape/crawl attack. It's an browser that users can use, but it actually scrapes the sites and sends the info to OpenAI.

So, yes, we should all expect extremely erratic scraping to be going on, and countermeasures are important.

Of course, if the "browser" doesn't appear to work, the users will stop using it.

@stonebear2 @Argonel @oblomov

ansuz / ऐरन

I just published a blog post summing up my most pertinent thoughts about dealing with badly-behaved web-scraping bots:

https://cryptography.dog/blog/AI-scrapers-request-commented-scripts/

It isn't exactly a Hallowe'en-themed article, but today is the 31st and the topic is concerned with pranking people who come knocking on my website's ports, so it's somewhat appropriate.

#infosec #bots #halloween #scrapers #AI #someMoreHashtagsHere

ansuz / ऐरन

looks like someone reshared my article to Hacker News where someone has (predictably) already commented on the headline without reading the article 😅

https://news.ycombinator.com/item?id=45773347

ansuz / ऐरन

it's like reddit if every user was a meth-head tech bro

ansuz / ऐरन

my system for catching anomalous HTTP traffic just flagged someone sending me an API key as an "x-api-key" header.

that's about what I'd expect from the average HN reader

Colin the Mathmo

@ansuz Some of the replies/comments in that thread are just insane!

Scraping is fine, even when there's a robots.txt (and other signs) saying not to, but DDoSing is illegal and you should sue people who do that.

Just ... what ?!?

Some of these people are in a totally different reality.

No, not reality ... the other thing.

ansuz / ऐरन

@ColinTheMathmo oh yea, a lot of them have very clearly had way too much silicon-valley-kool-aid to drink and were basically incoherent.

I was actually pleasantly surprised at a few others, though, I can't help but feel that the more sensible ones might be wasting their time commenting on HN and should just join fedi

Neil Brown

@ansuz @ColinTheMathmo

Definitely a "take the rough with the smooth" thing.

I don't tend to read HN commentary on my stuff. (Sometimes, but not always)

Tris

@ansuz If Marginalia is already there then:
https://alpha.mwmbl.org
https://stract.com

Oblomov

@ansuz @ColinTheMathmo this has been my experience also in the rare cases when my writing gets shared on HN (or lobste.rs or reddit, FWIW): there a bunch of ignorami who are full-in on the corporate propaganda, and a few sane voices that try to highlight how bad that is for everyone. 1/2

Oblomov

@ansuz @ColinTheMathmo

On the one hand, I agree that they're mostly wasting their time and would do better to spend it elsewhere; OTOH, I am glad for the work they do, and who knows, maybe their words reach others that can then do better.

2/2

Scared Rune (Halloweeen Edition)

@ansuz responding to someone on hacker news that they're too good for HN and should join fedi would be absolutely hilarious to no one but me

ansuz / ऐरन

@rune I might have tried it if I had an account there, but then imagine if I was wrong and that was their one good take. fedi equivalent of buyer's remorse

Piero Bosio Social Web Site Personale

sysadmins/webmasters of fedi:

Feed RSS

Gli ultimi otto messaggi ricevuti dalla Federazione

Post suggeriti

Oggi mi sento stanco, svuotato, sottile, senza direzione, senza orizzonte, sospeso... spento.

🟢 EOLO ha attivato la nuova rete 5G FWA

Ciao!

Good morning to everyone who says “searching” and not “googling”.