sysadmins/webmasters of fedi:
-
On just this one website I count 23 unique IPs with wildly different user agents. Only 4 identify themselves as bots.
This domain receives far less traffic than any other, as I use it mostly as a personal blog and share it only with friends. I assume this will be much more effective if applied to my more public sites.
Several of my websites now feature a commented-out script tag linking to a non-existent file. Any IP requesting this file will be banned at the firewall level for a significant duration.
I'll give this a few days and report back on how many bad bots it catches.
-
Several of my websites now feature a commented-out script tag linking to a non-existent file. Any IP requesting this file will be banned at the firewall level for a significant duration.
I'll give this a few days and report back on how many bad bots it catches.
@ansuz that's a very interesting approach, do let us know, I'm still looking for ways to handle these things.
-
@ansuz that's a very interesting approach, do let us know, I'm still looking for ways to handle these things.
@oblomov I'll probably write a blog post about it.
I'm using a range of different tricks to classify humans/scrapers, and updating them all the time.
I don't expect this will make a huge difference in terms of absolute numbers, but it does seem like it will catch a lot of requests that otherwise look legitimate, so it feels like I've found a missing puzzle piece.
Scrapers are just greedier than any other type of agent, and that can be exploited ❤️
-
@oblomov I'll probably write a blog post about it.
I'm using a range of different tricks to classify humans/scrapers, and updating them all the time.
I don't expect this will make a huge difference in terms of absolute numbers, but it does seem like it will catch a lot of requests that otherwise look legitimate, so it feels like I've found a missing puzzle piece.
Scrapers are just greedier than any other type of agent, and that can be exploited ❤️
@ansuz much appreciated, thanks.
-
@oblomov I'll probably write a blog post about it.
I'm using a range of different tricks to classify humans/scrapers, and updating them all the time.
I don't expect this will make a huge difference in terms of absolute numbers, but it does seem like it will catch a lot of requests that otherwise look legitimate, so it feels like I've found a missing puzzle piece.
Scrapers are just greedier than any other type of agent, and that can be exploited ❤️
-
@Argonel @oblomov I've thought about it, and I have some code stashed somewhere which would do exactly that, but so far I've decided it's probably not worth the electricity.
The most efficient thing is to terminate the requests as quickly as possible and to let something low-level like the system firewall keep them from coming back for a while.
Maybe I'll change my mind in the future, but if we're lucky the bubble will crash and the problem will just go away.
-
Several of my websites now feature a commented-out script tag linking to a non-existent file. Any IP requesting this file will be banned at the firewall level for a significant duration.
I'll give this a few days and report back on how many bad bots it catches.
@ansuz
We found that blocking them just leads them to return with another IP but not follow the bait, redirecting to a tarpit seems to work, however. -
@ansuz
We found that blocking them just leads them to return with another IP but not follow the bait, redirecting to a tarpit seems to work, however.@jonny @ansuz Here's a quite interesting post that also supports this approach: https://maurycyz.com/misc/the_cost_of_trash/
-
@jonny @ansuz Here's a quite interesting post that also supports this approach: https://maurycyz.com/misc/the_cost_of_trash/
@ollibaba
@ansuz
Yeah some of the bots get stuck in nepenthes and never come out. There have been some tencent bots rattling around in there for months last I checked. We get very little bot traffic on sciop, when I watch the request logs its mostly requests for RSS feeds from torrent clients and I very rarely see the kind of scraping activity I see on my other sites.I think because
- the bots hit the domain root first which only has the hidden crawler link in it and top-level nav links
- each of the content-bearing index pages they would find are two-step lazy loads, where htmx triggers the load of more links after an initial page load, and only a subset of the crawlers seem to always have/start with full-browser emulation
- many of the crawlers seem to do a "second pass" with a browser emulator if they complete the domain quickly or hit some block, I can't tell if they always do this or if its only below some threshold page count or something.
- however since they are still crawling the sweet sweet tarpit, which is served under the same domain from a different machine so it just looks like normal pages, they seem perfectly content to just chow on that and dont seem to try and come back to the main site at least for awhile.
The tarpit is quite soothing, we have it trained on a combination of WWE announcer transcripts and Kropotkin's mutual aid among some other texts: https://sciop.net/crawlers/
Anecdotally, and I haven't tested this in a serious way, but having any kind of block seems to make it worse, since active countermeasures are a decent signal that you have some juicy human text in there you're trying to protect. When I put user agent blocks on my forgejo instance I noticed a substantial increase in traffic.
Also, p much all of the Anubis stuff was done by @ashley , I just watch the logs on sciop
-
I would absolutely fsck the hell out of somebody doing that. If it tries to get the bogus .js file, feed it something that grabs the incoming IP and feeds it back on itself .... basically make the bots DDOS themselves.
Like in the old days sending an XTREE packet (which would lock up a Windows box, back when the skript kiddies used such things...)
Don't MESS with Mama Bear, motherfrackers.
-
I would absolutely fsck the hell out of somebody doing that. If it tries to get the bogus .js file, feed it something that grabs the incoming IP and feeds it back on itself .... basically make the bots DDOS themselves.
Like in the old days sending an XTREE packet (which would lock up a Windows box, back when the skript kiddies used such things...)
Don't MESS with Mama Bear, motherfrackers.
@stonebear2 @Argonel @oblomov the main complication is that it seems there are lots of different scraper projects exhibiting this behaviour, so there's not too many methods that can be applied broadly.
zip-bombs (as others have suggested) are fairly likely to be effective, but they require extra traffic on my part. blocking them is the most universal and scalable response.
that said, if I ever encounter a scraper that I simply can't block, I'll probably take the zip-bomb route.
-
I was just looking at my webserver logs while sipping coffee (as one does) and I noticed that one of my websites was receiving requests for a js file which I had prototyped but never actually deployed.
The script tag is present in the page, but it's commented out. I investigated, and it seems that scrapers see that tag and are trying to grab it even though it's completely non-functional. I guess they just want every bit of code they can find to help train an LLM.
This seems like a promising pattern for catching scrapers that pretend to be normal browsers.
Since so many people are boosting this thread I think I'll take the opportunity to mention that I'm available for hire on a part-time or contract basis.
Feel free to reach out if you like my ideas about computer-related topics and have both the budget and need of someone who has such ideas.
I can be reached by Fediverse DM or the contact form on my website:
-
@Argonel @oblomov I've thought about it, and I have some code stashed somewhere which would do exactly that, but so far I've decided it's probably not worth the electricity.
The most efficient thing is to terminate the requests as quickly as possible and to let something low-level like the system firewall keep them from coming back for a while.
Maybe I'll change my mind in the future, but if we're lucky the bubble will crash and the problem will just go away.
-
@stonebear2 @Argonel @oblomov the main complication is that it seems there are lots of different scraper projects exhibiting this behaviour, so there's not too many methods that can be applied broadly.
zip-bombs (as others have suggested) are fairly likely to be effective, but they require extra traffic on my part. blocking them is the most universal and scalable response.
that said, if I ever encounter a scraper that I simply can't block, I'll probably take the zip-bomb route.
One thing to note is also that the Atlas "Browser" is a distributed scrape/crawl attack. It's an browser that users can use, but it actually scrapes the sites and sends the info to OpenAI.
So, yes, we should all expect extremely erratic scraping to be going on, and countermeasures are important.
Of course, if the "browser" doesn't appear to work, the users will stop using it.