sysadmins/webmasters of fedi:
-
I would absolutely fsck the hell out of somebody doing that. If it tries to get the bogus .js file, feed it something that grabs the incoming IP and feeds it back on itself .... basically make the bots DDOS themselves.
Like in the old days sending an XTREE packet (which would lock up a Windows box, back when the skript kiddies used such things...)
Don't MESS with Mama Bear, motherfrackers.
@stonebear2 @Argonel @oblomov the main complication is that it seems there are lots of different scraper projects exhibiting this behaviour, so there's not too many methods that can be applied broadly.
zip-bombs (as others have suggested) are fairly likely to be effective, but they require extra traffic on my part. blocking them is the most universal and scalable response.
that said, if I ever encounter a scraper that I simply can't block, I'll probably take the zip-bomb route.
-
I was just looking at my webserver logs while sipping coffee (as one does) and I noticed that one of my websites was receiving requests for a js file which I had prototyped but never actually deployed.
The script tag is present in the page, but it's commented out. I investigated, and it seems that scrapers see that tag and are trying to grab it even though it's completely non-functional. I guess they just want every bit of code they can find to help train an LLM.
This seems like a promising pattern for catching scrapers that pretend to be normal browsers.
Since so many people are boosting this thread I think I'll take the opportunity to mention that I'm available for hire on a part-time or contract basis.
Feel free to reach out if you like my ideas about computer-related topics and have both the budget and need of someone who has such ideas.
I can be reached by Fediverse DM or the contact form on my website:
-
@Argonel @oblomov I've thought about it, and I have some code stashed somewhere which would do exactly that, but so far I've decided it's probably not worth the electricity.
The most efficient thing is to terminate the requests as quickly as possible and to let something low-level like the system firewall keep them from coming back for a while.
Maybe I'll change my mind in the future, but if we're lucky the bubble will crash and the problem will just go away.
-
@stonebear2 @Argonel @oblomov the main complication is that it seems there are lots of different scraper projects exhibiting this behaviour, so there's not too many methods that can be applied broadly.
zip-bombs (as others have suggested) are fairly likely to be effective, but they require extra traffic on my part. blocking them is the most universal and scalable response.
that said, if I ever encounter a scraper that I simply can't block, I'll probably take the zip-bomb route.
One thing to note is also that the Atlas "Browser" is a distributed scrape/crawl attack. It's an browser that users can use, but it actually scrapes the sites and sends the info to OpenAI.
So, yes, we should all expect extremely erratic scraping to be going on, and countermeasures are important.
Of course, if the "browser" doesn't appear to work, the users will stop using it.
-
Since so many people are boosting this thread I think I'll take the opportunity to mention that I'm available for hire on a part-time or contract basis.
Feel free to reach out if you like my ideas about computer-related topics and have both the budget and need of someone who has such ideas.
I can be reached by Fediverse DM or the contact form on my website:
I just published a blog post summing up my most pertinent thoughts about dealing with badly-behaved web-scraping bots:
https://cryptography.dog/blog/AI-scrapers-request-commented-scripts/
It isn't exactly a Hallowe'en-themed article, but today is the 31st and the topic is concerned with pranking people who come knocking on my website's ports, so it's somewhat appropriate.
#infosec #bots #halloween #scrapers #AI #someMoreHashtagsHere
-
I just published a blog post summing up my most pertinent thoughts about dealing with badly-behaved web-scraping bots:
https://cryptography.dog/blog/AI-scrapers-request-commented-scripts/
It isn't exactly a Hallowe'en-themed article, but today is the 31st and the topic is concerned with pranking people who come knocking on my website's ports, so it's somewhat appropriate.
#infosec #bots #halloween #scrapers #AI #someMoreHashtagsHere
looks like someone reshared my article to Hacker News where someone has (predictably) already commented on the headline without reading the article 😅
-
looks like someone reshared my article to Hacker News where someone has (predictably) already commented on the headline without reading the article 😅
it's like reddit if every user was a meth-head tech bro
-
looks like someone reshared my article to Hacker News where someone has (predictably) already commented on the headline without reading the article 😅
my system for catching anomalous HTTP traffic just flagged someone sending me an API key as an "x-api-key" header.
that's about what I'd expect from the average HN reader
-
it's like reddit if every user was a meth-head tech bro
@ansuz Some of the replies/comments in that thread are just insane!
Scraping is fine, even when there's a robots.txt (and other signs) saying not to, but DDoSing is illegal and you should sue people who do that.
Just ... what ?!?
Some of these people are in a totally different reality.
No, not reality ... the other thing.
-
@ansuz Some of the replies/comments in that thread are just insane!
Scraping is fine, even when there's a robots.txt (and other signs) saying not to, but DDoSing is illegal and you should sue people who do that.
Just ... what ?!?
Some of these people are in a totally different reality.
No, not reality ... the other thing.
@ColinTheMathmo oh yea, a lot of them have very clearly had way too much silicon-valley-kool-aid to drink and were basically incoherent.
I was actually pleasantly surprised at a few others, though, I can't help but feel that the more sensible ones might be wasting their time commenting on HN and should just join fedi
-
@ColinTheMathmo oh yea, a lot of them have very clearly had way too much silicon-valley-kool-aid to drink and were basically incoherent.
I was actually pleasantly surprised at a few others, though, I can't help but feel that the more sensible ones might be wasting their time commenting on HN and should just join fedi
Definitely a "take the rough with the smooth" thing.
I don't tend to read HN commentary on my stuff. (Sometimes, but not always)
-
sysadmins/webmasters of fedi:
I am looking for suggestions of which search engine crawlers I should consider permitting in my robots.txt file.
There can definitely be value in having a site indexed by a search engine, but I would like to deliberately exclude all of those which are using the same data to train LLMs and other genAI. More specifically, I would only like to allow those which have an explicit stance against training on others data in this fashion.
Currently I reject everything other than Marginalia (https://marginalia-search.com/). Are there any others I should consider?
@ansuz If Marginalia is already there then:
https://alpha.mwmbl.org
https://stract.com -
@ColinTheMathmo oh yea, a lot of them have very clearly had way too much silicon-valley-kool-aid to drink and were basically incoherent.
I was actually pleasantly surprised at a few others, though, I can't help but feel that the more sensible ones might be wasting their time commenting on HN and should just join fedi
@ansuz @ColinTheMathmo this has been my experience also in the rare cases when my writing gets shared on HN (or lobste.rs or reddit, FWIW): there a bunch of ignorami who are full-in on the corporate propaganda, and a few sane voices that try to highlight how bad that is for everyone. 1/2
-
@ansuz @ColinTheMathmo this has been my experience also in the rare cases when my writing gets shared on HN (or lobste.rs or reddit, FWIW): there a bunch of ignorami who are full-in on the corporate propaganda, and a few sane voices that try to highlight how bad that is for everyone. 1/2
On the one hand, I agree that they're mostly wasting their time and would do better to spend it elsewhere; OTOH, I am glad for the work they do, and who knows, maybe their words reach others that can then do better.
2/2
-
@ColinTheMathmo oh yea, a lot of them have very clearly had way too much silicon-valley-kool-aid to drink and were basically incoherent.
I was actually pleasantly surprised at a few others, though, I can't help but feel that the more sensible ones might be wasting their time commenting on HN and should just join fedi
@ansuz responding to someone on hacker news that they're too good for HN and should join fedi would be absolutely hilarious to no one but me
-
@ansuz responding to someone on hacker news that they're too good for HN and should join fedi would be absolutely hilarious to no one but me
@rune I might have tried it if I had an account there, but then imagine if I was wrong and that was their one good take. fedi equivalent of buyer's remorse