sysadmins/webmasters of fedi:

ansuz / ऐरन

I just published a blog post summing up my most pertinent thoughts about dealing with badly-behaved web-scraping bots:

https://cryptography.dog/blog/AI-scrapers-request-commented-scripts/

It isn't exactly a Hallowe'en-themed article, but today is the 31st and the topic is concerned with pranking people who come knocking on my website's ports, so it's somewhat appropriate.

#infosec #bots #halloween #scrapers #AI #someMoreHashtagsHere

ansuz / ऐरन

looks like someone reshared my article to Hacker News where someone has (predictably) already commented on the headline without reading the article 😅

https://news.ycombinator.com/item?id=45773347

ansuz / ऐरन

it's like reddit if every user was a meth-head tech bro

ansuz / ऐरन

my system for catching anomalous HTTP traffic just flagged someone sending me an API key as an "x-api-key" header.

that's about what I'd expect from the average HN reader

Colin the Mathmo

@ansuz Some of the replies/comments in that thread are just insane!

Scraping is fine, even when there's a robots.txt (and other signs) saying not to, but DDoSing is illegal and you should sue people who do that.

Just ... what ?!?

Some of these people are in a totally different reality.

No, not reality ... the other thing.

ansuz / ऐरन

@ColinTheMathmo oh yea, a lot of them have very clearly had way too much silicon-valley-kool-aid to drink and were basically incoherent.

I was actually pleasantly surprised at a few others, though, I can't help but feel that the more sensible ones might be wasting their time commenting on HN and should just join fedi

Neil Brown

@ansuz @ColinTheMathmo

Definitely a "take the rough with the smooth" thing.

I don't tend to read HN commentary on my stuff. (Sometimes, but not always)

Tris

@ansuz If Marginalia is already there then:
https://alpha.mwmbl.org
https://stract.com

Oblomov

@ansuz @ColinTheMathmo this has been my experience also in the rare cases when my writing gets shared on HN (or lobste.rs or reddit, FWIW): there a bunch of ignorami who are full-in on the corporate propaganda, and a few sane voices that try to highlight how bad that is for everyone. 1/2

Oblomov

@ansuz @ColinTheMathmo

On the one hand, I agree that they're mostly wasting their time and would do better to spend it elsewhere; OTOH, I am glad for the work they do, and who knows, maybe their words reach others that can then do better.

2/2

Scared Rune (Halloweeen Edition)

@ansuz responding to someone on hacker news that they're too good for HN and should join fedi would be absolutely hilarious to no one but me

ansuz / ऐरन

@rune I might have tried it if I had an account there, but then imagine if I was wrong and that was their one good take. fedi equivalent of buyer's remorse

ansuz / ऐरन

One of the people that read my article on algorithmic sabotage set up an infinite source of nonsense for LLM scrapers to ingest:

https://shoobot.com

It uses txtgen (https://ndaidong.github.io/txtgen/) to respond to every subdomain and URL with garbage.

There are already other projects to do the same, but it's nice to see more people trying their hand at addressing the problem.

#AI #sabotage

ansuz / ऐरन

I wrote up some commentary about the response to my last article:

"Algorithmic sabotage debrief"

https://cryptography.dog/blog/algorithmic-sabotage-debrief/

ischade

@ansuz I love this, thank you for writing it!

You talk about how people just block IP addresses. Yup, that's me.
I'm blocking /16s at this point, as well as a couple /8s, and many /13s and /14s.
1.5 billion requests over a 24 hour period is not okay. I'd already blocked one offender subnet, when they stepped up their attempts to the billion range. They tried for a solid week. Then they spent 3 months at a rate of 30K per 24 hours, before wandering off to bug someone else. 🙄

ansuz / ऐरन

Somebody posted a Korean-language summary of my "AI scrapers request commented scripts" with some commentary:

https://aisparkup.com/posts/6165

An excerpt, translated via Firefox's local translation models because I can't read Korean:

> Is it self-defense against unauthorized collection of data under the guise of ignoring robots.txt, or is it excessive retaliation? The fact that an AI model that costs billions of dollars can be disabled with a few hundred documents is a pleasant reversal for some, but a worrying situation for others.

They also mention the WWE/Kropotkin tarpit described by @jonny:

> Another is said to have created a contaminant data with text created by mixing the WWE script with Kropotkin’s “inter-infrastructure theory.”

ansuz / ऐरन

I'm trying to work on the fluconf website, but I keep getting nerd-sniped by the behaviour of the scrapers that are already hitting the not-yet-officially-public site.

In my follow-up to the AI-scrapers article I mentioned that there are actors who monitor certificate transparency logs for new domains to crawl/probe, but I've never really paid close attention to the resources that get targeted first.

It's hard to know for sure because they spoof their user-agent strings and do their best to look like legitimate traffic, but it looks like OpenAI, Nomic.ai, and some other minor AI companies are using this technique.

Piero Bosio Social Web Site Personale

sysadmins/webmasters of fedi:

Feed RSS

Gli ultimi otto messaggi ricevuti dalla Federazione

Post suggeriti

Ci ha lasciato Paolo Virno, filosofo del linguaggio

Dal kamikaze che realizzò "La cucina italiana non esiste".

La monarchia in Europa sopravvive sostanzialmente come un residuo del passato

Internet Archive Wayback Machine not working with some Mastodon content