The web has a memory — and we’ve saved 1 trillion pages of it!

internetarchive@mastodon.archive.org

The web has a memory — and we’ve saved 1 trillion pages of it! 🌐
Join the @internetarchive for The Web We’ve Built, a celebration of the people, stories & technology that preserve our digital world.

📅 Oct 22
🎟️ In Person in San Francisco & 🖥️ Virtual worldwide ⤵️
https://blog.archive.org/event/the-web-weve-built-celebrating-1-trillion-web-pages-archived/

#Wayback1T #Livestream

jay@mastodon.gamedev.place

@internetarchive I notice my websites are still being scraped by your bots even though I manually denied all bots in my domains' robots.txt files a year or two ago. WayBackMachine was great before the age of modern AI, but not now. I don't want companies using your APIs to scrape my archived content to train their AI models. I thought I read if I denied bots in robots.txt, you'd automatically pull related website content from your archives... but my content remains on your servers.

evan@cosocial.ca

@jay @internetarchive that's not how robots.txt works. It only prevents new downloads. If you want to ask them to remove already archived data, check here: https://help.archive.org/help/how-do-i-request-to-remove-something-from-archive-org/

evan@cosocial.ca

@jay @internetarchive I don't know if Archive.org makes their archives available for AI training. There's a good how-to from WIRED on how to opt out.

https://www.wired.com/story/how-to-stop-your-data-from-being-used-to-train-ai/

evan@cosocial.ca

@jay @internetarchive you may have been thinking of Common Crawl, which does make its crawl data available for LLM training. See here for opt-out info:

https://commoncrawl.org/blog/balancing-discovery-and-privacy-a-look-into-opt-out-protocols

jay@mastodon.gamedev.place

@evan @internetarchive Thanks for your info, Evan. My main concern is I notice the WayBackMachine has a copy of my websites in 2025, as recent as July, although I denied bots in 2024 or 2023 for all of my domains. I don't believe many companies are adhering to "robots.txt" contents, but Google Search currently is. Unfortunately, the only way to really protect a website today is to require user account login (secure session) to wall-off public content from web scraping.

https://www.theverge.com/news/757538/reddit-internet-archive-wayback-machine-block-limit

evan@cosocial.ca

@jay @internetarchive good luck!

Piero Bosio Social Web Site Personale

The web has a memory — and we’ve saved 1 trillion pages of it!

Feed RSS

Gli ultimi otto messaggi ricevuti dalla Federazione

Post suggeriti

BREAKING NEWS!

Tomorrow’s the big event!

🐹🎶 In 1998, the web danced to hamsters.'nHampsterdance began as a lighthearted bet between Canadian art student Deidre LaCarte & her sister: who could get the most traffic with a single webpage?

Not all art hangs in museums.

Piero Bosio Social Web Site Personale

The web has a memory — and we’ve saved 1 trillion pages of it!

Feed RSS

Gli ultimi otto messaggi ricevuti dalla Federazione

Post suggeriti

BREAKING NEWS!

Tomorrow’s the big event!

🐹🎶 In 1998, the web danced to hamsters.'nHampsterdance began as a lighthearted bet between Canadian art student Deidre LaCarte &amp; her sister: who could get the most traffic with a single webpage?

Not all art hangs in museums.

🐹🎶 In 1998, the web danced to hamsters.'nHampsterdance began as a lighthearted bet between Canadian art student Deidre LaCarte & her sister: who could get the most traffic with a single webpage?