The web has a memory — and we’ve saved 1 trillion pages of it!
-
The web has a memory — and we’ve saved 1 trillion pages of it! 🌐
Join the @internetarchive for The Web We’ve Built, a celebration of the people, stories & technology that preserve our digital world.📅 Oct 22
🎟️ In Person in San Francisco & 🖥️ Virtual worldwide ⤵️
https://blog.archive.org/event/the-web-weve-built-celebrating-1-trillion-web-pages-archived/ -
The web has a memory — and we’ve saved 1 trillion pages of it! 🌐
Join the @internetarchive for The Web We’ve Built, a celebration of the people, stories & technology that preserve our digital world.📅 Oct 22
🎟️ In Person in San Francisco & 🖥️ Virtual worldwide ⤵️
https://blog.archive.org/event/the-web-weve-built-celebrating-1-trillion-web-pages-archived/@internetarchive I notice my websites are still being scraped by your bots even though I manually denied all bots in my domains' robots.txt files a year or two ago. WayBackMachine was great before the age of modern AI, but not now. I don't want companies using your APIs to scrape my archived content to train their AI models. I thought I read if I denied bots in robots.txt, you'd automatically pull related website content from your archives... but my content remains on your servers.
-
undefined Evan Prodromou shared this topic
-
@internetarchive I notice my websites are still being scraped by your bots even though I manually denied all bots in my domains' robots.txt files a year or two ago. WayBackMachine was great before the age of modern AI, but not now. I don't want companies using your APIs to scrape my archived content to train their AI models. I thought I read if I denied bots in robots.txt, you'd automatically pull related website content from your archives... but my content remains on your servers.
@jay @internetarchive that's not how robots.txt works. It only prevents new downloads. If you want to ask them to remove already archived data, check here: https://help.archive.org/help/how-do-i-request-to-remove-something-from-archive-org/
-
@jay @internetarchive that's not how robots.txt works. It only prevents new downloads. If you want to ask them to remove already archived data, check here: https://help.archive.org/help/how-do-i-request-to-remove-something-from-archive-org/
@jay @internetarchive I don't know if Archive.org makes their archives available for AI training. There's a good how-to from WIRED on how to opt out.
https://www.wired.com/story/how-to-stop-your-data-from-being-used-to-train-ai/
-
@jay @internetarchive I don't know if Archive.org makes their archives available for AI training. There's a good how-to from WIRED on how to opt out.
https://www.wired.com/story/how-to-stop-your-data-from-being-used-to-train-ai/
@jay @internetarchive you may have been thinking of Common Crawl, which does make its crawl data available for LLM training. See here for opt-out info:
https://commoncrawl.org/blog/balancing-discovery-and-privacy-a-look-into-opt-out-protocols
-
@jay @internetarchive you may have been thinking of Common Crawl, which does make its crawl data available for LLM training. See here for opt-out info:
https://commoncrawl.org/blog/balancing-discovery-and-privacy-a-look-into-opt-out-protocols
@evan @internetarchive Thanks for your info, Evan. My main concern is I notice the WayBackMachine has a copy of my websites in 2025, as recent as July, although I denied bots in 2024 or 2023 for all of my domains. I don't believe many companies are adhering to "robots.txt" contents, but Google Search currently is. Unfortunately, the only way to really protect a website today is to require user account login (secure session) to wall-off public content from web scraping.
https://www.theverge.com/news/757538/reddit-internet-archive-wayback-machine-block-limit
-
@evan @internetarchive Thanks for your info, Evan. My main concern is I notice the WayBackMachine has a copy of my websites in 2025, as recent as July, although I denied bots in 2024 or 2023 for all of my domains. I don't believe many companies are adhering to "robots.txt" contents, but Google Search currently is. Unfortunately, the only way to really protect a website today is to require user account login (secure session) to wall-off public content from web scraping.
https://www.theverge.com/news/757538/reddit-internet-archive-wayback-machine-block-limit
@jay @internetarchive good luck!
Gli ultimi otto messaggi ricevuti dalla Federazione
Post suggeriti
-
-
-
🐹🎶 In 1998, the web danced to hamsters.'nHampsterdance began as a lighthearted bet between Canadian art student Deidre LaCarte & her sister: who could get the most traffic with a single webpage?
Uncategorized1
-
Before TikTok, before Instagram, there was MySpace: the internet’s hub for music, culture, & eye-frying HTML.
Uncategorized4