sysadmins/webmasters of fedi:
- 
sysadmins/webmasters of fedi: I am looking for suggestions of which search engine crawlers I should consider permitting in my robots.txt file. There can definitely be value in having a site indexed by a search engine, but I would like to deliberately exclude all of those which are using the same data to train LLMs and other genAI. More specifically, I would only like to allow those which have an explicit stance against training on others data in this fashion. Currently I reject everything other than Marginalia (https://marginalia-search.com/). Are there any others I should consider? 
- 
 undefined Oblomov shared this topic on undefined Oblomov shared this topic on
- 
sysadmins/webmasters of fedi: I am looking for suggestions of which search engine crawlers I should consider permitting in my robots.txt file. There can definitely be value in having a site indexed by a search engine, but I would like to deliberately exclude all of those which are using the same data to train LLMs and other genAI. More specifically, I would only like to allow those which have an explicit stance against training on others data in this fashion. Currently I reject everything other than Marginalia (https://marginalia-search.com/). Are there any others I should consider? I think I will probably have to dig through WIkipedia's comparison of search crawlers[1] for an answer. I half-expecting negative results, though. [1]: https://en.wikipedia.org/wiki/Comparison_of_search_engines#Search_crawlers 
- 
I think I will probably have to dig through WIkipedia's comparison of search crawlers[1] for an answer. I half-expecting negative results, though. [1]: https://en.wikipedia.org/wiki/Comparison_of_search_engines#Search_crawlers I looked at Mojeek and learned that they have a dedicated button for searching Substack, so I guess that's off the list. 
- 
I looked at Mojeek and learned that they have a dedicated button for searching Substack, so I guess that's off the list. @ansuz Mojeek has a strange choice of languages too. Did not support Russian or anything using a non-roman script last I checked. 
- 
I looked at Mojeek and learned that they have a dedicated button for searching Substack, so I guess that's off the list. I was just looking at my webserver logs while sipping coffee (as one does) and I noticed that one of my websites was receiving requests for a js file which I had prototyped but never actually deployed. The script tag is present in the page, but it's commented out. I investigated, and it seems that scrapers see that tag and are trying to grab it even though it's completely non-functional. I guess they just want every bit of code they can find to help train an LLM. This seems like a promising pattern for catching scrapers that pretend to be normal browsers. 
- 
I was just looking at my webserver logs while sipping coffee (as one does) and I noticed that one of my websites was receiving requests for a js file which I had prototyped but never actually deployed. The script tag is present in the page, but it's commented out. I investigated, and it seems that scrapers see that tag and are trying to grab it even though it's completely non-functional. I guess they just want every bit of code they can find to help train an LLM. This seems like a promising pattern for catching scrapers that pretend to be normal browsers. On just this one website I count 23 unique IPs with wildly different user agents. Only 4 identify themselves as bots. This domain receives far less traffic than any other, as I use it mostly as a personal blog and share it only with friends. I assume this will be much more effective if applied to my more public sites. 
- 
On just this one website I count 23 unique IPs with wildly different user agents. Only 4 identify themselves as bots. This domain receives far less traffic than any other, as I use it mostly as a personal blog and share it only with friends. I assume this will be much more effective if applied to my more public sites. Several of my websites now feature a commented-out script tag linking to a non-existent file. Any IP requesting this file will be banned at the firewall level for a significant duration. I'll give this a few days and report back on how many bad bots it catches. 
- 
Several of my websites now feature a commented-out script tag linking to a non-existent file. Any IP requesting this file will be banned at the firewall level for a significant duration. I'll give this a few days and report back on how many bad bots it catches. @ansuz that's a very interesting approach, do let us know, I'm still looking for ways to handle these things. 
- 
@ansuz that's a very interesting approach, do let us know, I'm still looking for ways to handle these things. @oblomov I'll probably write a blog post about it. I'm using a range of different tricks to classify humans/scrapers, and updating them all the time. I don't expect this will make a huge difference in terms of absolute numbers, but it does seem like it will catch a lot of requests that otherwise look legitimate, so it feels like I've found a missing puzzle piece. Scrapers are just greedier than any other type of agent, and that can be exploited ❤️ 
- 
@oblomov I'll probably write a blog post about it. I'm using a range of different tricks to classify humans/scrapers, and updating them all the time. I don't expect this will make a huge difference in terms of absolute numbers, but it does seem like it will catch a lot of requests that otherwise look legitimate, so it feels like I've found a missing puzzle piece. Scrapers are just greedier than any other type of agent, and that can be exploited ❤️ @ansuz much appreciated, thanks. 









 

 

