sysadmins/webmasters of fedi:
-
sysadmins/webmasters of fedi:
I am looking for suggestions of which search engine crawlers I should consider permitting in my robots.txt file.
There can definitely be value in having a site indexed by a search engine, but I would like to deliberately exclude all of those which are using the same data to train LLMs and other genAI. More specifically, I would only like to allow those which have an explicit stance against training on others data in this fashion.
Currently I reject everything other than Marginalia (https://marginalia-search.com/). Are there any others I should consider?
-
undefined Oblomov shared this topic
-
sysadmins/webmasters of fedi:
I am looking for suggestions of which search engine crawlers I should consider permitting in my robots.txt file.
There can definitely be value in having a site indexed by a search engine, but I would like to deliberately exclude all of those which are using the same data to train LLMs and other genAI. More specifically, I would only like to allow those which have an explicit stance against training on others data in this fashion.
Currently I reject everything other than Marginalia (https://marginalia-search.com/). Are there any others I should consider?
I think I will probably have to dig through WIkipedia's comparison of search crawlers[1] for an answer.
I half-expecting negative results, though.
[1]: https://en.wikipedia.org/wiki/Comparison_of_search_engines#Search_crawlers
-
I think I will probably have to dig through WIkipedia's comparison of search crawlers[1] for an answer.
I half-expecting negative results, though.
[1]: https://en.wikipedia.org/wiki/Comparison_of_search_engines#Search_crawlers
I looked at Mojeek and learned that they have a dedicated button for searching Substack, so I guess that's off the list.
-
I looked at Mojeek and learned that they have a dedicated button for searching Substack, so I guess that's off the list.
@ansuz Mojeek has a strange choice of languages too. Did not support Russian or anything using a non-roman script last I checked.