#decemberAdventure day 9: back to #gopher
-
#decemberAdventure day 9: back to #gopher
A number of years ago I setup my own search tool for the gopherspace. I've not re-done a full crawl since getting it working, and due to some design issues, it eventually stopped working reliably enough to keep running. Due to RSI issues, I have prioritized other projects in the intervening years, but have slowly made a list of notes and plans for fixing it. Today's adventure is the first part in implementing my plan and getting it back to a useable state.
I've rewritten the crawler. The new one is a lot less buggy than the original, and has a number of improvement including a correctly working filter (supporting robots.txt and a defined list of servers not to index), better discovery of servers from the gopher maps, tracking when servers were last scanned, request rate limiting, and facilities for avoiding recording duplicate entries.
My initial tests have been on my main gopher server (forthworks.com:70) and a number of my private ones. This totals 32k selectors across 3 servers. I'm going to start a broader scan of the public gopherspace soon, so will update once I get through the initial scan of a few servers.
My full logs are at https://charles.childe.rs/DA2025
-
#decemberAdventure day 9: back to #gopher
A number of years ago I setup my own search tool for the gopherspace. I've not re-done a full crawl since getting it working, and due to some design issues, it eventually stopped working reliably enough to keep running. Due to RSI issues, I have prioritized other projects in the intervening years, but have slowly made a list of notes and plans for fixing it. Today's adventure is the first part in implementing my plan and getting it back to a useable state.
I've rewritten the crawler. The new one is a lot less buggy than the original, and has a number of improvement including a correctly working filter (supporting robots.txt and a defined list of servers not to index), better discovery of servers from the gopher maps, tracking when servers were last scanned, request rate limiting, and facilities for avoiding recording duplicate entries.
My initial tests have been on my main gopher server (forthworks.com:70) and a number of my private ones. This totals 32k selectors across 3 servers. I'm going to start a broader scan of the public gopherspace soon, so will update once I get through the initial scan of a few servers.
My full logs are at https://charles.childe.rs/DA2025
Update on the initial scan: 91 servers (of 488 found in the indexes) scanned, 397 pending, 11 unreachable. 579,936 selectors with 2,193,285 descriptions. Data set is 569MiB in size.
I'm stopping my scans for today, will resume them tomorrow.
-
undefined stefano@mastodon.bsd.cafe shared this topic on
-
Update on the initial scan: 91 servers (of 488 found in the indexes) scanned, 397 pending, 11 unreachable. 579,936 selectors with 2,193,285 descriptions. Data set is 569MiB in size.
I'm stopping my scans for today, will resume them tomorrow.
Update on my scan of gopherspace.
725 servers identified
84 unreachable
6 restricted due to robots.txt or manual exclusion requests
325 scanned completely
1 in progress
393 pending1,269,190 unique selectors, 5,699,901 descriptions.
The scan will continue (slowly). I'm going to start writing the new front end for doing searches of the collected data next week.
