> It might make more sense to seed the robot with a list of known (or
> suspected) relevant pages...
> Still I can't escape the feeling that there must be a better way
> to find _new_ resources for a subject, other than blind crawling
> on the Web. Like monitoring newgroups, mailing lists, parse specific
> pages etc, not to mention human editorship :-)
I am interested in just such a search engine. If anyone is working on
something that they need a place to test or getting ready to offer such a
harvesting/indexing tool, I would be very interested in talking to them.
Unfortunately, I am getting too busy with other things to spend a lot of
time working out the kinks in my own code - especially in that I have not
worked extensively with TCP/IP raw coding.
I have a specific list of starting points that I have in fact collected
using newsgroups, mailing lists and user referrals. The list is extensive,
and I would start to manually build an index if only I had a good format to
use for creating the database. Again, I do not currently have the time to
research current tools/file structures and would much prefer a working package
to automate as much of the collecting/sorting/structuring/indexing as possible.
Preferably something pre-bundled with a cgi search tool to hunt through the
database and perhaps admin programs for managing the data collected.
I would be especially interested in something that can create new lists of
starting points by finding URI's in the documents I ask it to start from, and
then check those quickly for matching key words to create a new list of
starting points I might use for my next harvest. (preferably looking for
repeats of items already in my index)
I don't ask for much do I? Anyone know of anything that can do the magic
I desire that will run on my Linux box? I have a project that I want to
get rolling her, but can't find the tools to do the job. I hate pounding
nails with my head, and would much prefer a 'hammer' to do the job right...
S
cott