Re: Single tar (Re: Inktomi & large scale spidering)

Erik Selberg (selberg@cs.washington.edu)
27 Jan 1997 12:14:46 -0800


Sigfrid Lundberg <siglun@gungner.ub2.lu.se> writes:

> I'm afraid that off-line harvesting and indexing isn't as easy as
> implied in these contributions. The (tar.gz?) archives has to be
> generated by some program which is capable of reading the
> configuration files of the WWW server used. This is necessary in order
> to use server side includes and the like. The programs involved must
> possibly execute scripts, as it goes. Mind you, there might
> potentially be a lot of pages using constructs like
>
> <!--#exec cmd="some_program" -->
> <!--#include file="some_file" -->

DING! DING! DING! we have a winner!

yup; the ol' "rooting-through-your-filesystem" approach has the
problems that:
1) you're looking at the FS view, when you want to look at the Web
view;
2) you're looking at a view from your machine, not some random machine

(2) is actually more obnoxious, because if you have different
permission levels (say areas which are "only som.dom access"), you may
index files that are in protected areas. This gets really bad if you
then include the entire file with the public archive.

I think the only reasonable solution in a lot of ways is to make a
spider which is attached to a server. This spider would then create
the tarballs (and you could have one big file, as well as a week's
worth of changes in another for incremental updating). The spider
could also do other useful things, like make sure you got all your
scripting done right, you didn't forget any links, etc. etc. etc.

Rob --- would this be something the apache folks would be interested
in?

Thanks,
-Erik

-- 
				Erik Selberg
"I get by with a little help	selberg@cs.washington.edu
 from my friends."		http://www.cs.washington.edu/homes/selberg
_________________________________________________
This messages was sent by the robots mailing list. To unsubscribe, send mail
to robots-request@webcrawler.com with the word "unsubscribe" in the body.
For more info see http://info.webcrawler.com/mak/projects/robots/robots.html