Re: Single tar (Re: Inktomi & large scale spidering)

Sigfrid Lundberg (siglun@gungner.ub2.lu.se)
Mon, 27 Jan 1997 12:08:58 +0100 (MET)


On Mon, 27 Jan 1997, =?ISO-8859-1?Q?Jaakko_Hyv=E4tti?= wrote:

>
> On 26 Jan 1997, Erik Selberg wrote:
> > because you're already stuck with this format. Another option is that
> > webmasters just ship a big tarball of raw text; it would seem to me
> > that would be the best option for all parties. But who knows if anyone
> > will implement it!
>
> I think this would be the only really universal interface. Also on
> simple servers it would be just a simple tar. If I get to do some robot
> work later this year I'll consider implementing this, in some scale here
> Finland. If robot admins see archives on servers, maybe they'll implement
> them too.. if some robot implements it, maybe servers do it too.. If it
> survives, it was good..
>

I'm afraid that off-line harvesting and indexing isn't as easy as
implied in these contributions. The (tar.gz?) archives has to be
generated by some program which is capable of reading the
configuration files of the WWW server used. This is necessary in order
to use server side includes and the like. The programs involved must
possibly execute scripts, as it goes. Mind you, there might
potentially be a lot of pages using constructs like

<!--#exec cmd="some_program" -->
<!--#include file="some_file" -->

or equivalent proprietary ones like Roxen's SPML, preprocessing in
W3-mSQL, PHP/FI or you name it.

I think that the off-line packaging of indexing data function has to
be implemented in two parts:

- One, which has to be integrated in the WWW server and
knows about its configuration, and which can use HTTP on a
named pipe, unix domain socket or whatever is suitable for
the purpose in particular operating environments.

- Another one which should be a client to the former
and able make some sort of compressed archive from what it
gets. It should be able to use this "off-line" HTTP, and
harvest pages locally using extracted links. Obviously,
this should still be subject to the constraints given in
the /robots.txt (presumably with a specific offline
useragent. I should also honour the <meta name="ROBOTS" ...>
stuff.

One might also thing of delivering things in some kind of standardized
record format (SOIF comes to my mind). This would imply that the
latter package would have to perform some kind of standardized
indexing. If this is the case, then the package has to be metadata
aware" and, for instance, support embedded Dublin Core records.

Yours

Sigfrid
_________________________________________________
This messages was sent by the robots mailing list. To unsubscribe, send mail
to robots-request@webcrawler.com with the word "unsubscribe" in the body.
For more info see http://info.webcrawler.com/mak/projects/robots/robots.html