Re: implementation fo HEAD response with meta info

G. Edward Johnson (lorax@speckle.ncsl.nist.gov)
Fri, 7 Jun 1996 13:50:15 -0400 (EDT)


On Fri, 7 Jun 1996, Davide Musella wrote:

> Hello!
> I've written a patch for the Apache server so that it insert the meta info
> contained in a html file in the response to a HEAD request (and only in this
> kind of response).
> There is a web page in which I explain better how this server works:
> http://jargo.itim.mi.cnr.it/Robot/
> It does not parse runtime the html files, but the system is based on a submission
> action of the html docs.
> All this it's only a test of the implementation of these features.
> If someone wants to test this server with his robot and , then tell me what he
> thinks about it, I'll be really glad!! :)
> Suggestions are welcome!

Davide,

I read through your paper
<URL:http://jargo.itim.mi.cnr.it/documentazione/articol_INET96.html>
and was wondering what you thought of an additional extenstion. In your
paper you envision a process like this (somewhat abriviated):

Web Server:
1) New document is submitted.

2) Web server adds document to SFT.txt, RSA.txt, and a third un-named
file containing the Document Meta Information (lets call it DMI).

3) When HEAD request comes in, pulls meta info from the DMI file, adds in
the regular headers and sends it out.

Robot:

1) Robot requests SFT.txt to get the URL's of all the files on the server

2) Robot requests RSA.txt to get all external links in files on the server

3) Robot issues a head request for each file individually.

It seems that if you are already defining two files, you could just as
easily define three, and add a DMI.txt that contained all meta
information for each document on the system.

What you would gain.
1) With the request of three documents, a robot could index your entire site

2) Lower server load. You wouldn't have to reparse the DMI to generate
information for each HEAD request, and there would be many fewer requests
from robots.

What you would lose.
1) the robot probably wouldn't get the last-modified date and content length

But this may not be a liability. Chances are, the robot doesn't care
about the content length, and many robots index on a fixed schedule that
doesn't relate to the last modified time.

Even so, your process requires registration of documents (you have it a
manual process, but I could easily see a web server automatically
registering documents every night.) So, the web server already knows the
content lenght and last modified time, and it could be included in DMI.txt

2) the robot would have to get the whole file, not just the parts that
have changed.

If the robot is updating the entire site, then it would HEAD each
document, so it would be getting all the DMI information anyway, just in
many seperate requests. If it just wanted to update a few files, it
could issue a HEAD request for them, and the process would occur exactly
as you described it.

3) There might be files in DMI.txt that you don't want indexed.

If the robot heeds the robots.txt file, it would hopefully check each
entry in SFT.txt and not index those it wasn't supposed to. However, if
it ignored robots.txt then it could still index files you don't want it
to, much like the situation today.

Is there anything I missed that would make this un-acceptable, or does
this seem like something that could be done?

Edward.

__ lorax@nist.gov | It is unlawful to use this document in a
/ ` / / | manner inconsistent with its packaging
/-- __/ , , , __. __ __/ +------------------------------------------
(___, (_/_(_(_/_(_/|_/ (_(_/_ | Do not taunt Happy Fun Ball