Re: HEAD request [was Re: Server name in /robots.txt]

Davide Musella (davide@jargo.itim.mi.cnr.it)
Mon, 29 Jan 1996 18:21:28 -0100


Martijn Koster wrote:
> That is better, and my non-existant ideal server would do it only at
> document submission time. However, it is server-side parsing, which
> is unfortunate, and it still requires server changes for the majority
> of deployed servers. I think that expecting the entire server codebase to
> change for a few user-agents (say robots) is unrealistic.

Yeah, that's also my idea of the ideal server, but why unrealistic...
robots action is the base of the internet, if we want to follow the internet
growing, we must study new cataloguing techniques, 'cause the actual methods
aren't sufficient. So little changing to make the net better aren't so
unrealistic. Then also your suggestions need some changing of the servers.

> >Yes, but if they can work only with the data content in an HTTP header,
> >why request the whole document...
> >You can save the 90% of retrieve time, and the load of the net will be a
> >bit lower.
> Similarly you can do a full GET, and stop retrieving after reading the
> META Tags in the HEAD. This would save almost as many bytes.
> The main question is though, will the agent/browser do just a HEAD and
> be satisfied with that, or will it do a full GET anyway? If it does a full
> get, as I believe it will, see below, than you're simply duplicating
> information and wasting bytes.

It's the price to pay. We are tring to lower the entropy of an http answer of
a robot request, so somewhere the entropy must grow, we must decide where.
Different methods move only the place in which it happens.
The problem is: How much can we spend (in cpu,disk, memory & net load)
to implement some new, essential, characteristic for the web cataloguing ??

> But WebCrawler also needs content: for a full-text index, to find new
> links, for analysis etc. So we'll continue GETting.
Yes, this is a problem.. the solution could be to make another table..
But I'm doing some statistics to calculate the amount of this and of others
possible table. I'll post the results.

> Have you actually got any indication that anyone on the client
> side wants HTTP-EQUIV, thinks its better than the alternatives,
> and wants to implement it?

I've received many signals, but nobody told me anything about this method.
So, nobody said me "I don't agree with the http-equiv method" (except you)
but also I've not received any positive signal.

>-- Martijn
>
> HTTP-EQUIV Considered harmful

I'm glad to read this. I hope we can find the best way to resolve this
problem.

> 2. HTTP-EQUIV considered harmful
>
> 2.1 It is not backwards compatible
>
> Disallowing the combined use of the HTTP-EQUIV and NAME attribtues
> would make some previously conforming HTML documents non-conforming.
> This is undesirable.
>
> For exaple, the instance:
>
> <META HTTP-EQUIV="..." NAME="...">
>
> is legal in RFC 1866, but does not conform to the proposed extension.

Yes, it's true... I had not thought about this problem, but is it a big problem?
The idea is that this Meta info MUST be used, and that the html-author, normally
is not a computer-oriented :) person, so to make the utilization of this tag easier
I've done a linear syntax. It could add some redundancy in the code, nothing
more. If you think there are too many documents with the meta tag inside (?)
and with both the http-equiv and name attributes in the same tag (?), I'll have
any problem to cut off that sentence .

> 2.2 It prevents future additions
>
> The imposition of a syntax and semantics on all CONTENT attribute
> values precludes the definiton of future conflicting values syntaxes.
> This severly reduces the extendibility of the <META> element.
>
> For example, the instances:
>
> <META NAME="cost" CONTENT="10 dollars">
> <META NAME="bestquote" CONTENT="Et tu, Brute">
>
> would have the semantics of "10 AND dollars", and "Et AND tu OR Brute".
>
> It should also be noted that this constraint is in conflict with
> the proposed extension itself, in that it prescribed HTTP
> conforming values for HTTP-EQUIV attributes named after HTTP headers,
> which donot use the AND/OR logic.
You know I've only formalized the CONTENT use. I haven't changed anything.
If you found CONTENT="Et tu, Brute" you understood "Et AND tu OR Brute" also
before my draft.
In fact you suggested me to change that part in:
"Keyword phrases are separated by commas?"
but I thought it wasn't more clear than my definition, so I didn't.

> 2.3 The inclusion of HTTP-specific instructions in HTML is counter to
> the protocol-independent nature of HTML.
>
> The inclusion of HTTP-specific instructions goes counter to this
> clean separation, and this negatively affects both the meta
> information and the HTML document. If a browser only supports META
> HTTP-EQUIV it will not be able to act on this information when served
> via a protocol other than HTTP; so the meta data goes to waste, and
> the space is wasted in the HTML.

This was also before my Draft.

> 2.4 It opens up name-space conflicts in HTTP.
>
> There is a possible conflict between HTTP-EQUIV attribute values and
> HTTP header values, as the META and HTTP definitions of syntax and
> semantics may differ. This complicates the future extension of both
> the META and HTTP work.
> Even is the syntax is correct there may be semantic problems,
> which may confuse, or might be used for spoofing.

but I've written:

Do not name an HTTP-EQUIV attribute the same as a response header
that should typically only be generated by the HTTP server. Some
inappropriate names are "Server", "Date", and "Last-Modified".
Whether a name is inappropriate depends on the particular server
implementation. It is recommended that servers ignore any META
elements that specify HTTP equivalents (case insensitively) to their
own reserved response headers.

> There is even already some unclarity in the values proposed now.
> For example, the relationship, if any, between HTTP-EQUIV="expire"
> and the HTTP header "Expires", or HTTP-EQUIV="Timestamp" and the
> HTTP header "Last-modified" etc. is unlear.
A server can decide how long is the expire time but it is the same for
all the docs in the web. The timetable of a school could have an expire time
of an year and an internet-draft has an expire time of six months. How can
you say to the server :"ehi my docs will expire in a year" rather than
"in six months"?
I think the difference between TIMESTAP and Last-modified is clear.
we are speaking about documents, not file. The file is only the box that
contains the documents, nothing more.

> 2.5 In the common case this information is a unnecesarry duplication.
>
> The most common HTTP methods (GET and POST) result in the HTML
> content being transmitted. In this case, the information specified
> in the HTTP-EQUIV is sent twice: once as HTTP header, and once in
> the document content. This is an unnecesarry waste of bandwidth.

That info must be included only in the answer of an HEAD request, it is
specified in the draft.

> Limiting the generation of HTTP headers for HTTP-EQUIV attributes
> to HEAD requests would alleviate this duplication, but this may
> mean the contruct is used too little to make it worthwile to
> standardise and implement.

ops..

>
> The use of meta data is especially important for the special
> category of User-agents known as robots [ROBOT], and they could be
> conceivably modified to do HEAD requests. However, this is unlikely
> to happen as robots generally need the entire content: they need to
> parse content to find new URL's, they often use full-text indexing
> technology which works best on complete content, and they may wish
> to do further analysis on the content to assess desirability or
> statistical properties.

Ok, but I'm talking only about the cataloguing, 'cause I guess the most
among the robots want to index the web, if a robot wants the whole doc
to do some analysis, probably it doesn't need the meta info neither.

> 2.6 It requires server-side parsing.
>
> HTTP servers needs to parse the HTML document in order to generate
> headers for HTTP-EQUIV attributes. This is undesirable for a number
> of reasons:
>
> - implementing even a partial HTML parser correctly is considerable
> effort.
> - it means servers may need to be modified as the HTML standard
> develops.
> - the parsing consumes additional CPU and memory resources.
>
> The client is the one using and applying the META data; it should
> be given most of the flexibility and burden.
>
> 2.7 It does not allow for rich meta data formats.
>
> The data transmitted in the HTTP header has to conform to strict
> syntax rules. At the very least they may not contain a CR-LF
> followed by a non-space or another two CR-LF pair. The proposal
> provides no encoding mechanism, so these restrictions must be present
> also in the CONTENT provided with an HTTP-EQUIV attribute.
> This limits the power of expression of the meta data.

Yes, but you have to think that this meta info are extracted by the author
himself, to be sure that the info you receive are exact, you must be sure
that the author has understood at best how to use the metatag.

> 2.8 It does not allow for meta data content negotiation.
>
> The CONTENT values of HTTP-EQUIV attributes can not be negotiated.
> This means one cannot specify a preference to receive meta data
> in HTML, URC, or IAFA format. It also means the language of the
> meta data cannot be selected.
>
> This limits the power of expression of the meta data.
>

>
> 3. Alternatives for meta data in HTML
>
> This section aims to show that alternative solutions exist that
> do not share the same, or as many, problems,
>
> It is not meant to be a complete overview of alternatives,
> nor a complete analysis of each alternative, let alone
> a full solution specification.
>
> 3.1 Using NAME
[...]
> 3.2 A META HTTP method
[...]
> 3.3 Using Accept headers for meta data
[...]
> 3.4 Using the <LINK> element
[...]

New method with new problems.
How to solve the problem of retrieving external/internal links contained
in a html doc??
And who must build the external meta-file?? Do you think an author can build it?

bye
Davide
------------
Davide Musella
National Research Council, Milan, ITALY
tel. +39.(0)2.70643271
e-mail: davide@jargo.itim.mi.cnr.it