Re: HEAD request [was Re: Server name in /robots.txt]

Martijn Koster (m.koster@webcrawler.com)
Fri, 26 Jan 1996 18:20:24 -0700


Hi all,

Sorry for the delay, but to prevent this discussion going around in circles
I wrote up all arguments and solutions I could think of in internet-draft
form, and appended it to this message. Please have a look, it is directly
relevant to this discussion, and I'd love comments...

At 7:26 PM 1/23/96, Davide Musella (CNR wrote in reply to me:

>> The server is supposed to parse the document, and slam the value into
>> an HTTP header. This is of course a waste of server CPU and bandwidth
>> for the majority of cases, and opens a whole can of worms with the
>> semantics of HTTP header namespace collisions.
>
>It isn't the only way to handle the META info, The WN server does it using
>a table, so they parse the document only once a day.
>Ok, it isn't the best way, but there are many ways to resolve it.

That is better, and my non-existant ideal server would do it only at
document submission time. However, it is server-side parsing, which
is unfortunate, and it still requires server changes for the majority
of deployed servers. I think that expecting the entire server codebase to
change for a few user-agents (say robots) is unrealistic.

>> This makes far more sense -- let user-agents decide what they want to do
>> with the data;
>
>Yes, but if they can work only with the data content in an HTTP header,
>why request the whole document...
>You can save the 90% of retrieve time, and the load of the net will be a
>bit lower.

Similarly you can do a full GET, and stop retrieving after reading the
META Tags in the HEAD. This would save almost as many bytes.

The main question is though, will the agent/browser do just a HEAD and
be satisfied with that, or will it do a full GET anyway? If it does a full
get, as I believe it will, see below, than you're simply duplicating
information and wasting bytes.

>> So the idea is that you can do both HTTP-EQUIV=foo and NAME=bar in the
>> same META tag. The last draft I saw on the subject had HTTP-EQUIV
>> as the main thing, with NAME being optional. I think it makes far
>> more sense to have NAME, and abolish HTTP-EQUIV, or at least make
>> it a secondary choice.
>> In fact it'd be good if robots started to promote this. I'd add it
>> to WebCrawler if I wasn't buried in other work...
>
>But, if the webCrawler can index a doc by the content of the META NAME tag
>it can also use the META HTTP-EQUIV tag so it can use an HEAD request
>have the indexing info without parse the document and be sure to have
>the best indexing info about that doc, 'cause the author has indexed it
> for you.

But WebCrawler also needs content: for a full-text index, to find new
links, for analysis etc. So we'll continue GETting.

But say we didn't need the whole content; if we'd do a HEAD, chances
are the server doesn't support HTTP-EQUIV, and we have to follow with a
full GET anyway. This means time, bandwidth, and the agents' resources
are wasted. So in practice we'd do a GET first time, and parse it
out of the HTML document.

Have you actually got any indication that anyone on the client
side wants HTTP-EQUIV, thinks its better than the alternatives,
and wants to implement it?

>I've made some alterations to that draft, to be clearer and more exhaustive.

OK. I believe most of the comments I made below still hold.

Regards,

-- Martijn

--------

Martijn Koster
January 1995

HTTP-EQUIV Considered harmful

Status of this Memo

This document is a working document. The latest version may be
found on <URL:http://info.webcrawler.com/mak/projects/meta/
equiv-harmful.html>

This is a working document only, it should neither be cited
nor quoted in any formal document.

Distribution of this document is unlimited.

Please send comments to the author.

Abstract

The use of the HTML META element with HTTP-EQUIV attribute
for generalised meta data should be discouraged.

Table of Contents

1. Introduction
1.1. The definition of the META element
1.2. The definition of names
1.3. Overview of objections
2. HTTP-EQUIV considered harmful
2.1. It is not backwards compatible
2.2. It prevents future additions
2.3. The inclusion of HTTP-specific instructions in HTML is counter
to the protocol-independent nature of HTML.
2.4. It opens up name-space conflicts in HTTP.
2.5. In the common case this information is a unnecesarry duplication.
2.6. It requires server-side parsing.
2.7. It does not allow for rich meta data formats.
2.8. It does not allow for meta data content negotiation.
3. Alternatives for meta data in HTML
3.1. Using NAME
3.2. A META HTTP method
3.3. Using Accept headers for meta data
3.4. Using the <LINK> element
3.5. Using content-selection
4. Conclusion and Recommendations
5. Security Considerations
6. Author's Address
7. References

1. Introduction

This introduction explains the current specification of the
META element, and the extension of the META element as
proposed in [HTML]. Section 2 will further discuss these issues.

1.1 The definition of the META element

The <META> element is defined in RFC 1866[HTML] as follows:

The <META> element is an extensible container for use in identifying
specialized document meta-information. Meta-information has two main
functions:

* to provide a means to discover that the data set exists
and how it might be obtained or accessed; and

* to document the content, quality, and features of a data
set, indicating its fitness for use.

Each <META> element specifies a name/value pair. If multiple META
elements are provided with the same name, their combined contents--
concatenated as a comma-separated list--is the value associated with
that name.

...

HTTP servers may read the content of the document <HEAD> to generate
header fields corresponding to any elements defining a value for the
attribute HTTP-EQUIV.

The set of names, and syntax and semantics of associated values is
not defined in RFC 1866, and this may partly be the reason that
few, if any WWW, User-agents acts on this information. Because of this
lack of demand, few servers implement the HTTP header generation for
<META> elements with an HTTP-EQUIV attribute.

1.2 The definition of names

Current work in progress [META] aims to adress this issue by
specifying semantics and some syntax for the following set of names:

keywords, author, timestamp, expire, language, abstract,
organization, revision

It furthermore modifies the definition in RFC 1866 by specifying that:

The HTTP-EQUIV and the NAME attributes are mutually exclusives.

and that the CONTENT attribute value has to conform to a specific
syntax: The SPACE (ASCII[32]) character specifies boolean AND, and
the COMMA (ASCII[44]) specifies boolean OR.

Finally it incourages the use of the "keywords" name/value pair to
aid document cataloguing.

1.3 Overview of objections

This concept of the <META> element with the HTTP-EQUIV attribute, has
several drawbacks:
- it is not backwards compatible
- it prevents future additions
- the inclusion of HTTP-specific instructions in HTML is counter to
the protocol-independent nature of HTML.
- it opens up name-space conflicts in HTTP.
- in the common case this information is a unnecesarry duplication.
- it requires server-side parsing.
- it does not allow for rich meta data formats.
- it does not allow for meta data content negotiation.

These drawbacks are further explained in section 2. Alternatives are
suggested in section 3. Recommendations are presented in section 4.

2. HTTP-EQUIV considered harmful

2.1 It is not backwards compatible

Disallowing the combined use of the HTTP-EQUIV and NAME attribtues
would make some previously conforming HTML documents non-conforming.
This is undesirable.

For exaple, the instance:

<META HTTP-EQUIV="..." NAME="...">

is legal in RFC 1866, but does not conform to the proposed extension.

2.2 It prevents future additions

The imposition of a syntax and semantics on all CONTENT attribute
values precludes the definiton of future conflicting values syntaxes.
This severly reduces the extendibility of the <META> element.

For example, the instances:

<META NAME="cost" CONTENT="10 dollars">
<META NAME="bestquote" CONTENT="Et tu, Brute">

would have the semantics of "10 AND dollars", and "Et AND tu OR Brute".

It should also be noted that this constraint is in conflict with
the proposed extension itself, in that it prescribed HTTP
conforming values for HTTP-EQUIV attributes named after HTTP headers,
which donot use the AND/OR logic.

2.3 The inclusion of HTTP-specific instructions in HTML is counter to
the protocol-independent nature of HTML.

The WWW's success is in no small part due to the protocol-independent
nature of HTML, allowing it to be served from FTP servers, Gopher
servers, and directly from file systems. Similarly, the
content-independent nature of HTTP has advantages.

The inclusion of HTTP-specific instructions goes counter to this
clean separation, and this negatively affects both the meta
information and the HTML document. If a browser only supports META
HTTP-EQUIV it will not be able to act on this information when served
via a protocol other than HTTP; so the meta data goes to waste, and
the space is wasted in the HTML.

2.4 It opens up name-space conflicts in HTTP.

There is a possible conflict between HTTP-EQUIV attribute values and
HTTP header values, as the META and HTTP definitions of syntax and
semantics may differ. This complicates the future extension of both
the META and HTTP work.

More importantly, it means that user ignorance can easily result in
inadvertently non-conforming HTTP protocol: If a user chooses a
HTTP-EQUIV header which is defined in HTTP but doesn't use the
correct syntax and semantics, the server ends up sending bad
protocol unless it specifically check syntax.

Even is the syntax is correct there may be semantic problems,
which may confuse, or might be used for spoofing.

There is even already some unclarity in the values proposed now.
For example, the relationship, if any, between HTTP-EQUIV="expire"
and the HTTP header "Expires", or HTTP-EQUIV="Timestamp" and the
HTTP header "Last-modified" etc. is unlear.

2.5 In the common case this information is a unnecesarry duplication.

The most common HTTP methods (GET and POST) result in the HTML
content being transmitted. In this case, the information specified
in the HTTP-EQUIV is sent twice: once as HTTP header, and once in
the document content. This is an unnecesarry waste of bandwidth.

Limiting the generation of HTTP headers for HTTP-EQUIV attributes
to HEAD requests would alleviate this duplication, but this may
mean the contruct is used too little to make it worthwile to
standardise and implement.

The use of meta data is especially important for the special
category of User-agents known as robots [ROBOT], and they could be
conceivably modified to do HEAD requests. However, this is unlikely
to happen as robots generally need the entire content: they need to
parse content to find new URL's, they often use full-text indexing
technology which works best on complete content, and they may wish
to do further analysis on the content to assess desirability or
statistical properties.

2.6 It requires server-side parsing.

HTTP servers needs to parse the HTML document in order to generate
headers for HTTP-EQUIV attributes. This is undesirable for a number
of reasons:

- implementing even a partial HTML parser correctly is considerable
effort.
- it means servers may need to be modified as the HTML standard
develops.
- the parsing consumes additional CPU and memory resources.

The client is the one using and applying the META data; it should
be given most of the flexibility and burden.

2.7 It does not allow for rich meta data formats.

The data transmitted in the HTTP header has to conform to strict
syntax rules. At the very least they may not contain a CR-LF
followed by a non-space or another two CR-LF pair. The proposal
provides no encoding mechanism, so these restrictions must be present
also in the CONTENT provided with an HTTP-EQUIV attribute.

In addition the values are restricted by the DTD.

This limits the power of expression of the meta data.

2.8 It does not allow for meta data content negotiation.

The CONTENT values of HTTP-EQUIV attributes can not be negotiated.
This means one cannot specify a preference to receive meta data
in HTML, URC, or IAFA format. It also means the language of the
meta data cannot be selected.

This limits the power of expression of the meta data.

3. Alternatives for meta data in HTML

This section aims to show that alternative solutions exist that
do not share the same, or as many, problems,

It is not meant to be a complete overview of alternatives,
nor a complete analysis of each alternative, let alone
a full solution specification.

3.1 Using NAME

Rather than concentrating on the HTTP-EQUIV attribute as the main use of
the <META> tag, one can concentrate on using the NAME attribute.
This would remove all HTTP related problems, and turn it into a
meta data construct that is restrictive, but at least simple and
safe. It also requires no server-side modificiation.

This would promote the use of NAME as a way of associating
general meta data, and leaves the HTTP-EQUIV as a separate issue.
HTTP-EQUIV could either be removed altogether, or restricted
to standardised HTTP headers.

This would require a rewording of the relevant section in 1866,
deprecating the use of HTTP-EQUIV for non-HTTP meta data.
For backwards compatibility HTTP-EQUIV would be allowed, and
User-agents would be encouraged to substitue NAME for HTTP-EQUIV
where the NAME is missing or the HTTP-EQUIV value specifies
a non-HTTP attribute.

This still has the problem of severely limiting the power of
expression of the meta data.

3.2 A META HTTP method

A new-to-be-defined HTTP method META could be used to request
meta data associated with a URL. This would return separate
content, and behaves like a normal GET.

This would give the meta data complete expressive power, as any
kind of content can be returned. In fact, content can be negotiated,
for format, language, compression etc.

This would also work for non-HTML documents.

This then requires a meta data content type to be defined,
but this could be as simple as text/plain, or made as complex
as desired.

This still has the problem of requiring server modification to
link a URL with its meta data. It least the modification is
restricted to a new method, and no changes or pre-parsing
are required.

It would also limit the accessibility of the meta data to the
use of the HTTP protocol.

3.3 Using Accept headers for meta data

Rather than using a new method as suggested in 3.2, Accept headers
can be used in conjunction with a standard GET to request the meta
data.

All the advantages of 3.2 are inherited, but this requires no server
modification beyond standard content negotiation.

This would still limit the accessibility of the meta data to the
use of the HTTP protocol, but as it is an HTTP construct for an
HTTP solution this is acceptable.

3.4 Using the <LINK> element

A different proposal [RELREV] seeks to standardise values of the HTML
<LINK> element, which expresses relations between documents.
Once such relation is "META", indicating that one document contains
meta data for another.

This construct would again alleviate all HTTP problems, and allow
complete expressive power for meta data.

This does not require modifications to servers, nor relies on a
correct implementation of content-negtiation, and is applicable
for protocols other than HTTP.

This doesn't provide a solution for non-HTML data, but as it is an
HTML construct this is acceptable.

A perceived disadvantage is that the means the entire document needs
to be transmitted to find the META data, but this is not true; a client
can stop receiving the transmission as soon as the relevant LINK
element has been found.

A disadvantage id that a second request is required. I believe this is
not a severy problem as retrieving the meta data is not the common
browsing case, and is offset by the advantages.

3.5 Using content-selection

One could see meta data specification as a specific instance of
content-selection, and provide a PICS [PICS] based solution.
This would limit the expressive power of the meta data,
and lock it into a whole new set of problems. This may be an area
for research, but appears to be an unsatisfactory solution.

4. Conclusion and Recommendations

We have seen that the use of HTTP-EQUIV has several drawbacks
ranging from being limited as meta data, to potential conflicts with
HTTP, a separate protocol.

We have also seen that several alternatives exist which donot share
these drawbacks.

My recommendation is:

- to depricate the use of <META> HTTP-EQUIV for general meta data,
and specifying it only to be used, with extreme caution, for existing
HTTP header values. This could be a cheap way of providing
configuration information to a server, or directly to a client
if some protocol does not support the construct.

- to specify that <META> NAME can be used for general meta data,
but not promote its use beyond the provision of free-text keywords
in the language of the document.

- to urge that future specifications of values for NAME specify a
clear semantics and syntax for CONTENT on a per-NAME basis.

- to promote the use of the <LINK> tag to specify meta information.

- to define a simple content-type for meta data, such as the IAFA
templates, and encourage research on more advanced formats

- to see what comes out of URC research, and how this fits with the
above.

5. Security Considerations

There are several security implications in the use of meta data.

On an abstract level it is always difficult to guarantee that the meta
data is applicable to the document. It is therefore important to
present the user with easy means to find out how and where meta data
was obtained.

The use of HTTP-EQUIV opens up potential problems if data in the HTML
document is passed into protocol header fields unchecked.
Syntactically incorrect values may result in invalid protocol being
transmitted. Syntactically correct values may give opportunity to
spoofing of certain fields. This document argues against the use of
HTTP-EQUIV.

6. Author's Address

Martijn Koster,
Software engineer at the WebCrawler group of America OnLine.
Email address: m.koster@webcrawler.com

See <URL:http://info.webcrawler.com/mak/mak.html> for more information

7. References

[HTML] RFC 1866, Hypertext Markup Language - 2.0, T. Berners-Lee &
D. Connolly, November 1995. <URL:ftp://ds.internic.net/rfcs
rfc1866.txt>

[META] The META Tag of HTML, Davide Musella, January 1996.
<URL:http://info.webcrawler.com/mailing-lists/robots/0324.html>
<URL:ftp://ds.internic.net/internet-drafts/
draft-musella-html-metatag-02.txt>

[ROBOT] World-Wide Web Robots, Wanderers and Spiders, Martijn Koster,
<URL:http://info.webcrawler.com/mak/projects/robots/robots.html>

[RELREV] Hypertext links in HTML, M. Maloney & L. Quin
<URL:ftp://ds.internic.net/internet-drafts/
draft-ietf-html-relrev-00.txt>

[HTTP] The HyperText Transfer Protocol
<URL:http://www.w3.org/pub/WWW/Protocols/>

[PICS] The Platform for Internet Content Selection
<URL:http://www.w3.org/pub/WWW/PICS/>