Re: More Robot Talk

bbh@xenodata.com
Fri, 17 Jan 1997 13:29:37 -0600


Is anyone storing the md5 of the entire contents of a document? This would provide a unique,
easily storable content index. It would not necessarily do anything about the fuzzy URL problem,
but Alta-Vista and others could use a content index to stop returing so many duplicate documents
automatically. Apache docs and RFCs are examples of things people love to duplicate.

Bryan Hackney
bbh@xenodata.com

----- Begin Included Message -----

From: =?ISO-8859-1?Q?Jaakko_Hyv=E4tti?= <Jaakko.Hyvatti@iki.fi>
Subject: Re: More Robot Talk

Nice to have something of a technical nature for a change here.

The heuristics for URL canonicalization presented here do not yet take
into account the new HTTP/1.1 Host: request header. It complicates the
matters more once some servers actually will have multiple virtual hosts
with a single IP address, differentiated by Host: headers or absolute
URI's in request headers (see rfc2068).

...

----- End Included Message -----

_________________________________________________
This messages was sent by the robots mailing list. To unsubscribe, send mail
to robots-request@webcrawler.com with the word "unsubscribe" in the body.
For more info see http://info.webcrawler.com/mak/projects/robots/robots.html