Identifying identical documents

Daniel T. Martin (MARTIND@carleton.edu)
Wed, 03 Jul 1996 14:58:28 -0500 (CDT)


In case it helps to explain things, all of these questions come out of
considering how best to handle webservers that, because of the operating system
or other constraints, treat URLs case-INsensitively.

I've been looking at various items in the list and I've been wondering - several
times, in response to people suggesting something that would make it possible to
determine if, say, /foo/bar.html and /goo/../foo/bar.html are the same document,
people have responded "use checksums".

Well I'm curious as to what checksum method people are using - some CRC variant,
a simple sum modulo 2^n, etc. What have robot designers found sufficient in
comparing two documents to determine if URL B is in fact simply a re-occurrence
of URL A?

Related to this is a suggestion for an extension to robots.txt (as if we don't
have enough already)

Pragma: case-insensitive # 'cause I happen to LIKE VMS, that's why!

which indicates that on this server, URLs are not to be considered distinct if
they differ only in case. Of course, robots expecting the occasional
case-insensitive server and using good checksums won't get much information out
of this, but it could potentially speed up things on the robot's end.

Speaking of robots.txt, just what are all of the extensions currently on the
table? I've gotten lost looking through the archives - the only ones that seem
to have any support (i.e. more than one post) are:
Allow header (just to be more positive)
sh-file-globbing-style pattern matching (i.e. /~*/foo.html matches /~cow/foo.html
and /~monster/foo.html but not /~ftp/bar.html)

Are there any other extensions that are being seriously considered?

-=-=-=-
DaNiEl MaRtIn *your .sig here*