Re: Server name in /robots.txt

Mordechai T. Abzug (mabzug1@gl.umbc.edu)
Sat, 20 Jan 1996 21:00:16 -0500 (EST)


"DE" == David Eagles spake thusly:
DE> A good solution would be the inclusion of a new optional html tag:
DE> <host="www.anchovies.com">

"SE" == Someone Else space thusly:
SE> True, but if we're in the process of adding new tags, etc, wouldn't it be >better to add an HTTP response field generated by the server as in

"TB" == Tim Bray spake thusly:
TB> Nope. This can't be done automatically. I may not want to tell the
TB> world that the preferred name for this happens to be the one that
TB> this particular server is operating under at this particular moment.
TB> All the server can know is what it's known as at the moment. We need
TB> more indirection.
TB>
TB> Also, I not only want to solve the hostname problem, I want to solve
TB> the /a/./b/./c/ and /a/x/../b/y/../c/ and hard-link and symlink file
TB> problems. To do this, a *human* (or a document management system)
TB> needs to assert what the canonical name for something is.

We've got 5% of *webmasters* using /robots.txt. Does anyone think we'll be
able to convince more than a minute fraction of *HTML writers* to conform to
some obscure standard? Remember, you don't need to know anything to become an
HTML author. And don't forget the legacy problem: there already are millions
of documents in existence. Would *you* like to go ahead and modify each of the
docs you've written to conform to any new standard?

Proposing a new standard for *servers* is even worse. If I've got a server
up, running, and configured the way I like, you'll need a shotgun to convince
me to go through the hassle of downloading, compiling, testing, and
configuring some new version unless it comes with *major* benefits.

If this really is a problem for your robot, I'd suggest that you solve it
yourself. One suggestion: use some sort of document comparison algorithm. If
you only wish to avoid perfect duplication (ie. symlinks, hard links, etc.)
I'd suggest using MD5 (don't use checksums; too unreliable for something the
size of the web) to generate digests, and use the digests as the keys for an
associative array, with the URL as value. Every time you download a document,
take its digest and make sure you don't already have that digest. Note that
for a corpus of *this* size, even MD5 might not be perfectly reliable (ie. as
in 29 people for 50/50 chance of same birthday) so once you have the first
match, you might want to use some longer comparison (perhaps download the
original document again?) to confirm that they really are the same thing.

-- 
			  Mordechai T. Abzug
http://umbc.edu/~mabzug1   mabzug1@umbc.edu     finger -l mabzug1@gl.umbc.edu
Set laser printers to "stun".