Re: Possible robots.txt addition

Martijn Koster (m.koster@webcrawler.com)
Wed, 6 Nov 1996 19:39:50 -0800

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: P. Senthil: "Re: technical descripton[D[D[D"
Previous message: Brian Clark: "Re: Domains and HTTP_HOST"
Maybe in reply to: Ian Graham: "Possible robots.txt addition"

At 12:34 PM 11/6/96, Ian Graham wrote:
>> In message <Pine.SUN.3.95.961105204518.18054D-
>> 100000@airedale.cisco.com>, Issac Roth <iroth@cisco.com> writes
>>
>> <Meta http-equiv="URI" contents="http://the.proper/location.html">
>
>This is fine, but means that you have to unnecessarily modify all the
>HTML documents, and also does not work for non-html based data (pdf, ps,
>images, etc.)

The http-equiv is only one way to set this HTTP header; you should be
able to configure your server to include such a header on every outgoing
resource. Pretty trivial to code in Apache, and you can even check the Host
header beforehand to see if you need to.

After all, it is your server that wants to do something out of the
ordinary, not a robot. So let the server do it.

>It also does not indicate why the redirection occurs --
>there is nothing to say that the original domain *name* is invalid, only
>that the particular URL is moved.

If you're really bothered about that, why not argue for a change to HTTP
to have a server send a Preferred-Host header in the response? That could
give the required semantic information.

>> The next step I did was run a CGI script on the old machine which
>> reported a 302 to any known robots and a redirect to others.
>>
>> I still had to email the maintainers of the search engines to drop the
>> old pages. I did have my mail address in the header of each page so that
>> the maintainers could mail and confirm that the 'delete' request was not
>> spoofed email... It still took 6 to 9 months to effectively complete the
>> move.
>
>Which are among many of the reasons I am making this proposal.

Just because things don't work the way you want you can't use that
to justify demand for new techniques and expect _them_ to be
implemented properly. Fix the original problem!

You want search engines to wake up to the fact a URL has moved, or is
deleted. So demand that a search engine, when you submit a URL that
results in a 404, it gets removed from its database. Demand that they
do the right thing with Temporary Redirects or Permanent Redirects.
Demand that they pay attention to Expires.

If they don't give in to those demands, then they definately won't
go and implement a hack specifically for you...

The thing is, you have one plan to reorganise the URL space you maintain,
others may have other plans. Maybe I'm renaming all *.html to *.html.
Or maybe I'm downcasing all path components. Or maybe I want to insert
a new level in part of my path space. None of these is less reaseonable
than your plan. Should we come up with mechanisms to do all these things?
I think not.

As you yourself said, this problem is not limited to robots, so /robots.txt
is really the wrong place for it. And it appears to be confusing enough.

Later you wwrite:

>But, redirection can only redirect a
>single URL, not an entire server's worth of content.

It is true that the Web hasn't really got a model for applying a semantic
to a certain subset of a server's URL space, which would help this case
and others. But I don't think that's something we can easily fix...

-- Martijn

Email: m.koster@webcrawler.com
WWW: http://info.webcrawler.com/mak/mak.html

Next message: P. Senthil: "Re: technical descripton[D[D[D"
Previous message: Brian Clark: "Re: Domains and HTTP_HOST"
Maybe in reply to: Ian Graham: "Possible robots.txt addition"