Re: Possible robots.txt addition

Ian Graham (ianweb@smaug.java.utoronto.ca)
Thu, 7 Nov 1996 11:34:34 -0500 (EST)

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Klaus Johannes Rusch: "Re: Possible robots.txt addition"
Previous message: Adam Gaffin: "Belated notice of spider article"
In reply to: Sigfrid Lundberg: "Re: Back of the envelope computations"

Yes, URI headers could be added with a server module (or perhaps
content-location:, which is apparently the HTTP/1.1 reincarnation
of URI), based on appropriate configuration information (in
httpd.conf, for example) . However, my hope here was to implement
something simple that would not mean modifying servers. A lot of
people don't have that skill, or are not running a server for which
these changes are not so easy.

I too believe that some of this information should be expressed
within HTTP. However, at present it is not the case. My proposal
assumed that something like this would be of immediate use to
enough people that it would seem a useful robots file addition.
That does not appear to be the case .... ;-).

As you say below, it would be nice to be able to ask generic questions
about a server's URL space, as opposed to just receiving information
about a specific URL. This would require some sort of database
interface to the server space.... which is an interesting idea, but
far more than I wanted to get into right now.

Ian

> The http-equiv is only one way to set this HTTP header; you should be
> able to configure your server to include such a header on every outgoing
> resource. Pretty trivial to code in Apache, and you can even check the Host
> header beforehand to see if you need to.
>
> After all, it is your server that wants to do something out of the
> ordinary, not a robot. So let the server do it.
>
> >It also does not indicate why the redirection occurs --
> >there is nothing to say that the original domain *name* is invalid, only
> >that the particular URL is moved.
>
> If you're really bothered about that, why not argue for a change to HTTP
> to have a server send a Preferred-Host header in the response? That could
> give the required semantic information.
>
> >> The next step I did was run a CGI script on the old machine which
> >> reported a 302 to any known robots and a redirect to others.
> >>
> >> I still had to email the maintainers of the search engines to drop the
> >> old pages. I did have my mail address in the header of each page so that
> >> the maintainers could mail and confirm that the 'delete' request was not
> >> spoofed email... It still took 6 to 9 months to effectively complete the
> >> move.
> >
> >Which are among many of the reasons I am making this proposal.
>
> Just because things don't work the way you want you can't use that
> to justify demand for new techniques and expect _them_ to be
> implemented properly. Fix the original problem!
>
> You want search engines to wake up to the fact a URL has moved, or is
> deleted. So demand that a search engine, when you submit a URL that
> results in a 404, it gets removed from its database. Demand that they
> do the right thing with Temporary Redirects or Permanent Redirects.
> Demand that they pay attention to Expires.
>
> If they don't give in to those demands, then they definately won't
> go and implement a hack specifically for you...
>
> The thing is, you have one plan to reorganise the URL space you maintain,
> others may have other plans. Maybe I'm renaming all *.html to *.html.
> Or maybe I'm downcasing all path components. Or maybe I want to insert
> a new level in part of my path space. None of these is less reaseonable
> than your plan. Should we come up with mechanisms to do all these things?
> I think not.
>
> As you yourself said, this problem is not limited to robots, so /robots.txt
> is really the wrong place for it. And it appears to be confusing enough.
>
> Later you wwrite:
>
> >But, redirection can only redirect a
> >single URL, not an entire server's worth of content.
>
> It is true that the Web hasn't really got a model for applying a semantic
> to a certain subset of a server's URL space, which would help this case
> and others. But I don't think that's something we can easily fix...
>
>
> -- Martijn
>
> Email: m.koster@webcrawler.com
> WWW: http://info.webcrawler.com/mak/mak.html

Next message: Klaus Johannes Rusch: "Re: Possible robots.txt addition"
Previous message: Adam Gaffin: "Belated notice of spider article"
In reply to: Sigfrid Lundberg: "Re: Back of the envelope computations"