Re: Who are you robots.txt? was Re: Servers vs Agents

david jost (david_jost_at_HBSD-AMG@ccgate.sysdev.telerate.com)
Mon, 02 Dec 96 09:07:17 est


Where can I get this file ?

DJ

______________________________ Reply Separator _________________________________
Subject: Who are you robots.txt? was Re: Servers vs Agents
Author: "John D. Pritchard" <jdp@cs.columbia.edu> at SMTPGateway
Date: 11/29/96 11:54 AM


> What I am proposing is that we re-evaluate the reasoning behind
> robots.txt. The proposals I have seen in this list seem to rely on the
> assumption that robots.txt is enforcible when it is quite clearly not.



It's important and useful

"/robots.txt" serves a very important and useful purpose for those who
elect to use it. that is it's mission and its goal, imho. what we can do
with a passive instrument is very different from what we can do without it.
nothing could replace it *and* be simpler or more effective. at least i
can't think of any.

in terms of URI space, the "/robots.txt" concept is very durable. to query
a remote server about its URI space you need a URL available on every
server that will respond to "interrogation". in the future maybe a content
typed request header would redirect this URI into a signed servlet or
applet for download and remote "query".

in terms of instrumentation, "/robots.txt" serves essential information
when it provides a "Disallow". this is what we *need* out of the standard
first. additional syntax makes life easier.


It's not a security feature of wwweb server software

in a certain sense, "/robots.txt" merely publicizes the access policy of
the server. it is free information that allows robots to navigate a URI
space without encountering 4/500 replies.

securing robots can only be done from the wwweb server software and
configuration. it is the wwweb server, or HTTP semantics, that must repell
denial of service attacks on port 80. it is the wwweb server that must
enforce policies, however they are publicized, concerning USER-AGENTs and
their roaming about URI space. certainly this enforcement is not for HTTP
spec as it is "implementation details". HTTP provides the error codes you
can return while enfocing whatever access policies you may have.

some have suggested, at great length, that restricting access to every URL
would accomplish something for servers. of course, this does not solve the
problem of denial of service through repetitive port requests which is the
primary vexing problem today. this problem can only be addressed by wwweb
server software that checks the return path of a socket before accepting
it, and compares the socket source end with a table of recently denied
requests (avoiding, again, "implementation details" of the "table"). this
process is secured (at the host, ie, no firewall) against spoofing with tcp
wrappers around the wwweb server as used around sendmail.


Alternatives obviate robots.txt?

in the Java(tm:sun.com) Servlet environment, a robot could download a
signed servlet, configure it and serialize it and upload it for processing.
while some might think this environment can do without "/robots.txt", it
cannot do without it. an implicit URI policy within the servlet is only a
private policy when a wwweb server should not make its access policies more
private than other publicly available information, like publicly available
links into the server's URI space. while there are public links there
should be a public policy statement concerning their automated use.


Classification model

the strictly essential part of "/robots.txt" is URI space classification.
this is provided in the "Disallow" fields of the existing version and also
in the "Allow" fields of the pending proposal. robots' classification as
currently available via the "User-Agent" field will undoubtedly evolve over
time to provide more flexibility via more complexity in the definition of
"/robots.txt". certainly this is an important and useful area for
development.


Conclusion

we should not and really cannot abandon "/robots.txt" because it is
underutilized. we should, and martijn has, move it into the mainstream of
Drafts and RFCs from the fringe of the "de facto" universe.

in the future, *more complex* and increasingly useful extensions to the
original concept will be available. for the moment, however, it is
inappropriate to complicate an underutilized system while the base system
provides what we need.




-john



You can lead a horse to water, but you can't make him drink.


_________________________________________________
This messages was sent by the robots mailing list. To unsubscribe, send mail
to robots-request@webcrawler.com with the word "unsubscribe" in the body.
For more info see http://info.webcrawler.com/mak/projects/robots/robots.html

_________________________________________________
This messages was sent by the robots mailing list. To unsubscribe, send mail
to robots-request@webcrawler.com with the word "unsubscribe" in the body.
For more info see http://info.webcrawler.com/mak/projects/robots/robots.html