the POST myth... a web admin's opinions..

Rob Hartill (robh@imdb.com)
Thu, 11 Jul 1996 22:13:27 -0600 (MDT)


A few people have suggested that sites should use POST to protect
sites from unwanted attention from robots. Could those people take
a few minutes to surf around "http://us.imdb.com/" and then come
back here and admit that POST is NOT A SOLUTION.

At this site you'll find a giant database with millions of distinct
URLs which cannot be hidden behind POST.. Practically every URL on that
site uses a script, whether the URL contains a "?" or not.

Now you might think that a solution is to make the entry point a POST
"protected" page. If you think that's reasonable, what do we do about
the tens of thousands (estimated at >60,000) links to that site
that appear on other pages around the world that'd allow a robot to
enter via a 'back door'.

Should the fundamental use of the web (hyperlinking)
be sacrificed so that third parties can run their robots without fear
of hurting anyone?. Why do the site providers have to bend over backwards
to accomodate robots? and moreover why should they bend over backwards
to accomodate the robots run by irresponsible people when in all likelyhood
the robot owners use of the collected data will bring little or no
benefits to the site being 'indexed'.

The people running the popular search engines seem to know what they
are doing (I've had requests for exceptions to be made to allow robots
a quick controlled scan of protected areas, and I've been happy cooperate).
These controlled and well run robots are useful and most sites with a
robots.txt do go to the effort of allowing access to *relevant* areas of the
site.

It's in everyone's interest to be linked correctly from other sites, and
that's the beauty and purpose of the web. Unfortunately, a small minority
of robot owners either choose to ignore robots.txt (often using unreasonable
excuses) or they are unaware of robots.txt. The latter category of robot
owner is still irresponsible because they've failed to find the guidelines
despite them being trivially easy to find if the person were to do some
simple research before starting.

I will apologies for any strong language that I've used so far on this
list out of pure frustration with a constant barrage of robotic attacks
that I've witnessed over the last 3 years, if the robot owners out there
are willing to accept that the content of this mail is reasonable and that
the ONLY current cure for the problems of rogue robots is to have all
robots respect robots.txt UNLESS THEY HAVE BEEN GIVEN PERMISSION not to.

For every 'reasonable' reason to ignore robots.txt that people on this
list can come up with, there's probably several counterexamples that
would illustrate a potential problem.

The fundamental problem with ignoring robots.txt is that the robot cannot
read the results of the request and make reasonable adjustments that a
human would if the results were unexpected.

People use robots.txt for various reasons. It's not just a protector
for scripts, it also protects against indexing information that isn't
suitable for indexing (will the POST supporters suggest the information
be displayed as a GIF to protect it?). These are just 2 reasons, there
are probably a dozen more examples that none of us on this list have
even thought about.

That's "robots.txt" dealt with. Just as important are the guidelines
for reasonable robot behaviour. If robot owners choose to ignore the
guidelines (which are just common sense after all) then they can cause
as must damage, e.g a robot that hits a site too fast can cause a denial
of service.. a robot that doesn't care that it's downloading foo.tar.gz
might cost someone a few $$s in ISP fees. (BTW, I've seen a
big search engine offering .PS files as search results.. they've weeded
out ".ps" but aren't set up to check case-insensitively). A robot
that downloads the same file every 30s all day (stupid?, it happens)
will skew someone's site statistics.

The people who setup the guidelines and devised robots.txt recognised
the potential problems long before many on this list had heard of
the www, yet some of them think they understand the problems of a site
admin better than someone who's done it 18 hours a day for years. You
really have to witness some of these dumb robots in action to appreciate
the problem. Ever had someone's modem or fax phone you every few minutes?
if you have then you'll appreciate the frustration of the inconvenience
and the inability to do anything about it... you end up taking the
phone off the hook.. being forced to take down your server to adjust
the configuration is a real drag and for some of us, it eats into
profits and valuable work time.

Few people who have been hit by more than one dumb robot have will
respond in a friendly way to the robot owners. Robot owners should
cut us a little slack. Being a victim on a regular basis builds up
a lot of anger. Screaming at offenders is often the only source of
relief.

rob

-- 
Rob Hartill (robh@imdb.com)
The Internet Movie Database (IMDb)  http://www.imdb.com/
           ...more movie info than you can poke a stick at.