Re: the POST myth... a web admin's opinions..

Rob Hartill (robh@imdb.com)
Fri, 12 Jul 1996 14:11:53 -0500 (CDT)


>On Thu, 11 Jul 1996, Rob Hartill wrote:
>
>>
>> A few people have suggested that sites should use POST to protect
>> sites from unwanted attention from robots. Could those people take
>> a few minutes to surf around "http://us.imdb.com/" and then come
>> back here and admit that POST is NOT A SOLUTION.
>
>Nope. Because you are wrong. I looked. POST could be used on imdb.com.

No it cannot. I've been running the site for 3 years and you think
you know it better than me after a few minutes?

>A link:imdb.com search on Alta-Vista returned 7,000 matches (600 or so of
>these were internal to imdb.com - link:www.moviedatabase.com gave about
>173 matches total).

a) Alta-Vista's link counting system is a random number generator. Sit
on the reload button over a period of time and you get results
that show say 8,000 one minute and 40,000 the next. Alta-Vista
is not a reliable source of statistics.

b) AFAIK, Alta-Vista counts documents containing links and not links. The
IMDb is a resource that generates lots of linking from individual
documents.

c) You didn't count all the possible URLs that we have now and have used in
the past which automatically bounce to new URLs.

Anyway, this is pointless, if you don't believe my stats I don't care, but
don't try and produce bogus ones of your own to disprove mine.

>Before you say something stupid like 'but that would take lots of
>storage', the answer to that is 'so?'. Storage is dirt cheap. I
>am looking at an ad right now that offers a fast 2.9 Gig SCSI-II drive
>for $339 dollars. Your database reports aren't *that* big and they
>don't vary that much from run to run. And your system performance
>would improve to boot. Running CGI unnecessarily is evil.

Your brief visit to my site now makes you an expert on the internal
workings of the database. Your assumptions are wrong. This takes us
back to a point that I have to keep making.. people should not assume
that their generic ideas or assumptions apply to all sites.

>As for the legacy problem - it would be quite easy to write a special
>purpose script to handle the 2000 or so direct links in existance. Since
>they are not going to change - a simple hashed lookup table kicking to the
>final *static* URL would work. Load on your system - miniscule compared
>with actually doing the searches. Programming effort - minimal.

Yawn. I know my system inside out, so please don't assume what's easy
or practical based on little understanding of how things work.

>Because of a little problem known as 'resource discovery'. It is *USELESS*
>to have the biggest baddest database in the world - if no one can find it.

We are indexed by friendly robots, we are listed in countless review
sites. We get trashed by occasional dumb robots. We don't try to hide.

>I operate several hugely successful websites - including one that robots
>would be more than unwise to attempt to index (not millions of potential
>URLs - an *infinite* number. The *ENTIRE SITE* is a CGI script. It gets
>about 80-100,000 hits a day and runs on a Linux box that it shares with
>another medium-high volume site). I don't worry a lot about robots beyond
>putting out a robots.txt advisory and taking the basic precautions. If
>they ignore it and are stupid enough to try and index my sites
>depth-first, they get meg after meg of useless data.

If it doesn't bother you then fine, but when my users are denied access
because of a dumb robot I do mind. Just because you don't mind doesn't mean
everyone else shouldn't.

>And the problem with imdb.com and hyperlinks is not the search engines -
>but your interface to your database.

This'll be good. Please elaborate. The interface is a showcase for
hyperlinking.

>I could make it bookmarkable without
>much effort at all

Sure you could; you can do site analysis without even seeing what goes
on behind the scenes. Your services must be in real demand.

>> For every 'reasonable' reason to ignore robots.txt that people on this
>> list can come up with, there's probably several counterexamples that
>> would illustrate a potential problem.
>
>Good link validating robots like MOMSpider that check only one link at a
>time with substantial pauses between successive hits on the same server.

We're not talking about good robots. By definition they don't cause
trouble. Their services are welcomed.

rob

-- 
Rob Hartill (robh@imdb.com)
The Internet Movie Database (IMDb)  http://www.imdb.com/
           ...more movie info than you can poke a stick at.