>
> A few people have suggested that sites should use POST to protect
> sites from unwanted attention from robots. Could those people take
> a few minutes to surf around "http://us.imdb.com/" and then come
> back here and admit that POST is NOT A SOLUTION.
Nope. Because you are wrong. I looked. POST could be used on imdb.com.
> At this site you'll find a giant database with millions of distinct
> URLs which cannot be hidden behind POST.. Practically every URL on that
> site uses a script, whether the URL contains a "?" or not.
>
> Now you might think that a solution is to make the entry point a POST
> "protected" page. If you think that's reasonable, what do we do about
> the tens of thousands (estimated at >60,000) links to that site
> that appear on other pages around the world that'd allow a robot to
> enter via a 'back door'.
A link:imdb.com search on Alta-Vista returned 7,000 matches (600 or so of
these were internal to imdb.com - link:www.moviedatabase.com gave about
173 matches total). Your German mirror has about 2,000 matches and the
Australian mirror 800. Searching for the disclaimer you ask to be placed
on people pages linking directly gave only 35 matches - a quick search for
link:imdb.com/M (which appears to be the base URL for all your search
engine script requests) gave about 2000 potential 'backdoor entries' -
substantial, but a long way from ">60,000". Are you doing any referer_log
analysis? It could give you pretty exact answers rather than "estimated
at".
Most of your database could be converted to a static tree for bookmarking
purposes. Rather than have your scripts return the contents directly have
them issue 'Location: ' to a static HTML build of your movie, actor, etc
summaries. If you want to see an example of 'dynamic database/static HTML'
design - look at http://www.netimages.com/classifieds/. The data changes
many times a day (updated by online forms) - but the pages are all
completely static HTML that robots can browse to their hearts delight
without adding noticably to our load. And that can be safely bookmarked
(relating to the paragraph below of yours about hyperlinking). Adding a
search engine wouldn't cause any trouble at all.
Before you say something stupid like 'but that would take lots of
storage', the answer to that is 'so?'. Storage is dirt cheap. I
am looking at an ad right now that offers a fast 2.9 Gig SCSI-II drive
for $339 dollars. Your database reports aren't *that* big and they
don't vary that much from run to run. And your system performance
would improve to boot. Running CGI unnecessarily is evil.
As for the legacy problem - it would be quite easy to write a special
purpose script to handle the 2000 or so direct links in existance. Since
they are not going to change - a simple hashed lookup table kicking to the
final *static* URL would work. Load on your system - miniscule compared
with actually doing the searches. Programming effort - minimal.
> Should the fundamental use of the web (hyperlinking)
> be sacrificed so that third parties can run their robots without fear
> of hurting anyone?. Why do the site providers have to bend over backwards
> to accomodate robots? and moreover why should they bend over backwards
> to accomodate the robots run by irresponsible people when in all likelyhood
> the robot owners use of the collected data will bring little or no
> benefits to the site being 'indexed'.
Because of a little problem known as 'resource discovery'. It is *USELESS*
to have the biggest baddest database in the world - if no one can find it.
I operate several hugely successful websites - including one that robots
would be more than unwise to attempt to index (not millions of potential
URLs - an *infinite* number. The *ENTIRE SITE* is a CGI script. It gets
about 80-100,000 hits a day and runs on a Linux box that it shares with
another medium-high volume site). I don't worry a lot about robots beyond
putting out a robots.txt advisory and taking the basic precautions. If
they ignore it and are stupid enough to try and index my sites
depth-first, they get meg after meg of useless data.
Not because I am malicious - because they are indexing things that are not
suitable for indexing. It has happened on my sites before - I shrug and
get on with life. Robot owners don't WANT meg after meg of useless data -
the problem is self-correcting. Either the owners will learn about obeying
the robots.txt file - or they will exclude my site from their future runs.
Either is fine and I don't worry it. If they were clueless beyond
redemption I would block their IP address. Hasn't happened yet.
And the problem with imdb.com and hyperlinks is not the search engines -
but your interface to your database. I could make it bookmarkable without
much effort at all - while *still* using POST method to search it.
Ironically - you already *almost* do this with your cache.
> The people running the popular search engines seem to know what they
> are doing (I've had requests for exceptions to be made to allow robots
> a quick controlled scan of protected areas, and I've been happy cooperate).
> These controlled and well run robots are useful and most sites with a
> robots.txt do go to the effort of allowing access to *relevant* areas of the
> site.
>
> It's in everyone's interest to be linked correctly from other sites, and
> that's the beauty and purpose of the web. Unfortunately, a small minority
> of robot owners either choose to ignore robots.txt (often using unreasonable
> excuses) or they are unaware of robots.txt. The latter category of robot
> owner is still irresponsible because they've failed to find the guidelines
> despite them being trivially easy to find if the person were to do some
> simple research before starting.
>
> I will apologies for any strong language that I've used so far on this
> list out of pure frustration with a constant barrage of robotic attacks
> that I've witnessed over the last 3 years, if the robot owners out there
> are willing to accept that the content of this mail is reasonable and that
> the ONLY current cure for the problems of rogue robots is to have all
> robots respect robots.txt UNLESS THEY HAVE BEEN GIVEN PERMISSION not to.
>
> For every 'reasonable' reason to ignore robots.txt that people on this
> list can come up with, there's probably several counterexamples that
> would illustrate a potential problem.
Good link validating robots like MOMSpider that check only one link at a
time with substantial pauses between successive hits on the same server.
They don't threaten a server any more than a browser does. They don't
climb your tree - but they are very likely to need to hit pages your
robots.txt forbids access to. And it is safe for them to do so since they
do not attempt to traverse your tree. The *worst* that can happen is
someone has a *lot* of permanent links to you and the validation robot
hits you fast enough to increase your server load unreasonably. If they
did what many browsers do and claimed they were 'Mozilla' - you would
never know they had been there.
-- Benjamin Franz