RE: alta vista and virtualvin.com

Bakin, David (dbakin@sanbruno.powersoft.com)
Mon, 03 Jun 96 20:11:00 PDT


I think this has been suggested before, but I haven't seen it in this
thread: The easy place to start would be to contact the principle server
vendors and explain to them why it would be great if they a) include a
sample robots.txt in each and every sample site installed from their
distribution and b) include the whys and wherefores of robots.txt in
their manuals and help. In fact, this may already be being done by some
of the major distributions, does anyone know? -- Dave

----------
From: owner-robots[SMTP:owner-robots@webcrawler.com]
Sent: Sunday, June 02, 1996 12:11 AM
To: robots
Subject: RE: alta vista and virtualvin.com

I am someone who runs one of these problem sites. When I realized
the mistake I had made in not setting up a proper robots.txt file (after
I had caused problems, unfortunately), I was happy to put all my
offending scripts in a separate directory and exclude them. I don't see
where anyone aware of the problem would ever want any output from these
types of scripts indexed. Wouldn't there always be a top-level html
page that would be enough of a reference? Perhaps you could add an
education step when putting such sites on your s---list, and send the
site an automated note pointing out the problem?

Also, perhaps the importance of dealing with this could be more
prominent, with the potential problems to the site explained (I
experienced some data corruption myself), on caching-related sections
of generic cgi FAQ's.

Respectfully,

-Ann Cantelow

-------------------The Interactive Poetry Pages----------------------
Collaborative poetry in real time- across the net.
http://www.csd.net/~cantelow/poem_welcome.html
---------------------------------------------------------------------

---------------------------------------
On Sat, 1 Jun 1996, Louis Monier wrote:

> This is an old thread, but I was out of town, then busy.
>
> If one thing about this whole robot field worries me, it is the
> existence of sites like this one. If you think about it, this scheme
is
> bad for everyone:
> 1. the robot, which can get trapped and visit the same pages (or worse,
> slightly different versions of the same pages) over and over.
> 2. the site, whose access stats and visitor database is all screwed up.
> 3. the users of the index, who inherit a large number of bogus URLs,
and
> further contributes to (2) by inherinting one of the robot's IDs.
>
> Need I say more? I think this scheme is detestable. Cookies may be
the
> way to go, and if one does not want to rely on them, at least use a
> decent syntax so that robots can guess the trick, say by making it
> obvious that a script is been invoked with arguments. Having one
common
> encoding (a 10-digit number as first path element) would be good, but
> it's too late. Another idea would be for these sites to recognize
> robots somehow, and only generate "clean" URLs, so robots would take
> only one trip through the site. But again, that's a lot of people to
> convince.
>
> So in the meantime, we use a semi-automatic solution: such sites are
> suspected, manually confirmed, and added to a s---list so that only
> their top-level page is indexed. I suspect that people trying to run
> fast robots right now, and who have not yet found out about this
> phenomenon, are simply accumulating junk from these sites. Ah ah!
>
> Seriously, this is a big problem. My friends at w3 tell me not to
worry
> because cookies will eventually eradicate such schemes, but in the
> meantime this is a real problem. Any thoughts?
>
>
> --Louis
>