RE: alta vista and virtualvin.com

Louis Monier (monier@pa.dec.com)
Sun, 2 Jun 1996 10:50:55 -0700


Ann, I wish all people running such sites were as reasonnable as you
are. I am thinking of trying to educate them of the problem indeed. So
many things to do...

--Louis

>----------
>From: Ann Cantelow[SMTP:cantelow@athena.csdco.com]
>Sent: Saturday, June 01, 1996 11:11 PM
>To: robots@webcrawler.com
>Subject: RE: alta vista and virtualvin.com
>
>
>I am someone who runs one of these problem sites. When I realized
>the mistake I had made in not setting up a proper robots.txt file
>(after
>I had caused problems, unfortunately), I was happy to put all my
>offending scripts in a separate directory and exclude them. I don't
>see
>where anyone aware of the problem would ever want any output from these
>
>types of scripts indexed. Wouldn't there always be a top-level html
>page that would be enough of a reference? Perhaps you could add an
>education step when putting such sites on your s---list, and send the
>site an automated note pointing out the problem?
>
>Also, perhaps the importance of dealing with this could be more
>prominent, with the potential problems to the site explained (I
>experienced some data corruption myself), on caching-related sections
>of generic cgi FAQ's.
>
> Respectfully,
>
> -Ann Cantelow
>
>
>-------------------The Interactive Poetry Pages----------------------
>Collaborative poetry in real time- across the net.
> http://www.csd.net/~cantelow/poem_welcome.html
>---------------------------------------------------------------------
>
>---------------------------------------
>On Sat, 1 Jun 1996, Louis Monier wrote:
>
>> This is an old thread, but I was out of town, then busy.
>>
>> If one thing about this whole robot field worries me, it is the
>> existence of sites like this one. If you think about it, this scheme is
>> bad for everyone:
>> 1. the robot, which can get trapped and visit the same pages (or worse,
>> slightly different versions of the same pages) over and over.
>> 2. the site, whose access stats and visitor database is all screwed up.
>> 3. the users of the index, who inherit a large number of bogus URLs, and
>> further contributes to (2) by inherinting one of the robot's IDs.
>>
>> Need I say more? I think this scheme is detestable. Cookies may be the
>> way to go, and if one does not want to rely on them, at least use a
>> decent syntax so that robots can guess the trick, say by making it
>> obvious that a script is been invoked with arguments. Having one common
>> encoding (a 10-digit number as first path element) would be good, but
>> it's too late. Another idea would be for these sites to recognize
>> robots somehow, and only generate "clean" URLs, so robots would take
>> only one trip through the site. But again, that's a lot of people to
>> convince.
>>
>> So in the meantime, we use a semi-automatic solution: such sites are
>> suspected, manually confirmed, and added to a s---list so that only
>> their top-level page is indexed. I suspect that people trying to run
>> fast robots right now, and who have not yet found out about this
>> phenomenon, are simply accumulating junk from these sites. Ah ah!
>>
>> Seriously, this is a big problem. My friends at w3 tell me not to worry
>> because cookies will eventually eradicate such schemes, but in the
>> meantime this is a real problem. Any thoughts?
>>
>>
>> --Louis
>>
>
>