RE: alta vista and virtualvin.com

Louis Monier (monier@pa.dec.com)
Sat, 1 Jun 1996 20:21:13 -0700


This is an old thread, but I was out of town, then busy.

If one thing about this whole robot field worries me, it is the
existence of sites like this one. If you think about it, this scheme is
bad for everyone:
1. the robot, which can get trapped and visit the same pages (or worse,
slightly different versions of the same pages) over and over.
2. the site, whose access stats and visitor database is all screwed up.
3. the users of the index, who inherit a large number of bogus URLs, and
further contributes to (2) by inherinting one of the robot's IDs.

Need I say more? I think this scheme is detestable. Cookies may be the
way to go, and if one does not want to rely on them, at least use a
decent syntax so that robots can guess the trick, say by making it
obvious that a script is been invoked with arguments. Having one common
encoding (a 10-digit number as first path element) would be good, but
it's too late. Another idea would be for these sites to recognize
robots somehow, and only generate "clean" URLs, so robots would take
only one trip through the site. But again, that's a lot of people to
convince.

So in the meantime, we use a semi-automatic solution: such sites are
suspected, manually confirmed, and added to a s---list so that only
their top-level page is indexed. I suspect that people trying to run
fast robots right now, and who have not yet found out about this
phenomenon, are simply accumulating junk from these sites. Ah ah!

Seriously, this is a big problem. My friends at w3 tell me not to worry
because cookies will eventually eradicate such schemes, but in the
meantime this is a real problem. Any thoughts?

--Louis

>----------
>From: chris cobb[SMTP:c-cobb@ix.netcom.com]
>Sent: Sunday, May 05, 1996 9:56 PM
>To: 'robots@webcrawler.com'
>Subject: alta vista and virtualvin.com
>
>This is a question I hope Louis will be able to answer, but the topic
>should
>be of interest to others.
> ---
>
>Virtual Vineyards(VV) (www.virtualvin.com) is an example of a
>site that customizes all internal links for each visitor by assigning
>a 9 digit number to each new home page viewer.
>
>When a person retrieves the home page for this site, each link
>on the home page (which points to another part of virtual vineyards)
>contains this newly created number. For example, the "what's new"
>link from the home page for me might be:
>
> www.virtualvin.com/vvdata/026684189/whatsnew.htm
>
>and you may have
>
> www.virtualvin.com/vvdata/552378463/whatsnew.htm
>
>Both of us see the same "what's new" page but the server is
>keeping track of us.
>
>On the "what's new" page, each of us might have
>a link back to the home page, but mine would continue to have
>my ID and you yours. Neither of us would get a new ID unless
>we reloaded the home page.
>
>VV appears to use a custom server which strips this number
>from each request and uses it to record the progress of each
>user through the site. The site does allow purchases of items
>with a "basket" metaphor - apparently using these IDs instead
>of a cookie to identify incoming requests.
>
>The problem arises when you consider that a webcrawler would
>encounter unusual problems when cataloging a site of this nature.
>Each referencing link (other than the home page)
>that the crawler recorded would contain an
>ID. As numerous, non-related people visited the site by way of
>the crawler's index, the host site would become confused and the
>entire tracking and shopping mechanism would break down. If two people
>searched for the same chardonney at Alta Vista, visited
>the exact same page using the query results and
>clicked 'purchase', the site would see two purchase
>requests from what looked to be the same person. The IDs would
>remain the same as long as these users roamed the site -
>even if they went back to the home page of VV while doing
>so. Even more distressing, this ID that the crawler recorded
>would progressively become days, weeks, and possible months
>old but still remain used on a frequent basis by different people.
>
>To examine this in practice, I searched for a low level (not home) page
>from
>VV on Alta Vista. I wanted to see if Alta Vista did indeed record a
>user ID when indexing these pages.
>
>Alta Vista did not return a match to the low level page query, but
>did contain an index of the VV home page.
>
>I initially thought that a 'robots.txt' file was having an effect -
>perhaps
>Alta Vista was not visiting any lower pages because the
>creater of the VV site realized the potential problems of their
>approach and attempted to guard against it by limiting crawler
>access. I did not, however, find a robots file.
>
>My questions:
>- Why doesn't Alta Vista index the lower levels of this site? Even
>though there is not a
>robots.txt file, it seems that Alta Vista is aware of what is going on
>and
>manages to avoid the problem. How is this done?
>- What comments do others have about indexing sites of this nature?
>
>Chris Cobb
>
>