RE: alta vista and virtualvin.com

Louis Monier (monier@pa.dec.com)
Sat, 1 Jun 1996 20:21:17 -0700


I must add that Benjamin's site makes a lot of sense, so he should not
take my previous half-flame personally. His design is safe from several
points of view, the main one being a robots.txt file...

--Louis

>----------
>From: Benjamin Franz[SMTP:snowhare@netimages.com]
>Sent: Monday, May 06, 1996 6:53 AM
>To: 'robots@webcrawler.com'
>Subject: Re: alta vista and virtualvin.com
>
>On Mon, 6 May 1996, chris cobb wrote:
>
>> This is a question I hope Louis will be able to answer, but the topic
>>should
>> be of interest to others.
>> ---
>>
>> Virtual Vineyards(VV) (www.virtualvin.com) is an example of a
>> site that customizes all internal links for each visitor by assigning
>> a 9 digit number to each new home page viewer.
>>
>> When a person retrieves the home page for this site, each link
>> on the home page (which points to another part of virtual vineyards)
>> contains this newly created number. For example, the "what's new"
>> link from the home page for me might be:
>
>[...]
>
>> so. Even more distressing, this ID that the crawler recorded
>> would progressively become days, weeks, and possible months
>> old but still remain used on a frequent basis by different people.
>>
>> To examine this in practice, I searched for a low level (not home)
>>page from
>> VV on Alta Vista. I wanted to see if Alta Vista did indeed record a
>> user ID when indexing these pages.
>>
>> Alta Vista did not return a match to the low level page query, but
>> did contain an index of the VV home page.
>>
>> I initially thought that a 'robots.txt' file was having an effect - perhaps
>> Alta Vista was not visiting any lower pages because the
>> creater of the VV site realized the potential problems of their
>> approach and attempted to guard against it by limiting crawler
>> access. I did not, however, find a robots file.
>>
>> My questions:
>> - Why doesn't Alta Vista index the lower levels of this site? Even
>>though there is not a
>> robots.txt file, it seems that Alta Vista is aware of what is going on and
>> manages to avoid the problem. How is this done?
>> - What comments do others have about indexing sites of this nature?
>
>I haven't checked that site in particular, but as an author of a
>similiar
>site (http://www.psiloveyou.com/) I can provide you some insight from
>the
>POV of a site designer. When I designed the site, I was *very*
>concerned
>not about robots, but proxying caches such as AOL. As a result, the
>entire
>site is a CGI script with several anti-caching features built in. The
>first of those features is a 'unique URL segment' like that you noted
>at
>VV. In my case, I put it in as a CGI parameter rather than as a portion
>of
>the URL path. AFAIK, the current generation of search engines do not
>save
>'?....' sections of URLs (at least I can not recall having seen a
>search
>engine return a URL with a '?....' segment) - so this helps prevent the
>'state re-use' problem from search engines. It is not perfect, as
>people
>can (and do) still save the URL manually and make links, but it helps
>considerably by keeping the automation tools from doing so.
>
>Next, I 'pre-expire' most of the pages with 'Cache: no-cache' and
>'Expire:
>really old date'. This is designed to cause both browsers and caching
>proxies (such as AOL) to NOT reuse a page but always ask the server for
>the page again if they want to re-use it.
>
>Third, inside the shopping basket area I use POST operations as much as
>possible to discourage browsers and caches from attempting to save the
>results of a page requests. Caching POST results is an explicit no-no
>in
>the HTTP standard - but some browsers do so anyway. Sigh.
>
>Lastly, each time a page where the current state is absolutely critical
>is
>accessed, it is given a one time only 'de-caching' path extension as
>the
>very last path element. This does not interfere with the server's
>identification of pages, but is rather passed to scripts as extra path
>information. It is intended explicitly to break caching mechanisms by
>rendering every URL served unique. This catches some browsers and
>proxies
>that refuse to honor the cache controlling headers (NCSA Mosaic in
>particular is a real offender for using its local cache to *ALWAYS*
>serve
>pages, even POST results).
>
>Lastly, I *DO* have a robots.txt file to keep the lower levels of the
>site
>from being indexed. Due to the unique path sections, the site is
>potentially a 'black hole' (an infinite URL space). I have had one
>robot
>fail to honor the robots.txt file and try to do a depth first search.
>As
>expected - it got looped and made several thousand requests trying to
>read
>the 'whole site'.
>
>So, basically, site design for a site that needs user unique state and
>needs to co-exist with a WWW of search engines and broken browsers is
>an
>exercise in 'defense in depth'.
>
>I chose NOT to use cookies when designing this site (last June) because
>Netscape was the only browser supporting them at that time. If I were
>designing the site today, I would use 'dual' code that would detect if
>a
>browser supported cookies and use them if it did to reduce the work the
>server had to do by pushing state onto the browser more transparently
>and
>with less server intervention (less active CGI). Something along the
>lines
>of check for a cookie or a PATH id, if neither exists, request a cookie
>and assign a PATH id. If there *is* a cookie, ignore the PATH id and
>manage state using the cookie. If there is no cookie, but there *is* a
>current PATH id, manage state on the server. If the PATH id is old and
>there is no cookie, (more than a day say), throw it out and start with
>a
>new id and cookie request.
>
>Oh, on a general note, I know that many search engine will explictly
>*not*
>follow links from pages that end in '.cgi' or have a path segment
>containing '.cgi/' for their *own* self-defense again 'black holes'
>caused
>by sites that do not have (but SHOULD have) robots.txt files. This is
>not
>applicable to VV - but does apply to many sites. Checking VV - they are
>depending on 'pre-expiring' pages by sending an expiration date of
>December 31, 1969. The search engines are likely discarding the pages
>as
>being 'pre-expired'.
>
>--
>Benjamin Franz
>
>