Re: alta vista and virtualvin.com

Benjamin Franz (snowhare@netimages.com)
Mon, 6 May 1996 06:53:11 -0700 (PDT)


On Mon, 6 May 1996, chris cobb wrote:

> This is a question I hope Louis will be able to answer, but the topic should
> be of interest to others.
> ---
>
> Virtual Vineyards(VV) (www.virtualvin.com) is an example of a
> site that customizes all internal links for each visitor by assigning
> a 9 digit number to each new home page viewer.
>
> When a person retrieves the home page for this site, each link
> on the home page (which points to another part of virtual vineyards)
> contains this newly created number. For example, the "what's new"
> link from the home page for me might be:

[...]

> so. Even more distressing, this ID that the crawler recorded
> would progressively become days, weeks, and possible months
> old but still remain used on a frequent basis by different people.
>
> To examine this in practice, I searched for a low level (not home) page from
> VV on Alta Vista. I wanted to see if Alta Vista did indeed record a
> user ID when indexing these pages.
>
> Alta Vista did not return a match to the low level page query, but
> did contain an index of the VV home page.
>
> I initially thought that a 'robots.txt' file was having an effect - perhaps
> Alta Vista was not visiting any lower pages because the
> creater of the VV site realized the potential problems of their
> approach and attempted to guard against it by limiting crawler
> access. I did not, however, find a robots file.
>
> My questions:
> - Why doesn't Alta Vista index the lower levels of this site? Even though there is not a
> robots.txt file, it seems that Alta Vista is aware of what is going on and
> manages to avoid the problem. How is this done?
> - What comments do others have about indexing sites of this nature?

I haven't checked that site in particular, but as an author of a similiar
site (http://www.psiloveyou.com/) I can provide you some insight from the
POV of a site designer. When I designed the site, I was *very* concerned
not about robots, but proxying caches such as AOL. As a result, the entire
site is a CGI script with several anti-caching features built in. The
first of those features is a 'unique URL segment' like that you noted at
VV. In my case, I put it in as a CGI parameter rather than as a portion of
the URL path. AFAIK, the current generation of search engines do not save
'?....' sections of URLs (at least I can not recall having seen a search
engine return a URL with a '?....' segment) - so this helps prevent the
'state re-use' problem from search engines. It is not perfect, as people
can (and do) still save the URL manually and make links, but it helps
considerably by keeping the automation tools from doing so.

Next, I 'pre-expire' most of the pages with 'Cache: no-cache' and 'Expire:
really old date'. This is designed to cause both browsers and caching
proxies (such as AOL) to NOT reuse a page but always ask the server for
the page again if they want to re-use it.

Third, inside the shopping basket area I use POST operations as much as
possible to discourage browsers and caches from attempting to save the
results of a page requests. Caching POST results is an explicit no-no in
the HTTP standard - but some browsers do so anyway. Sigh.

Lastly, each time a page where the current state is absolutely critical is
accessed, it is given a one time only 'de-caching' path extension as the
very last path element. This does not interfere with the server's
identification of pages, but is rather passed to scripts as extra path
information. It is intended explicitly to break caching mechanisms by
rendering every URL served unique. This catches some browsers and proxies
that refuse to honor the cache controlling headers (NCSA Mosaic in
particular is a real offender for using its local cache to *ALWAYS* serve
pages, even POST results).

Lastly, I *DO* have a robots.txt file to keep the lower levels of the site
from being indexed. Due to the unique path sections, the site is
potentially a 'black hole' (an infinite URL space). I have had one robot
fail to honor the robots.txt file and try to do a depth first search. As
expected - it got looped and made several thousand requests trying to read
the 'whole site'.

So, basically, site design for a site that needs user unique state and
needs to co-exist with a WWW of search engines and broken browsers is an
exercise in 'defense in depth'.

I chose NOT to use cookies when designing this site (last June) because
Netscape was the only browser supporting them at that time. If I were
designing the site today, I would use 'dual' code that would detect if a
browser supported cookies and use them if it did to reduce the work the
server had to do by pushing state onto the browser more transparently and
with less server intervention (less active CGI). Something along the lines
of check for a cookie or a PATH id, if neither exists, request a cookie
and assign a PATH id. If there *is* a cookie, ignore the PATH id and
manage state using the cookie. If there is no cookie, but there *is* a
current PATH id, manage state on the server. If the PATH id is old and
there is no cookie, (more than a day say), throw it out and start with a
new id and cookie request.

Oh, on a general note, I know that many search engine will explictly *not*
follow links from pages that end in '.cgi' or have a path segment
containing '.cgi/' for their *own* self-defense again 'black holes' caused
by sites that do not have (but SHOULD have) robots.txt files. This is not
applicable to VV - but does apply to many sites. Checking VV - they are
depending on 'pre-expiring' pages by sending an expiration date of
December 31, 1969. The search engines are likely discarding the pages as
being 'pre-expired'.

--
Benjamin Franz