in-document directive to discourage indexing ?

Denis McKeon (dmckeon@swcp.com)
Tue, 18 Jun 1996 21:20:18 -0600


Q: Is there a way to discourage a search engine or web robot from
indexing a particular html document *within the document itself*?

Using robots.txt would do the job, and is convenient for someone who is
both maintaining a set of pages and the web site that hosts those pages,
but for those of us using Web presence providers (aka ISPs/IAPs) to host
our pages, this would imply that the system administrator would have to edit
(or auto-append to) robots.txt every time any user asks for a change.
Doing that manually might become tiresome for the sys-admin or expensive
for the user, depending. The robots.txt proposed specification states:

>A possible drawback of this single-file approach is that only a server
>administrator can maintain such a list, not the individual document
>maintainers on the server. This can be resolved by a local process to
>construct the single file from a number of others, but if, or how,
>this is done is outside of the scope of this document.

Is there any sort of proposed extension to the HTML or other spec.
to allow embedding a "don't index this page" directive in a page?

That would allow the site admin. to run a script every night to update
robots.txt, and allow a robot to discard a page retrieved during the
time window between addition of the directive and the robots.txt update.

Is there any sort of robots.txt auto-updating script available?

Using access control is another possibility - a .htaccess file
or equivalent - is that a reasonable approach? (I doubt it -
seems like you'd have to list each robot that visited your site.)

On a related note, both DejaNews and Alta Vista now allow use of a header:

X-No-Archive: yes

in news articles - I am hoping to find a similar user-controllable method
for html documents. (BTW, if anyone knows why they didn't use: X-Archive: no
- please share the reason.)

A meta-topic - why would one want to discourage indexing of a page?

to avoid having parallel versions of a document indexed.
(graphic & text html, html & plain-text,
current & previous, test/draft/beta & release)

to avoid having temporary documents indexed.
(page X moved to location Y)

to encourage users to enter a page hierarchy through a top page
although that suggests a different directive:
"index this page, but point users at the top page first"

I don't mean to re-visit the "private pages" vs. "more indexing" issue
with this question, so let's assume that the page or site maintainer
have some good reasons for encouraging robots to index some parts
of their site, and discouraging them from indexing other parts of it.

thanks in advance,

-- 
Denis McKeon 
dmckeon@swcp.com