Re: The Internet Archive robot

Z Smith (zsmith@archive.org)
Thu, 03 Oct 1996 15:28:17 -0700


At 09:08 PM 10/3/96 +0100, Rob Hartill wrote:

>Just a quick check. Will you be using robots.txt to guide the robot
>away from graphics or just html ?

As our original posting (5-Sep-96) noted, we are obeying the Standard for
Robot Exclusion, which includes obeying robots.txt.
>
>I've always had my graphics blocked from robots via robots.txt. Will
>your robot respect that or will it be allowed to grab any inline images
>on a robot-allowed html page ?

Again, robots.txt lets you specify directories not to be gotten (such as,
say, /images).

We are also obeying the HTML flags of NOINDEX and NOFOLLOW if we find them
inside a given HTML page. So if you want your text to be indexed but your
images not to be retrieved by the archive, you can use the directories
approach or use NOFOLLOW (so the links to the in-line images referenced by
an HTML page won't be gotten). (more info at
http://info.webcrawler.com/mak/projects/robots/faq.html#noindex)

We've been giving some thought to the idea of Yet Another Tag for allowing
people to say "you can index this page but I don't want it archived" or
maybe "it's OK to archive this but I want to give permission before that
archive is accessible" or even "it's OK to archive this but it's sealed for
5 years". It has the potential to be a big messy question so we're thinking
about it for a while more before proposing something back to the community.

>
>btw, what USER-AGENT will you use to indetify the robot to the servers
>being visited ?
>
As we mentioned in our original posting of 5-Sep-96 to this list, the
User-agent is "ia_archiver". It's been a month since that posting, and
since some of the answers to the questions we're seeing could be answered by
a careful reading of the posting, I'll excerpt it here:

---------------------------------------------------------------------
>Return-Path: <burner@archive.org>
>X-Sender: burner@archive.org
>Date: Thu, 05 Sep 1996 20:46:38 -0700
>To: robots@webcrawler.com
>From: Mike Burner <burner@archive.org>
>Subject: The Internet Archive robot
>
>Hello World,
>
>The Internet Archive robot, which will identify itself as "ia_archiver" in
>the "User-Agent:" HTTP header field, will begin archiving the Web over the
>next few days.
>
>In the short term, the archiver will focus only on images; the lists of
>these have been derived from our existing HTML feeds, kindly donated to us
>by several of the commercial search engines.
>
>The archiver will obey the Standard for Robot Exclusion, and will take pains
>to tread softly on the surface of the Net. If anyone feels we have failed
>in either case, please let us know so they we can rectify the problem as
>quickly as possible.
>
>Internet Archive is gathering, storing, and providing access to public
>materials on the Internet such as the World Wide Web, Netnews, and
>downloadable software. The collection, reaching ten terabytes, will provide
>historians, researchers, scholars, and others access to this vast collection
>of data, and ensure the longevity of the information.
>
>For more information, including how you can help the Archive, please visit
>our web site (http://www.archive.org).
>
>Mike Burner
>
----------------------------------------------------------------------

Hope this helps clarify things.

Z