Does this count as a robot?

Thomas Stets (stets@stets.bb.bawue.de)
Sun, 7 Jan 96 2:50:47 CET

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Jeremy.Ellman: "Re: Does this count as a robot?"
Previous message: Jakob Faarvang: "Re: Web Robots"
Next in thread: Jeremy.Ellman: "Re: Does this count as a robot?"
Maybe reply: Jeremy.Ellman: "Re: Does this count as a robot?"
Reply: Benjamin Franz: "Re: Does this count as a robot?"
Maybe reply: Benjamin Franz: "Re: Does this count as a robot?"

I am currently writing (actually it's already written and I'm testing)
a program to copy a subtree of a server to my machine.

The reason behind this is that every so often I find a web site with
some interesting information, but I don't have the time (or money -
I have to pay for my connection) to study it all. From the first days
of accessing the web I wished I could copy pages or complete subtress
to my computer, graphics and all.

Well, now I can. :-)

OTOH, I don't want to upset anyone with my program. Any comments are
appreciated.

Here is the basic functionality:

- The program starts at a given URL and follows all links that
are in the same directory or below. (Starting with http://x/a/b/c/...
it would follow /a/b/c/d/... but not /a/b/e/...)
(except for IMG graphics)
- It will, optionally, follow links to other servers one level deep.
- No links with .../cgi-bin/... or ?parameters are followed.
- Only http: links are followed.
- No Document is requested twice. (To prevent loops)
- It will identify itself with User-agent: and From:
- It will use HEAD requests when refreshing pages.

The program was started primarily for my own use, but I might release
it as shareware (when I'm sure it's well-behaved).

Since it is intended for the consumer market (it is written for OS/2),
the users of this program will generally be connected by modem, (In my
case currently with 14.400 bps) which helps keeping used bandwidth down.

What I'd like to know:

- Should this Program use /robots.txt? Is it the type of program
that robots.txt is supposed to control? It is basically a web-browser,
the retrieved pages will just be read offline.

- How fast should I make my requests? Since this is not a robot in
the sense that it visits many different hosts, and since it is
not intended to traverse the whole server (after all, I have to
store all the data on my PC and I have to pay for the connection),
I'd rather not wait too long between requests.

My Idea is to read single pages in a similar way the IBM
WebExplorer does it: read the main dokument and get all the embedded
graphics as fast as possible. Then wait some time (some seconds) before
making the next request.

- How is the general feeling towards copying web-pages for
non-commercial use?

TIA Thomas Stets

-- 
-----------------------------------------------------------------------------
Thomas Stets                              ! Words shrink things that were
Holzgerlingen, Germany                    ! limitless when they were in your
                                          ! head to no more than living size
stets@stets.bb.bawue.de                   ! when they're brought out.
CIS: 100265,2101                          ! [Stephen King]
-----------------------------------------------------------------------------

Next message: Jeremy.Ellman: "Re: Does this count as a robot?"
Previous message: Jakob Faarvang: "Re: Web Robots"
Next in thread: Jeremy.Ellman: "Re: Does this count as a robot?"
Maybe reply: Jeremy.Ellman: "Re: Does this count as a robot?"
Reply: Benjamin Franz: "Re: Does this count as a robot?"
Maybe reply: Benjamin Franz: "Re: Does this count as a robot?"