The reason behind this is that every so often I find a web site with
some interesting information, but I don't have the time (or money -
I have to pay for my connection) to study it all. From the first days
of accessing the web I wished I could copy pages or complete subtress
to my computer, graphics and all.
Well, now I can. :-)
OTOH, I don't want to upset anyone with my program. Any comments are
appreciated.
Here is the basic functionality:
- The program starts at a given URL and follows all links that
are in the same directory or below. (Starting with http://x/a/b/c/...
it would follow /a/b/c/d/... but not /a/b/e/...)
(except for IMG graphics)
- It will, optionally, follow links to other servers one level deep.
- No links with .../cgi-bin/... or ?parameters are followed.
- Only http: links are followed.
- No Document is requested twice. (To prevent loops)
- It will identify itself with User-agent: and From:
- It will use HEAD requests when refreshing pages.
The program was started primarily for my own use, but I might release
it as shareware (when I'm sure it's well-behaved).
Since it is intended for the consumer market (it is written for OS/2),
the users of this program will generally be connected by modem, (In my
case currently with 14.400 bps) which helps keeping used bandwidth down.
What I'd like to know:
- Should this Program use /robots.txt? Is it the type of program
that robots.txt is supposed to control? It is basically a web-browser,
the retrieved pages will just be read offline.
- How fast should I make my requests? Since this is not a robot in
the sense that it visits many different hosts, and since it is
not intended to traverse the whole server (after all, I have to
store all the data on my PC and I have to pay for the connection),
I'd rather not wait too long between requests.
My Idea is to read single pages in a similar way the IBM
WebExplorer does it: read the main dokument and get all the embedded
graphics as fast as possible. Then wait some time (some seconds) before
making the next request.
- How is the general feeling towards copying web-pages for
non-commercial use?
TIA Thomas Stets
-- ----------------------------------------------------------------------------- Thomas Stets ! Words shrink things that were Holzgerlingen, Germany ! limitless when they were in your ! head to no more than living size stets@stets.bb.bawue.de ! when they're brought out. CIS: 100265,2101 ! [Stephen King] -----------------------------------------------------------------------------