Freely available robot code in C available?

ecarp@tssun5.dsccc.com
Tue, 12 Dec 1995 22:25:31 -0600


DSC Communications is a multinational company located in Plano, TX with
offices all over the world. We have lots of technically-savvy people in
the company, but not a lot of information on what different divisions are
doing within the company, especially in regards to web activities.

Since the division that I work for is Information Services, we feel that we
would like to get a handle on who is running web servers in the company,
what they have on them, and what they are being used for. The idea is to
eliminate duplication of effort (two or more departments put up servers, each
with similar information), and provide consistent information to our internal
departments.

I myself have been running a server (both internally and externally) for over
a year, and have many years of CS experience, so I feel that the task of
collecting information on who is doing what wouldn't be a overwhelming task.
It is felt that the best way of collecting the information needed would be
to either write some sort of web collection program from scratch or obtain
a freely-available one from the net and modify it for our needs.

I have read the proposed FAQ and all of the etiquette documents, and the plan
of attack is to write or obtain a robot that would scan HTML text only,
signaling the server that we can handle only text (avoiding the overhead of
having to download images only to discard them), then build an Oracle database
composed of URLs and text which could be searchable via an SQL query.

Comments or sample source code on doing such a task, or pointers to freely-
available code, would be greatly appreciated. If no such code is available,
pointers on writing such a beast would be also appreciated.

One more note: if I hadn't made it clear already, the robot would, under no
circumstances, be allowed to search outside the DSC domain, and we have no
direct access to the outside world except through our firewall (which will only
filter selected packets from selected sites, and the internal web server isn't
on the list). This is intended to be an 'internal use only' project, and so
would not be used to generate revenue, nor would it be allowed to roam the
net at large.

The other restriction on the server is that it must be written in C. ANSI C
is not a requirement.

Any help or comments would be greatly appreciated. Thanks in advance...

--
Ed Carp, Senior Operations Analyst, DSC Communications

Please note that I do not speak for DSC Communications, nor are any statements made herein meant to be taken as a position, official or otherwise, of DSC Communications.