Re: Analysing the Web (was Re: Info on large scale spidering?)

Patrick Berchtold (berchtold@www.student.isoe.ch)
Fri, 31 Jan 1997 12:07:46 +0100


On Tue, 21 Jan 1997, Greg Fenton wrote:
>
> .....
>
> Like I said, crawling is the easy part. The average size of a text
> page that I calculated on a base 1.5 million pages is 7.5kb
> [Can anyone confirm/deny?].
>
> .....
>

I thought a lot about that in the past couple of days. I looked out for
some information about that and finally found out that there is no
comprehensive reference about this
subject.

If one wants to work on large scale spidering he should know more about the
Web. Some kind of statistics would be really helpful, and what would be
easier than using a robot itself to catch this information.

I think of a robot that

1) starts at a given URL
2) downloads HTML page
3) parses the retreived file
4) collects information from HTTP header and parsed contents
5) collects information about all linked resources that are not text/html
or wwwserver/html-ssi
6) writes the collected data to some kind of database
7) follows one or more links to a resource of type text/html or
wwwserver/html-ssi
8) continues with 2)

Notes about 2)
These pages must be completely downloaded using GET, see also 5)

Notes about 3)
A fast parsing algorithm is necessary that (ideally) parses the files in
less than the download time.

The following things could be of interest:
- size *
- number of links
- number of links to other sites
- size of indexable text (tags stripped)
- number of inline images
- date/time of last modification *
- META information
- ...

* Size and date/time of the file can be taken from the HTTP header fields.
See also 4)

Thus the parser must be able to
- find links and determine which type of links it has found (href, image,
etc.)
- measure the size of the document with all tags stripped
- find and interpret META information
- ...?

Notes about 4)
Trivial task: counts links found by the parser, looks at HTTP header
fields, etc.

Notes about 5)
Trivial as well. Following info could be of interest:
- size
- date/time of last modification
- type of resource
- ...?

The point is: The robot does not have to parse resources that are not HTML
text. All interesting information is available in the header and thus can
be obtained by sending HEAD requests. This will save a lot of network
bandwith.

Notes to 6)
This process would of course collect tons of data. But not everything has
to be stored. I think of a database containing:
- Every URL visited, with date/time and error code
- Counters for file size, # of links, ... (most of them could be
cumulative)
- ...

It should contain enough to calculate things like
- how many pages do contain links to other sites (in percent)
- avg # of links
but also
- # of pages that do not contain any links (dead-end street for
link-following robots)
- # of resources that were not available (reason?)
- and more (perhaps: how many pages are framesets)

All this should give us a comprehensive base for calculations and robot
designs and thus giving us a chance to develop really powerful robots.

Please comment on this.

Patrick

_________________________________________________
This messages was sent by the robots mailing list. To unsubscribe, send mail
to robots-request@webcrawler.com with the word "unsubscribe" in the body.
For more info see http://info.webcrawler.com/mak/projects/robots/robots.html