Re: More Robot Talk (was Re: email grabber)

bbh@xenodata.com
Fri, 17 Jan 1997 02:02:42 -0600


I don't know where this thread emerged from, but I welcome it.

I wrote a robot from scratch several months ago. It has not been run it in at least
a couple of months. There are some very good technical issues involving
URL parsing, indexing and validation that have not been discussed much if at
all around here. At the risk of getting kicked off this list, it seems there are
too many admins and not enough tech types subscribing to this here list. I
remember someone stating that the mission purpose of this list was for admins
to mull the ramifications of bots and /robots.txt and such. Good work has been
done, no doubt, but if that is all there is to this list, the tech types need
to move on. Assuming there is enough room for everyone, I'll proceed.

******************************************************************
My opinions on the canonicalization/normalization of URLS:

//////////////////////////////////////////////////////////////////
Protocol and hostname should always be mapped to lowercase.
TCP port should always be tacked on.
Trailing slash should always be preserved.
Domain name dot com should be assumed to be the same as
www.domainname.com and ftp.domainname.com and
mail.domainname.com etc until discoverered otherwise.

//////////////////////////////////////////////////////////////////
Additional work for ambitious bots (or secondary support processes):

Lookup up IP address for host.
(Note: your DNS client or cache may not deliver all addresses (A records).
I can demonstrate if anyone cares.)
Reverse lookup names for IP address.
(Note: see above)
Correlate/Strike/Reduce host database with new intelligence.

Note: I do not do this directly now, but it is "easy", in the unix programmer sense.
Doing this properly would probably reduce bot traffic by FILL_IN_THE_BLANK.

//////////////////////////////////////////////////////////////////
Storing of URLS:

I run an md5 of the URL string, following at least some of the above
steps. Example:

md5("http://uppercase.com:80/LOWERCASE.HTML") = c294c56dd5f161056117eb7503c4cb30

I store the md5 string with NDBM (the NDBM with Solaris 2.4). With approx 200K
records the lookups are snappy. I have experienced NDBM data file corruption.
I do not want to use a full blown RDBMS. I have not used GNU DBM. MSQL is slow.

//////////////////////////////////////////////////////////////////
//////////////////////////////////////////////////////////////////
//////////////////////////////////////////////////////////////////
//////////////////////////////////////////////////////////////////

The previous is something like a stream of conscience dump, but I hope
it helps prime more technical discussion, or helps me decide to jump
out of this list.

//////////////////////////////////////////////////////////////////
Bryan Hackney
bbh@xenodata.com

_________________________________________________
This messages was sent by the robots mailing list. To unsubscribe, send mail
to robots-request@webcrawler.com with the word "unsubscribe" in the body.
For more info see http://info.webcrawler.com/mak/projects/robots/robots.html