More Robot Talk (was Re: email grabber)

Captain Napalm (spc@armigeron.com)
Thu, 16 Jan 1997 22:26:06 -0500 (EST)


It was thus said that the Great Andy Rollins once stated:
>
> Now how about more robot talk?
>
Okay. I've just been given the task to write a robot for a (yet another)
search engine. This part (the robot) interests me, so I'll probably end up
rolling my own, since there are some design criteria that I need to meet
that I'm not sure exist in current robots (did I mention I'm interested in
this part?) which I'll get to in a second.

What I want to know is, is there any search engines freely available? I
tried finding Harvest, but the site with the software seems to be down.
Database design is one of my weaker points (okay, so I failed that
particular class, twice) so this part of the project I'm less than thrilled
at.

Now, back to things robotic.

One of the things I'm concerned about is the canonical form of URLs. I
wrote a while back a small text file about this. I'd appreciate any
comments on it.

-spc (Without further ado ... )

The following URLs all map to the same file:

http://www.mgal.com/
http://www.mgal.com/index.html
http://WWW.Mgal.Com:80/
http://Www.Mgal.COM./index.html
http://www.MGAL.com.:80/
http://mgal.com/
http://mgal.com
http://mgal.com:80/
http://mgal.com.:80/index.html
http://WWW.MGAL.COM.:80/index.html#tag1

None, except for the first two, are in what I would call a canonical form
(given the port address). The URL for HTTP can be broken down in the
following format:

[http:[//<domain>[:<port>]]][<pathorfile>[#<tag>]]

(Note: This covers both absolute and relative URLs. One or the other (or
both) are required and in the case of relative URLs, the URL protocol is
implied).

The domain is the name or address to connect to. Both a name or an IP
address are valid for the domain part. If the name is given, then it has to
be resolved to an IP address, using the DNS protocol. The DNS protocol
specifies that domain names are case insensitive. Therefore, to
canonize a domain name, convert it to a consistant format. I choose to
convert it to lowercase.

Also, a FQDN ends with a period, but, conventionally, it is rarely written
and would probably serve more to confuse the neophite, so it is wise to drop
any trailing period. So, we can convert the above to:

http://www.mgal.com/
http://www.mgal.com/index.html
http://www.mgal.com:80/
http://www.mgal.com/index.html
http://www.mgal.com:80/
http://mgal.com/
http://mgal.com
http://mgal.com:80/
http://mgal.com:80/index.html
http://www.mgal.com:80/index.html#tag1

Which now eliminates two URLs:

http://www.mgal.com/
http://www.mgal.com/index.html
http://www.mgal.com:80/
http://mgal.com/
http://mgal.com
http://mgal.com:80/
http://mgal.com:80/index.html
http://www.mgal.com:80/index.html#tag1

Now, the standard port for HTTP is TCP port 80. The only time you should
need to specify the port is when the HTTP server is running on a
non-standard port. In this case, since the given port is 80, it can be
eliminated as well:

http://www.mgal.com/
http://www.mgal.com/index.html
http://www.mgal.com/
http://mgal.com/
http://mgal.com
http://mgal.com/
http://mgal.com/index.html
http://www.mgal.com/index.html#tag1

Which eliminates two more URLs:

http://www.mgal.com/
http://www.mgal.com/index.html
http://mgal.com/
http://mgal.com
http://mgal.com/index.html
http://www.mgal.com/index.html#tag1

Now, we are left with the following for the domain portions:

www.mgal.com
mgal.com

This is a difficult problem. One way would be to not allow a simple
domain name (mgal.com in this case) to be used, but require an actual host
(or computer) to be specified. a DNS query for the IP address of mgal.com,
then a reverse query on the IP address will reveal that mgal.com maps to the
IP address 206.217.30.1, which, when looked up, will return www.mgal.com.

Now now we are certain that both www.mgal.com and mgal.com are the same
host. But that may not always be the case. Take the following:

http://attache.armigeron.com/

A DNS query will return 204.29.162.30 and a reverse lookup will return
attache.armigeron.com. But, the preferred address is www.armigeron.com. A
general heuristic might be to lookup up www.<domain>, and if no results come
back, take the given domain as is, otherwise use www.<domain>.

The minimum domain name consists of two sections separated by periods.
Anything longer usually consists of hosts, subdomains and domains. For
example:

http://sunrise.cse.fau.edu/
http://www.cse.fau.edu/

are both to the same machine, but the second is probably the preferred
URL. We don't want to simply look up:

www.fau.edu

As that is another server in the domain. But, each host portion has to
map to a single machine, so stripping off the first portion:

cse.fau.edu

gives us a useful domain to apply the www part to:

www.cse.fau.edu

which returns a valid IP address.

In the following case:

http://snafu.lab34.r-n-d.bigcompany.com/
http://www.bigcompany.com/

are the same. So, in this case, we can apply the rule, removing more and
more of the domain and try:

www.lab34.r-n-d.bigcompany.com
www.r-n-d.bigcompany.com
www.bigcompany.com

until one returns a valid response (or until the IP addresses match).

So, applying this heuristic gives us:

http://www.mgal.com/
http://www.mgal.com/index.html
http://www.mgal.com/
http://www.mgal.com
http://www.mgal.com/index.html
http://www.mgal.com/index.html#tag1

Of which, two can be eliminated:

http://www.mgal.com/
http://www.mgal.com/index.html
http://www.mgal.com
http://www.mgal.com/index.html#tag1

And that takes care of the host portion of the URL.

The file portion of the URL is case sensitive, so translation to a
common case cannot be done. But, a reference to a file has to start with a
'/' so in those cases where no file is specified, an explicit '/' must be
given, so we add it if missing:

http://www.mgal.com/
http://www.mgal.com/index.html
http://www.mgal.com/
http://www.mgal.com/index.html#tag1

Which in this case, eliminates only one URL:

http://www.mgal.com/
http://www.mgal.com/index.html
http://www.mgal.com/index.html#tag1

For cataloging purposes, any tags defined within the document should be
ignored, so the should be dropped:

http://www.mgal.com/
http://www.mgal.com/index.html
http://www.mgal.com/index.html

Which eliminates one more:

http://www.mgal.com/
http://www.mgal.com/index.html

Which gives us the two canonical forms of the given URL.

Unfortunately, it cannot be assumed that for a given URL ending in a '/'
and one ending in 'index.html' (all else being the same) are the same
document, unless both documents are retrieved and compared. It can't even
be assumed that a URL ending in a document without an extension is a
directory!

_________________________________________________
This messages was sent by the robots mailing list. To unsubscribe, send mail
to robots-request@webcrawler.com with the word "unsubscribe" in the body.
For more info see http://info.webcrawler.com/mak/projects/robots/robots.html