Re: More Robot Talk (was Re: email grabber)

Captain Napalm (spc@armigeron.com)
Fri, 17 Jan 1997 03:19:19 -0500 (EST)


It was thus said that the Great Martin Kiff once stated:
>
> > Now, back to things robotic.
>
> If you are casting round for more 'corroborative evidence' you can use
> about the canonical URL of a page, the Base Href (if it is included)
> should give a clue and also - although vanishingly few pages have it -
> the URI meta tag.

Ah, thanks. Now to go look up what these are 8-)

> Also in the case where you have a
> http://www.some.where/dir/page.html
> URL you could try asking for the truncated
> http://www.some.where/dir
> page and see where the server redirects you to. Again it can only be
> corroborative evidence...

I just tried our main server (Apache 1.0.5 [1]) with the following URL:

http://www.armigeron.com/people/spc

And got a redirect back for

http://www.armigeron.com/people/spc/

I then tried:

http://attache.armigeron.com/people/spc

And got a redirect back for

http://www.armigeron.com/people/spc/

Not too bad. Won't really help for stuff like:

http://www.armigeron.com/people/spc/
http://www.armigeron.com/people/spc/index.html

But at least it might lead me to the actual correct host to use.

> Add to your list the IP address as well - I've had engines index part of
> my site under IP address as someone used the IP to link to it instead of
> the FQDN.

I'd probably do a reverse name lookup on the IP address, then go from
there. If the name doesn't resolve, then discard the URL.

> Watch out for AOL who seems to have case independent names in the 'file'
> part.
>
Now this I didn't know (now why did they go and do a stupid thing like
that?). I was planning on keeping individual server records (for things
like robots.txt, last time visited, etc) so I guess it shouldn't be too hard
to figure out if a web server has case insensitive filenames or not.

Another idea I had (and it was just an idea) was to have my robot send a
message to the web master at a site (if an address can be found - postmaster
as a last resort) giving the location of the robots.txt specification if a
robots.txt file was not found (404), plus a small file they can use:

-----------
# for more info, check out http://www...
User-agent: *
Disallow:
-----------

But ONCE (or maybe, once every six months or so). If robots.txt is found,
then no message is ever sent.

-spc (At least then, more web masters would hear of the concept ... )

_________________________________________________
This messages was sent by the robots mailing list. To unsubscribe, send mail
to robots-request@webcrawler.com with the word "unsubscribe" in the body.
For more info see http://info.webcrawler.com/mak/projects/robots/robots.html