> Also in the case where you have a
> http://www.some.where/dir/page.html
> URL you could try asking for the truncated
> http://www.some.where/dir
> page and see where the server redirects you to. Again it can only be
> corroborative evidence...
I just tried our main server (Apache 1.0.5 [1]) with the following URL:
http://www.armigeron.com/people/spc
And got a redirect back for
http://www.armigeron.com/people/spc/
I then tried:
http://attache.armigeron.com/people/spc
And got a redirect back for
http://www.armigeron.com/people/spc/
Not too bad. Won't really help for stuff like:
http://www.armigeron.com/people/spc/
http://www.armigeron.com/people/spc/index.html
But at least it might lead me to the actual correct host to use.
> Add to your list the IP address as well - I've had engines index part of
> my site under IP address as someone used the IP to link to it instead of
> the FQDN.
I'd probably do a reverse name lookup on the IP address, then go from
there. If the name doesn't resolve, then discard the URL.
> Watch out for AOL who seems to have case independent names in the 'file'
> part.
>
Now this I didn't know (now why did they go and do a stupid thing like
that?). I was planning on keeping individual server records (for things
like robots.txt, last time visited, etc) so I guess it shouldn't be too hard
to figure out if a web server has case insensitive filenames or not.
Another idea I had (and it was just an idea) was to have my robot send a
message to the web master at a site (if an address can be found - postmaster
as a last resort) giving the location of the robots.txt specification if a
robots.txt file was not found (404), plus a small file they can use:
-----------
# for more info, check out http://www...
User-agent: *
Disallow:
-----------
But ONCE (or maybe, once every six months or so). If robots.txt is found,
then no message is ever sent.
-spc (At least then, more web masters would hear of the concept ... )
_________________________________________________
This messages was sent by the robots mailing list. To unsubscribe, send mail
to robots-request@webcrawler.com with the word "unsubscribe" in the body.
For more info see http://info.webcrawler.com/mak/projects/robots/robots.html