> From: Aaron Nabil[SMTP:nabil@teleport.com]
> Subject: What to rate limit/lock on, name or IP address?
>
> It's would be undesirable to lock against both the IP address and
> the site name. So I won't.
>
> Should I switch from IP address to site name, or leave it as it is?
>
>
> This is the heart of the site-lock problem for most multi-site crawlers. And
> from the study I did, there was just as many problems locking against IP
> addresses as against hostnames. There may be multiple IP addresses
> per hostname, multiple hostnames per IP address, or both.
>
> I didn't come to a proper conclusion, but I ended up sticking with IP address
> locking. The proper thing would be to uniquely identify each "site". This
> would involve a lot of work, playing around with forward and reverse name
> lookups and a fair amount of by-hand editing.
>
> I don't think that there is a simple way to do it otherwise (suggestions?),
> though this is one of the extensions to the REP that I had proposed a while
> back.
>
> gregf.
>
Hi Greg ...
Your points are quite sound, and the IP Address + Port
would be the most unique identifier for a site.
(And since caching robots.txt should be very small, the
transient nature of the IP Address should be negligent)
However, [IMO] the 'site' locked should correspond identically
to the site portion of the URL to be tested [for Disallow]
and subsequently requested [if not disallowed]
Thus,
if requesting an URL of form:
http://aa.bb.cc.dd:port/...
should test
http://aa.bb.cc.dd:port/robots.txt
and, an URL of form:
http://site.NameOrAlias:port/...
should test
http://site.NameOrAlias:port/robots.txt
- just my $0.02
Cheers & Good Luck!
******************************************************
Brent Boghosian | Email: BrentB@OpenText.com
|
Open Text Corp. | Phone: (519)888-7111 Ext.279
180 Columbia St. West | FAX: (519)888-0677
Waterloo, ON N2L 3L3 | http://www.opentext.com
******************************************************