after a long time lycos's spider visited our department's web site.
it seems to exchange the information in the
HTTP_USER_AGENT and HTTP_FROM headers.
the logs showed this for lycos's spider:
HTTP_USER_AGENT=spider@lycos.com
REMOTE_ADDR=207.77.91.185
REMOTE_HOST=207.77.91.185
HTTP_FROM=Lycos_Spider_(T-Rex)/1.0
while inktomi puts the headers properly:
HTTP_USER_AGENT=Slurp/2.0 (slurp@inktomi.com, http://www.inktomi.com/slurp.html)
REMOTE_ADDR=209.1.12.110
REMOTE_HOST=j13.inktomi.com
HTTP_FROM=slurp@inktomi.com
my questions are:
a. is this an oversight by the lycos spider maintainers ?
b. is this something that has to be informed to lycos?
c. does exchanging headers have any unintended / intended effects ?
for instance: the web logs just show up wrong for lycos.
did lycos maintainers intend it that way?
anybody from lycos on the list?
d. why cannot they register the hosts, from which they run T-Rex,
in the DNS.
last year reqeusts were from qe20.lycos.com but they never
set the HTTP_FROM, HTTP_USER_AGENT fields.
sample from log:
[Wed Apr 10 20:29:30 1996]
HTTP_USER_AGENT=
REMOTE_ADDR=206.101.96.161
REMOTE_HOST=qe20.lycos.com
HTTP_FROM=
and now they exchange two fields.
e. why does it have to access the robots.txt thrice in a minute? (seems unecessary)
our web site is not that dynamic. i dont think any reasonable web site
is going to change their robots.txt every 20 seconds.
f. having accessed the robots.txt file, why not go and access other pages too ?
it did not take any page ?
it is technically possible the access for robots.txt come one machine and the crawl
occurs from another machine. the logs did not show anything of that sort.
oh dear, what can the matter be?
oh dear, what can the matter be?
t-rex's so long at the fair.
-dt
dinesh
student,computer science and automation,
iisc, india
cc: dr.manohar,dr.vinay,mr.mani
access times of robot.txt:
[Wed Dec 25 01:34:45 1996]
[Wed Feb 12 08:53:13 1997]
[Wed Feb 12 08:53:54 1997]
[Wed Feb 12 08:54:36 1997]
[Wed Feb 12 09:12:06 1997]
[Wed Feb 12 09:12:30 1997]
[Wed Feb 12 09:13:14 1997]
access patterns:!!!
it did not access anything apart from the robots.txt file
not even the homepage.
this has to be explained.
-- _________________________________________________ This messages was sent by the robots mailing list. To unsubscribe, send mail to robots-request@webcrawler.com with the word "unsubscribe" in the body. For more info see http://info.webcrawler.com/mak/projects/robots/robots.html