Re: infoseeks robot is dumb

Otis Gospodnetic (otisg@panther.middlebury.edu)
Fri, 15 Nov 1996 22:12:02 -0500 (EST)


> > Should a well behaved robot do this:
> >
> > kbackdraft-bbn.infoseek.com - - [04/Nov/1996:01:27:35 -0500] "GET /robots.txt
> > HTTP/1.0" 404 207
> > backdraft-bbn.infoseek.com - - [04/Nov/1996:01:27:40 -0500] "GET
> > /list_archives/webph/0066.html HTTP/1.0" 200 2642
> > backdraft-bbn.infoseek.com - - [04/Nov/1996:01:28:04 -0500] "GET
> > /list_archives/webph/0069.html HTTP/1.0" 200 3181
> >
>
>
> First of all, note that your site doesn't have 'robots.txt' :-)

yup, but that shouldn't make a difference here.

> In my opinion....I don't think infoseek's robot is dump.....I just looked
> at my access_log and saw the same from other search engines such as
> opentext, lycos, atext...etc....

hmm, I didn't:

crawl3.atext.com - - [04/Nov/1996:11:59:14 -0500] "GET /robots.txt HTTP/1.0"
407
crawl3.atext.com - - [04/Nov/1996:12:02:19 -0500] "GET /~sports/facilities/inde1
crawl3.atext.com - - [04/Nov/1996:12:07:13 -0500] "GET /~heikkone/search.html
H2
crawl3.atext.com - - [04/Nov/1996:12:09:31 -0500] "GET /~psych/ HTTP/1.0" 200
49
crawl3.atext.com - - [04/Nov/1996:12:27:20 -0500] "GET /~jinglis/travels/europe5
crawl3.atext.com - - [04/Nov/1996:12:27:55 -0500] "GET /~ru351/discussion.html
7
crawl3.atext.com - - [04/Nov/1996:12:30:28 -0500] "GET /~lien/ HTTP/1.0" 200
202
crawl3.atext.com - - [04/Nov/1996:12:32:10 -0500] "GET /~publish/catalog/studen9

demonet.opentext.com - - [05/Nov/1996:03:41:01 -0500] "GET /robots.txt
HTTP/1.07
demonet.opentext.com - - [05/Nov/1996:03:45:34 -0500] "GET
/~dickerso/research/7
demonet.opentext.com - - [14/Nov/1996:07:40:52 -0500] "GET /robots.txt
HTTP/1.07
demonet.opentext.com - - [15/Nov/1996:03:46:11 -0500] "GET /robots.txt
HTTP/1.07

this one obviously tried at very different times....so it's excused...

> if a document has a link to one document
> in your server then infoseek will have to try to get robots.txt. If there
> is another link from another document then I would say it is natural
> that it will try to get robots.txt again.

not if it remembers that it already tried to retrieve it. why repeat the same
mistake twice ?

> The question is: if a search engine or a robot doesn't find 'robots.txt'
> (return code from HTTP is 404), should it try to request 'robots.txt'
> again? Certainly not, but this might make the life of a robot writer
> harder!

not that hard :) this makes it hard on a web server, which is worse.

Otis

-- 
eZines Database		 - 	<URL:http://www.dominis.com/Zines/>
eBooks Dominis Bookstore -	<URL:http://www.booksite.com/cgi-bin/zines>
_________________________________________________
This messages was sent by the robots mailing list. To unsubscribe, send mail
to robots-request@webcrawler.com with the word "unsubscribe" in the body.
For more info see http://info.webcrawler.com/mak/projects/robots/robots.html