It's not only robots we have to worry about ...

Captain Napalm (spc@armigeron.com)
Thu, 26 Dec 1996 03:22:27 -0500 (EST)


With some of the recent talk about broken robots, I'd thought I'd share
some data I've collected that shows that robots aren't the only thing that
are broken.

In trying to track down problems and bugs, I've enabled extensive logging
into the meta-search engine at Cyber411. Frankly, I'm amazed at what I'm
getting.

The (official) front end (I have found others through this logging) is
located at 'http://www.cyber411.com/search/' and the output comes from
'http://www.cyber411.com/search/nph-bsframe.cgi' [1]. Here's some select
output from the latest version (with comments added): [2]

Dec 20 11:36:32 5Q:silly 1.0.6C[10891]: badrequest - [HEAD]
Dec 20 11:36:32 5Q:silly 1.0.6C[10891]: badrequest - from 38.11.233.106
Dec 20 11:36:32 5Q:silly 1.0.6C[10891]: badrequest - refered from (unknown)
Dec 20 11:36:32 5Q:silly 1.0.6C[10891]: badrequest - user-agent Mozilla/3.0 (Win 95; U)
----- ------ ----- ----------
| | | |
| | | +----- error message logged
| | +---------------- process ID
| +---------------------- version of nph-bsframe.cgi
+----------------------------- machine name

It never occured to me to check for a HEAD request for a CGI
program. The main problem is - the size of the resulting
document changes depending upon the search criteria. I suppose
I should check up on what to return for a HEAD request.

But I'm surprised that Netscape sends this out. Mostly, I get:

Dec 20 23:39:32 5Q:silly 1.0.6C[17426]: badrequest - [HEAD]
Dec 20 23:39:32 5Q:silly 1.0.6C[17426]: badrequest - from ax1.healey.com.au
Dec 20 23:39:32 5Q:silly 1.0.6C[17426]: badrequest - refered from (unknown)
Dec 20 23:39:32 5Q:silly 1.0.6C[17426]: badrequest - user-agent Mozilla/3.0 (Win 95; I) via Squid Cache version 1.0.12

This poor fellow attempted to use the search engine for about five
minutes straight (maybe two dozen attempts). Most of the HEAD
requests seem to be coming from proxy/cache servers (at first
I thought it was naive (read: stupid) robots. I added a robots.txt
file to the site and am still getting HEAD requests. Ah well ... )

Dec 21 07:40:12 5Q:silly 1.0.6C[19751]: contenttype(003) - from leo.xnet.com
Dec 21 07:40:12 5Q:silly 1.0.6C[19751]: contenttype(003) - refered from http://www.cyber411.com/search/
Dec 21 07:40:12 5Q:silly 1.0.6C[19751]: contenttype(003) - user-agent: Mozilla/2.0 (compatible; MSIE 3.0b1; Mac_PowerPC)
Dec 21 07:40:12 5Q:silly 1.0.6C[19751]: contenttype(003) - data: [text/html]

Seems like Microsoft doesn't have its act together. I was under the
(I suppose) mistaken impression that POSTs required a content-type
of 'application/x-www-form-urlencoded'. Actually, I take that
back. There is a second method, but it requires actually parsing
MIME based data. Ick. Most of the contenttype errors come from
MSIE. Figures.

Dec 23 07:03:33 5Q:silly 1.0.6C[10706]: badquerytype - [Hyper ]
Dec 23 07:03:33 5Q:silly 1.0.6C[10706]: badquerytype - from villella.rnd.aetc.af.mil
Dec 23 07:03:33 5Q:silly 1.0.6C[10706]: badquerytype - refered from http://www.cyber411.com/search/
Dec 23 07:03:33 5Q:silly 1.0.6C[10706]: badquerytype - user-agent Mozilla/1.2 (compatible; PCN-The PointCast Network 1.2/win16/1)

Another strange browser. Most broswers seem to truncate trailing
white space (my program is expecting 'Hyper', not 'Hyper ')
but not this one. This is fairly easy to fix on my side though.

Sigh. I wonder how many CGI programmers are aware of these types of
problems?

-spc (I won't even mention the non-official front ends to the
search engine that aren't even correctly set up ... )

[1] The name has almost no relation to the project anymore, except
for the 'nph-' and '.cgi' parts. Technically, it stands for:

Non-Parse-Headers-BrainStormFrameVersion.CommonGatewayInterface

Only it's no longer called Brainstorm (that was the original name)
nor is there a frames version anymore. Ahh, the price of
progress 8-)

[2] Logging is done via syslogd. The name of the machine is silly [3]

[3] silly being an SGI.

_________________________________________________
This messages was sent by the robots mailing list. To unsubscribe, send mail
to robots-request@webcrawler.com with the word "unsubscribe" in the body.
For more info see http://info.webcrawler.com/mak/projects/robots/robots.html