servers that don't return a 404 for "not found"

Aaron Nabil (nabil@teleport.com)
Mon, 14 Oct 1996 03:46:06 -0700 (PDT)


The seem to be a proliferation of "clever" web administrators who
replace their system 404 not found page with something that doesn't
actually return a 404 code. Try an altavista search for title:"not found"
and then go look at some of those pages.

Some of the best I've seen have a title of "404 - Not Found" and return
a 200. :)

Often these are dynamic documents that say something like "the document
you we looking for, http://www.bofh-r-us.com/page/, wasn't found." thus
giving them all unique index entries. Doh!

The only solutions I see available to me are...

For each of the 200,000 hosts in my database, try sending them a
bogus URL, and if they don't return a 404, label them as "unindexable"
and stick them in my "site stop list" file. You know, the one that
contains xxx.lanl.gov. :)

Look for "not found" in the title of returned document (this has some
false positives, but seem to work and only miss a few).

Or, if I wanted to index those non-conforming sites, send out a couple
dummy probes that will elicit the bogus returns, and then see if I can
find a "fingerprint" that will serve as an indicator of when I get
that pages. I could even stick that in database to avoid having to
check it more than once.

I'm inclined to use #1, and then publish a list of which sites were
"unindexable". Maybe I'll use #2 as a backup to catch those that slip
through.

-- 
Aaron Nabil
nabil@teleport.com