Some of the best I've seen have a title of "404 - Not Found" and return
a 200. :)
Often these are dynamic documents that say something like "the document
you we looking for, http://www.bofh-r-us.com/page/, wasn't found." thus
giving them all unique index entries. Doh!
The only solutions I see available to me are...
For each of the 200,000 hosts in my database, try sending them a
bogus URL, and if they don't return a 404, label them as "unindexable"
and stick them in my "site stop list" file. You know, the one that
contains xxx.lanl.gov. :)
Look for "not found" in the title of returned document (this has some
false positives, but seem to work and only miss a few).
Or, if I wanted to index those non-conforming sites, send out a couple
dummy probes that will elicit the bogus returns, and then see if I can
find a "fingerprint" that will serve as an indicator of when I get
that pages. I could even stick that in database to avoid having to
check it more than once.
I'm inclined to use #1, and then publish a list of which sites were
"unindexable". Maybe I'll use #2 as a backup to catch those that slip
through.
-- Aaron Nabil nabil@teleport.com