Responsible behavior, Robots vs. humans, URL botany...

Skip Montanaro (skip@automatrix.com)
Wed, 10 Jan 1996 14:16:15 -0500


Robert Raisch writes:

... indexing Internet accessible resources, appears to extract data from
a host by connecting to EVERY tcp port on the machine.

I suspect it's a case of some fool deciding that the ends justify the means.
On the Lycos search page they proudly announce:

Lycos indexes 91% of the web!

Select that link and you get:

Lycos has indexed over 10.75 million pages throughout the world....

What could the Alta Vista folks do to top that? How about:

You have access to all 8 billion words found in over 16 million Web pages.

One way to get to stuff the Lycos folks couldn't find was to be a little
more rapacious (ooh, I like that word - makes me think of Jurrasic Park...).

<digression>

Not to let the AV folks be the only ones getting jabbed, I'll take advantage
of the opportunity to jab Lycos a little. They have a small table on their
91% page:

Lycos 91% 10.75 Million
Open Text 12% 0.80 Million
Infoseek 6% 0.40 Million
Yahoo <1% 0.05 Million

It is obviously a case of apples and oranges to compare Lycos with Yahoo (I
can't comment on the others, although I believe they use robots as well),
since Yahoo is a reasonably well-organized human-built index. I tend to be
able to find things in Yahoo. Lycos, for all the scoring, abstracts,
searching options, yadda, yadda, yadda, is still a robot-generated index
with all the problems for us mere humans that implies. We tend to like
things a bit more structured. I don't normally find poring over a robot's
search engine output all that fruitful. I still can't seem to write queries
to any of the search engines that provide all that great a "usefulness
quotient", even with a degree in Computer Science.

If most of what's out there is crap (for the sake of argument, let's just
pick a number out of thin air, say, 91%... :-), users of Lycos and the other
robot indexes are bound to need real big shovels. On the other hand,
presumably the Yahoo folks or the submitters of URLs to Yahoo at least sniff
the URLs before deciding whether to add them to the database. In addition,
Yahoo tends to index the trunks of URL trees (which I find more useful), not
every friggin' leaf and branch.

Hypothetical conversation between two botanists on a field trip:
Ooh, Bob! look at this oak leaf! It sure is a whole lot different
than the one we found on that other tree! Let's remember where we
found it! Put that other one back...

Has anyone considered adding an option to the various robot search engines
that would restrict the depth of URLs returned to a query or at least use
the number of components in a URL's path to help score the page?

</digression>

Sorry for the digression. I'm done venting. Please return to work now.

Skip Montanaro | Looking for a place to promote your music venue, new CD
skip@calendar.com | or next concert tour? Place a focused banner ad in
(518)372-5583 | Musi-Cal! http://www.calendar.com/concerts/