Re: Proposed URLs that robots should search

Nick Arnett (narnett@Verity.COM)
Mon, 23 Oct 1995 15:26:16 -0700


At 1:51 PM 10/23/95, Andrew Daviel wrote:
>With my other hat on (admin@vancouver-webpages.com), I'm
>trying to build a database of URLs and other information for businesses
>on the Net.

I can't quite contain the urge to say, "Isn't everyone?"

>Some database registration robots (I believe) search submitted URLs for
>keywords, doing some natural language processing to discard modifiers and
>prepositions. However, the trend to graphics-dominated homepages makes
>such efforts of dubious utility.

I wouldn't be so quick to jump to that conclusion. I have seen few, if
any, business sites that don't offer text-only versions of their key pages.
Also, I'm utterly certain that a good relevancy-ranking engine will do a
better job at assigning categories than will an uncontrolled set of people,
especially when those people are out to maximize hits, rather than to
maximize relevancy.

Having said all of that, I'd like to agree that we need some additional
information for robots. Could we start simply by having a standard way to
set forth the name of the site? An icon for the site would be really nice.
It's very frustrating to build a search results list and have no
definitive way of describing the site on which the documents reside! Next,
I'd like to have the means to name groups of documents (Press releases,
product descriptions, as examples of typical business groupings). We guess
at these from directory names, but that's very haphazard. The secondary
naming problem is more difficult because there are many-to-many
relationships involved.

>In the spirit of /robots.txt, I would like to propose a set of files that
>robots would be encouraged to visit:
>
>/robots.htm - an HTML list of links that robots are encouraged to traverse

What does "encouraged" mean? How is it differnet from (not (robots.txt))?
Why HTML?

>/descript.txt - a text file describing what the site (or directory) is
> all about

Agreed.

>/keywords.txt - a text file with comma-delimited keywords relevant to the
> site (or directory)

Disagree greatly. This opens a giant can of worms. Keywords are never
enough, often confusing and difficult to maintain.

>/linecard.txt - for commercial sites, a text file with comma-delimited
> line items (brands) manufactured or stocked

This will drown in details.

>/sitedata.txt - a text file similar to the InterNIC submissions forms,
> with publicly-available site data such as
>
>Organization: organisation name
>Type: commercial/non/profit/educational etc.
>Admin: email of admininstration
>Webmaster: email of Web admininstration
>Postal: postal address
>ZIP: ZIP/postcode
>Country:
>Position: Lat/Long
>etc.

Yes to some of this at least. But there's an assumption that there's a
one-to-one relationship between the server and these field data. Often,
there isn't and no scheme that fails to deal with that is going to succeed.

I'm ready to adapt one of my prototype robots to parse this data for our
engine, so here's one hand up for "Yes, I'll implement it." I'm just doing
research, but my research does fall in front of our engineers at some
point.

By the way, today, Verity announced that NetManage and Purveyor have signed
up to use our search engine. They join Netscape, Quarterdeck and a few
others.

Nick

P.S. I've replied to the new list server address at webcrawler.com, rather
than the Nexor address.