Re: Proposed URLs that robots should search

Martijn Koster (mak@beach.webcrawler.com)
Mon, 23 Oct 1995 16:31:17 -0700


In message <acb1c3b800021004c4d1@[192.187.143.12]>, Nick Arnett writes:

> Also, I'm utterly certain that a good relevancy-ranking engine will do a
> better job at assigning categories than will an uncontrolled set of people,
> especially when those people are out to maximize hits, rather than to
> maximize relevancy.

Yeah, isn't that fun... :-/ Maybe we should have a shared spammer
blacklist :-)

> [want the name of the site]
> [groups of documents]

> >In the spirit of /robots.txt, I would like to propose a set of files that
> >robots would be encouraged to visit:
> >
> >/robots.htm - an HTML list of links that robots are encouraged to traverse
>
> What does "encouraged" mean? How is it differnet from (not (robots.txt))?

Because a robot may not want to traverse the whole site, and would
prefer to get "sensible" pages.

> Why HTML?

Yeah, bad news.

> [/keywords]
> Disagree greatly. This opens a giant can of worms. Keywords are never
> enough, often confusing and difficult to maintain.

Hmmm... yes, but it's not necesarrily worse than straight HTML text,
which is the alternative.

> >/linecard.txt - for commercial sites, a text file with comma-delimited
> > line items (brands) manufactured or stocked
>
> This will drown in details.

Yup.

> >/sitedata.txt - a text file similar to the InterNIC submissions forms,
> > with publicly-available site data such as
> >
> Yes to some of this at least. But there's an assumption that there's a
> one-to-one relationship between the server and these field data. Often,
> there isn't and no scheme that fails to deal with that is going to succeed.

Well, I hate to repeat myself, but ALIWEB's /site.idx will give you all of
the above (OK, not the icon, but you could add that). It doesn't seem
to scale to well to large sites who want to describe every single page
or resource on their server, but that's not the goal here...

Note also that nobody is stopping you to pull just the URLs from a site.idx,
and doing your standard robot summarising on that...

-- Martijn
__________
Email: m.koster@webcrawler.com
WWW: http://info.webcrawler.com/mak/mak.html