Re: Inter-robot Communications - Part II

Martijn Koster (m.koster@webcrawler.com)
Fri, 29 Dec 1995 08:26:27 -0700


At 4:42 AM 12/29/95, David Eagles wrote:

> Well, I never expected to receive such a favourable response
> about a standard port/protocol for communication between robots.

Cool. Many robot authors among the respondents?

>The key features I have thought of so far as listed below,
> so you can comment on these also (ie. tell m,e if I'm being too
> stupid/ambitious/etc)
>
>1. Dedicated port approved as an Internet standard port number.
> (What does this require?)

Not sure, but there's no point until there is an RFC which specifies
what the protocol on that port is doing, so concentrate on that first.

>2. Protocol (similar to FTP I think) which allows remote agents
> to exchange URL's, perform searchs and get the results in a standard
> format, database mirroring(?), etc.

Why on earth like FTP? FTP is reseonably complex and inefficient.
If we're talking about the web, use HTTP! We know how/that that
works, we have many implementations, and it's reaseonably OK.
It's at least as efficient as FTP for this kind of thing,
and when HTTP/NG comes along you can just plug that in.

This allows you to concentrate on just the data format;
so just invent a new Media type: text/foo.

> The idea behind this is that if Robot A finds a URL handled by another
> remote Robot (such as by domain name, keywords(?), etc), then it can
> inform the remote robot of it's existance.

This would be easy deployable if you use HTTP: POST a form or PUT a
file using a client library such as libwww-perl, and handle it in a CGI
script. Hey, we'll just link it to our submit form :-)

This has been discussed before actually, we never got time to make
it go anywhere...

> Similarly, if a user wants to search for something which happens to be
> handled by the remote server, a standard data format will be returned
> which can them be presented in any format.

Distributed interactive searching is more complicated than that though...
what do you do when there are three thousand of these servers around?
It is also complicated because these days you don't want all results;
there are too many of them. But masaging the results using whatever
selection and relevance feedback is very robot-specific, because
everyone uses different kinds of search engines. This sounds to me
like a problem for which there is no good and easy answer.

However, it'd be nice to come up with a way to efficiently hoover
other robots for URL's; this could be done with a mechanism such
as you describe. What we can learn from Harvest is that these
issues can be separated into different processes, making it all
a bit more flexible and clear.

>3. A method of correlating Robots with specialties (what the robot is for).
> An approach similar to DNS may come in handy here -
> limited functionality could be obtained by using a "hosts" type file
> (called "robots" ?), while large scale, transparent functionality would
> probably require a centralised site which would maintain a list of all
> know robots and their specialties. Remote robots would download the
> list( or search parts of it) as required. This could probably be
> another protocol command on the port above.

The words "scalable" and "centralised site" don't mix :-)

Hmmm... expressing "what the robot is for" is probably very difficult to
express. Meta information categorization is always a nightmare. What
classification to use?

>4. A standard set of data, plus some way to extend it for implementation
> specific users. I use the following fields in FunelWeb
> URL
> Title (from <TITLE>)
> Headings (from <Hx>)
> Link Descriptions (from <A HREF="">...</A>)
> Keywords (from user entry)
> Body Text (from all other non-HTML text)
> Document Size (from Content-Length: server field)
> Last-Modified Date (from Last-Modified: server field)
> Time-To-Live (server dependant)

Hmm, this is where it gets tricky. URL, Title, and Keywords are obvious.

Content-length and Last-Modified Date sound good, but do under-represent
the HTTP server response; what about Content-language and other variants?

Headings, link descriptions, and body text: hmmm. Which headers,
how are they ordered? same for links? What is "body text" in the
company of HTML tables etc? What about losing info from in-line
images and HEAD elements? What about frames?
This is a slippery slope; why not simply send the entire document
content compressed? As efficient, and gives complete freedom)

Also check out the URC work, sounds like some potential overlap here.
(Damn, I'm starting to sound like Dan Connoly :-)

> This also highlights one MAJOR consideration - These fields are generally
> only useful to HTML robots. Something needs to be considered to handle
> any input format, including FTP, WAIS and GOPHER.

Even if you ignore that for now you'd be scoring...

I'll share a different idea I had about this stuff (oe, or do I need
to patent it first these days?). If in the distributed gathering part
we start sending URL's, HTTP response headers, and complete content,
doesn't the word "caching proxy" spring to mind? I need to think more
about this, but it sounds to me that if we had an efficient way of
updating and pre-loading distributed caches using a between-cache
protocol, we'd be killing two birds with one stone: better caching
performance than the current 30%, and complete freedom to do whatever
you want with the content for robot purposes...

Happy New Year,

-- Martijn

Email: m.koster@webcrawler.com
WWW: http://info.webcrawler.com/mak/mak.html