Re: Inktomi & large scale spidering

Erik Selberg (selberg@cs.washington.edu)
26 Jan 1997 12:28:58 -0800


Nick Arnett <narnett@Verity.COM> writes:

> Netscape (and a number of other) servers come with our search engine built
> in. The indexes could easily be in a publicly accessible location under a
> well-known name. The obstacle is that index formats are proprietary because
> they contain structures that give the engine a lot of its speed and
> accuracy. Thus, any attempt to have an open standard would have to allow
> for multiple engines... and with index disk space overhead of 30-60 percent
> (or more, with some older technology), webmasters are unlikely to be eager
> to support more than one.

yup; and Microsoft's has MS's indexer built in, which again only works
with MS stuff.

The FIND group has been working on a protocol by which base index
servers (e.g. any ol' web server) can ship their index up index
servers (which are servers which contain pointers to base
indexes). Actually, the MS folks just gave a talk here at UW Friday
outlining what they were working on WRT to the FIND groups Commin
Indexing Protocol. THe one thing they didn't have an answer to is:
who's on the top? With the current scheme of things, if base servers
are shipping around indexes in a set format (be it verity, MS, or
whatever) then all the "tops" are going to look a lot alike, and
there will be no advantage to coming up with cool IR technology,
because you're already stuck with this format. Another option is that
webmasters just ship a big tarball of raw text; it would seem to me
that would be the best option for all parties. But who knows if anyone
will implement it!

-Erik

-- 
				Erik Selberg
"I get by with a little help	selberg@cs.washington.edu
 from my friends."		http://www.cs.washington.edu/homes/selberg
_________________________________________________
This messages was sent by the robots mailing list. To unsubscribe, send mail
to robots-request@webcrawler.com with the word "unsubscribe" in the body.
For more info see http://info.webcrawler.com/mak/projects/robots/robots.html