Re: Inktomi & large scale spidering

andy@andy.net
Sun, 26 Jan 1997 00:52:09 +0000


> they claim they have the technology that uses multithreading and parallel
> processing that allows them to index 10M documents per day.
> If that is really so then they can kick everybody's butt in about a week (what
> are they waiting for ?). Don't know....

I assume that is a theoretical limit. the Hotbot/Inktomi system uses
clustered machines to distribute the load. So, in theory you can
index a ton of documents, but the bandwidth and intracluster
communication limits will become serious barriers.

> All this seems to me like a very brute-force method instead of a well
> thought-of(English?), elegant method that doesn't just try to be very fast,
> but also very intelligent about what to index, what to ignore, and so on.

Any better ideas

> I have a feeling that if somebody could persuade all Webmasters out there to
> publish and make available information about their site(s) for robots, that
> person could have a superior spider in no time (not that others couldn't do
> the same, but they would not be the first one to use the new approach).

I think we can all agree the aliweb model was not the greatest idea.
If webmaster have to create the index manually it will not gain
widespread support. If somebody were to invent a nice clean java
app that would form the index and took virtually no time to setup
perhaps you could get some support. However, there were a number
of great perl scripts to form site indices for aliweb and very few
sites used them. Would and applet be enough less configuration
and hassle that people might actually use it??? Nobody knows. I
guess there is only one way to find out, but I'm not going to rush
out and write it.


> Oh, one more thing.
> Hotbot - where does their robot come from(hostname) ?
> Is it inktomi.com ?

Hotbot uses inktomi's technology, so probably yes.

> Also, atext.com is Excite, right ? Why atext.com ? Why not excite.com ?

The excite gang started out as the company Architext.

--
Andy Warner
andy@andy.net
http://www.andy.net/
_________________________________________________
This messages was sent by the robots mailing list. To unsubscribe, send mail
to robots-request@webcrawler.com with the word "unsubscribe" in the body.
For more info see http://info.webcrawler.com/mak/projects/robots/robots.html