Re: Washington again !!!

Erik Selberg (selberg@cs.washington.edu)
19 Nov 1996 17:20:03 -0800


You're right Rob, I do feel differently about your apology after
having read this. But back to the matter at hand:

Here's the scoop folks:

The MetaCrawler is a parallel web-search service which queries major
search services (such as Lycos and WebCrawler) and collates the
results. It also has a "verification" feature which will have the
MetaCrawler download pages returned from the search services and check
for availability and quality. The MetaCrawler has not conformed to the
robots.txt standard, as we relied on the search services we accessed
to conform (which they all claim they have). There was also minimal
support to prevent it from rapid-firing a site, however, due in part to
low usage and well distributed queries, there have not been any major
incidents until this past week, and certainly not a single "ceast and
desist" or similar request has come to my attention in over a year of
running the service. This is not an excuse for not including stronger
protections, just our rationale on why these were not highly
prioritized.

In May, the MetaCrawler split into two distinct entities: there's the
UW research version and a new commercially run MetaCrawler, being run
at NETbot. This was done in large part because I was the only one both
programming the MetaCrawler code as well as maintaining the system. As
many of you can imagine, the MetaCrawler code development slowed to a
crawl as I spent all of my time maintaining the service.

Since August, NETbot has been in charge of the production
MetaCrawler. The NETbot site (www.metacrawler.com and hosts maverick,
goose, hollywood, and wolfman) is in the final process of separating
itself from the UW. This includes renaming maverick.cs.washington.edu
to maverick.netbot.com (as well as the others) as well as putting up
their new fancy UI and version.

I am NOT a part of NETbot. I worked there this summer setting up the
machines and doing other groundwork. The code has been nearly
completely re-written, NOT BY ME, and has recently been "blessed" to
be the production version. Unfortunately, there appear to have been
some bugs in the software which have made MetaCrawler prone to
rapid-firing sites (e.g. LANL and IMDB). Therefore, the server
machines have been turned off while the NETbot developers figure out
what's going on.

There is no excuse for the recent behavior of MetaCrawler. As far as
I'm aware, the MetaCrawler will remain down until the Verification
feature is disabled, and that feature will not be re-enabled until the
MetaCrawler supports both robots.txt as well as has some strong
protections to avoid rapid-firing.

For further information, I urge you to contact NETbot directly:

hosting@netbot.com "Hosting Admin"
webmaster@netbot.com "MC WebMaster"

Again, I am NOT responsible for the administration, day-to-day running,
or development of the current production MetaCrawler. If you have
problems with the UW research prototype (hosts vorlon, minbar, draz,
or zhadum.cs.washington.edu) then I am the guy, otherwise the NETbot
folks are in control.

Regards,
-Erik

> The idiots at Washington are at it again.
>
> My logfiles for yesterday contain thousands of rejected requests for
> the worst piece of Net software in existence.
>
> "MetaCrawler"
>
> Despite months of problems and one lame excuse after another Erik Selberg
> continues to stick his head in the sand and his arse in everyone's face.
>
> MetaCrawler must die. It's a net nuisance.

-- 
				Erik Selberg
"I get by with a little help	selberg@cs.washington.edu
 from my friends."		http://www.cs.washington.edu/homes/selberg
_________________________________________________
This messages was sent by the robots mailing list. To unsubscribe, send mail
to robots-request@webcrawler.com with the word "unsubscribe" in the body.
For more info see http://info.webcrawler.com/mak/projects/robots/robots.html