I just started from a few well-known sources, like the NCSA archives, and the
Web is sufficiently connected to do the rest. I did not have the guts to give
it a single URL, but I bet that it would take quite a bit of work to find a URL
that would not connect to the whole Web. Think about how many pages mention
Yahoo for example, and how quickly the search will branch after that.
And then of course I use any URL people contribute.
-----------------
Since I'm here, and in the interest of saving bandwidth I want to respond to Skip
who was missing one important point and calling me a fool (;-)).
Alta Vista uses a fast robot. I ran this robot for a week and got 16M pages.
If I had run it for two weeks I would no doubt have 25-30M pages today. Once I
restart the robot the index will contain more pages, unless it finds a lot of
sites with better /robots.txt in which case it will delete these pages, and I
may report a smaller index for a while, which would be fine with me.
Notice that I said a "better" robots.txt, because I would actually enjoy seeing
every webmaster put up a good file and save everyone the trouble to fetch,
index, and read stuff that was never intended to be indexed. Every chance I get
to educate another person, specially a reporter, about the Robots Exclusion
Standard, I do it, because it's our only chance so far to improve the quality
of what ends up in Web indexes. And of course if webmasters used password
protection on ports that are not intended for public usage it would make life
somewhat easier: I have answered enough "you have violated my secret test site"
messages.
My point is that I don't want to maximize at all cost the number of pages to
report: I am interested in finding out how large the Web is, and giving everyone
access to its complete index. And while doing this I want to report facts,
not engage in a p...ing contest with some outfit who has reportedly indexed
91% of an absolutely unknown and moving figure. Alta Vista is a research
project with no place for this kind of creative arithmetic.
--Louis