Re: Info on large scale spidering?

Otis Gospodnetic (otisg@panther.middlebury.edu)
Thu, 23 Jan 1997 15:01:23 -0500 (EST)


sorry if I'm beating a dead horse, but here are a few more thoughts...

> I'm a grad student looking at research topics. One of the things I am
> interested in is how we will search the web in the future, given its
> enormous growth. One of the ideas I have been considering is that of
> "Distributed Information Retrieval", where instead of a large
> centralised index you have lots of smaller ones distributed around the
> net.
>
> If I am going to talk about a new way of indexing the web, I need to
> understand the old way though (even though there isn't much info
> available).

even now those general search engines are scary, returning tons of entries,
often inaccurate, often out of date, etc.

the way robots work now is: given a starting URL find all links in it, index
the page if i has changed since last visited, go visit all links found in that
page and do the same with them.
the bad thing about this approach, in my opinion, is that robots are blindly
issuing those conditional GET/HEAD requests and I assume most pages end up
being unchanged since last visit.
A better way would be if servers would either notify search engines about
pages that have been added or modified, or deleted, or if they (servers) would
keep info about those pages in some standard place, so that a robot can visit
a server once (the first time), get URLs of all pages that have been
added/deleted/modified, and index them.
I don't know, simple, but it could have some pitfalls and that's why it's nt
implemented.
In any case, distribution of data and work is, in my opinion, the only way to
go.
Maybe that's because I studied Distributed Database Systems for my Thesis ;)

> When people mention politeness do they mean the amount of time between
> individual hits on a site? If so then is the answer to do a lot of site
> parallelism? i.e. if your spiders are considered rude to hit a site 100
> times over a time period, then why not hit 100 sites once each and
> repeat.
>
> Or is politeness the number of hits on a site in a day or in a week. If
> it is the number of hits per week that determines your politeness, then
> an index which traverses your whole site every week can't possibly be
> considered polite! (if your site is big enough that is)
>
> Still, if you decide on some maximum "polite" spidering hit rate, then
> you may hit some sites weekly and other sites monthly. What doesn't
> make sense is saying, "therefore we will rebuild our index every
> month". If you can reflect new developments on www.netscape.com a week
> later rather than a month later then why wouldn't you want to?

the distributed approach I described roughly above would, I think, avoid this
problem...no ?
I think siteindex.txt or something like that has been discussed here befrore,
look at the list archives

> > Crawling is the easy part. Then there is the indexing of the data to
> > make it searchable. This takes time...a lot of time...and disk space.
>
> Yeah, you need the hardware to crunch data as fast as it comes in. I
> guess it helps if you manufacture high end machines (i.e. DEC). Even
> more expensive otherwise.

I work with some people from DEC who worked on AltaVista project before.
Just to pay the AltaVista team, DEC spent $100M.
not to mention disk space, procesor power, T?? line, etc.
no wonder they started putting advertizements on AltaVista pages - they were
losing that money until now at the rate of a couple of million dollars per
quarter, he he.

> > Another reason is the corporate aspect. These search engines offer a
> > free service to the public. If this is not the company's primary
> > business, then keeping the service in top shape is not a primary
> > objective of the company.
>
> This is a very convincing reason! They could spider faster if they
> wanted but they just don't care. A lot of these problems could be
> solved with clever programming and lots of hardware bucks, but the big
> boys don't care enough about index updates to put in the necessary cash.
>
> And why should they? It's a free service and even if advertising was
> covering costs, why would they be willing to put up extra $$$ just to
> run their spider n times faster?

uh, I don't agree with these statements.
only DEC was offering this service as a thing on a side.
Other companies (Infoseek, Lycos, Excite, etc.) are based on top of their
search engines.
no search engine no company.
They have to improve their spiders and search engines every day if they want
to keep up and keep making money.
your spider is too slow ? You're out of business cause other search engines,
your competition, will kill you.

> The reason you keep the actual pages on disk is so that when indexing,
> if a page hasn't changed since you last hit it you can read it straight
> of disk. Right? So spiders are always sending conditional GETs, and if
> there's no change then going to their local copy of the page.

note again that the distributed model would not require any conditional GETs.
you request the page if the page is new or has been modified.
if it has been deleted remove iit from the database.
Everything else means the page is the same as it was the last time you visited
it.

I could be totally off on this one, but this is just brainstorming...

Otis
==========================================================================
POPULUS People Locator - The Intelligent White Pages - http://POPULUS.net/
==========================================================================

_________________________________________________
This messages was sent by the robots mailing list. To unsubscribe, send mail
to robots-request@webcrawler.com with the word "unsubscribe" in the body.
For more info see http://info.webcrawler.com/mak/projects/robots/robots.html