Duplicate docs (was avoiding infinite regress...)

Nick Arnett (narnett@Verity.COM)
Tue, 9 Jan 1996 07:34:35 -0800


>I'm surprised that no spider seems to use the page content to guess whether or
>not two document trees are equal. For example, one heuristic would be to keep
>a checksum for every visited page, and to decide that two subtrees are probably
>equal if its root nodes and their children have iddentical checksums.

We've had requests for that behavior, not only due to sym links, but also
because there are many copies of the same document within an enterprise
network, and even more so when you're indexing large parts of the Internet.
(Imagine how many copies of FAQs are out there, for example.)

I think there are two main reasons it hasn't happened yet. One is just
that it hasn't risen high enough in the priority list, at least for those
of use who have commercial spider tools. For the most part, people are
still happy just to get a spider *working* in a convenient, maintainable
manner. Thus, most haven't even realized that sym links and duplicates are
an issue.

Second, the problem of duplicates is a slippery slope. It's probably not
hard to find 80 or 90 percent of them, but getting the last bunch, which
aren't *exact* duplicates, is going to have to be quite clever, since brute
force will probably be slow, at best.

Nick