Re: Identifying identical documents

Jaakko Hyvatti (Jaakko.Hyvatti@Elma.FI)
Thu, 4 Jul 1996 02:26:51 +0300 (EET DST)


(Please post with lines less than 80 columns wide.)

> Well I'm curious as to what checksum method people are using - some CRC
> variant, a simple sum modulo 2^n, etc. What have robot designers found
> sufficient in comparing two documents to determine if URL B is in fact
> simply a re-occurrence of URL A?

Why bother thinking about it, just budle together things you get while
parsing the document: (this is what I get)

32 bit CRC
Document length
Number of lines
Number of words
Number of unique words
Modified date
Expiration date
first n bytes of document title

..and use it as the key for a hash table. A few bytes overkill does
not hurt you :-) This is how I would do it. (I talked about it here,
but have not bothered to do it.)

I think this has to be done on per host basis, same documents on
different servers are worth indexing.

If you want to spare 10 lines of code, use just CRC. This really is
not a big problem, I should not post this message. :-)