Re: Extracting info from SIG forum archives

Denis McKeon (dmckeon@swcp.com)
Sun, 8 Sep 1996 06:16:36 -0600


In <842155310.10282.0@genps.demon.co.uk>,
Peter Small <peter@genps.demon.co.uk> wrote:
>What techniques, strategies, programs etc would anybody suggest for
>extracting specific information from a list serve discussion list archive?
>
>For example, DIRECT-L (the discussion list for users of Macromedia's
>multimedia authoring package DIRECTOR). Posts mount up at the rate of 80 -
>100 per day so we're talking about a current archive of 50,000 posts
>averaging about 1k per post.

The hypermail package is one approach:

http://www.eit.com/software/hypermail/hypermail.html

It takes mailbox files and turns them into hypertext archives,
with index files by Subject:, author(From:), Date:, and thread.
It seems to use In-Reply-To: and Reference: where available and useful,
but not anything from Subject:/Date: for the thread index.

I feel hypermail is a good start, but lacks some features -
preserving all headers, a more effective threading algorithm,
more tunable hyperlinking, and some capability for links
or indexes by keystrings. Also, as the pure volume of posts
grows, a brute force hyper-index becomes more unwieldy.

Still, hypermail could be useful for the situation of
"questions asked more than rarely, but less than frequently,"
that can often arise on a list as new members join over time.

A result for a lower volume list (~15MB over 8 months) is at:

http://mat.gsia.cmu.edu/POB/

-- 
Denis McKeon 
dmckeon@swcp.com