RLG DigiNews, ISSN 1093-5371

Home

Table of Contents

 

Feature Article

Digital Imaging and Preservation Microfilm: The Future of the Hybrid Approach for the Preservation of Brittle Books

Stephen Chapman, Harvard University, stephen_chapman@harvard.edu
Paul Conway, Yale University,
paul.conway@yale.edu
Anne R. Kenney, Cornell University,
ark3@cornell.edu

The Council on Library and Information Resources (CLIR) has issued a working paper on the dual use of microfilm for preservation and digital imaging for enhanced access in the context of the U.S. brittle books program. It builds on work that has already been accomplished, principally through projects conducted at Cornell University, to create preservation quality Computer Output Microfilm (COM) from digital images, and at Yale University, to produce digital imagery from extant microfilm. This paper offers general recommendations on the creation of microfilm, COM, and digital images from the perspectives of quality, cost, and technology. Second, it provides a decision tree to assist institutions in determining the circumstances under which one would scan first or film first. Third, it presents recommendations regarding the development of metadata elements associated with digitized page images, and finally, it suggests areas for futher discussion, research and development to expedite the hybrid strategy. The full working paper will be available mid-February on the Council on Library and Information Resources Web site.

This working paper is written as the National Endowment for the Humanities (NEH) reaches the halfway mark in the 20-year brittle books program launched in 1989 at the request of Congress. For the past decade, the brittle books program has focused on the use of microfilm to preserve and make accessible the information contained in books and journals threatened by acidic paper. This same time period has also witnessed intensive investigation into the use of digital imaging technology to reformat library and archival materials. Despite predictions that microfilm could be replaced by digital imaging, many have come to appreciate that digitization may increase access to materials but it does not guarantee their continued preservation. This working paper examines the circumstances under which a hybrid program, combining microfilm and digital imaging, can best exist. It begins with a set of key assumptions:

The research projects at Yale and Cornell addressed digital image conversion of text-based materials and the production of archival-quality microfilm. As the two projects revealed, the relationship of film to digital lies in aligning quality, cost, and access. These in turn are affected by the characteristics of the source material being converted; the capabilities of the technology used to accomplish the digital conversion; and the purposes or uses to which the digital end product will be put. For these reasons, the working paper is divided into four sections, covering the following issues:

  1. Characteristics of microfilm as a source for digital conversion
  2. Characteristics of microfilm as an end-product of digital conversion
  3. Choice of a digital conversion path (film-first or scan-first)
  4. Development of metadata elements associated with digital page images

The Characteristics of Microfilm as a Source for Digital Conversion
The findings of Yale's Project Open Book suggest that modest modifications to the Research Libraries Group microfilming guidelines may result in preservation microfilm that produces better quality digital image products but that the costs incurred in creating such film may not be recouped through reduced digital conversion costs. Detailed analyses from Yale indicate that the characteristics of the film had little or no impact on conversion costs in the Project Open Book. Modest changes to the RLG microfilming guidelines, however, could lead to improved quality and perhaps more cost-effective film scanning. Whether the additional costs associated with making improvements at the point of microfilming can be offset by lower scanning costs should be examined. Within the scope of its current NEH project to microfilm collections in the history of science, Harvard University is testing this premise. A report detailing project findings, including costs, will be available in the summer of 1999.

The CLIR working paper presents recommendations for the creation of new microfilm in order to produce better quality digital image products. With few exceptions, the recommendations do not challenge the primacy of international standards governing the creation of preservation microfilm, rather they suggest minor enhancements to or the tightening of such standards (see Table 1 below).

Ultimately, future developments in digital technology - such as affordable grayscale scanning capabilities, software-assisted processing tools, the development of continuous scan techniques, and blipping - may offer far greater promise to increase quality and reduce cost than any specific modifications in the creation of preservation microfilm. Close cooperation between the imaging technology community and imaging product developers in cultural institutions is needed to advance the capabilities and efficiency of the technology of scanning.

The Characteristics of Microfilm as an End-Product of Digital Conversion
The Cornell project showed that computer output microfilm created from 600 dpi 1-bit images scanned from brittle books can meet or exceed national microfilm standards for image quality and permanence. Achieving acceptable levels of image quality rested in the two-step process of converting original materials to COM:

Procedures for production and inspection of the COM will differ from those appropriate to conventional microfilm. Significant changes in film creation and quality control are introduced in COM recording, as images are generated digitally, not photographically. Factors affecting image quality, such as resolution and density, are made up-stream - at the point of scanning - and not at the point of filming. Table 1 includes recommendations for the creation of COM, and the full report provides recommended changes to the film inspection process.

Table 1. Recommendations for Microfilm and COM Production in a Hybrid Approach

Recommendation

Creating conventional film

Creating COM

Film Stock

Investigate use of 16mm

Investigate use of 16mm

Polarity

Scan duplicate negative microfilm

Produce COM in negative polarity

Density

Dmax of .90-1.30 acceptable for bitonal scanning

Current standards apply

Reduction ratio/ image orientation

Orient material to obtain lowest ratio; current film scanning favors cine position

Utilize variable reduction ratios and orient material on film to obtain lowest ratio

Placement

Minimize or eliminate "centerline weaving"

Placement highly consistent with COM

Skew

No greater than 2 degrees from parallel

Limit skew in scanning to 2 degrees from parallel

Splices

No splices within a volume

N/A

Duplicate images

Select most appropriate for retention in digital file

N/A

Documentation

Record reduction ratio, original's dimensions, and document structural metadata

Record reduction ratio, original's dimensions, resolution, bit depth, enhancements, file format, compression, pixel dimensions, recording space on film (e.g., 15mm), and document structural metadata

Technical targets

RIT Alphanumeric Test Chart, Kodak Gray Scale, possibly MTF target

RIT Alphanumeric Test Chart

The Choice of a Digital Conversion Path (Film-First or Scan-First)
The working paper examines various paths in the process of creating both digital images for access and microfilm for preservation. It describes some of the circumstances that may lead to a film-first or scan-first decision. The sequence of steps may be coupled in a single workflow (as in the Cornell project), or they may be separated by several years (as in the Yale project). The impact on preservation and cost in separating the two processes needs to be more fully explored.

In some cases, the choice of how to begin will be technical. For instance, if both the original and a microfilm version exist, but the brittle paper has deteriorated to such an advanced state that it can no longer be handled, microfilm is the only viable source for scanning. In other cases, the circumstances are resource or policy related: funding is available only to create a single format (whether microfilm or digital images) or institutional policies regarding disposition and handling preclude some reformatting options, such as flatbed scanning.

In making recommendations for choosing a digital conversion path, the authors distinguish "preservation quality" from "access quality" when describing the digital image masters produced in the scan-first and film-first approaches:

The full report contains a decision tree that flows from these definitions. Deciding where to begin the hybrid project requires a consideration of issues associated with the source materials, with assumed capabilities of technology and cost, and with local policies regarding disposition of originals. Each is important, but the decision tree begins from a starting point associated with image quality in the digital master. The tree suggests a decision making process for two main strategies:

Based on each of these strategies, the decision tree offers a means for assessing some of the circumstances governing whether to scan first or film first. It will be important to revisit these strategies as technology evolves.

Development of Metadata Elements for Digital Image Materials
The report also examines requirements for metadata to accompany the digital image files in order to create a usable digital object. A digital object will consist of:

  1. Digital masters (scanned page-images; each with a unique file name)
  2. Associated administrative metadata
  3. Associated structural metadata

Administrative metadata refer to the descriptive elements that reside within or outside a digital object to ensure that it will be managed over time; structural metadata refer to the elements within a digital object that facilitate navigation.

Table 2 presents a list of 11 recommended administrative metadata elements to document the following attributes of a given digital object:

Table 2. Proposed Administrative Metadata Elements

  1. A technical target that documents the capabilities of the scanner that was used for bitonal scanning; the RIT Alphanumeric Test Object is recommended
  2. For digital preservation masters, bibliographic targets for COM output
  3. Name of project
  4. Name of funding agency(ies)
  5. Unique identifier for the object
  6. Designation of object as "digital preservation master" or "digital access master" must be recorded in bibliographic record, according to procedures routinely followed to designate ownership and location of master negatives
  7. Owning institution
  8. Copyright statement (including note of any use restrictions)
  9. Date object was created (i.e., scanning date)
  10. Scanning resolution, bit depth, file format and version, and compression
  11. Change history of object: current version (edition) of object, with dates of migration, and notation of which features in #10 were changed

The creation of structural metadata is central to the digitization of 19th century materials. The working paper recommends that at a minimum structural metadata convey a brittle book's pagination and "feature codes" (e.g., title page, table of contents). As we obtain a greater understanding of users' expectations of digitized books and serials, the feature codes list will likely evolve. Table 3 lists a recommended minimum set of mandatory structural metadata elements.

Table 3. Proposed Structural Metadata Elements

1. Correct page number associated with each digital image

  • except in cases of printer's errors, page number must be transcribed (e.g., Roman or Arabic, upper or lower case) exactly as they are printed

2. Internal navigation/structural points when present in the original

  • for books, minimum elements: blank, title page, table of contents, index
  • for journals, minimum elements: blank, title page or cover, table of contents, index at the issue level when present; at the volume level when not

There are cost implications for specifying how many features must be encoded. If the hybrid approach is to be generalized, this question deserves broader discussion, where a final specification must balance cost and functionality.

Conclusion
This working paper offers some definitive recommendations and points to areas where additional information is needed. These range from changes to microfilm production to facilitate digital conversion, guidelines for the creation and inspection of COM, quality and cost considerations, metadata requirements, technology forecasting, and the role of film in digital preservation initiatives. In most cases, additional information can be obtained by holding a series of meetings with representatives from institutions that have undertaken hybrid projects, imaging services providers, key industry and technology developers, funding agencies, and preservation and cultural organizations. The report concludes with a recommendation that a series of such meetings be held over the course of the next six months. The working paper can then be finalized, and the key findings disseminated broadly to the preservation community both within the United States and around the world.

Footnote
1. See for instance, Nancy Elkington, editor, RLG Preservation Microfilming Handbook, Mountain View, CA: The Research Libraries Group, Inc., 1992; ANSI/AIIM MS23-1998, Practice for Operational Procedures/Inspection and Quality Control of First-Generation, Silver Microfilm and Documents, Silver Spring, MD: Association for Information and Image Management, 1998.


Digitization Efforts at the Center for Retrospective Digitization, Göttingen University Library

Norbert Lossau
Head of the Center for Digitization
Lossau@mail.sub.uni-goettingen.de
and
Frank Klaproth
Technical Director
Klaproth@mail.sub.uni-goettingen.de

In the beginning of 1997, the German Research Foundation (Deutsche Forschungsgemeinschaft) launched a new funding program for retrospective digitization of library materials as part of an initiative to build a German Digital Research Library. Two centers for retrospective digitization were established that year: one at the Bavarian State Library in Munich and one at the University of Göttingen (Göttinger Digitalisierungs Zentrum, GDZ). The GDZ was also charged with coordinating national efforts towards standardization and best practices for digitization, and is engaged in evaluating tools and techniques for image capture and text conversion, bibliographic description, document management, and the provision of remote access.

The GDZ is currently involved in the following projects:

The Conversion Process
The conversion process performed by the GDZ follows the guidelines and recommendations of the technical working group as outlined in its final report.

Image Capture
Text-based material is scanned from 35mm microfilm and from the originals. Microfilm scanning is outsourced and images are created offshore in 600 dpi, 1-bit mode. Paper scanning is undertaken in-house at 400 dpi 1-bit using two face-up scanners (Zeutschel Omniscan 3000 and Minolta PS 3000). Plans are underway to upgrade to the Minolta PS 7000, which is capable of producing 600 dpi images. The scanning systems run under the scan software, SRZ ProScan Book, developed for the GDZ by the Satz-Rechen-Zentrum Company in Berlin to meet special production requirements for older books (e.g., TIFF-header editing, production control window with tree-view over scanned pages, masking and cropping of pages during the scanning).

Image capture of older books often requires some kind of enhancement after scanning. For economic reasons, this post processing should be done in batch mode wherever possible. The GDZ uses a program called PixEdit, which allows for a semi-automatic enhancement of text pages, including despeckling, de-skewing, and filling of empty pixels.

Within the next month, a new scanning device for capturing color images will be added to the equipment of the GDZ. The digital camera-back "Picture Gate 8000," manufactured by Anagramm, has an 8,000 x 9,700 pixel array, and will be used for face-up scanning of valuable library resources, including the Gutenberg Bible. Also grayscale scanning of illustrations (e.g., from the travel account books) will be performed with this device. A special moveable cradle (shown below), designed for Graz University Library, will be used to ensure "contact-free scanning" for rare books. The cradle utilizes a sensitive low pressure system to hold down pages.

A Special Moveable  Scanning Cradle

The scanning process produces a high quality digital master in TIFF format. Derivatives (GIF, JPEG, PNG) will be created on-the-fly for online delivery. A PDF version will be available for downloading or for offline delivery on CD-R. In April 1999, 1,500 volumes of North American travel literature and 100 volumes of mathematics will be made publicly available. The TIFF header of the digital master contains both technical and descriptive information (a listing and explanation of the elements used is located on the GDZ's Web site). The digital masters are stored offline on CD-R, an ISO-standard storage medium. As of this writing, approximately 600,000 images have been created.

Creation of Searchable Full Text
The GDZ is interested in creating searchable full text to expedite access. The goal is to enable researchers to perform searches on the full text in the background, and to present the highlighted results in the digital image version on screen. Older materials, particularly those printed in Fraktur (Gothic), pose major challenges in text conversion via optical character recognition (OCR) programs. The GDZ evaluated both standard as well as sophisticated trainable programs (e.g., Prime Recognition, ProLector, Optopus, FineReader), and concluded that they performed poorly on Gothic texts. The Russian program, FineReader (version 4.0), offered the best results for non-Gothic texts, including a low failure rate on older books and poorer quality digital images. The GDZ is currently investigating other solutions for OCR processing of Gothic texts, in cooperation with a company in Potsdam, Germany, which is a spin-off firm from Lehrstuhl für Numerische Mathematik der Universität Potsdam.

Document Access

Navigation
Until the issue of Gothic text processing can be resolved, the GDZ will provide access to digitized materials through the structuring of those navigational tools provided by the original itself (e.g., tables of contents, indexes, list of illustrations). These will be hyperlinked to the referenced portions of the documents (e.g., chapters, page images). Navigation of page images requires a detailed description of the structure and the pagination of the original.

The pagination of a printed book often does not coincide with the "pagination" of the digitized version, represented by a series of individual image files. In printed books, many pages are frequently excluded from the formal pagination, such as the title page, a plate, or the last page in a chapter, but these must be accounted for in the digitized version. The solution is to create a concordance between the two numbering schemes, and to accommodate other pagination anomalies (e.g., Roman/front/, Arabic /body/, italic/back/). The GDZ has adapted the page sequencing principles of the Ebind project at the University of California-Berkeley, and extended it to include references for non-paginated pages.

Access via the OPAC and the Web
Following the guidelines of the DFG, digitized documents in Germany are recorded in the online library network catalogues (Verbundkataloge) to ensure direct access to full text resources via the OPAC. Göttingen University Library has already created electronic records for the original volumes in the PICA-GBV network (Gemeinsamer Bibliotheksverbund). The Library treats the digitized versions as reproductions, similar to microforms, and creates separate records. The bibliographic description in the GBV conforms to the German exchange format for libraries (MAB-2). Special categories for online resources, such as date of digitization, creator, copyright, URL, are added. The availability of the PICA-GBV Online library network catalog on the Web allows users to start a search in the online catalog, and go directly from the listed hits to the electronic version.

Document Management System (DMS): AGORA
The use of a DMS is a key component in the development of a German Digital Library. Navigating digitized documents requires the use of middleware to allow the administration and correlation of metadata and images into sophisticated hierarchical structures. These structures must reflect the nature of collected works (e.g., serials with volumes and issues, multi-volume works, enclosed entities in other works) as well as the content of individual volumes (for instance, older legal literature can contain up to 8 hierarchies for chapters).

Funded by the DFG, the GDZ is developing a system based on current international standards, called AGORA. Initial plans to build the DMS on Saros Mezzanine, a traditional DMS from FileNet, have been revised with respect to the complex requirements of library materials and the demand for open and platform-independent metadata and document structures. The main functionality of the system is currently based on new modules, unified in the system named AGORA and developed by the Satz-Rechen-Zentrum company in Berlin in cooperation with the GDZ. AGORA is intended as a scalable object-oriented model designed to:

Documents will be administered in a relational database (for the GDZ the DB2 is proposed), designed by the GDZ to accommodate the complexity of library materials. The model will provide for extensibility to other commercially available relational databases as well.

An administrative tool enables the following management functions:

  1. Importing scanned images
  2. Importing and exporting structured metadata in RDF/XML format.
  3. Batch conversion of images on the fly for Web access
  4. Generation of Web pages (with HTML templates and Java servlets)

The prototype for the system will be released in the end of March 1999. Plans are underway to add further functionality to AGORA in the next version, which will integrate a full-text search engine (Verity), and an "electronic book trolley" for selecting books or even parts of them for downloading, printing and accounting. (1) Users will have the option to view the document on-screen via a standard Web browser (as GIF, JPEG, PNG images), download a PDF file for printing, or order a high quality (600 dpi) printout from the library. Additionally, a PDF version, together with the free Acrobat Reader, will be offered on CD-R. These versions for output will be generated as derivatives from the digital master. The creation of PDF files with bookmarks for the table of contents is planned for the first release in the end of March 1999. It is also anticipated that about 400,000 images from the North American travel literature and 250,000 images from mathematics, along with their accompanying structural and bibliographic metadata, will be imported into the system and made available in spring 1999.

In the near future, integration of digitized material from other German projects is planned. The Max Planck Institute in Frankfurt for European Law is prepared to offer its material via the AGORA system. Staff at the GDZ provides technical advice to other digitization projects in Germany and throughout Europe via such efforts as the European Periodicals Project. The first step in the "Distributed Digital Research Library" in Germany, proposed by the DFG, will be taken in the next two years. The use of common metadata formats (e.g., RDF/XML) to build a network of local digital collections will be a main topic in this process.

Footnote
1. A description of the system can be found in: Klaproth, F., Lossau, N. "The Document Management System Saros Mezzanine and the New Product AGORA as Key Component in a Digital Library Architecture at Göttingen University Library", Research and Advanced Technology for Digital Libraries, Lecture Notes in Computer Science 1513, 1998.

Technical Feature

Lossy or Lossless? File Compression Strategies Discussion at ALA

Robin Dale, RLG Member Programs and Initiatives
Robin_Dale@notes.rlg.org

The American Library Association midwinter meeting in Philadelphia was the site of a discussion about lossy vs. lossless file compression techniques. Sponsored by the Association for Library Collections and Technical Services/Preservation and Reformatting Section, the two-hour meeting brought together six practitioners to discuss the merits and uses of different compression techniques. Walter Cybulski (Head, Quality Assurance Unit, National Library of Medicine) provided opening remarks, and Oya Y. Rieger (Digital Projects Librarian, Cornell University) served as moderator for the event; also provided an introduction to the topic. She briefly reviewed factors involved in selecting file formats and compression techniques, and then turned to the panel members for their opening statements on the use of lossy vs. lossless compression. Until recently, common belief has been that lossy compression is unacceptable for the types of files created during digital reformatting projects. With a body of research and development projects now complete, production-level imaging underway at some institutions, and new file formats and compression techniques recently introduced, the question for the speakers was whether this belief is still valid.



Carl Fleischhauer
Technical Coordinator, National Digital Library Program, Library of Congress
cfle@loc.gov

Fleischhauer opened with an overview of decisions related to file types and compression that had been made for the American Memory and National Digital Library projects during the past decade. The American Memory project team had used their best judgement to make dozens of decisions about quality, taking into account all of the tradeoffs they could identify. For example, uncompressed files were made and retained for pictorial materials where any compression artifact damages the reproduction. Uncompressed, high resolution files were also produced for certain "high stakes" documents where the importance of the document and a desire to minimize future handling of the original served as overriding motivations. In contrast, Fleischhauer referred to the Manuscript Digitization Demonstration Project that captured ten thousand pages of modern typescripts as grayscale images, reporting that the project steering committee had placed a higher value on legibility than on museum-quality rendering, and readily accepted "mild" use of the lossy JPEG compression algorithm.

Other projects were influenced by considerations unrelated to digitization. For example, the books scanned for American Memory have significance as artifacts, and the scanning approach is intended to prevent damage even if that means less-than-perfect images. The pictorial illustrations in these books are scanned in grayscale or color - as circumstances permit - but given the inherent quality, there is no compelling reason not to apply a modest level of lossy JPEG compression to the resulting files. In contrast, lossless compression is applied to the images of the typographic pages from these bound volumes. But since the pages are scanned lying in a curved plane at a resolution of 300 dpi, this end result can be viewed as a "loss" when compared to the 600 dpi capture of a page lying flat, possible only with a disbound volume. And, Fleischhauer concluded, some might even argue that binarization itself represents loss, when the 8-bit grayscale image that exists within the scanner is reduced to the 1-bit bitonal image that is saved to disk.



Louis Sharpe, III,
President, Picture Elements, Inc.
lsharpe@picturel.com

Sharpe advocated the use of visually lossless (but lossy) compression with certain types of originals. Because "practical people make practical decisions," Sharpe believes the use of visually lossless compression in association with digital images of text-based originals should be considered no worse than reformatting those same originals with high-contrast black and white microfilm. His position differs for originals such as photographs and paintings; he does not advocate the use of lossy compression on pictorial materials. However, because the world already accepts microfilm as a (lossy) medium for bitonal textual materials, Sharpe believes it is difficult to justify creating and storing huge, tonal, uncompressed files for the same material. Time, conversion costs, staffing, etc., are all factors in the decision-making process for digital reformatting. He called attention to a handout that framed his position as a debate proposition:

1. Resolved,
that visually lossless (yet lossy) compression of tonal images of illustrated book pages can be used to create high-quality digital masters if all of the following conditions are met:

2. Further resolved,
that such images are of sufficient quality to serve as preservation images for books which are:

3. Further resolved,
that such images are of comparable or superior quality to accepted preservation approaches such as microfilm.

4. Further resolved,
that cost matters in digital library image conversion projects, even though it is other people's money.

To illustrate the plausibility of his position, he provided a closing scenario: compare two files of the same size - an uncompressed file to a file that had its spatial resolution doubled on both axes and was then compressed using JPEG. Visually, he stated, most people would choose the compressed file because with doubled spatial resolution, it was inherently richer than the first. With a final nod to the improvements proposed for JPEG 2000, Sharpe argued that at a minimum, the library and archival community should not close the door on the use of visually lossless compression.



Howard Besser
Associate Professor, School of Information Management & Systems, University of California, Berkeley
howard@sims.berkeley.edu

Besser outlined his position by focusing on the decision-making process in imaging projects and the expected longevity of digital information. The decision-making process for digitization necessitates a set of tradeoffs, not just those technical in nature. To make correct decisions, the organization's mission, the anticipated user and uses, and the type of original material all must be considered. Who is the user? What uses will users make of digital files? Will they need to see the texture of the paper? Will the text suffice? Will these users have the same expectations as those in the future? These questions are all part of the formula, which, when combined with organizational constraints, leads to tradeoffs from the beginning of the decision-making process.

The problem of digital longevity, Besser asserted, is complicated by a series of related problems including viewing, scrambling, interrelation, custodial aspects, and translation. The viewing problem is related to the infrastructure required to support it. Strong reliance on complex machines is new (as compared to human-eye-readable forms such as microforms). Digital reformatting requires encoding, applications software, operating systems, hardware, driver devices, and possibly a network to display the reformatted image. The danger, he asserts, is that if one crucial bit of the image file is lost or damaged, the entire file becomes unusable. The scrambling problem is somewhat different. According to Besser, to solve short-term problems, we have engaged in processes that may result in long-term peril, such as compression and container architecture. Using the current Y2K as an example of this, Besser concluded that when he thinks of the long-term, "fixes" such as compression give him cause to worry.



Peter Hirtle,
Assistant Director, Cornell Institute for Digital Collections, Cornell University
pbh6@cornell.edu

Hirtle's position on the use of lossy compression is somewhat cautious. Based on his experience with image file formats and compression, his position leans towards that of Howard Besser, but admittedly wavers. According to Hirtle, the debate is only relevant to the concept of the master file; access copies should be made using whatever compression scheme meets the needs of users. Three elements are essential for a master digital image file: the file captured should be rich in information; it should allow for multiple uses over time; and the file should remain accessible for a long period of time. On the first point, richness of capture, Hirtle agreed with points made by both Besser and Sharpe. He cited Sharpe's example of two same-size files, one of which is uncompressed but hence at a lower resolution than the compressed file. The compressed file, when uncompressed, may actually create a file that is more information-rich and is visually preferable. Hirtle was less certain a compressed file would remain as usable over time. It may not be possible, for example, to migrate directly compressed files to a new file format. On the third point, Hirtle was suspicious of lossy compression and the ability of lossy-compressed files to remain available over time. Issues of migration from one lossy compression to another may cause irretrievable information loss by the chain of events (compress --> uncompress --> save as new file --> compress with new scheme). Furthermore, if error correction techniques that occur "behind the scenes" in physical media such as CD-ROM fail, a compressed image may become useless. The safest course to ensure long-term accessibility would be to avoid compression schemes altogether.



Joy Paulson
Head, Reformatting and Replacement Service, Preservation Division, University of Michigan
jpaulson@umich.edu

Paulson spoke from the experience of working in a large digital library program involved in many digitization projects, including bitonal textual materials, color papyri, e-texts, etc. In favor of lossless compression techniques, her position is based on the decision-making processes established for Michigan's Making of America project. Because Michigan does not create and store derivatives as part of the production process, the relatively low compression levels offered by the lossless compression method are not judged to be problematic. (Michigan employs an on-the-fly image conversion program created at the institution, called TIF2GIF, to convert the large bitonal TIFF images into smaller, grayscale GIF images for Web transmission and viewing.)

According to Paulson, the ITU-T Group IV (lossless) compressed TIFF images serve multiple purposes. In their current form, they can be printed and bound to create a new hardcopy or converted via TIF2GIF for image delivery. They also serve as Michigan's "hedge against the future." Instead of adapting the files to fit current limitations associated with bandwidth and viewing devices, the TIFFs remain information-rich, master files that can be migrated when new, and perhaps better, file formats or compression schemes become available. This approach has proven to be very feasible for Michigan, and no plans to change are in the near future.



Steven Puglia,
Preservation and Imaging Specialist, National Records and Archives Administration
steven.puglia@arch2.nara.gov

Puglia opened his comments by agreeing with all of the previous speakers. For Puglia, a position on lossy vs. lossless compression is not a clear-cut decision, but a matter of weighing the pros and cons. Compression decisions must be combined and weighed with other decisions in the imaging chain, many of which may lead to "loss." Microfilming and traditional analog processes are inherently lossy. The important point is to minimize loss in reformatting. Used as a set of tools in the decision-making process, compression may be an effective answer to the needs of an institution or its users.

Arguing against compression of master files, he asserted that even with larger images, one cannot get the full benefit from compression. The full benefit comes from very high compression ratios - something not to be considered with master files due to the loss of quality. He also questioned the usefulness of the length of time algorithms take to compress and uncompress high quality, high resolution images. The length of time required to compress and decompress is much longer at higher quality settings (lower compression ratios), which are appropriate for master image files, affecting processing time and users, and again minimizing the potential benefit of using compression. Higher compression ratios (lower image quality) are more appropriate for use with lower resolution access images, where users can derive greater benefit due to the faster compression/decompression times, and the loss of image quality is of less concern.

In describing the decisions made at the National Archives and Records Administration, Puglia explained that master files are stored offline as uncompressed files. The access copies are available online and are compressed. Agreeing with Lou Sharpe's earlier argument, Puglia believes that high quality capture and high quality JPEG or LZW compression may produce better files than a mid-quality TIFF image. But for now, he recommended against considering compression "the magic bullet." With lossy compression, one more technical detail must be taken into consideration.



Question & Answer Session

The statements of the panelists were followed by questions from the debate moderator and audience members. The following is a sample of the questions and the points panelists made in responding to them.

1. Have institutions received complaints about the effects of compression?

2. Why would one choose lossy over lossless?

3. What would you forecast for ten years from now?

4. What new file formats are rising to the forefront or coming soon?

If you wish to contribute your own thoughts to this discussion, please send them to preservation@cornell.edu.

Highlighted Web Sites

Open Information Interchange (OII): Archiving Standards

The objective of the European Commission's OII service is to facilitate the exchange of information among standards and specification developers, product and service providers, and end-users of these products and services. The Commission's Archiving Standards Web site provides an overview of the existing and emerging standards and industry specifications related to the archiving and management of information. For example, it covers standards and specifications associated with electronic imaging in the context of the related areas of document management and archiving.

Calendar of Events

The Challenge of Image Retrieval CIR'99:
Second UK Conference on Image Retrieval

February 25-26, 1999

The conference will bring together researchers and practitioners in the area of image retrieval. There will be an exchange of information and discussion on the significance of developments in related disciplines. The conference will be held in Newcastle upon Tyne, Great Britain

New Challenges for Scholarly Communication in the Digital Era:
Changing Roles and Expectations in the Academic Community
March 26-27, 1999

To be held in Washington, D.C., this conference is sponsored by organizations representing faculty, publishers, librarians, and learned societies. It will explore the nature and scope of the challenges for scholarly communication in the digital era, and will seek to define new roles that build on the strengths and needs of all sectors. Conference topics include; Getting Ahead in the Digital World, Distance Education, the Economics of Scholarly Communication, and What Does it Mean to Publish?

Call for Papers: DL'99: Third European Conference on Research and Advanced Technology for Digital Libraries
Deadline: April 1, 1999
The Third ECDL will take place in Paris, France, September 22-24, 1999 at the Bibliotheque Nationale de France (BNF). The conference's main objective is to bring together researchers from multiple disciplines to present their work on enabling technologies for digital libraries. The conference also provides an opportunity for scientists to develop a research community in Europe focusing on digital library development.

AIIM Show
April 12-15, 1999
The Association for Information and Image Management (AIIM) is an international organization for the information management community. The annual AIIM show, held this year in Atlanta, Georgia is one of the leading events in the imaging industry.

Northeast Document Conservation Center Spring Workshops
NEDCC has announced their next workshop series, which includes School for Scanning and Preservation Options in a Digital World: To Film Or To Scan. These workshops will be of interest to anyone working on digital imaging projects. For further information contact: Gay Tracy, tracy@nedcc.org.

Ninth DELOS Workshop on Digital Libraries for Distance Learning
April 15-17, 1999
The Ninth Workshop of the DELOS Working Group will focus on issues related to Digital Libraries for Distance Learning. It will be held at the Hotel Continental, Brno, Czech Republic. For further information contact: Pavel Zezula, zezula@cis.vutbr.cz, or Pasquale Savino, savino@iei.pi.cnr.it.

IEEE ADL'99: Advances in Digital Libraries Conference
May 19-21, 1999
To be held in Baltimore, Maryland, the goal of this conference is to share and disseminate information about important current issues concerning digital library research and technology.

Third International Conference on Conceptions of Library and Information Science
(CoLIS 3): Digital Libraries: Interdisciplinary Concepts, Challenges and Opportunities
May 23 - 26, 1999
To be held in Dubrovnik, Croatia the conference will focus on digital libraries, users, economics, and cooperation.

Computer Policy and Law Seminar
July 13 - 16, 1999
To be held at Cornell University in Ithaca, New York, this seminar will provide the opportunity to obtain skills and knowledge needed to improve collaboration between technology specialists and legal counsel. It is aimed particularly at university attorneys, judicial officers, technology administrators, risk managers, Webmasters, publications directors, and public administrators.

Announcements

Worldwide Survey of Digitised Collections in Major Cultural Institutions
The IFLA Core Programmes for Preservation and Conservation (PAC) and Universal Availability of Publications (UAP) are working together, on behalf of UNESCO, to undertake a survey of digitisation programmes in major cultural institutions, in order to establish a "virtual library" of digitised collections worldwide. The directory of digitised documents will take the form of a freely accessible database on the UNESCO Website. Information is also being collected on the preservation issues surrounding the digitisation of materials. For further information contact: Richard Ebdon, richard.ebdon@bl.uk.

The European Commission on Preservation and Access (ECPA) Digitization Survey
A second survey is being sponsored by the ECPA on the conservation and digitization of European photographic collections. ECPA is asking institutions to fill out a survey form; respondents will receive a copy of the final report.

Digital Image Distribution Cost Study Now Available
The Mellon Foundation funded a cost assessment of the Museum Educational Site Licensing (MESL)project which has just been released. Authored by by Howard Besser and Robert Yamashita, it is entitled, The Cost of Digital Image Distribution: The Social and Economic Implications of the Production, Distribution, and Usage of Image Data.

The Museum Educational Site Licensing Project (MESL) Project Reports Now Available Online
The MESL project has just released the online versions of Delivering Digital Images: Cultural Heritage Resources for Education and Images Online: Perspectives on the Museum Educational Site Licensing Project.

New LITA Publications On Digital Imaging And Metadata
The Library and Information Technology Association (LITA), a division of the American Library Association, has published two new books in the LITA Guides Series: Digital Images of Photographs: A Practical Approach to Workflow Design and Project Management, by Lisa Macklin and Sarah Lockmiller; and Getting Mileage Out of Metadata: Applications for the Library, by Jean Hudgins, Grace Agnew, and Elizabeth Brown.

California Digital Library Opens
The California Digital Library (CDL) has opened its public "digital doors" by making available an integrated Web gateway to digital collections, services, and tools. The CDL focuses on selecting, building, managing, preserving, and providing access to shared collections of high-quality digital materials for the university and its partners.

President Clinton Requests $188,500,000 for America's Museums and Libraries :
Request Includes $10 Million for New Digitization Initiative

The President's Fiscal Year 2000 budget released to the United States Congress requests $188,500,000 for the Institute of Museum and Library Services (IMLS), and includes $10 million for a new leadership initiative to support the creation of a National Digital Library for Education, part of the President's Educational Technology Initiative.

NARA Proposes Funding for Digital Records Projects
The National Archives and Records Administration has proposed adding $400,000 to its budget to begin developing a system for digital records projects, including archiving federal agencies' email. The request also would be used to build systems for preserving digital image files, and to devise a prototype for a system that researchers can use to access electronic records that NARA holds.

FAQs

Question:
What is the status of the proposal to revise the 007 field to include digital material? How in particular will it apply to digital images?

Answer:
We asked Diane Hillmann (dih1@cornell.edu), Head, Technical Services Support Unit, Cornell University Library, to respond to this question. She is a member of the Machine-Readable Bibliographic Information Committee (MARBI), which is an interdivisional committee of the American Library Association (ALA) that deals primarily with the development and maintenance of the USMARC formats. Here is her account of the status of the 007 field.

The proposal for a new MARC 007 field for digital preservation and reformatting was approved in Philadephia during MARBI's midwinter ALA meeting. Ordinarily, changes in USMARC (now MARC 21) take six months or so to be made real in vendor applications and bibliographic utilities. The Library of Congress' implementation of its new integrated library system may delay the process somewhat this year, though it is likely changes made by MARBI this session will find their way into the format by the end of 1999. [clarification note added on 22 February 1999: early text erroneously referred to the implementation of a new interlibrary loan system; it is an integrated library system.]

The MARBI Proposal 99-01 to enhance the MARC 007 field for computer files to allow recording of preservation and reformatting information originated last year as Discussion Paper 110 (DP 110). The enhancements, first proposed by the RLG Working Group on Preservation and Reformatting Information, involved a separate 007 for preservation/reformatting information, in addition to a general 007 field for digital files. (See also report in RLG News section.) However, this approach was modified for MARBI, as similar 007s (such as the ones for microforms and moving images) are presented in one field. Most of those expressing opinions recommended that the proposal be more generalized to apply to a broader array of digital files than just digital images, with less specific coding. Other recommendations sought to bring the coding conventions more in line with other 007s, with ample room for growth and clear enough instructions for use even by the non-expert.

The approved proposal allows for an 007 field of either 8 or 14 bytes, depending on whether information on long-term preservation is needed. The information recorded in the new set of 007 values will accommodate better retrieval and management of digitally reformatted materials, and will help guide decisions to digitize materials for preservation purposes. The following section highlights the enhancements to the 007 field that have significance for the digital imaging community:

 

007/00 Category of material

c computer file (a change in the definition will make it clearer that this code is applicable for digitally reformatted files as well as all kinds of other computer files)

007/01 Specific material designation

Most of the currently available codes for this byte reflect specific computer file formats (e.g., "j" for magnetic disks, "o" for optical discs and "r" for remote files). A new code "d" for "dynamic" was proposed to indicate a file that might change over time as new media become available for refreshing files. Rather than coding in the bibliographic record for specific media (and thus having to change them as storage media changed), the choice of a term indicating that the choice of media was not static was needed to augment options such as "unknown" or "other" already available. However, though it was agreed that "dynamic" was a better choice than "varies" (from DP 110), there was some discomfort with the term, and "unspecified" was approved as an alternative.

007/03 Color

The proposal included an additional code "b" for black and white, and a revision to code "a" which was formerly "one color" to indicate its use for white and one other color (consistent with the use of color codes for projected and non-projected graphics).

There was some concern about how textual material should be coded, particularly as it might not be viewed in black and white - the consensus was that "n" would be specified for most textual material where color was not a consideration.

007/04 Dimensions

A revision to the code definition for "not applicable" was suggested to specify the use of that code for preservation files.

007/05 Sound

No changes needed.

007/06-08 Image bit depth

The original proposal of two bytes was expanded to three, to indicate the exact bit depth of scanned images comprising the file, or a three-character code indicating that the information cannot be recorded. There were questions raised for this section, for instance does "not applicable" mean "no images?" If so, it was suggested that use be clearly specified. For files that contain more than one image type, no attempt has been made to code more than one-instead the code "mmm" for "multiple" would be used.

007/09 File formats

A similar strategy was chosen to code file formats-the consensus was that it was impractical to code for specific file formats, so the choices are instead three: one file format, multiple file formats, or unknown.

007/10 Quality assurance target(s)

The decision to record for the presence or absence of targets (no attempt is made to be more specific about the kinds of targets) reflects a divergence of practice from microforms, where standards are well established, and no such coding is considered necessary. The presence of targets suggests that serious attempts were made to assure quality capture of the original information.

007/11 Antecedent/Source

This byte, reflecting the source of the digital reproduction, allows some assessments of quality and potential use to be made. Coding can reflect original sources, microforms, other computer files, other intermediates and mixed sources. One issue that arose in discussion revealed the complexity of the definition of "original"-if a microform was created from computer files (as in the case of COM-Computer Output Microform), does it then qualify as an original for coding purposes? It was concluded that it might not, and that a specific disqualification for microform might be defined for code "a," which should differentiate a reproduction from an original source document in such cases.

007/12 Level of compression

Here the codes reflect the ability of the file to be delivered with fidelity to the original, using codes to indicate whether or not the file is compressed, and whether the compression technique is of a type defined as "lossy" or "lossless." Discussion on this byte centered around the question of whether the code "not applicable" implied something different than "uncompressed" or "unknown." As it was determined that it did not, it will be eliminated from the final version.

007/13 Reformatting quality

This byte was intended to reflect the same sort of information that "generation" implies in the microform 007 field, with its distinctions between master and service copies. One major difference with digital files is the lack of currently applicable quality standards. This led to an attempt to code for intention, rather than reference to objective measures, which caused some concern amongst the MARBI members.

An example of this was questions raised about the code "p" defined as " ... a file created via reformatting to help preserve the original item. The capture and storage techniques associated with preservation files ensure high-quality, long-term computer files that warrant long-term preservation." Code "r" is defined for files intended to replace the original item, should the original be lost, damaged, or destroyed. The question was raised whether this might be a moving target of sorts, in the absence of generally agreed upon standards. Further, if intention were important, should not a date be associated with the assessment of this kind? There was some discussion of whether this intent was not better expressed in a 583.

RLG News

New Report on RLG Members' Digital Preservation Needs
A new survey and analysis of RLG members' current practices, needs, and plans for preserving their growing collections of digital holdings is now available on the Research Libraries Group Web site. The report is available in two versions: an HTML document for online browsing or a PDF document optimized for printing.

The study, conducted during 1998 by Margaret Hedstrom, Associate Professor at the School of Information, University of Michigan, and Sheon Montgomery, Graduate Research Assistant, is based on an extensive written survey - to which 54 members responded - plus phone interviews with over a dozen collection administrators. The result is an up-to-date, carefully interpreted picture of the current state of digital preservation and the key concerns and expectations from an international cross-section of the RLG membership.

RLG's primary purpose in commissioning the study was to assess how members are managing their digital holdings and to determine what they need to be successful. The results will assist RLG in developing the kinds of resource-sharing mechanisms, training, and services that best serve members' needs.

The published report, Digital Preservation Needs and Requirements in RLG Member Institutions, presents findings under four headings - Responsibilities, Policies and Practices, Staffing and Expertise, and Needs and Requirements. The authors provide a concise set of recommendations for RLG and for individual institutions, and also address service providers.

This survey is one of several steps RLG and its members have taken in response to the 1996 study, Preserving Digital Information: Report of the Task Force on Archiving of Digital Information, that RLG jointly sponsored with the Commission on Preservation and Access (which has since merged into the Council on Library and Information Resources).

Vote at ALA Signals Future Change in Recording Bibliographic Information for Digitally Reformatted Files
Over the past ten years, digitization has moved from research and development projects to full-scale conversion projects. Digitization of traditional library and archival materials such as books and manuscripts are being joined by the digitization of recorded sound and motion pictures, creating tens of thousands of computer files in the process. During this same time, the community has struggled to adequately describe the resulting objects to allow access, but also record pertinent information about the creation process.

Traditionally, this kind of metadata has been recorded in bibliographic records. For materials reformatted to microforms, MARC records have been the key to sharing information about items which have been reformatted. When contributed to national bibliographic databases, the MARC records for preservation microforms help avoid duplicative efforts and provide new or enhanced access to titles or collections not previously in the database - assisting research and acquisitions. While new cataloging conventions were drafted to adequately describe the growing number of electronic material (those items "born" digital) acquired by institutions, until recently, cataloging conventions did not specifically allow for adequate information to be recorded for digitally reformatted materials.

For the last two years, the RLG Preservation and Reformatting Information Working Group has been working to identify current practice used to describe actions and decisions taken in support of digitization projects; core information required to assist with managing and maintaining access to the digital files over time; and terminology and methods to record such information in a bibliographic record so that it would be easily retrievable.

On 31 January 1999, at the American Libraries Association's midwinter conference, a proposal from the working group was passed by the ALA MARBI committee that advises the Library of Congress on extensions of the Machine-readable Cataloging (MARC21) format. Proposal 99-01, Enhancement of Computer File 007 for Digital Preservation/Reformatting, requested changes to the existing MARC 007 (c) field to encode information about digitally reformatted items.

Detailed information is available in Proposal 99-01 through the MARC Standards Office at the Library of Congress. A few small modifications were made to the proposal immediately prior to approval and will be incorporated in an update to the MARC21 Format for Bibliographic Data. We anticipate that the 007(c) values will be published in the year 2000 update, with implementation by systems supporting the national databases (Library of Congress, RLG, and OCLC) sometime thereafter.

This proposal results from two-year effort by working group members from the British Library, Columbia University, the European Register of Microform Masters, the Library of Congress, the National Library of Australia, the National Library of Canada, the University of Toronto, and the University of Leeds. Representatives from another RLG member advisory group, the RLIN Database Advisory Group, comprising representatives from Cornell University, Emory University, Getty Information Institute, Harvard University, New York University, Princeton University, University of Cambridge, and Yale University also reviewed earlier drafts of this paper. RLG applauds the effort of these contributors. (For more information on the MARBI discussion of the 007 proposal, see the FAQs section.)

Hotlinks Included in This Issue

Feature Articles
AGORA: http://www.agora.de
Anagramm GmbH, Germany: http://www.anagramm.de
Bavarian State Library in Munich: http://193.174.98.10/scan1/MDZ/
Council on Library and Information Resources: http://www.clir.org/pubs/archives/archives.html
Ebind project: http://sunsite.berkeley.edu/Ebind/index.html
final report: http://www.sub.uni-goettingen.de/ebene_2/vdf/einstieg.htm
FineReader: http://www.mitcom.de
GDZ's Web site: http://www.sub.uni-goettingen.de/gdz/en/tech_notes/tiffheader.html
German Research Foundation (Deutsche Forschungsgemeinschaft): http://www.dfg.de/
Participation in Digitization of European PERiodicals Project (DIEPER): http://www.SUB.Uni-Goettingen.de/gdz/dieper/
PICA-GBV (Gemeinsamer Bibliotheksverbund) network: http://www.gbv.de/homepage.html
PixEdit: http://www.pixedit.com
SRZ ProScan Book: http://www.agora.de/CompAll.htm
University of Göttingen (Göttinger Digitalisierungs Zentrum, GDZ): http://www.SUB.Uni-Goettingen.de/gdz

Technical Feature
American Memory and National Digital Library: http://memory.loc.gov/ammem/ftpfiles.html
Manuscript Digitization Demonstration Project: http://memory.loc.gov/ammem/pictel/index.html

Highlighted Web Site
Open Information Interchange (OII): Archiving Standards: http://www2.echo.lu/oii/en/archives.html

Calendar of Events
AIIM Show: http://www.aiim.org/events/aiim99/
The Challenge of Image Retrieval CIR'99: Second UK Conference on Image Retrieval: http://www.unn.ac.uk/iidr/conference.html
Computer Policy and Law Seminar: http://www.sce.cornell.edu/EXEC/html/cpl.html
IEEE ADL'99: Advances in Digital Libraries Conference: http://cimic.rutgers.edu/~adl/
New Challenges for Scholarly Communication in the Digital Era: Changing Roles and Expectations in the Academic Community: http://www.arl.org/scomm/conf.html
Northeast Document Conservation Center Spring Workshops: http://www.nedcc.org/confer.htm
Third International Conference on Conceptions of Library and Information Science (CoLIS 3): Digital Libraries: Interdisciplinary Concepts, Challenges and Opportunities: http://www.colis3.hr

Announcements
California Digital Library Opens: http://www.cdlib.org
Digital Image Distribution Cost Study Now Available: http://sunsite.berkeley.edu/Imaging/Databases/1998mellon/
The European Commission on Preservation and Access (ECPA) Digitization Survey: http://www.knaw.nl/ecpa/form.htm
IMLS Website: http://www.imls.fed.us/
The Museum Educational Site Licensing Project (MESL) Project Report is Now Available: http://www.ahip.getty.edu/mesl/reports/final_reports.html
The National Archives and Records Administration Proposes Funding for Digital Records Projects: http://www.nara.gov
New LITA Publications On Digital Imaging And Metadata: http://www.lita.org/litapubs/index.html
Worldwide Survey of Digitised Collections In Major Cultural Institutions: An IFLA PAC/UAP Joint Project: http://www.ifla.org/VI/2/p1/miscel.htm

FAQs
Discussion Paper 110 (DP 110): http://www.loc.gov/marc/marbi/dp/dp110.html
MARBI Proposal 99-01: http://lcweb.loc.gov/marc/marbi/1999/99-01.html
RLG Working Group on Preservation and Reformatting Information: http://www.rlg.org/preserv/pri-intro.html

RLG News
Council on Library and Information Resources: http://www.clir.org
Digital Preservation Needs and Requirements in RLG Member Institutions: http://www.rlg.org/preserv/digpres.html
Preserving Digital Information: Report of the Task Force on Archiving of Digital Information:
http://www.rlg.org/ArchTF/
Proposal 99-01: http://www.loc.gov/marc/marbi/1999/99-01.html
RLG Preservation and Reformatting Information Working Group: http://www.rlg.org/preserv/pri-intro.html

Publishing Information

RLG DigiNews (ISSN 1093-5371) is a newsletter conceived by the members of the Research Libraries Group's PRESERV community. Funded in part by the Council on Library and Information Resources (CLIR), it is available internationally via the RLG PRESERV Web site (http://www.rlg.org/preserv/). It will be published six times in 1999. Materials contained in RLG DigiNews are subject to copyright and other proprietary rights. Permission is hereby given for the material in RLG DigiNews to be used for research purposes or private study. RLG asks that you observe the following conditions: Please cite the individual author and RLG DigiNews (please cite URL of the article) when using the material; please contact Jennifer Hartzell at bl.jlh@rlg.org, RLG Corporate Communications, when citing RLG DigiNews.

Any use other than for research or private study of these materials requires prior written authorization from RLG, Inc. and/or the author of the article.

RLG DigiNews is produced for the Research Libraries Group, Inc. (RLG) by the staff of the Department of Preservation and Conservation, Cornell University Library. Co-Editors, Anne R. Kenney and Oya Y. Rieger; Production Editor, Barbara Berger; Associate Editor, Robin Dale (RLG); Technical Support, Allen Quirk.

All links in this issue were confirmed accurate as of February 12, 1999.

Please send your comments and questions to preservation@cornell.edu .

Contents Search Home

Trademarks, Copyright, & Permissions