RLG DigiNews

Home

December 15, 1998, Volume 2, Number 6, ISSN 1093-5371

Table of Contents

 

Feature Article

Digital Archiving: Approaches for Statistical Files, Moving Images, and Audio Recordings

Introduction
Oya Y. Rieger, Co-Editor,
RLG DigiNews
oyr1@cornell.edu

One of the features that makes digital archiving an overwhelming effort is the richness of the digital terrain. What we simply refer to as digital information is indeed a deep and complex combination of materials, including text, images, sound, video, and numeric and spatial files. The reason we group this wide range of materials under the umbrella of digital information is because they are all composed of "bits and bytes." As articulated in a recent British Library Research and Innovation Centre report, the type of material to be preserved is one of the key factors governing the choice of a preservation approach.(1) Although there is significant overlap in the issues faced with all digital materials, there are also problems unique to each format. For example, a simple ASCII text file is in a standard format with limited hardware and software dependencies. On the other hand, a numeric file that includes statistical data in a predetermined format, including macros for calculations, is much more challenging as a result of several interdependencies in its technical environment.

The October 1998 issue of RLG DigiNews included an article by Margaret Hedstrom reviewing national initiatives in digital preservation. This month, we offer readers a comparative digital archiving view from a different perspective, by type of digital material. We invited representatives from three digital archives, each with a different digital material type (statistical files, moving images, and audio), to describe the preservation issues they are addressing. We also asked them to articulate the attributes of these formats that represent unique challenges to the long-term preservation of their collections.

Numeric data archivists have helped pioneer solutions to long-term preservation challenges. The first contribution is by Simon Musgrave and Bridget Winstanley from The Data Archive (Essex, UK). They describe the challenges involved in archiving numeric files, highlighting metadata and privacy issues.

Next, Thom Shepard describes the underlying philosophy of the Universal Preservation Format (UPF) initiative spearheaded by WGBH, the public broadcasting station of Boston, Massachusetts. With more than 160,000 hours of video programming, WGBH has first-hand knowledge of the threat of technological obsolescence for film and video materials. Although the UPF model is based on moving images, the goal behind the initiative is to create a prototype that can be used universally for all types of digital materials.

Svein Arne Brygfjeld and Svein Arne Solbakk contribute the third segment on a collaborative digital audio archive initiative between the National Library of Norway and the Norwegian Broadcasting Corporation. They describe the prototype created to support storage, distribution, and archiving of digital audio files.

Notes
(1) British Library Research and Innovation Centre.
A Framework of Data Types, and Formats, and Issues Affecting the Long Term Preservation of Digital Material, 1997, ed. John C. Bennett, http://www.ukoln.ac.uk/services/papers/bl/jisc-npo50/bennet.html.  


Archiving Statistical Data: The Data Archive at the University of Essex
Simon Musgrave
Acting Director, The Data Archive
simon@essex.ac.uk
and
Bridget Winstanley
Information Director, The Data Archive
bridget@essex.ac.uk

The Data Archive at the University of Essex houses the largest collection of accessible computer-readable data in the social sciences and humanities in the United Kingdom. It is a national resource centre, disseminating data throughout the United Kingdom, and internationally through arrangements with national archives. It is funded by the Economic and Social Research Council, the Joint Information Systems Committee (JISC) of the Higher Education Funding Councils, and the University of Essex. Founded in 1967, it now houses more than 5,000 data sets of interest to researchers in all sectors and from many different disciplines. The data housed in the Data Archive come in a wide variety of formats, including surveys, censuses, registers, and aggregate data.

Digital Archiving Challenges Specific to Numeric Files
The top reasons that make archiving numeric files particularly challenging are:

Numerical Data and Metadata
Numeric data are incomprehensible unless they are accompanied by documentation that explains their context. This documentation can be divided into three main parts:

Historically, data and metadata were separated into two files, including limited internal metadata in the form of variable lists and labels. Occasionally there were also accompanying paper documentation with broader, but vital information about the survey and analysis design. This approach required the preservation of both of these materials, representing considerable challenges in maintaining information on mixed media. In recent years, great efforts have been made to digitise the backlog of paper documentation, and to agree to a standard encoding for the metadata. The Data Documentation Initiative (DDI), led by the Inter-University Consortium for Political and Social Research (ICPSR) at the University of Michigan, aims to standardise an SGML- and XML-based structure of data description to allow intelligent data browsing and retrieval. This initiative also forms a standard for dissemination and preservation of numeric data.

Format and Software Obsolescence
Data analysis software has grown powerful and complex since the days when all numerical data were collected as "flat" files to be analysed using simple tools such as SPSS, a statistical analysis software package. These files could be preserved in ASCII format, an ideal standard, without loss of meaning. In more recent years, carefully designed relational or hierarchical data files have taken their place. For preservation purposes (to avoid software dependency that may make the data unreadable in the future), data files can be written out as ASCII files with key identifiers and documentation describing their meaning. However, the structure is also meaningful as relational databases are designed very carefully to reflect the source and the research needs, and usually are put together using a model. Therefore, subsequent researchers will need to re-construct the structure. The Archive has concentrated on preserving the structure in published formats (such as the SPSS export format) that are ASCII-based, so that they can be read into many packages. However, SPSS export format is a proprietary standard, and the adoption of the DDI Document Type Definition (DTD) will enable these structures to become an archival as well as a dissemination standard.

Confidentiality and Ownership
The legal context in which social science data exists varies to some extent across national boundaries. Nevertheless, the archiving of data arising from social surveys must take into consideration the issue of confidentiality and data protection relating to the individuals surveyed, and also the questions surrounding institutional ownership and responsibility for long-term archiving.

Social science microdata contains details of respondents that may be used to identify individuals unless measures are taken to obviate this. At the time of data collection, these individuals will normally have received assurances that their answers to questions about themselves, their families, incomes, and sometimes their behaviour and attitudes will not be divulged in a way that allows them to be identified. Making the data available for further research, which is the prime aim of social science data archives, poses a risk to these assurances that in turn threaten the flow and quality of information from respondents. Therefore, it is unusual for archives to receive from data producers any data that contains the full extent of the data collected. The data will have been anonymised by the removal of variables that in many cases limit the extent of further research. This is inevitable for the present, but in 100 years, when the need for confidentiality has passed, the loss of these variables will mean a loss of information for posterity.

Although data resulting from censuses and surveys by national governments are now safeguarded, a large amount of research data are being lost due to the fact that the issues of ownership, copyright, and future preservation are not addressed at the beginning of research projects. (1) When the ownership of the data is not clearcut, and explicit arrangements for future preservation are not made at the beginning of the research project that generates the data, the data collectors or producers are reluctant to place them in a data archive for safekeeping. Subsequently the data often disappear by deliberate or accidental deletion when the research team that collected them is disbanded.

Potential Solutions
The key issue in the rapidly developing world of the Internet is standards. This is important because it enables archivists to adopt common and tested procedures to safeguard the data. Additionally, it enables users to move seamlessly and powerfully among resource collections, identifying and potentially using all the available resources. For example, the adoption of the Dublin Core for the top level of metadata facilitates the identification of a whole variety of resources. However, the adoption of more detailed standards, such as the DDI, enables the resources to be stored in ways that fully preserve the structure in a non-proprietary manner. This facilitates access via sophisticated tools such as those being developed in the European Commission-funded
Networked European Social Science Tools and Resources (NESSTAR) project.

There are no easy short cuts on the wider issues of media obsolescence. The fundamental issue is that data should be stored within long standing organizations that understand the need to manage the data with a careful eye on the changing needs of users and limitations of the technology. Due to rapid changes in hardware and software, the key word for statistical data is migration. The data and accompanying documentation must be migrated to later or different versions of the required hardware and software cautiously to ensure that no loss of information occurs.

Solutions to the legal issues require informational programmes to alert funders and researchers to the value of long-term preservation. It is important to ensure that all the parties involved in the funding and practice of research are aware of the implications of loss of heritage due to the lack of long-term preservation decisions made at the outset of projects. There is also need for resources to disseminate information about preservation, and to provide the physical infrastructure for preservation within accredited organisations.

Notes
(1) The British Library and and the Joint Information Systems Committee of the Higher Education Funding Councils (UK) have recently produced a programme of studies on digital archiving. Of particular interest will be British Library Research and Innovation Report 109: an Investigation into the Digital Preservation Needs of Universities and Research Funders: the Future of Unpublished Research Materials (1998), soon to be available at http://www.ukoln.ac.uk/services/elib/papers/supporting/.


Universal Preservation Format (UPF): Conceptual Framework
Thom Shepard, Project Coordinator, WGBH Educational Foundation
thom_shepard@wgbh.org

In 1977, two Voyager spacecrafts left Earth on a mission to explore and send back information about our solar system and beyond. Attached to each vehicle was a gold-plated phonograph record, which contained 115 images, a collection of Earth's natural sounds, greetings in 55 languages, and samples from the likes of Bach, Beethoven, and Chuck Berry. Each record included a stylus, and inscribed on its protective aluminum jacket were visual instructions on how to play it. The disks were intended by Carl Sagan and his team of managers and engineers as a kind of packaged time capsule for any aliens or distant cousins who, several centuries from now, might be rocketing along.

Somewhere in Space Float the Keys to Digital Preservation
Voyager's Interstellar Record inspired Dave MacCarn, Chief Technologist at the WGBH Educational Foundation, when he considered the long-term storage of digital media. As a major public broadcasting station and content producer, WGBH has developed one of the most significant media archives in the industry. This unique collection not only has obvious production and historical value, but also serves as a continuing source of revenue. Unfortunately, these very fragile tapes come in a variety of formats, many requiring antiquated machines to access their contents. Despite new media's promise of easy access and portability, digital technologies are compounding the problem of long-term storage by flooding the marketplace with new file formats and proprietary storage devices.

MacCarn examined the underlying technologies of some prominent acquisition applications and file formats, and found the basis for a "no frills but robust" storage mechanism that might be developed exclusively for digital archives. Based in part on Apple's Bento Container (a technology that allows for media content to be exchanged without modification among various computer platforms), MacCarn's preservation format would identify its contents as independent from the computer applications that created them, the operating system from which these applications originated, and the physical media upon which it is stored.

Voyager's Interstellar Record has a significant advantage over current digital storage. Unlike, say, Morse Code, messages inscribed on a phonograph can be retrieved because there is an analogy between how its information is stored and fundamental principles of physics. For digital information to be readable, you need an intermediary interpreter. MacCarn's notion of a "self-describing" digital file format is to include within its embedded metadata all the technical specifications required to access its contents. In effect, these stored algorithms would constitute a standardized blueprint for reconstructing both the data types and the physical mechanism upon which the data is recorded.

Both Dave MacCarn and Jeff Rothenberg of RAND Corporation promote the concept of the digital container, which Rothenberg calls the "virtual envelope." (1) These transport mechanisms would contain the bit streams along with descriptions of content and "transformation history." MacCarn's self-described file format contrasts with Rothenberg's emulation proposal in other specifics. The UPF does not promote hardware emulation as a solution for long-term storage of today's data types. Current applications increasingly place special demands on hardware. Computer hardware consists of many components - RAM, video and sound cards, even analog converters, as well as Web and proprietary architectures, so it is not enough to emulate a generic processor. In the past, the more successful emulators have depended upon the installation of special hardware helpers: cards including proprietary chipsets or ROMs (Read Only Memory) that contain the original operating system. Rothenberg's solution has demonstrable merit for ancient proprietary systems, and adoption of his ideas might salvage volumes of material that might otherwise be lost. The UPF initiative, on the other hand, is concerned with storing digital materials being created today. In short, Rothenberg's solution is retroactive; the UPF solution is proactive.

Inception of the Universal Preservation Format
Awareness of the need for a universal preservation standard grew out of meetings between MacCarn and the Director of the Media Archives, Mary Ide. Concerns were expressed not only for moving image material, but also for the many other data types that the station generates: audio, text, database files, captioning and descriptive video information, and the whole gamut of original digital content produced for the World Wide Web.

Two years ago, the WGBH Educational Foundation was awarded a grant from the National Historical Publications and Records Commission of the National Archives to produce a "recommended practice." The goal of the initiative is to prototype a platform-independent Universal Preservation Format (UPF), designed specifically for digital technologies that will ensure the accessibility of a wide array of data types - especially video formats - into the indefinite future.

Goals of the Initiative
One of the goals of the project is to bring together technology manufacturers and archivists to determine a UPF that meets the needs of both non-commercial and commercial interests. The UPF concept has been presented to a number of engineering, computer, and archival groups. The
UPF Web site includes a variety of related information, including the results of a user survey and minutes from UPF Study Group sessions within the Society of Motion Picture and Television Engineers. These quarterly meetings bring together engineers and archivists to exchange ideas, voice concerns, and to help untangle the semantics that have hindered effective dialogue in the past. By concentrating on elemental concepts of how data and information about that data might be stored through time, the Universal Preservation Format initiative is working toward a self-describing mechanism that will be as durable as the Voyager's "long playing" record.

Notes
(1) For a discussion of Rothenberg's digital encapsulation concept for digital archiving, see:
Rothenberg, Jeff. "Ensuring the Longevity of Digital Documents,"
Scientific American, January 1995, vol. 272, no. 1, pp. 42-47. A report, "Avoiding Technological Quicksand: Finding a Viable Technical Foundation for Digital Preservation," will be available from the Council on Library and Information Resources early in 1999.

Suggested Readings
MacCarn, Dave. "Toward a Universal Data Format for the Preservation of Media," SMPTE Journal, July 1997, vol. 106, no. 7, pp. 477-479, http://info.wgbh.org/upf/papers/SMPTE_UPF_paper.html.

Shepard, Thom and MacCarn, Dave, "The Universal Preservation Format: Background and Fundamentals," European Research Consortium for Informatics and Mathematics, Sixth DELOS Workshop, Preservation of Digital Information, June 1998, pp. 45-51.

Shepard, Thom. "Introduction to the Universal Preservation Format," Archival Outlook, July/August 1998, p. 24.

Shepard, Thom. "Universal Preservation Format Update," D-Lib Magazine, November 1997, http://www.dlib.org/dlib/november97/11contents.html.


Norwegian Digital Radio Archive Initiative
Svein Arne Brygfjeld, Adviser, National Library of Norway
svein.arne.brygfjeld@nbr.no
and
Svein Arne Solbakk, Head of IT, National Library of Norway
svein.solbakk@nbr.no

The National Library of Norway (NLN) is responsible for the long-term preservation and dissemination of a wide variety of materials, ranging from traditional textual information to digital television broadcast. The Norwegian Broadcasting Corporation (NRK) is the main public broadcasting company in Norway, broadcasting both radio and television. The Digital Radio Archive initiative was launched to address these two organizations' common needs and desire to establish an infrastructure capable of handling large amounts of digital audio files. In an attempt to create an open and modern digital archiving architecture, a prototype project was carried out in early 1998 to test the conceptual and technical elements of the envisioned common audio archive.

Long-Term Preservation Issues
As analog audio tapes degenerate with age, it is important to evaluate long-term preservation of audio based on digital formats. Unfortunately, long-term preservation of digital audio also has some cumbersome aspects. The most critical factors are the quality of the digital audio, the audio format, and the storage medium.

The quality has to be high enough to reproduce the characteristics of the original sound. For an old sound recording that is in poor condition, a resolution of 16 bits and a sampling frequency of 22 KHz mono may be more than acceptable, while it is questionable whether 16 bits and 48 KHz stereo is capable of reproducing the characteristics of an analogous sound recording in excellent condition. One also has to consider whether the digitized sound should be "cleaned" using digital techniques (with the risk that some of the audio information might get lost during this process), or whether this processing should be postponed in case better digital noise reduction systems are developed in the future.

When choosing an audio format, one has to consider the expected lifetime of the format, and how easy it will be to migrate to a new format in the future. A widely used format supported by a variety of computer platforms should be chosen. For long-time preservation, compressed lossy formats should be avoided.

Another consideration is the limited lifetime of storage devices. Optical storage is argued to have a lifetime that exceeds the lifetime of magnetic disks and tapes. However, current optical storage options cannot compete with magnetic devices when it comes to storage capacity. In addition to the media longevity issues, one has to take into account the lifetime of the technology needed to access and retrieve the information, such as mechanical devices (e.g., robot and tape readers), interfaces between the storage device and the computer (e.g., SCSI or fibre channel), the software needed to run the mechanical device, and the support for this software in the computer's operating environment.

The Prototype: Goals and Challenges
The main challenge in building a digital audio archive is the data volume, as digital audio files are extremely large. Retrieving information from the archive is not a trivial task, and requires a strong set of metadata to mediate the searching process. In addition, since audio is a continuous information type, there is need for technical solutions to support streaming of audio to meet access needs.

The objective of the prototype is to build a digital radio archive based on a selection of radio programs and their metadata from the historical archives of NRK. The prototype has the following characteristics:

Both the National Library of Norway and the Digital Radio Archive are located at Mo i Rana (below the Arctic Circle), which is located 1,000 kilometers from the Norwegian Broadcasting Corporation headquarters in Oslo. One of the goals of the prototype is to overcome the impediments of remote locations by making available all features of the database for NRK, NLN, and the Digital Radio Archive.

The Prototype: Technical Choices
A UNIX system (Hewlett Packard) with two RAID disk arrays comprising approximately 500 Gb of storage is used in the prototype. Software from Xing and RealNetworks is used for audio streaming. Digization and format conversions are performed on Silicon Graphics workstations. The systems are interconnected via 155 Mb/s ATM and 100 Mb/s Ethernet. ATM access is used for Wide Area Network communication. The metadatabase is running on a DEC Unix server. Audio files are identified by using record identifiers from the metadatabase. Binding between metadata and audio is achieved on-the-fly during the Web delivery.

The Prototype: Issues
Currently, the digitized audio quality is pretty high, and seems to be suitable for long-term preservation. A decision was made not to perform any noise reduction during digitization for the first phase of the project. The project team believes that the chosen file format standard (Broadcast Wave Format) is also adequate for long-term preservation of the audio files. It is supported by a variety of computer platforms, and is expected to be widely used, as it is endorsed as a standard by the European Broadcast Union. Therefore, it is very likely that well-documented migration paths will be available to support future preservation needs.

One of the major successes of the prototype is to enable the simple binding between the already existing metadatabase and the audio files. The Archive's experience has proven that having the audio on-line, even in low quality, greatly improves the value of the metadatabase. The project team was surprised to discover that the easy access to the audio leads to an improved quality of the metadatabase since users now correct the content of the metadatabase as they listen to the audio.

Future of the Digital Radio Archive
Based on the prototype's success, NRK and NLN are ready to implement a common digital radio archive. In Phase I, the archive will have a volume of 30,000-50,000 hours of audio. It will not only serve as a historical archive and a legal deposit of radio broadcast, but also as a system to support temporary storage for radio broadcast production. The archive will be available for research and education purposes within the limitations of copyright restrictions. Eventually, access will be broadened to include the general public. The implementation of a system module to manage copyright information, including the compensation of copyright owners upon use of audio, will also be implemented.

One of the interesting research questions behind the Digital Radio Archive is related to distributed processing and access. In addition to the technical links that will enable shared processing of audio files, this approach will require the investigation of standard metadatabase formats (e.g., Dublin Core) and common naming schemes for audio files (e.g., URN).

Suggested Readings
Digital Radio Perspectives for the Future, http://www.nbr.no/drlpub/digradioeng.htm

Technical Feature

Selecting a Digital Camera: the Cornell Museum Online Project
Peter B. Hirtle
Assistant Director, Cornell Institute for Digital Collections
pbh6@cornell.edu
and
Carol DeNatale
Registrar, Herbert F. Johnson Museum of Art, Cornell University
cad11@cornell.edu

Digital cameras have been attracting more and more attention as input devices for digital conversion projects. The potential advantages of a digital camera over the traditional flatbed scanner appear great. The primary advantage is the ability of a digital camera to handle materials of different sizes and shapes. With a flatbed scanner, one is limited to flat two-dimensional objects. Further, the size of the object is limited to the platen size of the scanner (unless one scans a document in sections, and then uses software to stitch the individual images together). With a digital camera, the size of the original document is immaterial. One could conceivably prepare a digital surrogate of a small photograph or a giant map with the same digital camera. A digital camera can also create digital surrogates of different kinds of objects. Flatbed scanners were designed to handle the paper found in business offices, though specialized flatbed scanners have been developed for photographs, slides, and maps. A digital camera can convert anything that can be photographed, including artifacts, artworks, and buildings.

In addition to their versatility, digital cameras are becoming an increasingly cost-effective possible alternative to flatbed scanning. Furthermore, the quality of digital cameras continues to improve. Early digital cameras were relatively low-resolution affairs, capable only of producing a small image suitable for use on a computer. Modern digital cameras have come to rival film cameras in the quality of their reproductions. Yet while the price of low-end digital cameras continues to drop, the difference in price (and in functionality) between a recreational and professional digital camera remains high. The question naturally arises, which camera is the right choice for a library, archives, or museum that wishes to prepare digital surrogates of its holdings?

In 1997 Cornell University set out to address that question. Thanks to a generous gift to the Herbert F. Johnson Museum from an anonymous donor, the purchase of a digital camera became possible. Other gifts from Arthur Penn, additional anonymous donors, and the Intel Corporation to the Cornell Institute for Digital Collections provided funds for the staffing of a digital photo studio and equipment to manage the project, and deliver versions of the captured images to the Web. With funding in place, a project team headed by co-project leaders Carol DeNatale of the Herbert F. Johnson Museum and Peter B. Hirtle of the Cornell Institute for Digital Collections could turn their attention to the assessment and selection of a digital camera.

There are two generic approaches to the selection of a digital camera. The first is to see how much money one has to purchase a camera, and then try to buy the best camera one can buy for that amount of money. The hope, but it is only a hope, is that the selected camera will be "good enough" for the proposed uses.

A second approach, pioneered and championed by Cornell University's Department of Preservation and Conservation, is to establish benchmarks for acceptable image capture and then assess the ability of imaging equipment to meet those benchmarks. (1) Following the benchmark approach has two major advantages. First, it ensures that the equipment one purchases really is "good enough" to meet the goals of the project. If the assessment shows that equipment that can meet the goals of the project is not available at an affordable price, the purchase can be postponed. Just as importantly, by establishing benchmarks for digital image capture first, curators can ensure that unneeded digital data is not captured. While storage costs are dropping, it is still expensive to capture, store, and manage large digital images. Digital capture tied to appropriate benchmarks can ensure that any proposed digital imaging project remains cost effective.

The Cornell project, which came to be dubbed "The Museum Online" project, therefore faced two challenges before a decision on which digital camera to be purchased could be made. First, we had to determine the appropriate benchmarks for a digital camera, and then we had to test cameras against those benchmarks.

The Museum Online project chose to define benchmarks based on output. We agreed that the goal of our project would be to produce a digital image that could be used to produce high quality printed reproductions. We took as our goal to be able to print all the images at 9" x 12" on a 200 lpi (line pairs per inch) printer. Theoretically, a digital image at least 49 Mb in size could meet this need [(9 x 12) x (200x2)2 x 24] / 8. (2) Our preliminary review of the market indicated that either high-end scanning-back cameras or area-array cameras were capable of creating image files of that size.

There were several advantages to selecting printing output as the benchmark for the project. First, meeting this benchmark would allow the Museum to use the digital images in place of photographic reproductions. Most of the reproductions requested from the Museum are intended for publication in art books; with digital images that met our benchmark, it would be possible to fulfill those orders with either the digital files themselves, or prints made from the digital files. Secondly, from the information-rich file created to meet this benchmark, it would be possible to produce derivative files suitable for delivery on the Web for the foreseeable future. While currently the size and quality of images needed for the World Wide Web is much less than is needed for printed output, the size of Web-oriented files is increasing as bandwidth and processing power on the desktop both increase. The information-rich files created for print-oriented output during the course of the project are likely to have a long and useful life on the Web as well.

In addition to image quality, the speed of image capture and processing was also important. The donors to the project had insisted on an ambitious timetable. In order to meet their goals, it would be necessary to capture 60-70 images per day. The camera for the project had to be able to meet these production goals as well. Our suspicion was that only an area-array camera would be able to meet this ambitious target, but we were willing to be convinced otherwise.

Technical Assessment
With benchmarks for image quality and throughput in hand, the project could then develop an RFP for a digital photo studio. We chose not to specify hardware or software in the RFP. Instead, we focused on our desired outcomes, and asked the vendors to propose solutions. Five responses to the RFP were received. To our surprise, all the responses called for either Phase One or Dicomed cameras.

As part of the RFP review process, we wanted to assess the image quality of both cameras. Two of the RFP finalists were invited to the Cornell campus in mid-September of 1997 to conduct a "shoot-off" of their equipment in the space that was to be used for the digital photo studio and on objects from the museum's collection. One vendor, Luna Imaging of Santa Monica, California, brought a Phase One PowerPhase scan-back camera. Luna had modified the camera software so that it was possible to shoot a file half the normal size (representing 70.3Mb), but in half the time. The other finalist demonstrated a Dicomed Big Shot 4000 area array camera. Each vendor shot one two-dimensional item (a Dürer woodcut) and one three-dimensional item (a Tiffany vase). A Kodak gray scale target and Macbeth color checker were included in each shot; in addition we included Megavision tone balls with the three-dimensional object.

As might be expected, there was a large difference in file size between the two cameras.

Image

Phase One Mb

Big Shot 4000

Dürer woodcut

70.3 Mb

48 Mb

Tiffany Vase

70.3 Mb

48 Mb

The digital files created from the "shoot-off" were then assessed by several means:

The results of the test were somewhat surprising. Our optimistic hope that we could select a camera based "on the numbers" (i.e., any that could produce a 48 Mb file) turned out to be mistaken. Both of the cameras produced files equal to or larger than we thought in theory we needed, but only one of them was acceptable to the curators.

As might be expected given its larger file size, the Phase One test of the Dürer woodcut won out over the Dicomed camera in the assessment of each camera's ability to capture a two-dimensional image containing fine detail. The Phase One image demonstrated excellent sharpness and resolution. While examination of the digital file revealed a slight shift in the gray balance and more "noise" than with the Dicomed camera, the curators greatly preferred the match prints of the Dürer made from the Phase One camera than from the Dicomed. This is in spite of the fact that both digital files should have met our initial benchmark requirements. There are several possible reasons for this surprising discrepancy. On reflection, we probably should have accounted for possible misregistration in the scanner by increasing the dpi requirements by at least another quarter. In addition, the real world interplay of camera, computer, network, storage, display, and printing added variables that simple resolution numbers could not address.

With the Tiffany vase, the test results were the exact opposite. In sharpness, lighting, and noise, the Dicomed outperformed the Phase One camera when capturing a three-dimensional image. This assessment of the digital file was echoed by the curators who examined the match prints; all preferred the prints produced from the Dicomed file. In this case the larger file size of the Phase One camera offered no advantages in image quality. The success of the Dicomed camera may be due to its being an area array camera and its shorter exposure time. It proved difficult to capture effectively three-dimensional images with the Phase One camera without additional lighting (perhaps due to the difficulty of ensuring an adequate depth of field with a linear array camera).

In spite of the care taken to ensure that the test would be fair, certain features of the testing process may have affected the results. First, a hard drive failure on the computer operating one of the cameras during its initial test required a return visit by the vendor. In the interim, the photo set-up had to be struck down. While care was taken to replicate the lighting and other variables as closely as possible, the difference from one photographic session to the next may have influenced the nature of the images.

Second, the "authenticity" of the match prints is open to question. The printing houses were instructed not to perform any of the color correction or image sharpening they would normally do before submitting a match print to the museum. While they may have followed these instructions, it is also possible that they could not resist the urge to make the images look as good as their equipment could make them appear.

Third, the tests revealed (and our subsequent experience has borne out) that for good tone, color, and image quality, the nature of the lighting used is almost more important than the camera. Each vendor brought the lights they thought would be appropriate for the task, but in reality the lights may not have been optimal for the kinds of objects the vendors were asked to capture. For good imaging, the lighting and the camera both need to take into account the unique characteristics of the material selected.

Finally, the importance of the digital photographers involved in the test cannot be overestimated. For example, one of the test images was over-exposed - a direct product of operator error and not the camera. Effective digital photography is far from automatic, and requires the knowledge and experience of a trained operator. The difference in the levels of photographic skill involved in the two tests would also have affected the results.

Conclusions
What, you might well be asking, did we do when presented with results that showed each camera excelled in one particular area? Because our project was going to start with works on paper, we opted to acquire through Luna Imaging the camera that did best with two-dimensional objects, the Phase One PowerPhase. Image quality has been excellent, and we have been able to achieve our desired throughput. We are continuing to look at the issue of lighting to see if there is some way this camera could be used for the three-dimensional objects, but we may also seek funding to purchase an area-array camera.

What lessons does the Cornell experience offer to others who may wish to purchase a digital camera? They can be summarized as follows:

Notes
(1) For more information on Cornell's benchmarking approach see:
Kenney, Anne R. and Chapman, Steven, Digital Imaging for Libraries and Archives. Ithaca, NY: Cornell University Library, 1996.

(2) According to this formula, the file size is 51,840,00 bytes, which translates to 49 megabytes (Mb).

Highlighted Web Site

DjVu

DjVu is a new proprietary image compression technology developed by AT&T Labs for both black and white and tonal images. One of the distinguishing features of DjVu is its segmentation function that enables the separation of an image into its background (e.g., paper texture, company logos) and foreground (e.g., text, line drawings) components to enable a more efficient and high-quality compression. Using this sophisticated lossy compression scheme, one can achieve compression ratios as high as 1000:1. This Web site provides general and technical information on DjVu, including sample images for comparison purposes, and an FAQ section. The plug-in required to view DjVu images on the Web is also available on this site.

Calendar of Events

International Digital Libraries Collaborative Research Projects
Proposals Due: January 15, 1999

The National Science Foundation will fund the United States portion of collaborative digital library projects among investigators from different countries to foster long-term, sustainable relationships between US and non-US researchers and research organizations. Proposals should have the overall research goal of enabling users to access and exploit information in new ways. Research issues include information organization, forms of information distribution, scalability and security techniques for worldwide data systems, and tools to search, store, and deliver information in different media or languages. For further information contact: Stephen M. Griffin, sgriffin@nsf.gov.

Museums and the Web 1999
March 11-14, 1999
To be held in New Orleans, Louisiana, this conference will focus on Web-related issues for museums, archives, libraries, and cultural heritage institutions. Topics of interest will include museum applications of the Web including publication of museum content, and Web programs.

Third IEEE Metadata Conference
April 6-7, 1999
The IEEE Metadata Conference to be held in Bethesda, Maryland, will encourage discussion of metadata related issues. The presented papers will contribute to practical understanding, with an emphasis on best practices. There will be papers that present a vision that builds on the evolution of current capabilities, data mining, knowledge management for the novice and the expert, and the role of metadata in exploiting digital data across heterogeneous environments.

Announcements

A Research Agenda for Digital Libraries: Summary Report of the Series of Joint NSF-EU Working Groups on Future Directions for Digital Libraries Research
The report of the joint National Science Foundation-European Union project to explore research agendas for digital libraries is now available on the Web. This is a collaborative effort among leading researchers from the United States and Europe. The group has been exploring the possibilities of a joint international research agenda. Five working groups were formed and each has explored a key research topic: Intellectual Property and Economics, Interoperability, Global Resource Discovery, Metadata, and Multilingual Information Access.

TEI and XML in Digital Libraries
The Digital Library Federation has issued the proceedings from "TEI and XML in Digital Libraries" held in June 1998. The two major purposes of the meeting were:

The Library of Congress National Digital Library Program RFPs Now Available
The Library of Congress National Digital Library Program has made the following documents available on their Web site.

Additionally available are important documents from previous projects such as Paper Scanning - Digital Images from Original Documents, Text Conversion and SGML-Encoding.


Five College Archives Digital Access Project
The Five College Archives Digital Access Project, now in its third year, seeks to digitize and make available on the Web a variety of archival and manuscript materials pertaining to the history of women's higher education. The five colleges are Amherst College, Hampshire College, Mount Holyoke College, Smith College, and the University of Massachusetts at Amherst. The digitization of all the selected collections at the Mount Holyoke College Archives and Special Collections has been completed. The next phase involves material at the Smith College Archives.

Final Report of the Library of Congress Manuscript Document Digitization Demonstration Project
Sponsored by the libraries Preservation Directorate, this demonstration project produced images of 10,000 document pages from the New Deal Era Federal Theatre collection held by the Music Division at the Library of Congress. The final report explores the following questions: What type of image is best suited for the digitization of large manuscript collections, especially collections consisting mostly of twentieth century typescripts? What level of image quality strikes the best balance between production economics and the requirements set by future uses of the images? Will the same high quality image that might be appropriate for preservation reformatting also provide efficient online access for researchers?

The Digital Scriptorium of the Rare Book, Manuscript, and Special Collections Library at Duke University
Duke University has just released the Historic American Sheet Music Web site. The site includes digital images of over 16,000 pages of sheet music from 3,042 pieces published in the United States between 1850 and 1920. The sheet music chosen for digital reproduction represents a wide variety of music types including bel canto, minstrel songs, protest songs, sentimental songs, patriotic and political songs, plantation songs, Civil War songs, spirituals, dance music, songs from vaudeville and musicals, "Tin Pan Alley" songs, and songs from World War I.

FAQs

Question:
I read the press release from UMI about their plans to digitize their microfilm collection. Can you tell me some of the technical issues associated with this effort?

Answer:
To answer this question, we contacted UMI to ask about their digital imaging specifications. The following information about the Digital Vault Initiative is provided by Jeff Moyer, UMI Vice President of Product Management.

The goal of UMI's Digital Vault Initiative is to create a digitized collection of printed works based on the company's extensive microfilm collection that is stored at the company's headquarters in Ann Arbor, Michigan. The initiative will open the doors to 500 years of information, including hundreds of thousands of books, newspapers, and periodicals, totaling more than 5.5 billion pages. The first phase of the initiative focuses on UMI's collection of early English literature, including nearly every English-language book published from the invention of printing in 1475 to 1700. Designed to complement the high quality microfilm images, the goal of the Early English Books Online (EEBO) system is to give scholars access to Web-based bibliographic citations and full-page images that are also available via Adobe PDF (Portable Document Format).

After reviewing various options, the project team decided to create and deliver binary master images at 400 dpi as CCITT Group 4 TIFF files. The fast delivery of images is one of UMI's top goals. For on-line viewing, UMI uses an innovative image compression technique developed at AT&T Labs called DjVu (see Highlighted Web Site) that dramatically reduces the size of the files with very little loss in image quality. Users will also have the option of downloading the images for off-line viewing in Adobe PDF format.

The Early English Books collection is stored in UMI's vaults on 1,000-foot reels of 35mm microfilm. Second-generation negative masters are scanned. Quality degradation is not a factor with second-generation negatives because the difference between the camera negative and the second-generation is virtually undetectable at 400 dpi. The original reduction ratio on film varies between 8x and 20x. A standard reduction ratio of 15x is used during film to digital conversion.

UMI's Vault Duplication Department begins the production process by cleaning the film, visually inspecting each frame for damage, and then delivering the film to the Scanning Division of UMI's Xerography department. Five scanners are dedicated to the project. They are Sunrise Imaging Proscan 3 models, powered by 300 MHz Pentium P3 processors, with 128 Mb of RAM, running on Windows NT 4.0 operating system. The scanners are in use 24 hours per day, seven days per week.

The operators place the film on the scanners, and adjust the image-sizing parameters. During scanning, the operators monitor the process on an image detect screen. A blue line crossing the center of the screen is a scan field, which senses the film and creates a graph below the image window based on the data it acquires. The data represents the microfilm image densities (number of pixels). Spikes appear when the scan field crosses a page with an illustration or other dense image. For best quality, an operator must manually set an average line between the peaks and valleys of the graph while avoiding spikes. This challenges the operator's skill because much of the Early English Books collection is rich with illustrations.

Scanned images are stored on a hard drive until quality inspection is complete. This quality control process involves inspecting all images through proprietary software developed by UMI. Quality control specialists also index illustrations. After the digital images pass inspection, they enter a queue until about 650 megabytes of data are compiled. The images then are written to a CD-ROM disc, which is stored in the EEBO CD-ROM jukebox towers. They hold 5,280 compact discs with a total storage capacity of over 3.4 terabytes.

UMI began the EEBO project in June 1998 and is releasing the material on the Web in December 1998, via an interface that is suitable for both novice users and seasoned scholars. The interface links the images to corresponding bibliographic records through a search engine that allows browsing as well as field and Boolean searching.

The scanning phase of the EEBO project will be completed by July 1999. The next stage will involve collaborating with several libraries and universities to convert the EEBO works to ASCII text. Future phases of the Digital Vault Initiative will encompass scanning UMI's extensive collection of newspapers and periodicals, including full runs of such publications as Time magazine and the New York Times.

RLG News

Digital Preservation Needs and Requirements in RLG Member Institutions
In late April 1998, RLG commissioned a study of digital archiving needs and requirements in member institutions. The research was conducted by Margaret Hedstrom, Associate Professor at the School of Information, University of Michigan, and Sheon Montgomery, Graduate Student Research Assistant. The purpose of the study was twofold: 1) to gather baseline data on the nature and extent of digital preservation problems in member institutions and the status of their digital preservation programs, and 2) to identify needs and requirements of RLG members in meeting their responsibilities for preserving digital information. Of particular interest was whether digital preservation is a common concern in libraries, archives, museums, and special collections, or whether this problem is still limited to large institutions that were early adopters of digital technologies. As well, RLG wanted to learn more about the policies and practices that are being used to preserve digital materials. This information has helped to determine the extent to which successful models and prescriptive guidelines are known and are being replicated, and to gain a deeper understanding of obstacles to digital preservation in member institutions.

The following is the executive summary, excerpted from the final report. The final report, Digital Preservation Needs and Requirements of RLG Member Institutions, will be available as both a print and on-line publication. The on-line version will be available very soon through the RLG PRESERV web site (http://www.rlg.org/preserv or http://www.thames.rlg.org/preserv/, if connecting from Europe).

 

Digital Preservation Needs and Requirements in RLG Member Institutions
Executive Summary

Margaret Hedstrom and Sheon Montgomery

This research on digital preservation responsibilities and the nature and extent of digital holdings provides concrete evidence about digital preservation problems facing libraries, archives, museums, and other repositories. Digital preservation is an increasingly common concern for RLG member institutions. Two-thirds of the 54 institutions in the study assume responsibility for preserving material in digital form. Typically, institutions with digital preservation responsibilities acquire materials in digital form and create digital files through conversion of print, manuscript, and visual materials to digital form. A few institutions have large collections of digital materials, but most are of modest size. The frequency of acquisitions and the size of digital collections are growing rapidly. Many RLG members have taken on digital preservation responsibilities during the last three to five years, and almost all anticipate more digital acquisitions and more conversion projects during the next three years.

Digital preservation policies and practices are under-developed in member institutions given the increasing prevalence of digital materials. Only half of the institutions with digital preservation responsibilities have policies that govern acquisition, conversion, storage, refreshing, and/or migration of digital materials. Less than half of the institutions place limitations on the types of digital formats that they accession and only 20 percent have adopted standards for the master files of materials that they generate through conversion. As a consequence, the majority of institutions maintain digital materials in at least six different formats, and more than one-third of the institutions maintain 10 or more different digital storage formats.

The institutions with the largest digital collections have the most developed policies and practices. Typically, institutions transfer some of their materials to new storage media when the materials are acquired, but most institutions also maintain some materials in their original format. Less than half of the institutions with digital holdings refresh or migrate digital materials in their holdings. Most institutions that do refresh or migrate digital materials carry out these activities on an ad hoc basis or in conjunction with system upgrades, rather than as an integral part of a digital preservation program. Many institutions have not yet confronted problems of technology obsolescence or systems upgrades because they began acquiring digital materials and conducting conversion projects within the last few years. Nevertheless, two-thirds of the institutions with digital holdings cannot access some of their materials because they lack the operational or technical capacity to mount, read or access files stored on some of the storage media in their holdings. Three-quarters of the institutions believe that irreplaceable information will be lost if digital preservation concerns are not resolved.

This study assessed administrators' perceptions of the greatest threats to digital holdings and identified common obstacles to digital preservation. Collection managers view technology obsolescence as the greatest threat to digital preservation, followed by insufficient resources and insufficient planning. Most administrators consider the physical condition of materials to be the least significant threat. Lack of expertise is a significant obstacle to effective digital preservation programs. About 80 percent of RLG members rate the highest level of in-house expertise as either novice or intermediate, and only eight institutions consider their staff expertise to be at the expert level. Although 20 institutions acquire some expertise from outside sources, such as consultants, institutions with expert or intermediate levels of expertise on their staffs are also more likely to have access to outside experts.

RLG members are planning a variety of measures to improve their capacity for digital preservation and respond to anticipated growth in both acquisition and conversion activities. The majority of institutions plan to use training provided by professional organizations, local training programs, and vendors as well as independent study to increase digital preservation expertise among their staff members. More than half of the institutions plan to hire additional staff with expertise in digital preservation, and about one-third plan to hire consultants. Almost all of the institutions anticipate developing new policies for preserving digital materials in the next three years, including all of the institutions that currently have written policies, as well as 33 of 36 that do not currently have a policy.

Member institutions are seeking leadership in the development of standards and best practices, guidance on model policies and practices, and various types of training from consortia, such as RLG. The respondents also expressed an interest in a variety of services from third party vendors, especially conversion services, migration services, hardware and software to meet archival needs, and archiving or preservation storage services. Acceptance of third party services is contingent on reliability and a reasonable cost.

There are several concrete steps that RLG and other organizations can take to support efforts in libraries, archives, museums, and other repositories to develop digital preservation capabilities. RLG can make a significant contribution to digital preservation by compiling a concrete set of guidelines, standards, and best practices for digital preservation and providing leadership and coordination in emerging standards and practices. RLG should further investigate its potential role in providing digital preservation services, especially in the areas of consultation and training.

There are also actions that RLG members can take in anticipation of emerging standards, guidelines, and best practices for digital preservation. In most institutions, digital preservation activities are distributed among different administrative units and different types of professional staff because of the variety of expertise needed to address this problem. Coordinating mechanisms within institutions would enhance utilization of existing resources, foster more consistent policies and practices, draw attention to this issue at higher levels of administration, and in some cases help achieve economies of scale. Assigning responsibility for storage, maintenance, and migration can help institutions avoid crises and minimize the impact of unanticipated changes in hardware and software. For planning purposes, explicit policies on the scope of digital collections and on digital conversion goals should be developed within institutions. Such policies are also important in building a framework of shared responsibilities across institutions.

This study suggests that there are opportunities for third party service providers to develop or enhance services in the areas of training, consultation, conversion, migration, and affordable, reliable storage services. RLG members are favorably inclined toward third party services if they are reliable, meet strict library and archival standards, and are provided at a reasonable cost.

This report examined one component of the evolving infrastructure for long-term preservation of digital information: archives, libraries, museums and other repositories that have been instrumental in preserving and providing access to scholarly communications, documentary heritage, and other cultural resources in traditional formats. Many of these institutions are beginning to add digital preservation to their array of preservation responsibilities, and by the year 2001, 98 percent of the institutions that responded to the survey expect to assume responsibility for preserving some digital information. There is a gap between current models for digital preservation and the status of digital preservation in many institutions. Institutions with large digital collections and more years of experience generally have policies in place that govern acquisition, storage, refreshing, and migration of digital materials. But the majority of institutions have not developed digital preservation policies or established methods to preserve digital information. Institutions are seeking better methods and more affordable services to tackle the problems posed by technology obsolescence. Insufficient resources and inadequate planning for digital preservation also are considered major obstacles to digital preservation, and only a few institutions have experts in digital preservation on their staff or available through consultation arrangements. Leadership from RLG and other organizations in developing and promoting standards and best practices for digital preservation as well as reliable and affordable services from third party providers are essential components of an evolving infrastructure for distributed digital collections.

 

REACH Project Findings
The REACH project was launched by RLG and the Getty Information Institute (GII) to investigate whether information about museum objects could be extracted from collection management systems and made useful for research. As part of the process, it was hoped that museum management software vendors would be encouraged to build in export mechanisms to facilitate ongoing extraction.

During the course of the project, museums and vendors identified the access points that would be most useful to researchers. The intent was for a number of museums to identify and coordinate information not otherwise accessible information that was created for research access (both with and without images). The intent is that the resulting online database could be evaluated for its value to researchers.

What was learned as the project got started:

During the course of the project:

Outcome:

The RLG/GII staff decided much of the planned evaluation could take place with resources already in hand. The creation of the REACH element set, however, answered two important questions:

The resulting data set has many commonalities with other cultural heritage data standards, and provides a useful starting point for discussions aimed at identification of core data that might help inform efforts directed at integration of networked cultural heritage resources for the benefit of research.

Hotlinks Included in This Issue

Feature Article
Broadcast Wave Format (BWF): http://www.ebu.ch/pmc_bwf.html
The Data Archive: http://dawww.essex.ac.uk/
Data Documentation Initiative (DDI): http://www.icpsr.umich.edu/DDI/codebook/codedtd.html
National Library of Norway (NLN): http://www.nbr.no/e_index.html
Networked European Social Science Tools and Resources (NESSTAR): http://dawww.essex.ac.uk/projects/nesstar.html
Norwegian Broadcasting Corporation (NRK): http://www.nrk.no/info/engelsk/
UPF Web site: http://info.wgbh.org/upf/

Technical Feature
Cornell Institute for Digital Collections: http://CIDC.library.cornell.edu/
Dicomed: http://www.dicomed.com
Herbert F. Johnson Museum: http://www.museum.cornell.edu/HFJ/
Luna Imaging: http://www.luna-imaging.com/
Phase One: http://www.phaseone.com

Highlighted Web Site
DjVu: http://djvu.research.att.com/home_mstr.htm

Calendar of Events
Museums and the Web 1999: http://www.archimuse.com/mw99
Third IEEE Metadata Conference: http://www.llnl.gov/liv_comp/metadata/md99/md99.html

Announcements
The Digital Scriptorium of the Rare Book, Manuscript, and Special Collections Library at Duke University: http://scriptorium.lib.duke.edu/sheetmusic/
Document Digitization Demonstration Project : http://memory.loc.gov/ammem/pictel/index.html
Five College Archives Digital Access Project: http://clio.fivecolleges.edu
The Library of Congress National Digital Library Program RFPs Now Available: http://memory.loc.gov/ammem/ftpfiles.html
A Research Agenda for Digital Libraries: Summary Report of the Series of Joint NSF-EU Working Groups on Future Directions for Digital Libraries Research: http://www.iei.pi.cnr.it/DELOS//NSF/Brussels.htm
TEI and XML in Digital Libraries: http://www.hti.umich.edu/misc/ssp/workshops/teidlf/

FAQs
Digital Vault Initiative: http://www.umi.com/hp/Features/DVault/

RLG News
REACH element set: http://www.rlg.org/reach.elements.html
The REACH project: http://www.rlg.org/reach.html
Visual Resources Association/RLG/GII initiative: http://www.rlg.org/vision.html

 

Publishing Information

RLG DigiNews (ISSN 1093-5371) is a newsletter conceived by the members of the Research Libraries Group's PRESERV community. Funded in part by the Council on Library and Information Resources (CLIR), it is available internationally via the RLG PRESERV Web site (http://www.rlg.org/preserv/). It will be published six times in 1998. Materials contained in RLG DigiNews are subject to copyright and other proprietary rights. Permission is hereby given for the material in RLG DigiNews to be used for research purposes or private study. RLG asks that you observe the following conditions: Please cite the individual author and RLG DigiNews (please cite URL of the article) when using the material; please contact Jennifer Hartzell at bl.jlh@rlg.org, RLG Corporate Communications, when citing RLG DigiNews.

Any use other than for research or private study of these materials requires prior written authorization from RLG, Inc. and/or the author of the article.

RLG DigiNews is produced for the Research Libraries Group, Inc. (RLG) by the staff of the Department of Preservation and Conservation, Cornell University Library. Co-Editors, Anne R. Kenney and Oya Y. Rieger; Production Editor, Barbara Berger; Associate Editor, Robin Dale (RLG); Technical Support, Allen Quirk.

All links in this issue were confirmed accurate as of December 11, 1998.

Please send your comments and questions to preservation@cornell.edu .

Contents Search Home

Trademarks, Copyright, & Permissions