RLG DigiNews February 15, 2002, Volume 6, Number 1

			RLG DigiNews		BROWSE ISSUES		SEARCH		RLG

February 15, 2002, Volume 6, Number 1

ISSN 1093-5371

Table of Contents

Feature Article 1
"We don't know the first thing about digitization:" Assessing the Need for Digitization Training in Illinois, by Trevor Jones and Beth Sandore

Feature Article 2
Integrating a Free Digital Resource: The Status of Making of America in Academic Library Collections, by Kizer Walker

Highlighted Web Site
METS: Metadata Encoding and Transmission Standard

FAQ
Web Log Analysis, by Lee LaFleur

Calendar of Events

Announcements

RLG News

Editors' Note:

This issue marks the beginning of our sixth year of publishing RLG DigiNews, and we will be celebrating our anniversary in several ways. First you will note the new look and feel of the publication. RLG DigiNews has had several facelifts over the years, but we are quite taken with this new design and hope you agree. We're also adding new features. Beginning with this issue, you will be able to print individual articles and FAQs. Just click on the printer icon or accompanying text at the top of each article/FAQ.

We also are pleased to announce that Nancy Y. McGovern has become the co-editor of RLG DigiNews. Nancy is the Digital Preservation Officer at Cornell University Library and currently leads its digital imaging and preservation research unit. Prior to coming to Cornell in August 2001, Nancy managed and coordinated digital implementation projects at the U.S. National Archives and Records Administration, and later served as a consultant to a number of research projects, most recently the Digital Preservation Testbed Project of the Dutch Government. She is working on her PhD in digital preservation at University College London.

In April we will publish a special anniversary issue of RLG DigiNews, and would like to get your feedback on the journal, as well as suggestions for the future. We always welcome comments, but for this issue we have devised a special online survey. Please help us make RLG DigiNews responsive to your needs by taking the ten minutes to complete this survey. Many thanks!

print this article

"We don't know the first thing about digitization:"
Assessing the Need for Digitization Training in Illinois

Trevor Jones
Illinois Digitization Institute,
University Library,
University of Illinois at Urbana-Champaign
trevorj@staff.uiuc.edu

Beth Sandore
University Library,
University of Illinois at Urbana-Champaign
sandore@uiuc.edu

Ask non-specialists what it takes to complete a digital imaging project, and responses will range from a desire to "slap it on a scanner and go" to uncomprehending glassy-eyed stares. The reality lies somewhere between these two extremes, but it is apparent that many cultural heritage professionals are confused by the digitization process. Most are interested in digitizing some part of their collections, but often possess only a vague idea of how to begin. Although great advances have been made in the development of standards and best practices for digitization, these principles have yet to filter down to the majority of non-specialists. In Illinois, as in many states, there is such pressure to "get materials on the Web" that digitization projects are often hastily planned and poorly executed.

In January 2001, the Illinois Digitization Institute was created to develop digitization training materials for cultural heritage organizations throughout the state. The Institute is part of the Digital Imaging and Media Technology Initiative at the University Library at the University of Illinois at Urbana-Champaign. Funded by a Library Services and Technology Act grant administered by the Illinois State Library, the Institute's first priority was to determine the extent and type of digitization training needed in the state. One of the primary goals of the Institute was to develop training to provide cultural heritage professionals with the means to mainstream digitization into their institutions' activities. We were interested in developing a model that differed from the nationally acclaimed workshops offered by the Cornell University Library and the Northeast Document Conservation Center (NEDCC), by providing both training opportunities and continuing advice. The success of national and regional digitization workshops offered by groups such as Cornell and the NEDCC made it clear that cultural heritage institutions have a strong need for digitization training. Although there is anecdotal information about training activities and needs throughout the country, we found no examples of systematic assessment of digitization training programs. Since we were working with a geographically defined population, we felt it would be useful to gather information about Illinois institutions' prior training, their perceived needs, and their digitization activities to date.

In order to meet this objective, the Institute sent out surveys to 459 libraries, museums, and archives throughout Illinois. We surveyed institutions of all sizes, ranging from large academic libraries to all-volunteer historical museums. The survey hoped to determine:

The extent to which digitization training is needed
The types of training formats that are most desired
The types of digital projects currently under way
The extent to which current digital projects follow best practices
The amount and type of digitization equipment at cultural heritage institutions in the state.

We sent the survey to a random stratified sample of public, academic, and school libraries, as well as "special" cultural heritage institutions, including museums, historical societies, and archives. The overall response rate for the survey was 32%, and the results were tabulated by the Survey Research Laboratory at the University of Illinois at Chicago. Forty-seven percent of responses were from public libraries, 30% from schools, 5.6% from academic libraries, 4.2% from library systems, and 11.8% from museums and archives. Not surprisingly, the responses indicated a substantial need for digitization training in Illinois. Although the survey was limited to one state, it is probable that similar surveys conducted elsewhere would produce comparable results.

One of the survey's most surprising findings was the percentage of institutions that already own some type of digitization equipment. Eighty-two percent of all respondents reported owning a flatbed scanner, digital camera, or some other digitization tool (See figure 1).

Figure 1: Types of Digital Equipment Owned by Survey Respondents

However, relatively few institutions had the knowledge required to effectively digitize cultural heritage collections. Only 15% of respondents reported that they or other staff members at their institution had attended digitization workshops or in-depth training sessions like those hosted by the Northeast Document Conservation Center or Cornell University Library (See figure 2).

Figure 2: Digitization Training Attended by Survey Respondents

Despite the prevalence of digitization equipment in the state, comparatively few institutions had begun to digitize their collections at the time of the survey. Although 82% of respondents owned digitization tools, only 35% had conducted digital projects and overall the results of these projects had been discouraging. Only fifty-one percent of reported digital projects were available on the Web, while 28% of completed digital projects had not yet begun to provide any public access. More than two-thirds of the digital projects reported on in the survey did not utilize any type of metadata, and only eight percent had made use of the common Dublin Core metadata element set (See figure 3).

Figure 3: Types of Metadata used by Survey Respondents

If the trends identified in this survey continue, the vast majority of digital projects in Illinois will fail to meet even basic standards for Internet access to digital materials. Unless more training is provided, cultural heritage institutions will continue to underutilize their equipment and produce substandard digital content. This failure could have long-term consequences for the state's cultural heritage institutions. The lack of robust metadata in the majority of the state's digital projects will make it difficult if not impossible to share data, and will certainly result in increased labor and material costs in the future.

We found that cultural heritage institutions are somewhat receptive to learning more about the theory and practice of digitization. Over half of the respondents expressed interest in one-day workshops on digitization basics (56%), followed by Web-based tutorials (19%) (See figure 4).

Figure 4:Types of Digitization Training Favored by Survey Respondents

However, only 17% were interested in workshops longer than one day, and although many expressed a desire to learn about "Digital Capture" and "Materials Selection," the majority was indifferent to the drier subjects of "Metadata" and "Project Planning." These answers suggest that many respondents are primarily seeking an introduction to digitization, and lack knowledge of the importance of project planning in the digitization process. A cross-tabulation of the survey results indicates that cultural heritage professionals are not fully aware of the prerequisites for successfully completing a digitization project. While only 17% percent of all respondents expressed interest in multi-day training sessions, those who had already received formal digitization training were more than twice as likely (50%) to express an interest in multi-day workshops. This disparity suggests that individuals who have received at least some digitization training understand the complexities of the process and realize that it takes more than a day to learn how to successfully implement a digitization program.

The Illinois Digitization Institute is using the results of the survey to design training materials for cultural heritage organizations in Illinois. Because the survey indicated that novices would most likely attend one-day training sessions, we began offering a series of free one-day workshops covering the basics of digitization. Limited to 15 participants, these sessions focus on project planning, choosing equipment, and also provide hands-on opportunities to work with a flatbed scanner and digital camera. The aim of these introductory sessions is not to convince cultural heritage organizations to embark on digitization projects, but rather to help them make informed choices about digitization and its role in their institutions. If participants decide to proceed with a digital project, they are encouraged to do additional readings or enroll in the Institute's series of interactive Web-based course modules. Using WebCT and WebBoard, these two-week modules make use of discussion boards and collaborative assignments to help participants plan and develop their own digitization program.

The Institute has also developed online training for recipients of digitization grants funded by the Illinois State Library. Recipients of LSTA Educate and Automate digitization grants are now required to complete a digitization course before receiving their grant funds. For this training, the institute has adopted a slightly different approach. Because participation in this training is mandatory, training begins with a two-week online course module covering the basics of digitization. Students are asked to do readings, answer questions online, participate in WebBoard discussions, and prepare a formal evaluation of another institution's digitization project. The follow-up to this online training is a two-day hands-on intensive workshop held at the University of Illinois at Urbana-Champaign. Because project planning and evaluation have been covered in the online course, more workshop time is available to address the practical aspects of scanning, image manipulation, and problems specific to the grantees' own digital projects. Although there was some grumbling from the grant recipients about the time commitment required for the training, evaluations have been almost uniformly positive. Participants in all of the Institute's training leave with a detailed digitization bibliography, links to a technical insert providing an overview of the digitization process, and access to an Image Quality Calculator program that assists in determining optimal resolution for scanning text documents. As we continue to assess the efficacy of these training methods, the Institute is hopeful that these efforts will eliminate some of the confusion surrounding digitization, and thus "raise the bar" for digital projects throughout Illinois.

Acknowledgements: The Illinois Digitization Institute has been developed pursuant to a Library Services and Technology Act grant administered by the Illinois State Library. The authors would like to thank Anne Craig, Joe Natale, Connie Frankenfeld, and Alyce Scott from the Illinois State Library for their assistance.

Please help us make the next five years of RLG DigiNews even better by spending 5-10 minutes filling out our online survey.

print this article

Integrating a Free Digital Resource:
The Status of Making of America in Academic Library Collections

Kizer Walker
Cornell University
kw33@cornell.edu

The Making of America (MOA) projects at Cornell University and the University of Michigan provide searchable, full-text digital access to a growing body of primary materials documenting American social history in the second half of the 19th century. Between the separate sites maintained by the two collaborating institutions nearly 9,000 monograph volumes and approximately 150,000 journal articles are currently available through MOA, free of charge, to users around the world. In June and July 2001, as part of an on-going evaluation of this resource, Cornell University Library surveyed academic libraries that link to MOA to gather an impression of how and why these institutions are integrating MOA into their collections. The survey assessed the impact of the availability of the digital resource on collection development and management decisions regarding print versions of titles duplicated in the MOA collections. The report that follows presents the survey results and situates them in relation to actual use of the MOA collections as tracked over three weeks in Web logs. The author also interviewed principals of the projects at Cornell and Michigan to determine their reactions to the survey findings and future plans for MOA; the interview follows the survey report.

MOA Institutional Use Survey
The MOA survey adapted and expanded on surveys conducted by JSTOR in 1999 and 2000, the results of which suggest that libraries in increasing numbers have been willing to let go of print journal backruns and rely on JSTOR to archive and provide access to these materials in digital form. Would libraries' handling of the openly-accessible MOA collections show similar tendencies?

Librarians involved in administration, collection development, reference, and acquisitions at approximately 250 institutions were invited to respond to a Web-based survey on their institutions' use of MOA; a single response was requested for each institution. Along with a series of multiple-choice questions, survey participants had the option of submitting open-ended comments on the MOA collections. We compiled the list of invited participants from academic library Web pages containing links to the URLs for the Cornell and Michigan MOA sites as identified via commercial search engines (1). Institutions of all sizes received the survey mailing: 58 of the 112 Association of Research Librarians (ARL) member institutions were among the survey recipients, including 22 of the libraries ranked among the top 25 in the 1999-2000 ARL Membership Index. Approximately 10 percent of the survey mailings were sent to institutions outside North America.

Librarians from 93 institutions answered the survey. Two responses came from Canadian institutions and eight from libraries outside North America. We received 28 responses from ARL member libraries, including 13 that ranked among ARL's top 25. As figure 1 illustrates, around half of the U.S. respondents represented Ph.D.-granting institutions, approximately one-third Master's colleges and universities, and the rest undergraduate institutions.

Figure 1: MOA Survey U.S. Respondents by Carnegie Classification Category

Who uses MOA?
The MOA survey focuses on the integration of MOA into academic library collections, but academic research is one among many uses of MOA. Sample Web logs of the MOA site administered by the Cornell Library provide a revealing—albeit lower-than-normal use—snapshot of the collection. Recorded over three one-week periods in December 2001 and January 2002, the logs indicate that over 90% of 97,378 distinct visits to the Cornell MOA site originated from users of machines registered to the commercial (e.g., ".com") and network (e.g., ".net") domains. Presumably, private individual users of commercial Internet service providers account for a considerable number of these visits. Visits originating from U.S. academic institutions—that is, from education domains (e.g., ".edu"), whether from library computers or not—accounted for approximately 7% of the total. Visits to the Cornell MOA site for the period under review are broken down by domain type in figure 2. Logs show 126,119 visits referred to MOA from other sites. Academic sites (including Cornell Library pages, but not pages within the MOA site itself) comprised approximately 33% of all such referrals.

Figure 2: Top-Level Domain Types by Visits to MOA

Though outside the scope of the present report, detailed study of non-academic use of the MOA collections is needed. Meanwhile, Web usage statistics provide necessary context for our findings regarding the status of MOA in academic library collections.

The reports, generated from usage logs with WebTrends Web analytic software, rank 200 organizations for each log period according to the number of visits to MOA from machines registered to that organization. Of the 89 U.S. academic institutions among these recurrent visitors, Ph.D.-granting universities represented a sizable majority at nearly 80%. Users at these universities accounted for around 90% of academic visits in the three weeks under consideration. The libraries at 61 of the 89 institutions from which the visits originated are ARL members. Users from institutions that responded to the MOA survey comprised 22% of academic visits. Figure 3 breaks down visits to the Cornell MOA site for the logged period according to the Carnegie typology.

Figure 3: MOA Visits from U.S. Academic Institutions by Carnegie Classification Category

Why do academic libraries provide access to MOA?
By and large, libraries seem to regard MOA as a valuable enhancement to their print holdings, but not as a suitable replacement for print collections. 85% of all librarians responding to the MOA survey reported that they provide access in order "to add titles not held in the library's print collection"; adding new titles was a motivation for 82% of responding ARL institutions and 69% of the top-ranked ARL members. 69% of all the libraries surveyed and 86% of ARL libraries reportedly link to MOA in order to "provide text searchable alternative versions to supplement titles already held in the library's print collection." A number of librarians commented that the ability to access the collections remotely is valuable for student and faculty users, particularly where libraries are supporting a distributed learning curriculum. Only 4% of respondents said that "replac[ing] titles held in the library's print collection" was a motivation for providing MOA access.

Integration of MOA into library collections
We have taken the presence of MOA titles in OPAC records as one measure of the degree to which libraries conceive of the resource as an integral piece of their collections. Fourteen respondents reported that their libraries' OPACs provide links to individual titles in the MOA collection at present. Half of these were ARL member institutions (25% of the ARL members surveyed). Nine of the 14 libraries are at U.S. Ph.D.-granting institutions, one at a Master's university, two at undergraduate institutions, and two at universities outside North America. In the majority of libraries surveyed, access to the MOA collections is from a comprehensive electronic resources page, a subject-based list of digital resources, or course-specific lists maintained by the library.

We asked survey participants a series of questions about the implications of access to the MOA titles for the management of their libraries' print holdings. This part of the survey closely followed JSTOR's bound volume survey, but the responses diverged markedly from those submitted by JSTOR subscribers, as figure 4 illustrates below. Asked whether, "given the availability of the titles in the MOA collection," their libraries had moved bound volumes to remote storage, 6% respondents answered "yes," and another 6% answered that bound volumes had not been moved, but that there were plans to move them in the future. 78% reported that no items had been moved to remote storage and that their libraries had made no such plans. Although two respondents said their libraries had "entered into a group remote storage project with other institutions to consolidate . . . print collections," only one of these reported that MOA access had been a factor in the decision. Four percent noted future plans for a group storage arrangement, but 81% did not foresee any such coordination with other institutions. Only a single respondent reported that "bound volumes of titles included in the MOA collection" had been "discarded outright." A further 3% related that their libraries planned to discard some of these volumes in the future, but 85% responded that no bound volumes had been discarded in light of MOA accessibility and that there were no plans to do so.

Management of print titles offered in the digital collections	MOA 2001 Institutional Use Survey (93 total responses)	JSTOR 2000 Bound Volume Survey (138 total responses)	JSTOR 1999 Bound Volume Survey (214 total responses)
Moved bound volumes to remote storage?	6% (6 institutions)	25% (34 institutions)	20% (42 institutions)
Made plans to move bound volumes?	6% (6 institutions)	20% (27 institutions)	24% (52 institutions)
Discarded bound volumes?	1% (1 institution)	22% (31 institutions)	13% (28 institutions)
Made plans to discard bound volumes?	3% (3 institutions)	22% (30 institutions)	25% (54 institutions)
Entered into a group remote storage project with other institutions to consolidate print collections?	2% (2 institutions)	3% (4 institutions)	2% (4 institutions)
Made plans to enter into group storage project?	4% (4 institutions)	7% (10 institutions)	7% (16 institutions)

Figure 4: MOA and JSTOR Results Compared

Another series of questions offered examples of more restrained possible actions affecting libraries' print holdings. Queried about "other cost or shelf-space saving solutions" developed "as a result of access to the MOA collection," participants were reluctant to allow MOA to influence their management of print materials. Five percent of all respondents answered that their libraries had "removed duplicate items" and 7% had plans to do so. Nine percent responded that their libraries had "stopped replacing lost or damaged print issues" of journals represented in MOA, and 11% more reported plans to stop. Fourteen percent said their institutions were planning to or had already discontinued purchasing microfilm backruns. Five percent of respondents said that their libraries have installed compact shelving (presumably reflecting a decrease in the priority afforded to accessibility of MOA titles in print), and 5% indicated plans to do so. The rate of positive responses to this series of questions was similar or lower for ARL institutions; however, these gave a significantly higher number of unequivocally negative responses ("no, and no plans…").

In their comments, a number of librarians indicated that their institutions have, in fact, withdrawn or remotely stored print materials that could be replaced with electronic versions, but that the MOA collections have not been factored into such decisions. That MOA's impact on collection management policies has not approached that of JSTOR is not consistent with the perceived usefulness of the MOA collections. Indeed, respondents praised MOA as a "tremendous resource," an "excellent and useful collection" that is "invaluable for small libraries," and a "fantastic service to the historical profession." Instead, librarians' relative tentativeness likely has to do with perceptions of the stability of the resource. Comments of some of the respondents suggest that MOA may not be widely viewed as a permanent digital repository. Ruth Dickstein, Subject Specialist for History and Women's Studies at the University of Arizona Library, wrote: "we are removing JSTOR titles, and could consider doing the same with the MOA titles, [but] just have never assumed that the MOA titles had the stability of always being available." Michael Stoller, Director of Collections & Research Services at the New York University Libraries expressed similar reservations:

We have treated Making of America as a supplement to our own holdings and not as a replacement for any locally-held resources. In future we might view MOA as a form of ready access to materials locally held offsite. But we are not presently inclined to view it as a 100% reliable digital archive, whose paper equivalents can or should be withdrawn from our collections.

In an August 2000 interview with RLG DigiNews, JSTOR's president, Kevin Guthrie, emphasized that his organization's decision-making and communication with stakeholders has at all times centered on JSTOR's core mission of establishing a trusted digital archive. JSTOR has vigorously cultivated relationships with libraries and sought to make its preservation policies clear to librarians. Although MOA has been an important testing ground for digital preservation techniques at both Cornell and Michigan, to date neither university has forcefully articulated its policies and practices regarding digital preservation of the MOA holdings. More active communication with librarians could help clarify MOA's long-term strategies. A few survey respondents proposed that MOA supply the usage statistics that commercial vendors typically make available to libraries; this and other services to libraries would enhance MOA's visibility.

As digital preservation and archiving projects multiply and evolve, standards and oversight mechanisms are emerging that will facilitate communication of preservation strategies (2). Such communication should be central to MOA's outreach to academic libraries as well as to other communities.

Interview with Anne Kenney, Wendy Lougee, and John Wilkin

A number of respondents to our institutional use survey indicated that MOA would weigh more heavily in collection management decision-making if there were clearer assurance that these materials would be accessible for the long term. How would you characterize the commitment of the Cornell and University of Michigan libraries to maintaining this resource?

Anne Kenney (MOA 1 Project Director, Cornell): Cornell University Library (CUL) is committed to maintaining and strengthening its digital holdings including the MOA collection. To date, this commitment has been de facto, but will be made explicit in the library's new Master Plan, which will be adopted by early spring. Over the past 5 years, CUL has actively developed its digital preservation capabilities to ensure the long-term accessibility of its digital content. Through an IMLS-funded initiative, the library developed a digital preservation strategy for its image-based collections and last year assumed long-term responsibility for the arXiv.org e-Print archive. The blueprint for creating a Central Depository for digital content will be completed within the next two months, with development planned for the summer and fall. Just recently Nancy McGovern has been appointed CUL's first Digital Preservation Officer, charged with developing digital preservation policies and coordinating various digital archiving efforts library-wide. Cornell Library has also participated heavily in research and development efforts focusing on digital preservation, through such projects as the Mellon E-Journal Archiving Project (Project Harvest), the Digital Libraries Initiative Phase 2 project (Project Prism), and the Risk Management Study on the effects of format migration. Because the online MOA collection serves our clientele so well, we will be moving the bound volumes comprising the collection to off-site storage over the next several years.

Wendy Lougee (MOA 1 Project Director, University of Michigan): We have developed archiving procedures and policies (currently in draft and undergoing internal review) and are committed to sustaining our locally created digital collections. Current mechanisms include methods to ensure the longevity and long-term access to the digital master. Creation and conversion practices use standards-based methods and storage on media with long-term viability. Access systems, where possible, use the digital master as an access copy and rely on redundancy (storage and multiple locations) and frequent backups.

We have recently adopted a policy to move our preservation reformatting to digital methods as a default method. Consequently, in the future we will be reviewing additional brittle and endangered volumes for digital conversion and utilizing similar methods.

Since the original MOA project, we have made cataloging records (via ftp) available for all items included in the MOA collection to facilitate access at other institutions.

What has been the approach to date to publicizing the MOA project, particularly with regard to establishing MOA's status as a stable, reliable resource? Do you envision new features or services that might increase MOA's value to its users or broaden its readership to new communities? What can MOA learn from other digital library projects, such as JSTOR, in this regard?

Anne Kenney: This survey revealed some very interesting trends—institutional faith in JSTOR has steadily increased for good reasons but also because JSTOR is overt about its commitment to its customers. I believe that CUL could follow the lead of JSTOR and others in offering the same commitment to current and future customers—not just within the CUL community but beyond to the growing secondary clientele. In meeting the needs of the former, we can also serve the latter with a manageable overhead. We are particularly taken with the National Library of Australia's "Safekeeping Project" that is building a distributed and permanent collection of digital resources in digital preservation through negotiations with resource owners or their designees to provide long-term access to their material. Those resources for which safekeeping strategies have been put in place are marked

on the PADI Web site.

I also believe that in the next couple of years we will see various strategies evolve for developing the business case for underwriting the costs of digital archiving. CUL has already expended a great deal of time and money in the care and feeding of this resource and will continue to do so. The financial arrangements for doing so will inevitably change, however. The extent to which we integrate our holdings into the collections of other libraries will be closely monitored. The future will lie in greater inter-institutional dependencies for maintaining digital assets that are valued by all yet managed in a distributed manner.

Wendy Lougee: Our publicity has conveyed information about stability and methods for long term access, as well as use and functionality. While we have not, thus far, advocated collection management decisions as a result of MOA, we anticipate that the planned digital registry (under development through the Digital Library Federation) would be an appropriate venue to communicate this information.

Finally, can you describe how you perceive MOA's relationship with other digital library projects, and how you would like to see such relationships develop in the future? Is MOA involved in any plans to integrate access to separate digital collections, or other collaborative projects that would reduce redundancy among digital resources? What steps were taken in the development of MOA to provide for future interoperability with other databases?

Anne Kenney: Wendy has already mentioned the DLF's registry initiative, which is being designed in part to reduce duplication of effort in digitization. Through various projects and initiatives—notably the Open Archives Initiative and collaborative efforts with other research institutions, in particular the Library of Congress and Michigan—Cornell is actively pursuing a program to integrate access across institutional boundaries. We are also intrigued by the suggestion of the survey respondents who asked whether we could provide them with statistical data covering their institution's use of MOA materials.

John Wilkin (Head, Digital Library Production Service, University of Michigan): Formally, we are exploring integration of digital collections through a National Science Foundation grant with Cornell, Goettingen, and Michigan to extend the Dienst protocol to support full text access. Michigan's OAI metadata harvesting project (supported by Mellon) will bring together freely available digital collections. We have made MOA cataloging records available via ftp to other institutions for inclusion in local catalogs. Our local systems development using our Digital Library Extension Service (DLXS) incorporates support for cross-repository searching.

Footnotes
(1) Google and Altavista were searched for links to MOA, using the Michigan MOA URL and the two URLs for the Cornell MOA site in our search strings—for example a search for "link:moa.umdl.umich.edu/ -host:umich.edu" at Altavista yields links to the Michigan site, excluding links at the University of Michigan host. Links from academic library sites were selected "by hand" from the results. We compiled an email address list of individual librarians from information available at the library Web sites. (back)
(2) See for instance the draft report of the RLG/OCLC Working Group, Attributes of a Trusted Digital Repository: Meeting the Needs of Research Resources (Mountain View, CA: RLG, 2001). (back)

Highlighted Web Site

METS: Metadata Encoding and Transmission Standard

The METS project, sponsored by the Digital Library Federation, is developing an XML document format for descriptive, structural and administrative metadata for digital works. This site includes a new version of the METS schema, released in December 2001. The site also offers sample documents and other information on applying the schema, technical documentation, and a useful introductory tutorial. This site is the prime source for information on an important project that promises to have a significant impact on the development of digital libraries in the near future.

print this FAQ

FAQ
Lee LaFleur
Cornell University
ljl26@cornell.edu

My institution is interested in monitoring the use of our online resources. Is Web log analysis an effective means of doing this?

Recently a number of organizations, including the Digital Library Federation (DLF) and the Association of Research Libraries (ARL), have urged libraries to take responsibility for documenting the use of the digital resources they manage. The ARL New Measures Initiative has issued their Emetrics Phase II report, offering guidelines for usage statistics that libraries should collect in order to document changes in the use of Web-based resources. It is recommended that libraries begin tracking the number of downloads, page views, queries and search sessions by users. Statistics such as these can be obtained by examining the data produced by the Web servers on which these resources are stored. Each transaction on the Web consists of a request issued through the client's browser and a corresponding response from the Web server. These transactions are automatically recorded by the server in files known as Web logs.

The data stored in Web log files consists of long strings of text and numerical data, so reading them can be very difficult and unintuitive. You will probably want to use a log file analysis program to interpret the data. These programs are available as shareware or through various commercial vendors. Some types of log analysis software run on the administrator's desktop, in which case the log files must be transferred from the server to the desktop before analysis is carried out. Other analysis programs run on the server itself and can gather data from the log files directly, either in "real time" or at scheduled intervals. Depending on the size of the log file and the capacity of the hardware, the analysis process can be labor intensive and time consuming. In general, the more detailed the analysis desired, the more complicated and expensive the software tends to be.

The software works by comparing different data sets from the logs and making inferential calculations based on a number of factors. For example, the length of a "visit" is generally determined by calculating the difference between the date/time stamp of a user's arrival and departure requests. "Visits" are determined by counting requests from a single IP address over a period of time. After a preset period of inactivity (e.g., 30 minutes) on a Web page or site, a visit is considered terminated. Any activity that occurs after this time period by the same user would then be counted as a new visit.

Some analysis software packages offer details on the geographic location of users, even down to the city level. More expensive packages can analyze log file types from a wide variety of servers (Apache, Microsoft IIS, Netscape, etc.) Higher end software packages may provide hundreds of different types of reports. Some offer customizable reports, generated on the fly from databases in which the analyzed data is stored. Such reports may list the top pages visited or the number of unique visitors, and some provide reports that may be re-sorted for viewing along any number of different variables. Most commercial analysis programs also offer some type of graphing function through which the report data may be represented visually. Select packages also allow the user to download data from the report into PDF, Word documents or Excel spreadsheets.

Figure 1. Screenshot from WebTrends Live showing the top twenty countries from which visitors came when they visited the Cornell Department of Preservation and Conservation Web site.

Figure 2. Screenshot from NetTracker Log Analysis software package showing the
top ten visitors to the Louis Agassiz Fuertes Web site.

"Live tracking" is another means of analyzing Web traffic that is often used to obtain detailed information about online users. Live trackers are typically third party services that monitor Web traffic by requiring Web site administrators to place special JavaScript code into each of their Web pages. Thus, live tracking doesn't rely on Web logs at all. Instead usage data is derived through the JavaScript each time a page on your site is loaded. Like Web logging, this method also requires an analysis process, but in this case the work is usually done by the live tracking service and you receive a finished report. Live tracking results are usually available in real time, allowing up-to-the minute reporting. Compared to Web log analysis, live tracking can provide an equally in-depth analysis of Web traffic activity, while also offering a more detailed profiling of users' system requirements. The JavaScript employed in live tracking can identify a variety of user display properties, including monitor resolution, pixel dimensions and bit depth, screen widths and available color palettes, as well as information on whether or not cookies (a technology for passing personalized data between Web clients and servers), Java, and JavaScript are enabled. When combined with data on users' Internet connection speeds, this information can help guide decisions about the presentation of digital information, including images.

Figure 3. This report lists the most common screen resolutions used by visitors to Cornell's Preservation site. Screen resolutions are given in terms of pixel dimensions.

Web traffic analysis can be a valuable asset to librarians who want to understand current and potential users of their collections. User statistics can help libraries gain continued funding and administrative support for new and existing digital projects. By analyzing the contents of Web log files, we can learn a great deal about online visitors and about which resources are being used and which ones are not. Such data can assist librarians and archivists in answering questions of whether users are visiting expected pages, which sections they are spending the most time on, and which types of content they appear to be most interested in. Web server administrators can rely on user data to assess file structure and server load over the network. Libraries can determine users' locations (by IP address), and which referring Web pages (links), search engines, and keywords (queries) are transporting them to the digital library. Log files can also provide some indication of whether users are navigating a Web site or resource properly based on click stream data that allows us to see the paths (internal references) through which users are traveling, as well as information on which pages they visit first, which pages they exit from, how long they stay, and what files they choose to access. Log file data allows libraries to assess the number of files that have been downloaded and those for which the download was aborted. The logs also identify any errors that may occur during an online transaction. Additionally, log files contain technical information regarding the user's operating system and Web browser, which is of interest in designing resources for the system requirements of particular audiences.

There are many good reasons for libraries to use Web traffic analysis software. However, there are a number of important factors to keep in mind. Usage data does not provide rich qualitative information, such as a user's overall satisfaction with resources, and it certainly won't explain why people are searching for particular information. In this regard Web traffic analysis is not a substitute for more qualitative studies (focus groups, surveys, etc.) that the library should also be conducting.

Web traffic and log analysis is essentially an inferential process that relies on heuristic specifications set up by the companies that design the software. Although the reports provide a helpful view of user interaction with library resources, much of the information may be inconclusive. Different software packages use different methods for deriving their reports, and the lack of documentation for many analysis programs makes some of their specifications suspect. For instance, the prevalent use of robots or spiders over the Internet may affect the accuracy of user statistics. Robots are commonly used to comb the Web for data, and when doing so they make frequent visits to each page on a Web site. Analysis programs attempt to control for robots, but many of these visits still slip through the cracks, thereby inflating the number of "actual" reported visitors.

Additional Sources of Information

There are many Web log analysis software packages and services available, and the market is in much flux. The following sites may prove helpful in evaluating the various products currently in use.

AWStats Official Web Site

Dan Grossman, "Analyzing Your Web Site Traffic," iBoost Journal

Software QA/Test Resource Center, "Web Site Test Tools and Site Management Tools"

Makiko Itoh, "Web Site Statistics: How, Why and What to Count"

"Web Site Analysis," from PC Magazine (June 27, 2000)

Elaine Nowick, "Using Server Logfiles to Improve Website Design," Library Philosophy and Practice, Vol. 4, No.1 (Fall 2001)

Calendar of Events

Digital Resources for the Humanities: DRH 2002
Call for Papers: Due March 1, 2002
To be held September 8-11, 2002, Edinburgh , Scotland
This annual conference is a forum for all those involved in the digitization of cultural heritage materials.

Second International Workshop on New Developments in Digital Libraries (NDDL2002)
April 2-3, 2002
Ciudad Real, Spain
This workshop will serve as a forum for researchers and practitioners to discuss new developments in digital libraries. Topics include: Metadata Issues, Digital Library Prototypes, Systems Interoperability, and New Roles of Librarians in Digital Libraries.

Museums and the Web 2002
April 17-20, 2002
Boston, MA
In its sixth year, the program addresses Web-related issues for museums, archives, libraries, and other cultural institutions.

CLIR Hosts International Workshop on Digital Preservation
APRIL 24-25, 2002
Washington, D.C.
The Council on Library and Information Resources will hold a workshop entitled The State of Digital Preservation: An International Perspective. The focus will be on international developments in digital preservation and identifying the emerging challenges. Registration information may be found here.

The European Library-Milestone Conference
April 29 - 30, 2002
Frankfurt am Main, Germany
Nine European national libraries are working together with the Conference of European National Librarians and are developing a portal for the project The European Library (TEL). This conference will address topics such as: National Libraries and Publishers; Business of Digital Libraries; and Describing and Handling Digital Publications related to the portal.

Announcements

Digital-Copyright Listserv
To meet the developing application of copyright laws in the online environment, The Center for Intellectual Property has initiated a new listserv. It will provide a forum for the analysis of topics such as copyright law and policy, technologies, and federal information law and policies that impact higher education, particularly digital distance education.

The International Federation of Library Associations and Institutions (IFLA) and the International Publishers' Association (IPA) Establish a Joint Steering Group
One of the goals of this alliance is to develop a joint statement on the archiving and preserving of digital information and to make long-term archiving and preservation a key agenda item internationally.

OSSNLibraries Portal
This portal is a prototype of open source software (OSS) in libraries. It is a combination directory of OSS projects and information resources designed for and useful in library settings.

National Information Standards Organizations (NISO) /Book Industry Study Group (BISG) Meeting Report on Digital Archiving
This report looks at three ongoing projects that examine cost-effective business models for archiving, exploring rights issues, and identifying needed standards.

Cedars Project Evaluation for 1998-2001
The evaluation of the first three years of the Cedars Project is now available.

Archival Preservation of Smithsonian Web Resources: Strategies, Principles and Best Practices
The Smithsonian Institution Archives commissioned this study to assess the requirements for the archival preservation of Smithsonian Institution Web sites, and to develop a strategy, guidelines, and best practices that would facilitate access to usable and trustworthy Web sites.

Colorado Digitization Project Best Practice Document for Digital Audio
Feedback is being requested on this document. The draft document provides guidelines for the technical issues, and a set of best practices for converting analog cassette tape recordings of oral histories into digital format.

RLG News

RLG Creates New Discussion List Related to Digital Preservation and Digital Repositories

RLG has created oais-implementers@lists2.rlg.org, a new discussion list which is intended for individuals and institutions who are actively working with the Open Archival Information (OAIS) Reference Model as a part of an overall effort to model, build, and manage their own digital archive or repository. Currently an International Organization for Standardization (ISO) draft standard, the OAIS provides a common reference model, a common terminology, and a common conceptual framework with which to work, enabling discussion among the many types of organizations and institutions grappling with digital preservation and digital repository creation and management.

It is expected that oais-implementers list members will come from a variety of disciplines including (though not restricted to) libraries, archives, space data centers, corporations, universities, and others. The list and its supporting web pages were created to enable communication and provide information about OAIS reference model implementations, applications, and related standards development. The list also provides a forum for discussion and the opportunity for the exchange of information, ideas, and experience among people engaged in similar activities. The supporting web pages will alert researchers to OAIS activities occurring in similar disciplinary or geographical areas, as well as provide links to further OAIS-related standards development. List members are encouraged to contribute project and contact information to be included in these resources.

To subscribe to the new list:

Send the following message to listmanager@lists2.rlg.org

Subscribe oais-implementers <FirstName LastName>

Publishing Information

RLG DigiNews (ISSN 1093-5371) is a newsletter conceived by the members of the Research Libraries Group's PRESERV community. Funded in part by the Council on Library and Information Resources (CLIR) 1998-2000, it is available internationally via the RLG PRESERV Web site (http://www.rlg.org/preserv/). It will be published six times in 2002. Materials contained in RLG DigiNews are subject to copyright and other proprietary rights. Permission is hereby given for the material in RLG DigiNews to be used for research purposes or private study. RLG asks that you observe the following conditions: Please cite the individual author and RLG DigiNews (please cite URL of the article) when using the material; please contact Jennifer Hartzell, RLG Corporate Communications, when citing RLG DigiNews.

Any use other than for research or private study of these materials requires prior written authorization from RLG, Inc. and/or the author of the article.

RLG DigiNews is produced for the Research Libraries Group, Inc. (RLG) by the staff of the Department of Preservation and Conservation, Cornell University Library. Co-Editors, Anne R. Kenney and Nancy Y. McGovern; Production Editor, Barbara Berger Eden; Associate Editor, Robin Dale (RLG); Technical Researchers, Richard Entlich and Peter Botticelli; Technical Coordinator, Carla DeMello.

All links in this issue were confirmed accurate as of February 14, 2002.

Please send your comments and questions to preservation@cornell.edu.


		RLG DigiNews		BROWSE ISSUES		SEARCH		RLG