RLG DigiNews Banner
  February 15, 2002, Volume 6, Number 1
ISSN 1093-5371       

 

 

FAQ
Lee LaFleur

Cornell University
ljl26@cornell.edu



My institution is interested in monitoring the use of our online resources. Is Web log analysis an effective means of doing this?

Recently a number of organizations, including the Digital Library Federation (DLF) and the Association of Research Libraries (ARL), have urged libraries to take responsibility for documenting the use of the digital resources they manage. The ARL New Measures Initiative has issued their Emetrics Phase II report, offering guidelines for usage statistics that libraries should collect in order to document changes in the use of Web-based resources. It is recommended that libraries begin tracking the number of downloads, page views, queries and search sessions by users. Statistics such as these can be obtained by examining the data produced by the Web servers on which these resources are stored. Each transaction on the Web consists of a request issued through the client's browser and a corresponding response from the Web server. These transactions are automatically recorded by the server in files known as Web logs.

The data stored in Web log files consists of long strings of text and numerical data, so reading them can be very difficult and unintuitive. You will probably want to use a log file analysis program to interpret the data. These programs are available as shareware or through various commercial vendors. Some types of log analysis software run on the administrator's desktop, in which case the log files must be transferred from the server to the desktop before analysis is carried out. Other analysis programs run on the server itself and can gather data from the log files directly, either in "real time" or at scheduled intervals. Depending on the size of the log file and the capacity of the hardware, the analysis process can be labor intensive and time consuming. In general, the more detailed the analysis desired, the more complicated and expensive the software tends to be.

The software works by comparing different data sets from the logs and making inferential calculations based on a number of factors. For example, the length of a "visit" is generally determined by calculating the difference between the date/time stamp of a user's arrival and departure requests. "Visits" are determined by counting requests from a single IP address over a period of time. After a preset period of inactivity (e.g., 30 minutes) on a Web page or site, a visit is considered terminated. Any activity that occurs after this time period by the same user would then be counted as a new visit.

Some analysis software packages offer details on the geographic location of users, even down to the city level. More expensive packages can analyze log file types from a wide variety of servers (Apache, Microsoft IIS, Netscape, etc.) Higher end software packages may provide hundreds of different types of reports. Some offer customizable reports, generated on the fly from databases in which the analyzed data is stored. Such reports may list the top pages visited or the number of unique visitors, and some provide reports that may be re-sorted for viewing along any number of different variables. Most commercial analysis programs also offer some type of graphing function through which the report data may be represented visually. Select packages also allow the user to download data from the report into PDF, Word documents or Excel spreadsheets.


Figure 1. Screenshot from WebTrends Live showing the top twenty countries from which visitors came when they visited the Cornell Department of Preservation and Conservation Web site.


Figure 2. Screenshot from NetTracker Log Analysis software package showing the
top ten visitors to the Louis Agassiz Fuertes Web site.


"Live tracking" is another means of analyzing Web traffic that is often used to obtain detailed information about online users. Live trackers are typically third party services that monitor Web traffic by requiring Web site administrators to place special JavaScript code into each of their Web pages. Thus, live tracking doesn't rely on Web logs at all. Instead usage data is derived through the JavaScript each time a page on your site is loaded. Like Web logging, this method also requires an analysis process, but in this case the work is usually done by the live tracking service and you receive a finished report. Live tracking results are usually available in real time, allowing up-to-the minute reporting. Compared to Web log analysis, live tracking can provide an equally in-depth analysis of Web traffic activity, while also offering a more detailed profiling of users' system requirements. The JavaScript employed in live tracking can identify a variety of user display properties, including monitor resolution, pixel dimensions and bit depth, screen widths and available color palettes, as well as information on whether or not cookies (a technology for passing personalized data between Web clients and servers), Java, and JavaScript are enabled. When combined with data on users' Internet connection speeds, this information can help guide decisions about the presentation of digital information, including images.


Figure 3. This report lists the most common screen resolutions used by visitors to Cornell's Preservation site. Screen resolutions are given in terms of pixel dimensions.


Web traffic analysis can be a valuable asset to librarians who want to understand current and potential users of their collections. User statistics can help libraries gain continued funding and administrative support for new and existing digital projects. By analyzing the contents of Web log files, we can learn a great deal about online visitors and about which resources are being used and which ones are not. Such data can assist librarians and archivists in answering questions of whether users are visiting expected pages, which sections they are spending the most time on, and which types of content they appear to be most interested in. Web server administrators can rely on user data to assess file structure and server load over the network. Libraries can determine users' locations (by IP address), and which referring Web pages (links), search engines, and keywords (queries) are transporting them to the digital library. Log files can also provide some indication of whether users are navigating a Web site or resource properly based on click stream data that allows us to see the paths (internal references) through which users are traveling, as well as information on which pages they visit first, which pages they exit from, how long they stay, and what files they choose to access. Log file data allows libraries to assess the number of files that have been downloaded and those for which the download was aborted. The logs also identify any errors that may occur during an online transaction. Additionally, log files contain technical information regarding the user's operating system and Web browser, which is of interest in designing resources for the system requirements of particular audiences.

There are many good reasons for libraries to use Web traffic analysis software. However, there are a number of important factors to keep in mind. Usage data does not provide rich qualitative information, such as a user's overall satisfaction with resources, and it certainly won't explain why people are searching for particular information. In this regard Web traffic analysis is not a substitute for more qualitative studies (focus groups, surveys, etc.) that the library should also be conducting.

Web traffic and log analysis is essentially an inferential process that relies on heuristic specifications set up by the companies that design the software. Although the reports provide a helpful view of user interaction with library resources, much of the information may be inconclusive. Different software packages use different methods for deriving their reports, and the lack of documentation for many analysis programs makes some of their specifications suspect. For instance, the prevalent use of robots or spiders over the Internet may affect the accuracy of user statistics. Robots are commonly used to comb the Web for data, and when doing so they make frequent visits to each page on a Web site. Analysis programs attempt to control for robots, but many of these visits still slip through the cracks, thereby inflating the number of "actual" reported visitors.

Additional Sources of Information

There are many Web log analysis software packages and services available, and the market is in much flux. The following sites may prove helpful in evaluating the various products currently in use.

AWStats Official Web Site

Dan Grossman, "Analyzing Your Web Site Traffic," iBoost Journal

Software QA/Test Resource Center, "Web Site Test Tools and Site Management Tools"

Makiko Itoh, "Web Site Statistics: How, Why and What to Count"

"Web Site Analysis," from PC Magazine (June 27, 2000)

Elaine Nowick, "Using Server Logfiles to Improve Website Design," Library Philosophy and Practice, Vol. 4, No.1 (Fall 2001)


publishing information

Publishing Information

RLG DigiNews (ISSN 1093-5371) is a newsletter conceived by the members of the Research Libraries Group's PRESERV community. Funded in part by the Council on Library and Information Resources (CLIR) 1998-2000, it is available internationally via the RLG PRESERV Web site (http://www.rlg.org/preserv/). It will be published six times in 2001. Materials contained in RLG DigiNews are subject to copyright and other proprietary rights. Permission is hereby given for the material in RLG DigiNews to be used for research purposes or private study. RLG asks that you observe the following conditions: Please cite the individual author and RLG DigiNews (please cite URL of the article) when using the material; please contact Jennifer Hartzell, RLG Corporate Communications, when citing RLG DigiNews.


Any use other than for research or private study of these materials requires prior written authorization from RLG, Inc. and/or the author of the article.


RLG DigiNews
is produced for the Research Libraries Group, Inc. (RLG) by the staff of the Department of Preservation and Conservation, Cornell University Library. Co-Editors, Anne R. Kenney and Nancy Y. McGovern; Production Editor, Barbara Berger Eden; Associate Editor, Robin Dale (RLG); Technical Researchers, Richard Entlich and Peter Botticelli; Technical Coordinator, Carla DeMello.


All links in this issue were confirmed accurate as of February 14, 2002.


Please send your comments and questions to preservation@cornell.edu.

end of issue