Saturday, February 16, 2008

Digital Preservation

Advances in technology have brought about an area of great concern for librarians around the globe. How do we preserve digital information in order to ensure it is around for future generations? Why is this information at-risk? What information needs to be preserved? Who needs to be involved in the preservation? How are we going to accomplish this task? In December 2000 the United States Congress, in recognition of the need to address the issue of preserving our digital heritage, approved $100 million for a national strategy initiative, the National Digital Information Infrastructure and Preservation Program. The Congress tasked the Library of Congress with the development and execution of this preservation plan in partnership with other organizations and institutions, both public and private. The Library of Congress’ mission, “to make its resources available and useful to Congress and the American people to sustain and preserve a universal collection of knowledge and creativity for future generations” (“Importance of digital preservation”, n.d.). However, this is not just an issue for North America. Archival institutions throughout the world are working on this same issue. The challenge is to develop rationales, protocols, and methods that may be applied and utilized by institutions now and in the future. In addition, the methods must be robust and yet flexible enough to allow progressive preservation or migration of material to new preservation formats as new technologies are discovered. This posting addresses the four questions stated earlier: why is some material at-risk, what material should be preserved, who is involved with this preservation, and how will it be accomplished.

Why is Digital Preservation Needed

Much of the information available on the Internet has been created solely in a digital format. Websites containing information on an upcoming election, maps used for political redistricting or utilities management, surveys on health care, or the war in Iraq – all of these are considered as at-risk objects. Studies have shown that “13% of Internet sources cited in three prestigious journals were not retrievable from the original hyperlink 27 months after publication” (Fenton, 2006). Add to this the increasing number of electronic sources that are more conveniently accessed and easily searched than their print counterparts. Due to that ease of searching for and accessing information via electronic resources, many publishers are moving to electronic publication of media and foregoing paper copies. If this information is not successfully preserved by migrating the information into multiple storage formats it may not be around for future generations to use. The same fate may be in store for other types of media such as videotapes, phonograph records, and cassette tapes to name a few. Because they rely on specific equipment or software for access, once the equipment or software is obsolete the data is lost. In addition to the loss of information there is a cost involved with not standardizing a method of preserving digital information. In 2003-2004 a survey of libraries completed by the Association of Research Libraries revealed that on average 31% of library expenditures were for licensing of electronic resources. However, because of concerns with authenticity of electronic preservation many librarians not only pay for the electronic resource but also for subscriptions to print copies, where they are available (Fenton, 2006).

What Needs to Be Preserved

Even though there is agreement that digital information must be preserved, it is not necessarily agreed upon what that information should be. Article 7 of the United Nations Educational, Scientific and Cultural Organization (UNESCO) charter on the Preservation of Digital Heritage adopted in 2003 emphasizes the need for selection criteria on the basis of ‘significance and lasting cultural, scientific, evidential or other value” (Lusenet, 2007). The term cultural heritage may be found throughout the charter. However, cultural heritage may apply to movies, television shows, as well as state and local government statistics. On the other side of the world, in 2004 partners of the National Digital Information Infrastructure and Preservation Program agreed to collect and preserve specific types of information which do not exist is print, namely social science data sets, public television programming, political websites, Geospatial data, and the history of the Dot.com era of the 1990s. In effect this leads to two approaches: everything goes or selective preservation. The everything goes is referred to as a harvesting method where everything in the national domain is gathered. The mindset is to save everything and then let the future generations determine whether they want to keep it or not. This is criticized as an act of storage rather than preservation. On the other hand, selective preservation involves collecting only those pieces of information that are foreseen of interest to future generations. This method involves judgment and selection by professionals familiar with the information they are reviewing. There is a third group with a much narrower focus on archiving objects that are digital variations of print documents such as journals. Surprisingly enough the technical aspect of how to go about preserving large amounts of information has not proven to be the most difficult part of any of the projects. The most difficult part is deciding what to preserve and getting agreement from project partners as well as information owners.

Who is Responsible and How Can They Accomplish It

There are numerous partners in the quest to preserve the world’s digital heritage. The information contained in this paper highlights but a few of those involved and what portion of the overall project they are focusing on.
1) Stanford University Libraries
In 2006 the Library of Congress entered into a three-year agreement with Stanford University funding the CLOCKSS (Controlled Lots of Copies Keep Stuff Safe) digital archiving pilot. This project is intended to provide a secure, long-term archiving solution that is decentralized. CLOCKSS is based on another Stanford program, LOCKSS (Lots of Copies Keep Stuff Safe) described as an “open-source software that provides libraries with an easy and inexpensive way to collect, store, preserve, and provide access to the own local copy of authorized content” (NDIIPP, 2006a).
2) Library of Congress (LOC)
One of the several programs under the leadership of the Library of Congress is the National Digital Information Infrastructure and Preservation Program (NDIIPP). The goal of this program, as dictated by the United States Congress, is to identify a national network of libraries and other organizations with responsibilities for collecting digital materials that will provide access to and maintain those materials. Secondly, to set forth, in concert with the Copyright Office, the policies, protocols and strategies for the long-term preservation of such materials, including the technological infrastructure required at the Library of Congress. Lastly, to advance digital preservation methods and determine the best practice. Another LOC program is IRENE (Image, Reconstruct, Erase Noise, Etc.), a conservation technology that creates digital audio files by taking high-resolution images of media such as early phonograph records. In the same vein is SAMMA (System for the Automated Migration of Media Assets) which is a robotic system that creates preservation-quality digital files from cassette-based media. The Library of Congress has also undertaken the Preserving Creative America initiative along with eight partners. This initiative, part of NDIIPP, targets digital preservation of creative media such as movies, sound recordings, digital photography, and video games (LC, 2007). Finally, there is the Web Capture Program which creates archives of websites that include information such as Supreme Court nominations; Hurricane Katrina; papal transition following the death of John Paul II (NDIPP, 2006a); the Iraq war; elections of 2000, 2002, 2004, upcoming 2008; 9/11 remembrance; 107th Congress; Winter Olympic games of 2002 (Web capture, n.d.).
3) North Carolina State University Libraries in partnership with
In February 2003 the NC OneMap was unveiled as a combined state, federal, and local initiative focused on providing view access to geographical data across North Carolina. In addition it allows users to search for and download data, view and query metadata, and identify who is in possession of what data (Morris, 2006). Later that year the North Carolina State University Libraries in partnership with the North Carolina Center for Geographic Information and Analysis and NC OneMap announced the North Carolina Geospatial Data Archiving Project which will focus on the collection and preservation of digital Geospatial data resources from state and local government agencies in North Carolina. The project objectives are to identify resources using OneMap, gather at-risk data, develop a reliable method for storing the data, enhance the metadata for better identification, and develop a model for data archiving. The initial project plan is to retain the data objects in the format received, and then export the content into a more reliable commercial vector format.
4) University of California at Santa Barbara and Stanford University
One of the eight projects funded for the NDIIPP is the National Geospatial Digital Archive (NGDA) whose goal is to design repository infrastructures at each university and to collect materials across a broad spectrum of geographic formats (Sweet-kind, 2006). Geospatial data is particularly complex when it comes to archiving due in part to the multiple file format layers that accompany each object. Each layer is needed in order to view the file but each may be stored in a completely different format than the main file. The NGDA will include prototype archives for data housing as well as a Geospatial format registry that will describe the stored data.
5) PREMIS – Preservation Metadata Implementation Strategies
PREMIS is a data dictionary developed by the company PREMIS as a specification with the goal of creating an set of core preservation metadata elements (Guenther, 2007). Metadata elements are information on digital objects such as identifiers, size, relationships with other objects, creating application information. The intent is to establish a dictionary to be used by archiving institutions in order to adequately and uniformly identify the objects that are being archived. The dictionary is not only meant for ‘after’ creation but for ‘during’ creation as well. If objects are created with metadata that sufficiently describe what they are and how they came to be preservation of those objects is greatly simplified.
6) JSTOR in partnership with Ithaka, The Andrew W. Mellon Foundation, and The Library of Congress
JSTOR and partners have launched a project, Portico, which is a not-for-profit electronic archiving service established in order to address the scholarly community’s critical and urgent need for a robust, reliable means to preserve electronic scholarly journals (Fenton, 2006). Portico’s mission is to preserve scholarly literature published in electronic form and to ensure that these materials remain accessible to future generations of scholars, researchers, and students (NDIIPP, 2006b).
7) UNESCO – United Nations Educational, Scientific and Cultural Organization
The UNESCO Charter on the Preservation of Digital Heritage, adopted in 2003 as one way of safeguarding documentary heritage, is closely connected to the Memory of the World Programme which aims to preserve and promote cultural heritage through digitization projects, the publication of guidelines, and the Memory of the World Register of over a hundred works of exceptional importance. The charter defines cultural heritage as cultural, educational, scientific, administrative, technical and medical resources created in digital format or converted into digital format from print sources. Resources include text, databases, still and moving images, audio, graphics, software, and web pages (Lusenet, 2007). The charter is important because it affirms the role of archival institutions and extends existing preservation systems to include digital media.
8) The British Library (BL)
The British Library is working with Internet Archive on using the Heritrix crawler and harvester tool. The focus of the BL is to create a digital archive infrastructure that would allow storage of information harvested from websites in the UK domain (Hawkins, 2007).
9) International Internet Preservation Consortium (IIPC)
The IIPC involves the Library of Congress and the national libraries of Australia, Canada, Denmark, Finland, France, Iceland, Italy, Norway, Sweden, The British Library, and the Internet Archive (USA) (Mission, n.d.). The consortium goals are to collect Internet content, develop common tools, standards and techniques for archiving, be a strong advocate for initiatives and legislation that support archiving Internet content, support preservation of Internet content by libraries, archives, museums, and cultural heritage institutions around the world.

Digital preservation is a complex undertaking both from the perspective of archiving data created in obsolete formats and preparation of newly created data objects for archiving. Because of the complexity and the enormity of data that does or will need archiving it is a process that requires partnerships in order to divide and conquer. There are many hurdles for the digital preservation project: establishing criteria to determine what data objects will be preserved, obtaining copyright agreements that will allow duplication of information, and establishing a strong foundation that future preservation actions will build upon. With project deadlines of 2010 through 2015 the projects will have to clear these hurdles in order to deliver a repeatable, reliable, and secure method for long-term preservation of the world’s digital heritage.


References

Fenton, E. (2006, April). Preserving electronic scholarly journals: Portico. Ariadne 47. Retrieved December 5, 2007 from http://www.ariadne.ac.uk/issue47/fenton/intro.html

Guenther, R. (2007, April). PREMIS what it stands for: Preservation metadata implementation strategies [Electronic version]. Computers in Libraries, 27(4), 19.

Hawkins, D. (2007, May). The incredible digital journey [Electronic version]. Information Today, 24(5), 22-23.

Importance of digital preservation. (n.d.). Library of Congress website. Retrieved December 2, 2007 from http://www.digitalpreservation.gov/importance/

LC announces digital preservation partnerships [Electronic version]. (2007, September). American Libraries, 38(8), 40.

Library of Congress. (2007). Sustainability of digital formats, Planning for the Library of Congress collections. Retrieved December 5, 2007 from http://www.digitalpreservation.gov/formats/sustain/sustain.shtml )

Lusenet, Y. (2007, Summer). Tending the garden or harvesting the fields: Digital preservation and the UNESCO charter on the preservation of the digital heritage [Electronic version]. Library Trends 56(1), 164-182.

Mission statement. (2007, August). International Internet Preservation Consortium website. Retrieved December 8, 2007 from http://netpreserve.org/about/index.php

Morris, S. (2006, Fall). Geospatial web services and geoarchiving: New opportunities and challenges in geographic information services [Electronic version]. Library Trends, 55(2), 285-303.

NDIIPP supports CLOCKSS: Library makes preservation award to Stanford [Electronic version]. (2006, July/August). Library of Congress Information Bulletin, 65(7/8), 176.

NDIIPP supports Portico: Nonprofit electronic archiving service receives award [Electronic version]. (2006, January). Library of Congress Information Bulletin, 65(1), 13.

New LC audiovisual center boasts state-of-the-art technology [Electronic version]. (2007, September). American Libraries, 38(8), 40.

Sweetkind-Singer, J., & Larsgaard, M., & Erwin, T. (2006, Fall). Digital preservation of geospatial data [Electronic version]. Library Trends, 55(2), 304-314.

Web capture. (n.d.). Library of Congress website. Retrieved December 8, 2007 from http://www.loc.gov/webcapture/index.html

8 comments:

Richelle Rininger said...

Wow, you have really done your homework. I have learned much from your posting about the history and future of information preservation. A part of me is still worried that if everything is digital only and if we have a digital meltdown, then what would happen to all that information that we preserved? Is it all gone or is there going to be a double backup system to ensure that that does not happen? Just food for thought...

Kate Dunigan AtLee said...

This information is excellent, Lisa. Thank you, I enjoyed your thorough exploration of digital preservation.

It seems like there are two motivations behind what gets preserved. One being that those who decide what gets preserved are guessing at what historical information future societies will want. The other is the idea that we are preserving what we want future generations to know about us. Are there other motivations for preservation you have come across? Which is the most prevalent?

Kate Dunigan AtLee said...

You mentioned that there is a movement by some to preserve everything and let future generations decide what they are interested in. I'm fascinated that these people think this is actually possible. What a burden to place on future information professionals!

Carol Winfield said...

Great work, Lisa! This is a critical area in our field. It pains me to think of how much digital information we have already lost. Where I work, we publish a bi-monthly journal. We lost 2-1/2 years of our archives that were stored on Syquest disks (anybody even remember those?) A few disks went bad. Luckily, we moved the rest of the archives to another medium before Syquest drives became obsolete.

Emily W. said...

Your post is thorough, well-organized, and thought provoking. Digital preservation is an admirable and necessary undertaking in these technologically advanced times. My concerns are in reference to the costs involved and possible funding sources. What implications will our weakened economy have on these timelines for digitization?

Maureen said...

Like Emily, I'm very concerned about the cost issue affecting the decision of what is preserved. There is so much that we need to keep for present & future generations, but who can fund this? Kate asks what some of the motivations are for preserving info. Unfortunately, what is profitable is one big motivation (corporations and those with funding can and do preserve), yet something like cultural info, or research studies (not linked to corporations) may not have the funding.
It is scary to think what has been lost already due to this being an issue that is not dealt with sufficiently yet. This contribution by Lisa points to some farsighted organizations that are trying to raise this issue.
It is not a new issue, by far. In ancient times, decisions were made about what to preserve. An example is the monks that copied over and preserved writings... yet a decision making process went on then, too. Perhaps they focused on religious writings. There may have been so much literature and other views (not permitted by the church) that did not get copied. We always have a slanted view of what info exists since we don't know the full scope (it is not all available or preserved).

Bubbly Bibliophile said...

Awesome article. Very well researched and informative. I wasn't surprised by the statistic of unavailable links. I find this frequently in my research. I am very excited about the future of digital archiving!

Lisa Anderson said...

Kate,
The only motivation I detected was that of the specific organization wanting to preserve information related to its specialty (e.g. art, movies, television, geospatial data). This is a topic that you could dig and dig and still not reach the bottom. Amazing in that it hasn't been around for that long. I am astounded, though, that someone could think that a legitimate way of tackling this issue is to save everything and let the future generations sort it out. What a waste of money and labor to save something that the librarian of the future will just turn around and discard.