[May 16-17, 2011] Notes – Digital Public Library of America

[May 16-17, 2011] Global Interoperability and Linked Data Workshop

On May 16-17, 2011, the Berkman Center together with Open Knowledge Commons and the Institute for Information Law at the University of Amsterdam convened a group of technical and legal experts from public and research libraries and government agencies in the United States and Europe for a workshop focused on key questions regarding global interoperability in digital libraries.

Download workshop notes as a PDF

Segment I: Welcome and Setting the Stage

Maura Marx (DPLA Steering Committee; Executive Director of Open Knowledge Commons) introduced the workshop by emphasizing the importance of conceiving of the DPLA in a global context; looking at lessons from existing projects such as Europeana; and applying these lessons not only to the DPLA itself but also to a global digital library infrastructure.

Lucie Guibault (Institute for Information Law at the University of Amsterdam) spoke briefly about her experiences working with Europeana, outlining some of the key interoperability issues and challenges that arise from working with a wide range of institutions including museums, libraries, and archives.

John Palfrey (DPLA Steering Committee Chair; Harvard Law School; Berkman Center for Internet & Society) gave a short history of the DPLA, describing the efforts that have already been made to connect groups of people in the United States who can interoperate and expressing hopes that this “big tent” approach can extend globally. Interoperability is a core principle of the DPLA; efforts to build such a library should begin with a presumption of high-level interoperability both within the DPLA and across other projects.

Segment II: Linked data and interoperability in Europeana

The second segment of the workshop began with an introduction to linked data followed by presentations on how linked open data is being used by Europeana.

Dan Brickley explained that the idea of linked data dates back to Tim Berners-Lee’s original conception of the web, as outlined in his 1989 proposal for information management. Brickley introduced Berners-Lee’s four principles of linked data, described the history of the concept of the “semantic web,” and gave a short overview of the Resource Description Framework (RDF).

Paul Keller (Kennisland), who works on the Europeana Connect project, described the challenges that have arisen while working with cultural heritage organizations to assemble and share metadata. The biggest challenge is licensing. Many participants argued that factual data (e.g., the author of a work or the date it was produced) should be free of copyright; however, some cultural heritage institutions are reluctant to share their metadata freely, since they invest considerable resources in its creation and see licensing it as a way to recoup some of those costs. As a way of cooperating with these organizations, Europeana is proposing that they start by sharing smaller quantities of data or less complete data sets as a way of easing into data-sharing arrangements. Europeana is also pushing toward a “CC0” license for data, which by removing constraints on data reuse would enable innovative, commercially attractive services to be built on top of the data.

Antoine Isaac (Vrije Universiteit Amsterdam) described the transition from Europeana’s initial data model to the new Europeana Data Model (EDM). The initial model was the “lowest common denominator” for metadata; it forced interoperability by sacrificing some of the original detail. The new Europeana Data Model (EDM) is geared toward preserving this level of detail, enabling richer data while still allowing for interoperability. The EDM is being developed collaboratively; Europeana is posting the evolving specifications online and soliciting feedback.

Dr. Stefan Gradmann (Humboldt University Berlin) introduced the Linked Open Data 2 (LOD2) project, which enables application building on top of linked open data. He stressed the importance of understanding that open is not the same thing as free, and that free is not the same thing as anti-commercial: open data projects should not exclude commercial reuse.

During discussion participants raised the question of data persistence and how best to ensure that permalinks are truly permanent. Participants noted that persistence would bring value not only in an academic sense, for scholars who need to be able to reference sources online, but also by allowing users to easily circulate links through social networks. The presentations also sparked a debate about data licensing, including over whether potential licenses should require attribution. Some participants suggested creating a common set of open data licensing standards to which institutions could publicly commit.

Segment III: Interoperable discovery: bibliographic metadata

The third segment of the workshop focused on various existing approaches to handling bibliographic metadata.

Rufus Pollock (Open Knowledge Foundation) described his work with openbiblio.net, publicdomainworks.net, and bibliographica.org. His presentation emphasized the need for a standard for openness, which he argued is fundamental to interoperability, scaling, and building a global data infrastructure.

Lorcan Dempsey (OCLC) presented the Virtual International Authority File (VIAF), which uses data from libraries and bibliographic records to connect the names (and variant names) of authors and creators and present them in a way that makes them easier to manage.

Ed Summers (Library of Congress) discussed issues of data synchronization, stressing the importance of aligning data efforts with existing web trends instead of creating specialized, library-only approaches to handling data. One example of this is allowing Google to index content, rather than hiding it behind a search box or a single user interface, in order to make it discoverable and useful.

Jonathan Rothman (HathiTrust) explained how HathiTrust aggregates and distributes bibliographic metadata from partners who contribute objects. One of the challenges the HathiTrust faces is how to match different versions of bibliographic records for the same object.

John Weise (HathiTrust) described HathiTrust’s work on full text search and ways to extend discovery. Part of the organization’s agreement with Google requires that they prevent systematic download of their data, which limits opportunities for discovery. Weise notes that as new discovery tools emerge, users will expect to be able to search more than metadata; some architectures will be more capable at handling this than others, and the DPLA should continue to keep users’ best interests in mind while building its technical system.

Much of the discussion focused on where and how the DPLA should be open, and in what ways. Participants noted that working with institutions to license their data for reuse in digital library contexts is one of the most challenging aspects facing existing organizations in this space. John Palfrey (DPLA Steering Committee Chair; Harvard Law School; Berkman Center for Internet & Society) described three layers of the DPLA—code, metadata, and content—and suggested that the code be kept completely open source and the metadata as open as possible while tackling thornier copyright issues surrounding content later.

Segment IV: Interoperable use: licensing frameworks and rights language

The fourth segment of the workshop focused on approaches to identifying rights and rights holders and on developing an overarching licensing framework for digital libraries.

Paul Keller (Kennisland) described efforts taken via the Europeana Public Domain Charter to clearly mark works in the public domain with a machine-readable label. Europeana is attempting to establish a set of best practices around cultural heritage objects that aren’t protected by copyright. Efforts are also underway to create a “public domain calculator” that analyzes whether works are definitively in the public domain. This is an incredibly complex project, given the multiplicity of jurisdictions and copyright laws in Europe, and it is perhaps most helpful in combination with ARROW, the Durationator, etc. Complicating the issue is the fact that much of the content in Europeana that is public domain has been mislabeled (for example, with a CC license by someone who doesn’t understand the difference) or is unlabeled.

John Weise (HathiTrust) described the copyright review undertaken by HathiTrust on books published in the United States between 1923 and 1963. Just over half of the 135,000 books reviewed were found to be in the public domain; HathiTrust is hoping to expand this review into non-US works. He argued that pushing for orphan works legislation may not be the best approach; HathiTrust is currently working to establish best practices for orphan identification and for publicizing titles identified as orphans.

Paola Mazzucchi (ARROW) gave an overview of ARROW’s process for rights information management. She emphasized the need to involve the “entire value chain”—libraries, authors, publishers, publisher and author associations, books in print organizations, and collective rights management organizations—in the process. She also stressed the importance of bridging various kinds of gaps: cultural gaps among stakeholders, interoperability gaps among data sources, cost-benefit gaps in the search for rights holders, and content gaps in digital library collections.

Lucie Guibault (Institution for Information Law, University of Amsterdam) pointed out that libraries, museums, and archives have different perspectives and levels of familiarity and capability with respect to metadata. She suggested that extended connective licensing, where collecting societies negotiate rights agreements that extend to rights holders who are not members of these societies, might be the best solution to some of the challenges faced by Europeana and others.

Urs Gasser (Berkman Center for Internet & Society) noted that legal interoperability is particularly challenging for libraries because it involves not only copyright law but also the inconsistency and opacity of private contracts with content providers. These issues are compounded by the challenge of negotiating through multiple jurisdictions to operate across borders. He emphasized the need for transparency and the importance of collective action power in ensuring interoperability.

Segment V: Interoperability for mining and research: full text

The last segment of the workshop focused on research applications for the data managed by digital libraries.

Repke de Vries (Heritage of the People’s Europe) focused on multilinguality: situations in which a researcher who reads one language wants to access a work published in another language. This includes handwriting and speech recognition applications.

Robert McDonald (Indiana University) described the work of the HathiTrust Research Center (HTRC), which will provide access to a comprehensive body of published works for scholarship and education and for computational research purposes. The HTRC is building themed collections and is sharing ingestion and replication mechanisms with the HathiTrust.

Greg Crane (Tufts University) introduced several academic use cases for the DPLA, one of which was the idea that anyone watching a Discovery Channel or History Channel show should be able to take any content in the broadcast and trace the evidence behind it. He emphasized the need for the DPLA to connect with Europeana and other initiatives in order to encompass not just the modern world but also global history.

In discussion, participants debated the best approach to a global digital library system. Many felt the DPLA should be one point in a global network composed of a wide range of projects and initiatives, including existing language engineering communities, research organizations, and collections.

Summary and Key Questions

The workshop concluded with a session on key points drawn from the presentations and discussion.

Projects of the DPLA and Europeana could interoperate in many ways
Nearly every presentation at the workshop offered relevant insights for the DPLA, from the process ARROW follows to identify rights and rights holders to the ways in which Europeana is attempting to automate identification of works that are in the public domain. These potential points of connection raise the question of whether there might exist a precise project on which the DPLA and Europeana could collaborate, and if so, what that project might be.

The DPLA should explore incentives for participation for content contributors
Presentations highlighted the need to work with the underlying cultural heritage organizations that will contribute content to ensure that their participation benefits them as much as it benefits the DPLA. Participants noted that many of these organizations are approaching these sorts of activities with risk management as their primary concern. In thinking about licensing, the DPLA should be sensitive to this multiplicity of perspectives while emphasizing the potential of the connections that can grow from linked data and metadata.

Open, rich, interoperable linked data is crucial for the DPLA
Throughout the workshop, participants emphasized the need for open, rich, interoperable linked data to be at the core of the DPLA’s efforts. Included in this principle is the need to formulate policies across the DPLA and Europeana to ensure interoperability; the goal should be to reach for the most workable, rather than the highest, level of interoperability possible.

The DPLA should or could:

provide persistent, reliable long-term links to information (i.e., serve a quasi-preservation function);
support other open data projects;
support the work of language translation projects, including text to speech efforts;
support use cases around teaching and learning; and
support existing large text, “big humanities,” and “small humanities” projects around the world.

The DPLA might support an open access-type movement for open (meta)data
Participants suggested that the DPLA, Europeana, and/or other similar institutions might consider making a claim for open data standards in the way that other institutions have for open access.

Next Steps

Beta Sprint
On May 20, 2011, the DPLA Steering Committee announced a Beta Sprint to surface innovations that could play a part in the building of a digital public library. The Beta Sprint seeks, ideas, models, prototypes, technical tools, user interfaces, etc.—put forth as a written statement, a visual display, code, or a combination of forms—that demonstrate how the DPLA might index and provide access to a wide range of broadly distributed content. The Beta Sprint also encourages development of submissions that suggest alternative designs or that focus on particular parts of the system, rather than on the DPLA as a whole.

A review panel appointed by the Steering Committee and composed of experts in the fields of library science, information management, and computer science will review Beta Sprint submissions in early September. Creators of the most promising betas will be invited to present their ideas to interested stakeholders and community members during a public meeting in Washington, DC.
More information is available at http://blogs.law.harvard.edu/dpla/.

Workstreams
The work of the DPLA over the next two years will take place largely within six, or possibly seven, workstreams:

Audience and Participation
This workstream will collaborate with each of the other workstreams to ensure that decisions are being made that will best support the current and future needs of the broadest possible user group. Specifically, it will examine models to establish and serve communities of users and stakeholders and to define the privileges and benefits the DPLA will offer to them.
Content and Scope
This workstream will make recommendations for a collection development policy for the DPLA. One primary goal is to begin to identify and articulate the criteria for including materials in a proposed DPLA. This workstream will also confront questions regarding management of and access to distributed materials.
Financial/Business Models
This workstream will make recommendations for a sustainable business plan for the DPLA.
Governance
This workstream will make recommendations for a system of decision making and management for the DPLA. The DPLA must be as broad, open, and non-partisan as possible.
Legal Issues
This workstream will make recommendations regarding how to approach and influence the legal and copyright environment in order to support equitable knowledge distribution in a digital world.
Technical Aspects
This workstream will explore the desired architecture for the DPLA and will make recommendations regarding technology to be used for its development and to build or facilitate building the discovery environment.
(Research Uses)
This potential workstream, closely related to the Audience and Participation workstream, will make recommendations regarding how the DPLA might support various forms computational research, teaching functions, both large- and small-scale humanities projects, etc.

Plenary Meetings
These workstreams will periodically come together in a series of plenary meetings, to be held every six months beginning in the fall of 2011. These meetings will provide an avenue for the workstreams to share their ongoing efforts with each other and with key stakeholders and for stakeholders to give feedback.