Going live with Harvard’s catalog

[Note: Dec. 3, 2013: We’ve updated the links on this page and a bit of the text to reflect the current reality about where things are.]

We’re very pleased not only that Harvard University has decided to make virtually its entire catalog of bibliographic records available for bulk download under a Creative Commons 0 (public domain) license, but that we’re providing programmatic access to those records in their entirety the LibraryCloud API. That’s over 12 million full records in the MARC21 format.

It’s live now. Begin with the API documentation (which includes some legal usage notes) here. If you instead want to do a bulk download, please go here.

We are using a two-tier schema. We have a simplified core which combines and extends Dublin Core and Schema.org. It works across data sets as well as we can manage. But we are preserving all the metadata that doesn’t fit into that core. You can access it if you know the schema. In the case of Harvard’s data, it’s MARC21, so the keys are well-known. You can retrieve entire MARC21 records if that’s where your bliss is, or you can grab the fields you want.

The API is an early alpha. Please let us know about problems you encounter.

We’ve also capped access at 3 queries per second from a single IP address. We are feeling our way here, and we think that that’s probably more than any app is going to need for now, unless it’s trying to absorb all the data through the API, in which case we repeat: Go bulk download it. It’s all there, and we’ll all be much happier.

Thank you, Harvard!

And please note and respect the statement of community norms, including the norm that attribution be given to those who are providing this information, including Harvard and also, importantly, the OCLC. Thank you.

List of Hackathon projects

Here are the notes SJ Klein took at the end of our first Hackathon on April 5 when we each described what we had built. The original notes from the session are here.

Corey – The main thing i did was write a really hackish ruby harvester, get json back and parse it into columns with dates, urls, data soruce, title, and record id. Dump that into a pipe-delimited file.
I played around with loading this into viewshare, a LOC project imnplementing a lot of Simile and Exhibit tools that can import data and visualize it.
I’m looking at attempts to plot the word ‘exploration’ on a timeline based on opu date, and a pie chart showing that 30% of results for ‘exploration’ come from NPR YouTube and Biodev Library.
I did the same thing with monkeys and turtles. There were more monkeys in LOC than in the biodev heritage library…
David – User-created book covers for works for which DPLA doesn’t have covers (which right now is all books). I put in a search term, to get in results from dpla, pick one. It goes to flickr, taking the subject and description and author and title from dpla, mashing them together as a string, removing stopwords, separating strings by commas, sending them to flickr as tags, and geting the set of flickr images. Let the user choose which one she wants to use as a cover and put the title and author on the image.
Andromeda Timeline! On the backend, python is querying dpla and taking JSON it gets and munging it into this timeline tool, put out by Knight recently (http://timeline.verite.co/). It autogenerates a timeline that helps you see where things are in history. If it is a multimedia item, it will embed that by default. it doesn’t know how to handle npr stuff yet, but it is a fun way to see an idea evolve over time.
User data could be dumped into this script, but it’s not yet clear how to wire those two pages together. In theory you could put in your own search terms.https://github.com/thatandromeda/DPLA
Jason I already showed what I was working on earlier. I just used the NPR data to embed an mp3 onto another page. Taking a blacklight app and using it… I di release the related gem, so that’s [progress].
David C. working on an html file on github. I worked on a javascript improvement to dpla search, so that on every keyup it comes back with results that match the keywords you’ve entered so far. You can change the searchtype from title to keyword. Just a jquery experiment!
Ralph I’ve been trying to enhance the data in the db. So I tried a monkey search like Corey, and found an author David Lipsky, I copy that author and throw it at this little api for looking up names: this comes back and tells me there are 3 david lipsky’s it knows about, and gives me control ids into variuos services that know about him. So we pick one that is most known: the Natl Lib of Austrlia, LOC, and DNB all know about him. We throw that up into worldcat and find out more about him.
This turns up things like related works, links to his WP article. I wanted to see if I could cook down the google refine API to query freebase and come out with better IDs all at once than by going through all of these services.
Dan Working on Covered: with Brad, we refined how this works — it lets you stack up a set of criteria in one search. Say “monkey”. this lets you paginate through the result set. This pulls in covers from openlibrary; if I click on a result, this will do a flickr search for terms in the title. If it doesn’t find anything, it shows nothing.
If I want to refine this down, I can find subsets of the matches. This is all done clientside.
James and Nate Working on a map mashup, showing how one can generate lists from dpla queries and place them on a map. Right now we have random locaions in Boston; we picked 10 spots and 10 lists of books in those locations. These are the nearest local public libraries. IT geolocates your position and finds nearby public libraries. One happens to be in the middle of a river… don’t mind that.
We can click on one of them, and you will see different books… these are live links to relevant media. Ideally this would be connected to a tool for making lists, and you could find out what summer reading lists people were making around you.
James – this morning we worked on the DPLA api to the set of apis that Zeega works with. Now you can ingest things from the DPLA api into Zeega, and expose it — to add extra metadata, geolocate it.
Reinhard and Ryan I am amazed that Ryan was able to merge the things we were doing. “the world’s gnarliest merge in the past few minutes” What we have here are several visualizations. you are seeing a treemap. this shows – with imperfect colors – several dimensions of data. all the subjects at the top, bigger boxes being ones with more items in the total search resultset.
Colors, from white to green, show how many were present in the 20 items we actually retrieved; just to show how you could have multiple dimensions of data. At the bottom there’s a timeline bargraph: results by data from the 1800s to the current time. no labels on it yet.
And below that there is a tag cloud, another way of representing this data. We built all of this purely from results from the facets of the API response. I imagine we could make a visual way to drill down into results this way. For instance, you could click on one of these vis’s and requery.
We got many results for people, who have birth and death dates. I looked at all creators, averaged their birth/death dates, to get a single value here.
Jay I created a really simple python wrapper for the DPLA API – based on the Solr api someone else wrote. It is much simpler – it lets you query in different ways, and define facets and sort parameters. It is on github as dplapy. http://github.com/lbjay/dplapy
Matt I showed something this morning which I’ll show again v. quickly — pulling a list of trending twitter topics and see what dpla matches we can get. not too useful, but here are things trending in the last 5 minutes. you have to click around a bit to find matches – but there are things we match on for say Stenson. Monkey (As a failsafe, just to get a working link)
I can click on one of the matches and pull up a page on http://api.dp.la
This morning quickly I took dan’s Covered app, and included an “add to shlv.me” link which should take you to a shlv.me page where you can fill in various required fields and add that dp.la item to my own shelf here. That’s it!
Paul OCLC has an XID service, so during ingestion when an ISBN shows up, they can call out, get related items, and create a Work record (in FRBR speak) and then find all matching records in WorldCat.

Hack day!

Yesterday we had our first Hackathon. It was a great day and insanely useful to us.

About a dozen developers showed up. After a very brief overview and introductions, they set to work. Scroll to the bottom of this PiratePad to see the list of what they worked on.

Thanks to Martin Kalfatovic and Chris Freeland, co-chairs of the Technical Workstream, for having the idea and for putting it together, along with the Secretariat and fabulous Berkman staff.


Andromeda Yelton has written an important and moving post about being quite literally the only woman at the table. We must do better.

Report from CNI session

Matt, Paul, and I went to CNI in Baltimore and held a discussion session about the platform. The three of us described the aims and purposes, the proposed schema strategy, ingestion processes, the API, and our initial thoughts about linked data. The discussion with the group was lively and helpful.

I scribbled notes as people were talking. Here’s some of what was discussed.

A bunch of comments clustered around Linked Open Data. We are proposing ingesting RDF and make all of the platform’s metadata accessible as triples. But, why aren’t we using a triple store and more fully embracing Linked Data? Why are we using schemas instead of ontologies? Are we going to publish RDF and not just consume it, and does the world need Yet Another RDF Publisher (YARDFP…nah, probably not going to catch on :)? We replied that we are very open to discussing this, of course. Going native with RDF would (it seems) allow us and others to do the sort of metadata enhancement (clustering, semantic enhancements, rich relationships among works, etc.) that we think is one of the ways the DPLA platform can have distinctive value. But a big part of the answer is that the platform is supposed to go live 12 months from now, so we didn’t feel we had time to develop (or, more exactly, adopt and adapt) an ontology. I also replied that we wanted to keep it simple for institutions that want to make their metadata available to the DPLA, but Dean Krafft of Cornell correctly pointed out that we could still let organizations do simple mappings, but could let them go hog wild with the ontology if that’s what they want to do. We take this as an open topic. Well, we take everything as an open topic. But we’d love to hear more from you about this.

Among the other questions:

Have we looked at the Networked Digital Library of Theses and Dissertations as a model? (Matt is very familiar with that project.)

How are we going to sync?

Why are we using Dublin Core instead of something that the search engines pay attention to? (We are using a mix of DC and Schema.org, and will allow developers to query the API using either terms when they overlap.)

At the end one of the attendees said that he still didn’t get what we’re trying to do, and what would count as a measure of success? I said that we’re building a a platform + services aimed at developers so that they can build apps that take advantage of the cultural richness distributed across the nation’s online collections, local libraries, museums, archives, etc. Because it’s an open platform, the measure of success will be the number, diversity, and utility of the apps that are developed on it.

CNI overall was a rich event with a great set of people. Lots to learn.

Technical overview posted

Way too long since the last post. Sorry. We’ve been head down writing a technical overview/specification/scope for the prototype platform we’re building. The first rev is now up, as a pdf for your reading pleasure, and as a Google Doc for your commenting pleasure. (These links are on the wiki’s Technical Overview page.)

The doc includes a functional description, a scope section explaining what will be in Phase 1 (April 27) and beyond, and a set of scenarios.

Read and comment! We’re eager to know where we’re going wrong, what we haven’t thought about, better ways of doing what we’re proposing, and anything else.

Building the scope…

We’ve been quiet on the blog, and also have delayed the second build (which is all but done and will be really interesting, we think). because We’re heads down in building a provisional scope document, tracing it out from strategy to functionality to a task list. We hope to have something in good enough shape to look at quite soon so that you all can debug it.

Discussion with BibServer/BibSoup

We had a really helpful discussion this morning with Richard Jones and Mark MacGillivray about their experiences with BibServer and BibSoup. Great projects. Among other useful points, they have steered us toward elasticsearch. We are currenty using Solr. BibJSON also looks like a useful way to delivery bib data.

There are obviously ways in which BibSoup and the DPLA platform can interoperate. For example, the DPLA platform might be a useful source of metadata about objects in bibliographies at BibSoup, and BibSoup could be a rich source of metadata about the relations among items. And lots more.

It was a fun and fascinating discussion. We’re going to stay in touch.