Validating ORE From the Command-line

I’ve been periodically poking at getting Linked Data/RDF views hooked into the World Digital Library web application, following Ed Summerslead from his work on Chronicling America. The RDF views also use the OAI-ORE vocabulary to express aggregations – in WDL, an item is an aggregation of its constituent files. The goal is to provide a semantically rich and holistic representation of a WDL item (identifier, constituent files, metadata, translations, and so on).

The ORE format is a new one for me so it’s hard to say whether the output of my dev branch is valid ORE or not. Plus I’m a sucker for validators. Turns out Rob Sanderson has developed a Python library for validating ORE, and this little snippet is what I’ve been using to validate the ORE. I didn’t put much effort into making it readable, so much as banging something functional out so I can meet deadlines, so mea culpa and all that. But without further hemming and hawing, the code:

# validate.py
import sys
from foresite import *

rem = RdfLibParser().parse(ReMDocument(sys.argv[1]))
aggr = rem.aggregation
n3 = RdfLibSerializer('n3')
rem2 = aggr.register_serialization(n3)
print rem2.get_serialization(n3).data

Most of this code is naively copied and pasted from Rob’s excellent Foresite documentation.

I invoke it thusly: python validate.py {URL}

And the output:

@prefix _27: .
@prefix _28: .
@prefix _29: .
@prefix bibo: .
@prefix dc: .
@prefix dcterms: .
@prefix ore: .
@prefix rdf: .
@prefix rdfs: .
@prefix rdfs1: .

 _28:ResourceMap a ore:ResourceMap;
     dc:format "text/rdf+n3";
     dcterms:created "2009-07-31T14:23:31Z";
     dcterms:modified "2009-07-31T14:23:31Z";
     ore:describes _29:id. 

 _29:id a bibo:Image,
         ore:Aggregation;
     dcterms:DDC "973";
     dcterms:alternative "Antietam, Maryland. Allan Pinkerton, President Lincoln, and Major General John A. McClernand"@en;
     dcterms:created "1862年10月3日"@zh,
         "3 de octubre de 1862"@es,
         "3 de outubro de 1862"@pt,
         "3 octobre 1862"@fr,
         "3 октября 1862 года"@ru,
         "October 3, 1862"@en,
         " ٣ آكتوبر، ١٨٦٢"@ar;
     dcterms:creator "Gardner, Alexander"@en,
         "Gardner, Alexander"@es,
         "Gardner, Alexander"@fr,
         "Gardner, Alexander"@pt,
         "Гарднер, Александр"@ru,
         "جاردنر, أليكسندر"@ar,
         "加德纳, 亚历山大"@zh;
... (and so on and so forth)
     dcterms:title "Antietam, Maryland. Allan Pinkerton, President Lincoln, and Major General John A. McClernand: Another View"@en,
         "Antietam, Maryland. Allan Pinkerton, el Presidente Lincoln y el General Principal John A. McClernand: Otra visión"@es,
         "Antietam, Maryland. Allan Pinkerton, le président Lincoln et le général-major John A. McClernand: Autre vue"@fr,
         "Antietam, Maryland. Allan Pinkerton,  Presidente Lincoln e Major-General John A. McClernand: Outra Vista"@pt,
         "Антитэм, штат Мэриленд. Аллан Пинкертон, президент Линкольн и генерал-майор Джон А. Макклернанд: Другой снимок"@ru,
         "أنتينام، ميريلاند ألان بينكرتون، الرئيس لينكولن، واللواء جون أ. ماكليرناند: منظر آخر"@ar,
         "安蒂特姆,马里兰州 艾伦·平克顿、林肯总统和少将约翰·A ·马克克拉南: 另一个视角"@zh;
     ore:aggregates ,
         ;
     ore:isDescribedBy ;
     rdfs:seeAlso . 

  a _27:FileDataObject;
     dcterms:format "image/gif";
     _27:fileSize "34531"^^. 

  a _27:FileDataObject;
     dcterms:format "image/tiff";
     _27:fileSize "1301614"^^. 

 ore:Aggregation rdfs1:isDefinedBy ;
     rdfs1:label "Aggregation". 

 ore:ResourceMap rdfs1:isDefinedBy ;
     rdfs1:label "ResourceMap". 

You might pick up on some warts I have yet to fix, but there you go.

A Digital Object Defined

What happens to a digital object defined? ((Inspired by Langston Hughes’s ”A Dream Deferred” and a spirited conversation in the office today.))


Does its identifier dry up
like a raisin in the sun?
Or its relationships fester like a sore–
And then run?
Do its bits rot like meat?
Or become overwritten–
like some throw-away sheet?


Maybe its metadata just sags
like a heavy load.


Or does it fade into code?

WDL Metadata Mapping, and, Parsing TEI in Python

Context

Early on in the effort to develop the first public version of the World Digital Library web application, we developed a (non-public) Django-based cataloging application where Library of Congress catalogers could manage metadata for WDL items. Management in this sense includes creation of records, editing of records, versioning of edits, mapping of source records, and some light workflow for assignment of records to individual catalogers and for hooking into translation processes ((Catalogers cataloged stuff in the English language, but every metadata record needed to be translated into the other six U.N. languages: Spanish, Russian, French, Arabic, Chinese, and Portuguese.)).

I worked primarily on the source record mapping tools. They take a number of formats as input and are called by the cataloging application to map metadata from these formats into the WDL domain model. Several though not all of which are XML-based, and thus easily dealt with in Python, via the etree module in the lxml package.

Dan recently kicked off a new R&D project for evaluating (any) metadata against any number of metadata profiles, mapping into a generic data dictionary, the goal being to determine how feasible it would be to develop a toolset for aiding remediation of metadata across any number of digital collections. I have been working on this project with Dan, and got started by seeing how generalizable the WDL metadata mapping tools are. Turns out they’re fairly generalizable once you tweak the various format-specific mapping rules to map into the generic data dictionary model rather than the WDL model (around 15 elements, and somewhere between Dublin Core and MODS in terms of specificity but flatly structured like DC).

Some of the test data I am working with now, that has nothing to do with WDL, is SGML-based TEI 2 markup. The closest I worked with on WDL was TEI P5 for manuscript description which is serialized in XML. Turns out my TEI mapping rules from before blew up on this TEI 2 stuff, as lxml.etree (naturally) wasn’t digging the non-XML input. I googled around a bit for how best to parse TEI (or any SGML) in Python and then discovered it’s actually simple as pie.

Code

If you’ve got the BeautifulSoup module installed ((And you are but one sudo easy_install BeautifulSoup away from that.)):

>>> from BeautifulSoup import BeautifulSoup
>>> tei = open('foo.sgm').read()
>>> BeautifulSoup(tei).findAll('title')[0].string
u'[Memorandum to Dr. Botkin]: a machine readable transcription.'

If not, the lxml.html module works too:

>>> from lxml import html
>>> h = html.parse(open('foo.sgm'))
>>> h.xpath('//title')[0].text
'[Memorandum to Dr. Botkin]: a machine readable transcription.'

Data

And here’s what the sample data looks like:

 %images;
]>




wpa0-07010101
[Memorandum to Dr. Botkin]: a machine readable transcription.
Life Histories from the Folklore Project, WPA Federal Writers' Project, 1936-1940; American Memory, Library of Congress.


Selected and converted.
American Memory, Library of Congress.


Washington, DC, 1994.

Preceding element provides place and date of transcription only.

For more information about this text and this American Memory collection, refer to accompanying matter.

U.S. Work Projects Administration, Federal Writers' Project (Folklore Project, Life Histories, 1936-39); Manuscript Division, Library of Congress. Copyright status not determined; refer to accompanying matter.

The National Digital Library Program at the Library of Congress makes digitized historical materials available for education and scholarship.

This transcription is intended to have an accuracy of 99.95 percent or greater and is not intended to reproduce the appearance of the original work. The accompanying images provide a facsimile of this work and represent the appearance of the original.

1994/03/15 2002/04/05
0001

Memorandum to Dr. Botkin from G. B. Roberts, May 26, 1941

Subject: Alabama Material

This material has not yet been accessioned and has only beeen been roughly classified as life histories, folklore, and miscellaneous data and copy save in the case of the 2 ex-slave items and the essay on Jesse Owens, each of which was recommended.

Total no. of items recommended: 3 (14 pp.) In progress

I2: Survey

[Series]

Near the end of my strawman post, I wrote:

The I2 repositories subgroup will be sending out its survey on identifier use cases in the coming week. It will be interesting to see if the requirements we have thus far identified still obtain in light of the data we collect from the survey.

We completed the survey late last week and began distributing it. Here’s what we sent out:

The NISO I2 Working Group is surveying repository managers to determine the current practices and needs of the repository community regarding institutional identifiers. We value your time and your input in the process to create a standard for a new institutional identifier. We hope that you will complete the survey which should take less than 15 minutes. The survey will remain open through Monday, July 6th.

Here is a link to the survey: http://www.surveymonkey.com/s.aspx?sm=RGQgZ3090DVrb3kFzr3P3Q_3d_3d

Please feel free to share this message with other interested parties.

First we used Survey Monkey to send the survey link to approximately one-hundred repository managers that the subgroup identified. Our process for identifying repository managers involved pulling together a list of prominent repositories from subgroup members, and then gathering more from OpenDOAR, “an authoritative directory of academic open access repositories.” Then subgroup members were encouraged to share the survey link with colleagues, and post it far and wide via blogs, listservs, and tweets. The listservs we targeted were: JISC-REPOSITORIES, metadataLibrarians, digital-curation, SPARC-IR, ir-net, REPOMAN-L, PALINET-IR-L, dspace-general, fedora-commons-users, DC-IDENTIFIERS, and code4lib.

I’ve already received a few responses and have gotten useful feedback. Two of the hardest questions to answer so far have been: “What is an institutional identifier?” and “What is a repository?”

Institutional identifier

An institutional identifier is defined as a symbol or code that uniquely identifies an institution. Domain-specific examples of existing identifiers include SAN, IPEDS, GLN, MARC Org Code, and ISIL. Another example might be a Handle prefix or ARK name authority assigning number.

Repository

Institutional repositories and subject repositories like arxiv.org are clearly ‘repositories’, but beyond that it is a somewhat ill-defined term. One might look to the Kahn-Wilensky architecture, or the OAIS reference model (PDF), or even Wikipedia for definitions, but it’s not clear that even the authorities agree on what constitutes a repository.

It’s a system. It’s network-accessible and typically has a web interface of some sort. Files and groups of files sometimes known as objects tend to be deposited in them, perhaps for some combination of management, access, or preservation. Many run Fedora, DSpace, and ePrints, and factor heavily in scholarly communication. Some are document-centric. Some will accept anything. To some, a learning management system may be a repo. To others, a content management system may fit.

My background is in academia so my own definition is somewhat based in that context, but I wouldn’t say the term is necessarily limited to that context. There are other NISO I2 scenarios for library workflows and electronic resources, so it’s safe to assume that repository does not mean ILS or OPAC or ERP system. My hope is that folks have their own working definitions of the term and can decide for themselves what it means.

We’ve given folks a little over two weeks to respond to the survey, so the constant I2 drum-beating will quiet down for a while around here. I am very interested in what sorts of responses we get from the survey. Fun times!

Oh, and perhaps it goes without saying, but if you’re a repository owner, manager, expert, developer, or stakeholder with an interest in identifiers, please feel free to take the survey!

I2: Strawman

[Series]

In the prior I2 post, I wrote about the requirements the repositories subgroup has come up with for an institutional identifier standard (with the hope that our findings re: repositories could be generalized to other scenarios).

PhotonQ-Tim Berners Lee on Linked Data at TED
Image by PhOtOnQuAnTiQuE via Flickr

My strawman proposal of sorts is to explore how well linked data patterns fit this problem space. Linked data, briefly, is a way to expose and link data on the web in a more semantically meaningful way, and is often summarized using the four principles put forward by Tim Berners-Lee:

  1. Use URIs as names for things
  2. Use HTTP URIs so that people can look up those names.
  3. When someone looks up a URI, provide useful information.
  4. Include links to other URIs. so that they can discover more things.

That’s the crux of it.  Linked data takes well-known patterns on the web (linking, dereferencing, etc.) and applies them to data, which in this case could be metadata for identifying institutions.

Let’s examine each of the requirements and the applicability of linked data thereto.

  1. Should be agnostic to type of institution, e.g., libraries, museums, personal collections, historical societies: The web is already agnostic to type of institution.  HTTP URIs do not favor one type of institution over another.
  2. Should handle varying institutional granularity, e.g., institution-level, campus-level, division-level, unit-level: HTTP URIs are flexible in this regard.  Hierarchy, should one wish it to be surfaced in the identifier, may be encoded in either a DNS hostname or the path appended to the DNS name.  One can imagine a URI like “http://department.division.institution.tld/unit/subunit” or “http://institution.tld/campus/office/individual”.

    Hierarchy needn’t be surfaced in the identifier if one favors opacity, in which case “http://registry.tld/xnjsdasd” would suffice as an identifier, and may instead be entirely reflected in the (RDF) representation returned by dereferencing the URI.
  3. Should handle linking among institutions and subordinate units: Linked data handles linking via well-known HTTP mechanisms, referenced in the fourth principle of linked data.  Unlike the HTTP link, which has limited semantics, linked data links are semantically rich and extensible.
  4. Should express different sorts of relationships among these institutions and units: The “useful information” in the third principle of linked data is typically provided by an RDF representation, which is itself a list of assertions.  These assertions, or triples, consist of subjects, predicates, and objects.  The ability to express the relationships in this requirement is limited only by the availability of vocabularies that contain sets of predicates and classes for subjects and objects.  Think of the predicates as elements defined within a metadata standard, e.g., Dublin Core “creator”, MODS “relatedItem”, and so forth.  Vocabularies that contain these predicates and classes are growing and evolving daily, and should there not be a vocabulary that contains the relationship one wishes to express, it is fairly easy to create a custom vocabulary.

    The ability to mix and match vocabularies provides an expressiveness that is often not found in document-based metadata formats and the flexibility to express radically different relationships on a per-industry or per-institution basis.  This latter point is important as the I2 group has identified both core metadata elements for identifying institutions of different types and additional elements for specific types of institutions.  Why re-invent a new metadata format or schema when all one needs to express may already be contained in others?
  5. Should relate to existing relevant identifiers and registries: Same as requirement#4.  Linked data is all about expressing relationships between things, e.g., institutions, identifiers, registries, etc.
  6. Should be globally unique: HTTP URIs are guaranteed to be globally unique by virtue of the distributed DNS system and hierarchical naming within each HTTP service.
  7. Should be actionable: HTTP URIs provide dereferenceability/actionability via the well-known HTTP protocol.
  8. Should enable retrieval of metadata sufficient to identify the institution, which may vary widely by institution: HTTP URIs are actionable per requirement #7 and the metadata returned is flexible per requirement #4.
  9. Should accommodate changes as institutions come and go and re-organize and be able to relate defunct institutions to new ones: Linked data patterns provide for redirecting from defunct representations (institutional identifiers) to new ones via HTTP redirects.  One may also add assertions to institutional metadata such as owl:sameAs, for instance, which says that the institution identified by the given URI is the same as another institution identified by another URI.

This seems like a compelling path to follow for the I2 standard.

The I2 repositories subgroup will be sending out its survey on identifier use cases in the coming week.  It will be interesting to see if the requirements we have thus far identified still obtain in light of the data we collect from the survey.  If so, I would like to explore the idea of linked data for institutional identifiers a bit more.

I2: Requirements

[Series]

The I2 IR scenario subgroup approached the issue of institutional identifiers in repositories by first brainstorming about the various issues, problems, and sticking points that make identifiers in this space (and elsewhere) such a complex topic. Folks on the subgroup are repository managers or are otherwise involved with or knowledgeable about the repository space, so the brainstorming exercise yielded a good number of concerns.

The purpose of the exercise was to enumerate concerns and issues that could inform a draft survey to be administered to repository managers and experts around the globe in different organizational contexts: libraries, subject disciplines, archives, historical societies, etc. The purpose of the survey is to get an idea of the use cases and constraints around institutional identifiers in these different repository contexts, the assumption being that we ought to have requirements grounded in real world usage before we go off building a standard.

I will note here that the subgroup has worked up a draft survey that has just recently been reviewed by a small group of folks who know about survey design, and we hope to administer the survey to the aforementioned Reporati this week ((We will also x-post to repo-related mailing lists as well, and some of us may blog or tweet about it. My inclination is to cast as wide a net as possible so as not to miss important use cases. We can always scope things out later on, but it’s useful to be inclusive at this point lest our own assumptions carry the group forward.)). Which is to say that I don’t yet have a strong grasp of the use cases out there in the wild, and this series should be construed as my own premature cognitive fumblings. But let’s assume for now that what we learn from the survey results matches our initial brainstorming exercise.

Here is a slightly modified and boiled down version of the concerns and issues the subgroup came up with for a potential institutional identifier standard, which resembles a set of minimum requirements:

  1. Should be agnostic to type of institution, e.g., libraries, museums, personal collections, historical societies
  2. Should handle varying institutional granularity, e.g., institution-level, campus-level, division-level, unit-level
  3. Should handle linking among institutions and subordinate units
  4. Should express different sorts of relationships among these institutions and units
  5. Should relate to existing relevant identifiers and registries
  6. Should be globally unique
  7. Should be actionable
  8. Should enable retrieval of metadata sufficient to identify the institution, which may vary widely by institution
  9. Should accommodate changes as institutions come and go and re-organize and be able to relate defunct institutions to new ones

I doubt the list is exhaustive; I am almost certain we will uncover all sorts of tangly and esoteric use cases that add requirements. I expect it. Why else would we be gathering to discuss the need for an institutional identifier if it were a solved problem or a simple one? ((The cynical among you might have interesting answers to this question.))

Nevertheless, looking at the above list, the task we’ve taken on starts to feel less onerous. And thinking about identifier systems constrained by the list of concerns, the mind starts to cook up all sorts of possible solutions. I’ll share one in the next post in this series, a strawman proposal of sorts, and how it addresses each of these requirements.

I2: Background

[Series]

This is the first in a series of posts about institutional identifiers ((I offer that very tentatively, knowing what a spectacular failure my last attempt at a series was.)).

In my last post, I alluded to some documentation that I’ve written. That was somewhat misleading, which will soon be apparent, but I liked the parallel construction I had going, and I am but a slave to orderliness.

For about the past six months, I have been working with a NISO group looking into how institutions are identified within information systems:

The I2 (Institutional Identifiers – pronounced “I 2”) working group will build on work from the Journal Supply Chain Efficiency Improvement Pilot (http://www.journalsupplychain.com/), which demonstrated the improved efficiencies of using an institutional identifier in the journal supply chain. The NISO working group will develop a standard for an institutional identifier that can be implemented in all library and publishing environments. The standard will include definition of the metadata required to be collected with the identifier and what uses can be made of that metadata. …

The I2 group is split into a few subgroups which have been charged with looking into how institutional identifiers are used in particular scenarios. These scenarios are e-resources, repositories and e-learning systems, and library resource workflows. The scenario names pain me a bit, but so be it; this is our industry, and there are bigger windmills to tilt at.

I am currently co-chairing the subgroup looking at repositories and e-learning, and apparently I am its “tech lead.” I don’t want to get caught up on names and roles and titles, though; this series isn’t about those at all. I’m just setting the scene and explaining why my head’s in this space and laying bare my stake in the issue.

The remainder of this series will provide a bit more detail on the issues around institutional identifiers, share how the repository subgroup is grappling with identifier issues and engaging the repository community to assess needs, propose an approach for an identifier system that may meet said needs, and explore what seems to be the thorniest issue ((Hint: management. I know, “duh,” right?)).

State of the Me

Has it really been two months? Why, yes, it has. Oh me, oh my. I have tried to stick somewhat loosely to a schedule of writing here once a month ((Here I extend my hand and then imagine you, whomever you may be, smacking it ever so gently)), but alas, April came and went and I simply made no time to write.

That’s not entirely true; I did plenty of writing:

I wrote code. After a year of working on the World Digital Library project at $MPOW, we went live on April 21st. The last few weeks were very busy for the development team, but I did find a few moments to breathe and blink.

I wrote microblog updates. After months of trying to figure out what microblogging is all about ((Wondered: Is it IM? Status updates? Blogging? And how is it related to these? Concluded: it’s a little of each, and somehow it fits my status/vanity/sharing needs perfectly.)), it found its way into my daily routine. When time is short or thoughts arise fast and fuzzy, microblogging is a useful public scratchpad.

I wrote slides. The kind folks over at the College and University Section of the New Jersey Library Association invited me to be a panelist at the 2009 NJLA conference. The panel addressed recentish developments in open source integrated library systems. I spoke about the Evergreen ILS ((Hat tip to Equinox Software Inc.’s Karen G. Schneider for her kind assistance.)) and my co-panelists spoke about Koha and the Open Library Environment Project.

And, ever the dutiful technologist, I wrote documentation. And that will be the subject of my next post.

Ada Lovelace Day

I confess: prior to today, I had never heard of Ada Lovelace. A number of bloggers whom I follow wrote about Ms. Lovelace today, which is apparently Ada Lovelace Day: “an international day of blogging to draw attention to women excelling in technology.”

Inspired by their words, I thought I would say my piece as well. And so, this being the first Ada Lovelace Day, I’d like to celebrate the woman who is most responsible for my own love of libraries ((I realize Ada Lovelace Day is about technology, not about libraries, but I hope you’ll give me some slack.)) and technology: my mother, Diane. My mother is neither a technologist nor a mathematician, and I’m pretty sure she’s not comfortable in front of a Python interpreter. She was an employee at Rutgers University’s Alexander Library during their first automation efforts in the ’70s, partly while I was in utero. I like to think that library automation entered my bloodstream through osmosis back in 1973 and I’ve been working at this, well, not quite since then, but long enough. More than that, she got me hooked on libraries many years ago through frequent trips to neighborhood libraries and also by including me, in snot-nosed kid form, in her genealogical research that took us to some rural Maryland libraries and, yes, the Library of Congress. This thirst for knowledge (not to mention her constant and unwavering support for me despite the wacky paths I’ve chosen over the years) is why I celebrate my mother today.

Rutgers SCILS: What’s in a Name?

Former colleague Trevor Dawes has written a thorough piece about a name change proposed by the faculty of Rutgers’ School of Communication, Information and Library Studies (SCILS). They have voted on and approved a new name, School of Communication and Information, and it is now awaiting approval from the Board of Governors.

Trevor received e-mail from a current SCILS faculty member after getting involved in a discussion of the name change on a listserv. I find part of that e-mail ((Taken out of context, true.)), specifically the rationale for the name change, absolutely puzzling:

We just have so many programs now – we can’t possibly cover all of them in our school’s name. School of Communication and Information is something of a compromise name, but it does encompass all our departments and programs in the school.

So in order to cover more programs, the name of the school ought to communicate less? Does dropping “Library Studies” somehow represent Journalism, Media Studies, and Informatics students more?

I fail to see how removing “Library Studies” makes the name of the school more meaningful. Why not follow this rationale to its logical conclusion, then, and shorten the name to School of Information? Or iSchool? Or how about “School?” Yes, that’s it, “School!” Then all the departments and programs are equally well-represented. Huzzah, faculty!

I should be clear about my objection. I don’t mind SCILS becoming an iSchool. In fact, I think my education there could have benefited from a more iSchoolish curriculum. But any problems with the school then were not related to the name, and I doubt they are now. What I object to is the oddball rationale for the name change, and the notion that in order to affect change and improve the school, well, clearly a change in name will do the trick! It’s putting the cart before the horse, especially when the MLIS program lacks a core curriculum ((An opportunity for real change, though I will admit that there are good arguments against having one.)). This is change in name only and that is perhaps a missed opportunity.