Context

Early on in the effort to develop the first public version of the World Digital Library web application, we developed a (non-public) Django-based cataloging application where Library of Congress catalogers could manage metadata for WDL items. Management in this sense includes creation of records, editing of records, versioning of edits, mapping of source records, and some light workflow for assignment of records to individual catalogers and for hooking into translation processes ((Catalogers cataloged stuff in the English language, but every metadata record needed to be translated into the other six U.N. languages: Spanish, Russian, French, Arabic, Chinese, and Portuguese.)).

I worked primarily on the source record mapping tools. They take a number of formats as input and are called by the cataloging application to map metadata from these formats into the WDL domain model. Several though not all of which are XML-based, and thus easily dealt with in Python, via the etree module in the lxml package.

Dan recently kicked off a new R&D project for evaluating (any) metadata against any number of metadata profiles, mapping into a generic data dictionary, the goal being to determine how feasible it would be to develop a toolset for aiding remediation of metadata across any number of digital collections. I have been working on this project with Dan, and got started by seeing how generalizable the WDL metadata mapping tools are. Turns out they’re fairly generalizable once you tweak the various format-specific mapping rules to map into the generic data dictionary model rather than the WDL model (around 15 elements, and somewhere between Dublin Core and MODS in terms of specificity but flatly structured like DC).

Some of the test data I am working with now, that has nothing to do with WDL, is SGML-based TEI 2 markup. The closest I worked with on WDL was TEI P5 for manuscript description which is serialized in XML. Turns out my TEI mapping rules from before blew up on this TEI 2 stuff, as lxml.etree (naturally) wasn’t digging the non-XML input. I googled around a bit for how best to parse TEI (or any SGML) in Python and then discovered it’s actually simple as pie.

Code

If you’ve got the BeautifulSoup module installed ((And you are but one sudo easy_install BeautifulSoup away from that.)):

>>> from BeautifulSoup import BeautifulSoup
>>> tei = open('foo.sgm').read()
>>> BeautifulSoup(tei).findAll('title')[0].string
u'[Memorandum to Dr. Botkin]: a machine readable transcription.'

If not, the lxml.html module works too:

>>> from lxml import html
>>> h = html.parse(open('foo.sgm'))
>>> h.xpath('//title')[0].text
'[Memorandum to Dr. Botkin]: a machine readable transcription.'

Data

And here’s what the sample data looks like:

<!doctype tei2 public "-//Library of Congress - Historical Collections (American Memory)//DTD ammem.dtd//EN" 
[
<!entity % images system "07010101.ent"> %images;
]>




wpa0-07010101
[Memorandum to Dr. Botkin]: a machine readable transcription.
Life Histories from the Folklore Project, WPA Federal Writers' Project, 1936-1940; American Memory, Library of Congress.


Selected and converted.
American Memory, Library of Congress.


Washington, DC, 1994.

Preceding element provides place and date of transcription only.

For more information about this text and this American Memory collection, refer to accompanying matter.

U.S. Work Projects Administration, Federal Writers' Project (Folklore Project, Life Histories, 1936-39); Manuscript Division, Library of Congress. Copyright status not determined; refer to accompanying matter.

The National Digital Library Program at the Library of Congress makes digitized historical materials available for education and scholarship.

This transcription is intended to have an accuracy of 99.95 percent or greater and is not intended to reproduce the appearance of the original work. The accompanying images provide a facsimile of this work and represent the appearance of the original.

1994/03/15 2002/04/05
0001

Memorandum to Dr. Botkin from G. B. Roberts, May 26, 1941

Subject: Alabama Material

This material has not yet been accessioned and has only beeen been roughly classified as life histories, folklore, and miscellaneous data and copy save in the case of the 2 ex-slave items and the essay on Jesse Owens, each of which was recommended.

Total no. of items recommended: 3 (14 pp.) In progress