A comparative analysis of keyword extraction techniques [excerpt]

March 6, 2006

With widespread digitization of printed materials and steady growth of "born-digital" resources, there arise certain questions about access and discoverability. One such question is whether the full-text of this content, produced by advanced optical character recognition (OCR) techniques, is sufficient as a descriptor of the content. Will the model of mass digitization and full-text searching enable users to find the information they need? Or will we need to continue employing the classification skills of highly qualified human beings in order to ensure information is discoverable? The latter model seems to have worked well for the library community, with trained indexers and catalogers summarizing documents according to established standards and widely used thesauri or controlled vocabularies. The predictability of these techniques has some obvious benefits, such as consistency across different systems, the ability to construct browse interfaces in addition to search ones, and reduction of common errors such as differences in case, punctuation, spelling, and so forth. The process of human classification has thus proven to be quite effective in our endeavors to organize information.

The question of whether we will continue to classify digital content in a similar manner ought to be asked. Is there any hope to keep up with the dizzying pace with which documents are digitized? Classification is a costly, time-consuming process, requiring highly trained individuals to consume a large amount of information and summarize it. If the goal is to continue digitizing and making accessible information at the current rate, it is improbable that human catalogers and indexers will be able to keep up without sacrificing some of the quality that results from their considerable skills. Yet the goal of enhancing access and discoverability of digital content is one that ought to be pursued, and will likely not be realized through full-text searching alone. Indeed, why should we put so much time and effort into the process of digitization if it does not benefit our users?

Fortunately, the process of automatic extraction of keywords is one that has received much attention. As implied by the phrase, automatic keyword extraction is a process by which representative terms are systematically extracted from a text with either minimal or no human intervention, depending on the model. The goal of automatic extraction is to apply the power and speed of computation to the problems of access and discoverability, adding value to information organization and retrieval without the significant costs and drawbacks associated with human indexers. Research is taking place in numerous fields across the globe, and there is no clear frontrunner among the technologies and algorithms. This paper explores five approaches to keyword extraction, as presented in research papers, to demonstrate the different ways keywords may be extracted, to reflect commonalities between the approaches, and to evaluate the results thereof. Each paper is presented in a different section, for ease of organization.

... Read the paper in its entirety.

Twitter Facebook LinkedIn

A comparative analysis of keyword extraction techniques [excerpt]

You May Also Enjoy

Understanding (e.g.) DOIs for data sets

Ingest: Lessons learned

Ingest is a barrier to ingest

Impressions from Open Repositories 2010