Leveraging Human Intelligence: Semi-automated Processing in Assuring Access to Digital Content

Research output: Other contributionpeer-review

Abstract

Need for standardization in the content production industry have led producers of popular authoring and publishing applications to adopt structured mark-up languages, such as XML, to implement their content file formats. As part of our effort to ensure long term access to such content, we need to consider properties of the mark-up schemas and devise methods to enable effective mapping among them. The methods may range from a fully automated mapping between two formats to semi-automated format transformation of individual artifacts through human intervention. The former is an ideal scenario and achievable when full specifications of the original and target formats are available and when the development of a full converter is economically feasible. However, a common lack of these resources creates challenges and requires exploration of alternative approaches.

To that effect, we propose a concerted research effort in formal characterization of the mark-up languages and the programming languages that can be used to express transformations of content and structures described through document mark-ups. Among these, we anticipate an important role for methods akin to programming-by-example where the document transformation is learnt by observing user interaction with the contemporary applications as the user manually performs changes in the document. Based on observed examples, the program then generalizes the transformations for the single document and similar documents in the corpora. Such approaches that leverage human input and effectively infer the desired transformations are essential for both creating and testing automatic document converters in general and handling a long tail of structured document formats for which the converters are not available.
Original languageEnglish
Publication statusPublished - Sept 2013
Externally publishedYes

Fingerprint

Dive into the research topics of 'Leveraging Human Intelligence: Semi-automated Processing in Assuring Access to Digital Content'. Together they form a unique fingerprint.

Cite this