Metadata Extraction and Digital preservation:
An Overview
Milena Dobreva1, 2, Yunhyong Kim1 and Seamus Ross1
1
Digital Curation Center (DCC) & Humanities Advanced Technology and Information
Institute (HATII), University of Glasgow, 11 University Gardens, Glasgow, G12 8QJ, UK.
{s.ross, y.kim, m.dobreva}@hatii.arts.gla.ac.uk
2
Digital Humanities Department, Institute of Mathematics and Informatics,
8 Acad. G. Bonchev St., 1113 Sofia, Bulgaria.
dobreva@math.bas.bg
Abstract. Preservation metadata are in the core of the activities which
guarantee long term sustainability and usability of digital resources. Currently
the field of preservation metadata is more advanced in theoretical issues, with
most of the effort invested in developing preservation schemas and studying
interoperability issues. Recent research trends in automated metadata generation
are not well integrated into preservation metadata workflows although they, as
with all other types of metadata, can not be created manually at a pace
compatible with that at which digital resources are being created. In this paper
we investigate where the cross section of the current needs in preservation
metadata field and the achievements in automated metadata generation lies. We
also place this in the context of the preservation activities framework of the
DELOS reference model.
Keywords: preservation metadata, metadata generation, manual and automatic
generation, DELOS DLRM
1 Introduction
Such a wealth of information accessible to the global public as that on the World
Wide Web (WWW) has never been witnessed before. This is known as information
deluge. Estimations made on the increasing volume of information result in figures
which are hard even to comprehend. For example an IDC analysis [22] provides an
estimate that 97 billion emails, over 40 billion of which are spam messages, are being
sent daily worldwide in 2007. The total volume of business emails sent annually
worldwide in 2007 is estimated to be of a size approaching 5 exabytes.
Emails are only one of the many varieties of digital objects. Long term
sustainability is a common issue for all types of digital content. To ensure it,
electronic resources should be accompanied by preservation metadata. Over the last
few years, various institutions and consortia worked on suggestions for preservation
elements data sets (see, e.g., [21]). Preservation metadata, as all other types of
metadata are affected by the metadata bottleneck [23] which means that the human
2
Milena Dobreva, Yunhyong Kim, Seamus Ross
efforts needed to create metadata can not cope with the pace of creation of new digital
resources.
The consequences of this situation are quite worrying. Recently, Zhang and
Iastram [35] published results of their study with respect to human behaviour in
metadata creation, on a sample of 2400 websites comprising four groups – each
consisting of 600 web sites – representing the professional groups of their origin, in
order to analyse the different approaches to metadata entry displayed by four
professional communities. The research showed that 51.17% of the web sites created
within the Library and Information Science community had embedded metadata. The
web sites created by the Information Technology community featured metadata in
66.5% of the studied cases; 66.7% of the web sites of governmental and non-profit
organisations and 67% of the web sites from the Business and Industries sector had
metadata. Another worrying example comes from a recent evaluation of a German
national digitisation programme which reveals “insufficient metadata practice,
endangering the usage of the digital documents, not to speak of their preservation:
33% of the objects had no metadata at all, 33% bibliographic metadata only, 10% had
both bibliographic and subject metadata (rest: no information). Less than a third of the
metadata was digital.”1
However, coping with volumes of digital information and subsequent insufficient
production of metadata is not the only concern. The quality of manually created
metadata is found by researchers to rely heavily on the combination of two factors,
institutional processes and personal behaviour. The motivation, difficulty of working
with the application, difficulty of understanding the scope of the project and the
subsequent use of metadata in information retrieval are mentioned amongst the basic
factors which influence the quality [4].
Although unexpected in this setting of deficient metadata quantity and quality,
there are new emerging problems of information redundancy in metadata collections
[8]. Information redundancy arises when the same digital object is supplied with
metadata in different places (relating to replicated efforts where human resources are
not sufficient) and during ingest of vast numbers of objects supplied with similar
metadata into a digital repository making them hard to identify.
Under these circumstances, the application of automated metadata extraction in the
time of ingest in digital repository is a necessity. Automation would help in providing
more objects with metadata, and improve the quality in metadata; it also could help to
improve the metadata content in cases of redundancy.
In this paper we present an overview of the current work in the field of
preservation metadata in Section 2. Then in Section 3 we outline and analyse current
research trends in metadata generation in general. Finally, in Section 4 we present a
discussion of the work described in Section 2 in the context of preservation metadata,
and attempt to map it into the preservation activities of the DELOS Digital Library
Reference Model, to present the topic in the wider digital library context.
1
DELOS brainstorming on Metadata topics, Juan les Pins,
http://www.ukoln.ac.uk/ukoln/staff/t.koch/pres/Brainst200512-MDc.html
05/12/2005,
Metadata Extraction and Digital Preservation: An Overview
3
2 Preservation Metadata
Preservation metadata are defined as ‘descriptive, structured and administrative
metadata that supports the long-term preservation of digital materials’ [21]. This
definition is structural on the one hand – it places preservation metadata within the
general metadata classification. On the other hand, the definition is functional because
it explains what is the rationale behind having this set of metadata elements. The
preservation metadata should help to solve issues which are caused by the
technology-dependence of digital materials. They should also address the mutable
nature of digital objects.
In the preservation metadata field more work has been done recently on modeling
schemas while there is still a delay in the automatic extraction tools’ development and
integration in practice [21]. This means that one important question to which we still
do not have a good answer is: How to assure that we produce and use welldocumented digital resources with improved metadata quality using in a proper way
automatic tools?
This question is important both for content and service providers. It influences the
quality of the product which the content providers are producing. The service
providers rely on the quality of metadata and where it is not high enough they should
apply measures to improve it. It is suggested that preservation metadata cover five
major areas, provenance, authenticity, preservation activity, technical environment
and rights management [21].
A detailed framework especially for choosing a preservation planning procedure is
suggested by Strodl et al. [31]. The framework which is applied includes defining
requirements, evaluating alternatives and considering results. Metadata requirements
are defined during the first stage. However, metadata creation methods are not fully
explored.
Preservation Workflows in Digital Archives
Oltmans et al. present preservation functionality in a digital archive implemented
within the e-Depot2, the digital archiving system of the National Library of
Netherlands [26]. As of November 2007, e-Depot includes 10 million e-journal
articles from more than 5,000 e-journal titles. The articles are either online
publications, or published on CD-ROMs and other offline media. The workflow of
the e-Depot includes automated validation and pre processing of the electronic
publication; automated generation and resolving of identifier number; automated
search and retrieval functions; and automated functionality of identifying,
authenticating and authorizing users. The cataloguing, i.e. the creation of metadata for
the ingested material is performed manually. The deposit system is based on OAIS
[29]. Another OAIS-oriented example is the DIGARCH project presented by JaJa in
[17], which aims to build a Multi-Institution Testbed for Scalable Digital Archiving.
2
http://www.kb.nl/dnp/e-depot/factsandfigures-en.html
4
Milena Dobreva, Yunhyong Kim, Seamus Ross
The workflow of the Portico3, archive of electronic scholarly journals, had been
launched as the Electronic-Archiving Initiative, by JSTOR, in 2002. As of November
2007, 2,784,947 articles have been ingested into the archive. Owens [27] describes
the automated ingest workflow of Portico. The publishers’ document type definitions
(DTDs) are used with random sampling for possible problems. The descriptive
metadata which are extracted for use within Portico METS4 files do not include all
metadata which accompany the publication because they are found to be redundant
(some of them reflect in-house only publishing processes). Amongst the directions for
enhancement, the generation of minimal descriptive metadata is mentioned for the
cases where no XML file is supplied. This in fact means that currently the quality of
the descriptive metadata is not meeting a pre-defined common level.
These examples show that in practical workflows, when a particular standard had
been chosen, the matter of content and production of metadata seems to be solved in
advance. However, this is not a guarantee for supply of good quality metadata.
3 Basic Trends in Metadata Generation
Current research on metadata extraction is directed toward general metadata, and not
focused to consider preservation metadata.
What hints could manual metadata creation give us? The human operators follow
three basic steps:
1. Visual scanning of the document.
2. Mental analysis which results in identification of the metadata types and
their values, and
3. Entering the recognized/generated metadata in the proper form.
To be able to do such work, the operators should have proper training and be
familiar with the metadata structure, the computer standards used and the quality
requirements – what type of metadata should be entered and how detailed they should
be. Manual metadata entry, especially in specialized fields (e.g. description of
mediaeval manuscripts or archival documents, and linguistic annotation within the
text), is not guaranteed to be correct and complete, because the quality of work
depends on the experience and level of involvement of the operator.
Although the fundamental challenge in automated metadata extraction is to find ways
to execute the second step(the analysis which results in the identification of metadata
types, and the values for these metadata), even the first part(the scanning of the
document) is not a trivial task due to the variety of document content and electronic
formats.
Recent research can be grouped into work taking one of three directions which
would fit into various stages of the metadata lifecycle:
1. Methods for automated extraction. They are most commonly based on domain
specific indexing, formalisms for knowledge representation, e.g. ontologies,
automatic abstracting, document genre recognition, automatic generation from semi-
3
4
http://www.portico.org
Metadata Encoding and Transmission Standard, http://www.loc.gov/standards/mets/
Metadata Extraction and Digital Preservation: An Overview
5
structured metadata. These methods are best suited for implementation as part of the
process of ingesting digital resources into a repository.
2. Methods for metadata enrichment. They could be used during the process of
ingest of digital materials, as well as, for improving the quality of digital repositories.
3. Methods for generation of preservation metadata for web resources at the
time of dissemination. This method is tailored for web resources, and appropriate for
workflows reflecting their life cycle [30].
Extracting Specific Document Elements
The methods suggested for metadata extraction of specific document elements fall
into three categories: rule based approach, neural networks based approach and
statistical-based approach.
A. Rule-based approach
This group of methods applies the rule-based approach using different information
characteristics (layout of the source documents and natural language techniques).
Giuffrida et al. developed a rule-based system for metadata extraction from research
papers in Postscript [9]. The authors used the general layout rules, such as “titles are
usually located on the upper portions of the first pages and they are usually in the
largest font sizes”.
Yilmazel et al. developed the system MetaExtract which assigns Dublin Core +
GEM (Gateway to Educational Materials) metadata to educational materials (lesson
plans and web-based educational activities in mathematics and science at secondary
school level) using rule-based natural language processing technologies [34].
MetaExtract has three distinct extraction modules: (i) eQuery module (a rule-based
system using shallow parsing rules to extract terms and phrases within single
sentences which would then be assigned to the following metadata elements: Creator,
Title, Date, Grade, Duration, Essential Resources, Pedagogy-Teaching Method,
Pedagogy-Grouping, Pedagogy-Assessment, Pedagogy- Process, Audience,
Standards, Publisher, and Relations; (ii) a HTML-based Extraction module which
operates by comparing the text to a list of clue words developed previously, and (iii)a
Keyword Generator module which operates by computing the standard TF–IDF5
metric on each document. The quality of the extracted metadata was evaluated
through a web-based survey. It had been done through a questionnaire providing a
lesson plan and its associated metadata, either manually or automatically assigned.
The survey which showed significant difference between manual and automated
extraction for two of the elements, Title and Keyword (the quality was higher when
they were manually extracted). The quality of remaining extracted elements
(Description, Grade, Duration, Essential Resources, Pedagogy-Teaching Method, and
Pedagogy-Group) had been not found significantly different between the automated
and manual approach.
5
term frequency ‐ inverse document frequency
6
Milena Dobreva, Yunhyong Kim, Seamus Ross
Mao et al. conducted automatic metadata extraction from medical research papers
using rules on formatting information [25]. Their work is concerned with developing
a system to generate descriptive metadata (title, author, affiliation, and abstract) from
scanned medical journals, for the preservation of scanned and online medical journal
articles at the U.S. National Library of Medicine (NLM). The system consists of the
following modules: (i) ZoneMatch - generates geometric and contextual features from
a set of issues of each journal, and (ii) ZoneCzar - a rule–based labelling module
which uses the generated features to perform labelling independent of journal layout
styles.
B. Neural networks approach
Automatic extraction of metadata has also been attempted using a neural network
[33]. The patent is intended for use in data archiving systems, as a method for
automatic metadata extraction. The method is adaptable to non-standard documents
where metadata locations are unknown. The claim is that the method contributes to
extracting more metadata with greater accuracy and reliability (estimations are not
given).
The first step of the method is to provide a computer readable text document, an
authority list consisting of common uses of a set of words, and a neural network
trained to extract metadata from groupings of data called compounds.
In the next step, the words within the document are compared against the authority
list. In the third step, the compounds are processed through the neural network to
generate metadata guesses. The metadata may then be derived from the metadata
guesses by selecting those document, compound, and word guesses having the largest
document, compound, and word confidence factors, respectively.
C. Statistical-based approach
Another track in metadata extraction of specific elements is based on the application
of statistical methods.
Han et al. describe Support Vector Machines (SVM) classification-based method
as a machine learning method with better performance (higher precision) compared to
Hidden Markov Models (HMM) [13]. They present the problem of classifying the
lines in a document into the categories of metadata and proposed using SVM as the
classifier. This method is also used in the research of Council et al. [3] where the
basic studied objects are acknowledgements in research publications. The paper
describes a mixed method for automatic identification and extraction of
acknowledgements from research documents using a combination of a Support
Vector Machine and regular expressions. The algorithm has been implemented as
a plug-in to the CiteSeer Digital. As a demonstration, authors have used the
CiteSeer's autonomous citation indexing (ACI) feature to measure the relative impact
of acknowledged entities, and present the top twenty acknowledged entities within
the archive. The experimental results proved precision of 0.7845 and recall of
0.8955.
Metadata Extraction and Digital Preservation: An Overview
7
Hu et al. present the automatic extraction of titles from the bodies of documents
encoded in HTML [14]. Titles fields of HTML documents is often not correctly
filled in. In such cases, Hu et al. suggests the title be constructed from the body of
the HTML document. The authors suggest a supervised machine learning approach
which is based on format information (font size, position, and font weight) as
additional features in the process of title extraction. It is reported that the proposed
method significantly outperforms the baseline method of using the lines in largest
font size as title (20.9%–32.6% improvement).
The task of title extraction was developed further in the subsequent publications
of this group where title extraction from Word and PowerPoint documents was
presented [15], [16]. The authors again suggest the machine learning approach to title
extraction from general documents which belong to a number of specific genres,
including presentations, book chapters, technical papers, brochures, reports, and
letters. In their approach, titles in sample documents (for Word and PowerPoint
respectively) are annotated and taken as training data, based on which machine
learning models are constructed, and finally used to perform title extraction. The
method is unique in that it mainly utilizes formatting information, such as font size,
as features in the models. The results show that the use of formatting information can
lead to quite accurate extraction from general documents. Reported precision and
recall for title extraction from Word documents are 0.810 and 0.837 respectively,
and precision and recall for PowerPoint are 0.875 and 0.895. Another significant
result from the presented approach is the fact that models can be trained in one
domain and then applied to another domain.
Extraction of metadata from news using SVM (support vector machine) method is
discussed in [6], by Debnath and Giles, who performed experiments on headlines
metadata extraction. News metadata includes DateLine, ByLine, HeadLine. The paper
demonstrates that HeadLine is especially helpful in locating explanatory sentences for
any major events such as significant changes in stock prices reported in financial
news articles. Another application of the support vector machine is presented in
Diekema et al [7] who use it for hierarchical text categorization (assigning predefined
labels to text documents). The aim of this research was to provide tools for search in a
digital library of teaching resources – NSDL6 which would involve educational
standards of the different states. Thus automated standards alignment had been done
for 27 state standards used for training of the system, and 20 standards used for
testing. The recall reported for Mathematical and Science standards was over 99%.
Precision was found to be 72.89% for Mathematics and 60.55 for Science standards.
Liu et al. present the task of automatic identification, extraction, and search for the
contents of tables in documents [24]. To extract the contents of tables and their
metadata, an automatic table metadata extraction algorithm is designed and tested on
PDF documents. This algorithm includes three processing steps: on the first one the
PDF document is converted into a formatted text. Then during the second step table
candidates are being detected based on location analysis and keyword matching and
table metadata is extracted. Finally, table candidates are being confirmed or denied.
The algorithm had been tested on 120 randomly selected PDF documents from digital
6
NSDL, The National Science Digital Library, http://nsdl.org/
8
Milena Dobreva, Yunhyong Kim, Seamus Ross
libraries. The experimental results reported show that the algorithm has good
performance with an over-all precision, recall and accuracy of over 95%.
Day et al. applied a hierarchical template–based reference metadata extraction
method for scholarly publications [5]. Authors implemented a hierarchical
knowledge representation framework called INFOMAP, which automatically extracts
author, title, journal, volume, number (issue), year, and page information. The
experimental results show that, by using INFOMAP, these can be extracted from
different kinds of reference styles with a high degree of precision. The overall average
accuracy of 92.39% is reported with respect to six major reference styles which have
been compared in this study.
Metadata Extraction in Specific Subject Domains
There are also studies on automating metadata generation in specific subject
domains. Cardinaels et al. [2] discuss an interface for generating learning objects
metadata (LOM). As various categories of metadata sources they suggest document
content analysis, document context analysis, document usage, and composite
document structure.
In the specialized area of geospatial metadata generation, Batcheller [1]
suggested using an appropriate GIS (geographic information system). This approach
is somewhere between the field of data management and metadata generation. In
selected cases where geospatial information is needed, it could contribute to more
correct and complete data entry.
Performance Evaluation of Metadata Extraction
Performance evaluation of various metadata extraction methods, as well as
comparison between manual and automatic metadata extraction, are important for the
benchmarking of metadata extraction tools. As a rule, the papers mention the
precision and recall results for their approaches and/or give comparison with a
baseline method for the specific field.
Greenberg explores, in [11], the capabilities of two Dublin Core automatic
metadata generation applications – Klarity (this tool seems not to be supported
anymore after the producing company has been bought) and DC.dot. The top level
Web page for each resource, from a sample of 29 resources obtained from the
National Institute of Environmental Health Sciences (NIEHS), was submitted to both
generators. Results indicate that text extraction algorithms can contribute to
automated metadata generation. Results also indicate that harvesting metadata from
META tags created by humans can have a positive impact on automatic metadata
generation. The conclusion of the study is that integrating automated extraction
methods will contribute to the creation of optimal metadata.
A survey on the metadata experts’ opinions on the functionalities of automated
metadata generation applications is done by Greenberg et al. [12]. The paper reports
on the Automatic Metadata Generation Applications (AMeGA) project’s metadata
Metadata Extraction and Digital Preservation: An Overview
9
expert survey. Participants anticipate greater accuracy will be exhibited by automatic
techniques when dealing with technical metadata (e.g., ID, language, and format
metadata) compared to those when dealing with metadata requiring intellectual
discretion (e.g., subject and description metadata). Support for implementing
automatic techniques paralleled anticipated accuracy results. Metadata experts are in
favour of using automatic techniques, although they are generally not in favour of
eliminating human evaluation or production when dealing with more intellectually
demanding metadata extraction processes. Results are incorporated into Version 1.0
of the Recommended Functionalities for automatic metadata generation applications.
Additional research is needed to identify in what way automatic generation could be
combined with manual metadata entry, so that the best possible quality is achieved.
There is also a need for more active research twinned with implementation activities
which would lead to extraction of preservation metadata, or to the enrichment of
existing metadata records, at the time of ingest into the repository, or at a later stage
of the metadata lifecycle. It is very positive that ongoing research on metadata
generation includes experiments with various methods, which could be applied at
different metadata lifecycle stages (ingest, enrichment and dissemination).
Ongoing research involves extraction of various metadata elements from different
types of documents. The reported precision and recall results differ considerably with
respect to different types of documents. Therefore we believe that future solutions
should include components which analyze the document genre, as suggested by Kim
and Ross [18], [19], [20], and, accordingly, select the method which is likely to give
better results with respect to the document type.
4 Preservation as a Part of the Digital Library Reference Model
The analysis of the ongoing research shows that methods for preservation metadata
generation are still not in use, and the basic ‘guarantee’ for metadata quality is the use
of an established standard. To illustrate the place for preservation in the digital library
world, we will use the DELOS Digital Library Reference Model7. It is a formal and
conceptual framework describing the characteristics of digital libraries as information
systems. It introduces the main concepts (entities) and the relationships between them
grouped into six domains (Content, User, Functionality, Architecture, Quality and
Policy) on three levels: digital library, digital library system and digital library
management system.
The DELOS Digital Library Reference Model provides the general framework for
discussing preservation-related objects and processes through the definitions of
resource and information object within the Content domain. The specific concepts
which are part of the DELOS DLRM and may be used to model preservation are
listed below. They are also highlighted in Fig. 1 which presents the basic concepts of
the Content domain.
− Resource <hasMetadata> Information Object
Ideally, information objects would be accompanied by metadata that can be used to
7
http://www.delos.info/index.php?option=com_content&task=view&id=345
10
−
−
−
−
Milena Dobreva, Yunhyong Kim, Seamus Ross
automate decisions about preservation. This includes, for example, the date when
an information object can be destroyed, and the format of the information object,
which can be used to determine when the technology needed for interpreting the
format disappears, necessitating migration to a different format.
Resource <hasFormat> Resource Format
The format of an information object is important for its correct interpretation. The
issue of format applies both to primary InformationObjects and to
MetadataObjects, which are InformationObjects.
Resource <hasQuality> Quality Parameter.
Resource <linkedTo> Resource
Seeing an information object in its original context is important for the correct
understanding of its meaning.
Ontology with its specialization format is crucial for preservation. Format
specifications need to be preserved so that information objects using an old format,
or a previous version of an existing format, can still be interpreted. Likewise, the
different versions of subject ontology need to be preserved, so that the subject
metadata prepared using an old version of an ontology can be interpreted properly.
Fig. 1. Concepts and relationships within the Content domain presented as a concept map.
Preservation-related concepts are highlighted.
The Functionality domain features the following functions which are important for
preservation are presented in Table 1.
Table 1. Functionality elements in DELOS DLRM model.
Function
convertTransform
visualize
compare
Notes on Use
A function which is used for conversion of files including format
conversion.
A function which is important for preserving look and feel.
A function which ascertains whether two instances of an information
Metadata Extraction and Digital Preservation: An Overview
withdraw
export
Configure DL
EvaluateMetadata
ExtractMetadata
EnrichMetadata
Log Keeping
11
object are the same.
A function which is supports making decisions whether to maintain the
withdrawn object in a secondary store or whether to completely delete the
object.
A function which exports an entire digital library or pieces of it to create a
mirror site or to create a backup copy. Also making information objects,
especially metadata object, available for importing by another system
(harvesting).
A function which saves the configuration state after any changes.
A function which initiates the evaluation of a selected set of quality
parameters which are used to make a decision on the quality of metadata
accompanying the digital resource. The results are used in determining
whether metadata should be extracted or enriched.
A function which initiates extraction of metadata.
A function which initiates automated enrichment of metadata.
A function which supports logging system actions and use. It is important
for preservation in two ways:
(1) It allows preserving the state of the total system at any given time, and
(2) It provides for usage history of objects which preserves the context for
later uses.
In the Policy domain, two policies relate directly to preservation: Preservation
policy and Disposal policy. In particular, digital rights govern what preservation
measures can be taken, for instance, with respect to making backup copies.
Among the quality parameters, the following are of particular importance for
preservation:
Generic quality parameters:
Content quality parameters:
− Security Enforcement
− Integrity
− Interoperability Support
− Authenticity
− Documentation Coverage
− Authoritativeness
Functionality quality parameter:
− Performance or behaviour
− Fault management Performance
− Fidelity
Architecture quality parameter:
− Provenance
− Compliance to standards
The concepts and relationships from the DELOS Digital Library Reference Model,
as we showed above, are capable to model the overall preservation process. The use
of the model helps to address preservation issues consistently and in the needed level
of detail.
5 Conclusion
The preservation of digital material preservation is recognised as one of the vital
issues for safeguarding the European heritage. Ms. Viviane Reding, EU
Commissioner on IS & Media stressed in February 2007 [28]:
“… if we do not actively pursue the preservation of digital material now, we risk
having a gap in our intellectual record.
12
Milena Dobreva, Yunhyong Kim, Seamus Ross
If you allow me another historical reference, we do not want to experience the
digital equivalent of the destruction of the Alexandria Library. Scientific assets are
just too valuable to be put at risk”.
While manual metadata extraction definitely can not answer the current needs of
the metadata production, automatic extraction can not be seen as the universal
solution for obtaining metadata content. More attention should be paid to the
development of combined approaches based on a comparison of manual and
automatic extraction quality with respect to different metadata elements. Another
promising research direction is in adding intelligent elements to the preservation
metadata lifecycle, e.g. the analysis of the document genre in order to select the best
automated extraction tool; and the implementation of self-documenting components.
Given the high management costs of digital collections, it is also necessary to find
ways of adding the value of preservation metadata to other functionalities of the
digital libraries. The use of DELOS Digital Library Reference Model helps to
understand better the preservation-related components and processes within the digital
world and to model the processes in each separate case according to the specific
collection needs.
Acknowledgements. The research is being conducted as part of the Digital Curation Centre’s
(DCC) research programme. It has been supported by the DELOS: Network of Excellence on
Digital Libraries (G038-507618), funded under the European Commission’s IST 6th
Framework programme.
References
[1] Batcheller J.K.: Automating Geospatial Metadata Generation--An Integrated Data
Management and Documentation Approach. In: Proc. of the 10th AGILE International
Conference on Geographic Information Science, 7 pp. (2007).
[2] Cardinaels, K., Meire, M., Duval, E.: Automating Metadata Generation: the Simple
Indexing Interface. In Proc. 14th Int. Conf. on World Wide Web (Chiba, Japan, May 10--14,
2005). WWW '05. ACM, New York, NY, 548--556. (2005).
[3] Council, I., Giles, C., Han H., Manavoglu, E.: Automatic Acknowledgement Indexing:
Expanding the Semantics of Contribution in the CiteSeer Digital Library. Proc. of the 3rd int.
conf. on Knowledge capture, Banff, Alberta, Canada, 19--26, ISBN:1-59593-163-5 (2005).
[4] Crystal A., Greenberg J.: Usability of a Metadata Creation Application for Resource
Authors, Library & Information Science Research V. 27(2), 177--189 (2005).
[5] Day, M., Tsai, R., Sung, C., Hsieh, C., Lee, C., Wu, C., Wu, K., Ong, C., Hsu, W.:
Reference Metadata Extraction Using a Hierarchical Knowledge Representation
Framework, Decision Support Systems 43 pp. 152--167 . (2007).
[6] Debnath, S., Giles, C.: A Learning Based Model for Headline Extraction of News Articles
to Find Explanatory Sentences for Events. Proc. of the 3rd Int. Conf. on Knowledge
Capture, Banff, Alberta, Canada, 189--190, ISBN:1-59593-163-5 (2005).
[7] Diekema, A. R., Yilmazel, O., Bailey, J., Harwell, S. C., and Liddy, E. D. 2007. Standards
Alignment for Metadata Assignment. In Proc.2007 Conference on Digital Libraries
(Vancouver, BC, Canada, June 18--23, 2007). JCDL '07. ACM, 398--399 (2007).
[8] Foulonneau M.: Information Redundancy across Metadata Collections. Information
Processing & Management V. 43 (3), Special Issue on Heterogeneous and Distributed IR,
740--751 (2007).
[9] Giuffrida, G., Shek, E. Yang, J.: Knowledge-based Metadata Extraction from PostScript
File. Proc. 5th ACM Intl. conf. Digital Libraries (2000) 77--84 (2004).
Metadata Extraction and Digital Preservation: An Overview
13
[10] Glick, K. L., Wilczek, E., Dockins, R.: The Ingest and Maintenance of Electronic Records:
Moving from Theory to Practice. Proc. 6th ACM/IEEE-CS Joint Conf. on Digital Libraries
(Chapel Hill, NC, USA). JCDL '06. ACM Press, New York, NY, 359--359. (2006).
[11] Greenberg J.: Metadata Extraction and Harvesting: A Comparison of Two Automatic
Metadata Generation Applications, Journal of Internet Cataloging, 6(4): 59--82 (2004).
[12] Greenberg, J., Spurgin, K., Crystal, A.: Functionalities for Automatic Metadata Generation
Applications: a Survey of Metadata Experts’ Opinions. Int. J. of Metadata, Semantics &
Ontologies, 1(1), 3--20. (2006).
[13] Han, H., Giles, L., Manavoglu, E., Zha, H., Zhang, Z., Fox, E.A.: Automatic Document
Metadata Extraction Using Support Vector Machines. In 3rd ACM/IEEECS Conf. Digital
libraries. 37--48. (2003).
[14] Hu, Y., Xin, G., Song, R., Hu, G., Shi, S., Cao, Y., Li, H.: Title Extraction from Bodies of
HTML Documents and its Application to Web Page Retrieval. Proc. 28th Int. ACM SIGIR
Conf. on Research and Development in Information Retrieval, Salvador, Brazil, 250--257,
ISBN:1-59593-034-5 (2005).
[15] Hu Y, Li H, Cao Y, Meyerzon D, Zheng Q.: Automatic Extraction of Titles from General
Documents using Machine Learning, Int. Conf. on Digital Libraries, Proceedings of the 5th
ACM/IEEE-CS joint conf. on Digital libraries, Denver, CO, USA, 145--154, ISBN:1-58113876-8 (2005).
[16] Hu, Y., Li. H., Cao, Y., Teng, L, Meyerzon, D., Zheng, Q.: Automatic Extraction of Titles
from General Documents using Machine Learning. In: Information Processing and
Management 42, 1276--1293 (2006).
[17] JaJa J.: Robust Technologies for Automated Ingestion and Long-Term Preservation of
Digital Information. In Proc. of the 2006 Int. Conf. on Digital Government Research (San
Diego, California,). dg.o '06, vol. 151. ACM Press, New York, NY, 285--286 (2006).
[18] Kim Y., Ross S.: Genre Classification in Automated Ingest and Appraisal Metadata. Proc.
10th European Conference on research and advanced technology for digital libraries (ECDL
2006), Springer, LNCS 4172, ISBN 3-540-44636-2, pp. 63--74 (2006).
[19] Kim Y., Ross S.: Detecting Family Resemblance: Automated Genre Classification’ Data
Science Journal, Vol. 6 2007. pp. S172--S183 ISSN: 1683--1470 (2007).
[20] Kim Y., Ross S.: Examining Variations of Prominent Features in Genre Classification. To
appear in Proc. of the 41st Hawaiian International Conference on System Sciences, IEEE
Computer Society Press (2008).
[21] Lavoie, B., Gartner R.: Preservation Metadata. A Joint Report of OCLC, Oxford Library
Services, and the Digital Preservation Coalition (DPC), published electronically as a DPC
Technology Watch Report (No. 05-01) http://www.dpconline.org/docs/reports/dpctw0501.pdf. (2005).
[22] Levitt M.: Worldwide Email Usage 2007-2011 Forecast: Resurgence of Spam Takes Its
Toll. IDC Market Analysis #206038, 40 pp. (2007).
[23] Liddy, E.D.: A Breadth of NLP Applications. ELSENEWS of the European Network in
Human Language Technologies. Winter. (2002).
[24] Liu,Y., Mitra, P., Giles, C., Bai, K.: Automatic Extraction of Table Metadata from Digital
Documents. Proc. of the 6th ACM/IEEE-CS joint conference on Digital libraries, p. 339-340, ISBN:1-59593-354-9. (2006).
[25] Mao, S., Kim, J., Thoma, G.: A Dynamic Feature Generation System for Automated
Metadata Extraction in Preservation of Digital Materials. Proc of the First Int. Workshop on
Document Image Analysis for Libraries, Palo Alto, CA, 225--232 (2004).
[26] Oltmans, E., van Diessen, R., van Wijngaarden, H.: Preservation Functionality in a Digital
Archive. Proc. of the 4th ACM/IEEE-CS Joint Conference on Digital Libraries (Tuscon,
AZ, USA, June 7--11, 2004). JCDL '04. ACM Press, New York, NY, 279-286. (2004).
[27] Owens E.: Automated Workflow for the Ingest and Preservation of Electronic Journals. In:
Chapman S, Stovall SA (eds). Archiving 2006: Final Program and Proc. 109--112 (2006).
14
Milena Dobreva, Yunhyong Kim, Seamus Ross
[28] Reding V.: Scientific Information In The Digital Age: How Accessible Should Publicly
Funded Research Be?, Closing speech, Conf. on Scientific Publishing in the European
Research Area Access, Dissemination and Preservation in the Digital Age, Brussels (2007).
[29] Reference Model for an Open Archival Information System (OAIS), CCSDS 650.0-B-1,
http://public.ccsds.org/publications/archive/650x0b1.pdf (2002).
[30] Smith, J. A. and Nelson, M. L.: Generating Best-Effort Preservation Metadata for Web
Resources at Time of Dissemination. Proc. of the 2007 Conf. on Digital Libraries.
Vancouver, BC, Canada, JCDL '07. ACM Press, New York, NY, 51--52 (2007).
[31] Strodl, S., Becker, C., Neumayer, R., Rauber, A.: How to Choose a Digital Preservation
Strategy: Evaluating a Preservation Planning Procedure. Proc. of the 2007 Conference on
Digital Libraries. Vancouver, BC, Canada, JCDL '07. ACM Press, New York, NY, 29-38
(2007).
[32] Stuckenschmidt H., van Harmelen F.: Generating and Managing Metadata for Web-Based
Information Systems, Knowledge-Based Systems Volume 17, Issues 5-6, Special Issue:
Web Intelligence, 201--206 (2004).
[33] US Patent 6044375: Automatic Extraction of Metadata Using a Neural Network,
Inventors: O. Shmueli, D. Greig, C. Staelin, T. Tamir. Assignee: Hewlett-Packard Company.
(2000).
[34] Yilmazel, O., Finneran, C., Liddy, E.: Metaextract: an NLP System to Automatically
Assign Metadata. Proc. of the 4th ACM/IEEE-CS joint conference on Digital libraries,
Tuscon, AZ, USA, 241--242, ISBN: 1-58113-832-6. (2004)
[35] Zhang, J. Jastram, I.: A Study of the Metadata Creation Behavior of Different User Groups
on the Internet. Information Processing & Management, Vol. 42, 1099--1122 (2006).