Academia.eduAcademia.edu

Metadata Extraction and Digital preservation: An Overview

2007, C Thanos, F Borri, and A Launaro, (eds.), Second DELOS Conference on Digital Libraries

Preservation metadata are in the core of the activities which guarantee long term sustainability and usability of digital resources. Currently the field of preservation metadata is more advanced in theoretical issues, with most of the effort invested in developing preservation schemas and studying interoperability issues. Recent research trends in automated metadata generation are not well integrated into preservation metadata workflows although they, as with all other types of metadata, can not be created manually at a pace compatible with that at which digital resources are being created. In this paper we investigate where the cross section of the current needs in preservation metadata field and the achievements in automated metadata generation lies. We also place this in the context of the preservation activities framework of the DELOS reference model.

Metadata Extraction and Digital preservation: An Overview Milena Dobreva1, 2, Yunhyong Kim1 and Seamus Ross1 1 Digital Curation Center (DCC) & Humanities Advanced Technology and Information Institute (HATII), University of Glasgow, 11 University Gardens, Glasgow, G12 8QJ, UK. {s.ross, y.kim, m.dobreva}@hatii.arts.gla.ac.uk 2 Digital Humanities Department, Institute of Mathematics and Informatics, 8 Acad. G. Bonchev St., 1113 Sofia, Bulgaria. dobreva@math.bas.bg Abstract. Preservation metadata are in the core of the activities which guarantee long term sustainability and usability of digital resources. Currently the field of preservation metadata is more advanced in theoretical issues, with most of the effort invested in developing preservation schemas and studying interoperability issues. Recent research trends in automated metadata generation are not well integrated into preservation metadata workflows although they, as with all other types of metadata, can not be created manually at a pace compatible with that at which digital resources are being created. In this paper we investigate where the cross section of the current needs in preservation metadata field and the achievements in automated metadata generation lies. We also place this in the context of the preservation activities framework of the DELOS reference model. Keywords: preservation metadata, metadata generation, manual and automatic generation, DELOS DLRM 1 Introduction Such a wealth of information accessible to the global public as that on the World Wide Web (WWW) has never been witnessed before. This is known as information deluge. Estimations made on the increasing volume of information result in figures which are hard even to comprehend. For example an IDC analysis [22] provides an estimate that 97 billion emails, over 40 billion of which are spam messages, are being sent daily worldwide in 2007. The total volume of business emails sent annually worldwide in 2007 is estimated to be of a size approaching 5 exabytes. Emails are only one of the many varieties of digital objects. Long term sustainability is a common issue for all types of digital content. To ensure it, electronic resources should be accompanied by preservation metadata. Over the last few years, various institutions and consortia worked on suggestions for preservation elements data sets (see, e.g., [21]). Preservation metadata, as all other types of metadata are affected by the metadata bottleneck [23] which means that the human 2 Milena Dobreva, Yunhyong Kim, Seamus Ross efforts needed to create metadata can not cope with the pace of creation of new digital resources. The consequences of this situation are quite worrying. Recently, Zhang and Iastram [35] published results of their study with respect to human behaviour in metadata creation, on a sample of 2400 websites comprising four groups – each consisting of 600 web sites – representing the professional groups of their origin, in order to analyse the different approaches to metadata entry displayed by four professional communities. The research showed that 51.17% of the web sites created within the Library and Information Science community had embedded metadata. The web sites created by the Information Technology community featured metadata in 66.5% of the studied cases; 66.7% of the web sites of governmental and non-profit organisations and 67% of the web sites from the Business and Industries sector had metadata. Another worrying example comes from a recent evaluation of a German national digitisation programme which reveals “insufficient metadata practice, endangering the usage of the digital documents, not to speak of their preservation: 33% of the objects had no metadata at all, 33% bibliographic metadata only, 10% had both bibliographic and subject metadata (rest: no information). Less than a third of the metadata was digital.”1 However, coping with volumes of digital information and subsequent insufficient production of metadata is not the only concern. The quality of manually created metadata is found by researchers to rely heavily on the combination of two factors, institutional processes and personal behaviour. The motivation, difficulty of working with the application, difficulty of understanding the scope of the project and the subsequent use of metadata in information retrieval are mentioned amongst the basic factors which influence the quality [4]. Although unexpected in this setting of deficient metadata quantity and quality, there are new emerging problems of information redundancy in metadata collections [8]. Information redundancy arises when the same digital object is supplied with metadata in different places (relating to replicated efforts where human resources are not sufficient) and during ingest of vast numbers of objects supplied with similar metadata into a digital repository making them hard to identify. Under these circumstances, the application of automated metadata extraction in the time of ingest in digital repository is a necessity. Automation would help in providing more objects with metadata, and improve the quality in metadata; it also could help to improve the metadata content in cases of redundancy. In this paper we present an overview of the current work in the field of preservation metadata in Section 2. Then in Section 3 we outline and analyse current research trends in metadata generation in general. Finally, in Section 4 we present a discussion of the work described in Section 2 in the context of preservation metadata, and attempt to map it into the preservation activities of the DELOS Digital Library Reference Model, to present the topic in the wider digital library context. 1 DELOS brainstorming on Metadata topics, Juan les Pins, http://www.ukoln.ac.uk/ukoln/staff/t.koch/pres/Brainst200512-MDc.html 05/12/2005, Metadata Extraction and Digital Preservation: An Overview 3 2 Preservation Metadata Preservation metadata are defined as ‘descriptive, structured and administrative metadata that supports the long-term preservation of digital materials’ [21]. This definition is structural on the one hand – it places preservation metadata within the general metadata classification. On the other hand, the definition is functional because it explains what is the rationale behind having this set of metadata elements. The preservation metadata should help to solve issues which are caused by the technology-dependence of digital materials. They should also address the mutable nature of digital objects. In the preservation metadata field more work has been done recently on modeling schemas while there is still a delay in the automatic extraction tools’ development and integration in practice [21]. This means that one important question to which we still do not have a good answer is: How to assure that we produce and use welldocumented digital resources with improved metadata quality using in a proper way automatic tools? This question is important both for content and service providers. It influences the quality of the product which the content providers are producing. The service providers rely on the quality of metadata and where it is not high enough they should apply measures to improve it. It is suggested that preservation metadata cover five major areas, provenance, authenticity, preservation activity, technical environment and rights management [21]. A detailed framework especially for choosing a preservation planning procedure is suggested by Strodl et al. [31]. The framework which is applied includes defining requirements, evaluating alternatives and considering results. Metadata requirements are defined during the first stage. However, metadata creation methods are not fully explored. Preservation Workflows in Digital Archives Oltmans et al. present preservation functionality in a digital archive implemented within the e-Depot2, the digital archiving system of the National Library of Netherlands [26]. As of November 2007, e-Depot includes 10 million e-journal articles from more than 5,000 e-journal titles. The articles are either online publications, or published on CD-ROMs and other offline media. The workflow of the e-Depot includes automated validation and pre processing of the electronic publication; automated generation and resolving of identifier number; automated search and retrieval functions; and automated functionality of identifying, authenticating and authorizing users. The cataloguing, i.e. the creation of metadata for the ingested material is performed manually. The deposit system is based on OAIS [29]. Another OAIS-oriented example is the DIGARCH project presented by JaJa in [17], which aims to build a Multi-Institution Testbed for Scalable Digital Archiving. 2 http://www.kb.nl/dnp/e-depot/factsandfigures-en.html 4 Milena Dobreva, Yunhyong Kim, Seamus Ross The workflow of the Portico3, archive of electronic scholarly journals, had been launched as the Electronic-Archiving Initiative, by JSTOR, in 2002. As of November 2007, 2,784,947 articles have been ingested into the archive. Owens [27] describes the automated ingest workflow of Portico. The publishers’ document type definitions (DTDs) are used with random sampling for possible problems. The descriptive metadata which are extracted for use within Portico METS4 files do not include all metadata which accompany the publication because they are found to be redundant (some of them reflect in-house only publishing processes). Amongst the directions for enhancement, the generation of minimal descriptive metadata is mentioned for the cases where no XML file is supplied. This in fact means that currently the quality of the descriptive metadata is not meeting a pre-defined common level. These examples show that in practical workflows, when a particular standard had been chosen, the matter of content and production of metadata seems to be solved in advance. However, this is not a guarantee for supply of good quality metadata. 3 Basic Trends in Metadata Generation Current research on metadata extraction is directed toward general metadata, and not focused to consider preservation metadata. What hints could manual metadata creation give us? The human operators follow three basic steps: 1. Visual scanning of the document. 2. Mental analysis which results in identification of the metadata types and their values, and 3. Entering the recognized/generated metadata in the proper form. To be able to do such work, the operators should have proper training and be familiar with the metadata structure, the computer standards used and the quality requirements – what type of metadata should be entered and how detailed they should be. Manual metadata entry, especially in specialized fields (e.g. description of mediaeval manuscripts or archival documents, and linguistic annotation within the text), is not guaranteed to be correct and complete, because the quality of work depends on the experience and level of involvement of the operator. Although the fundamental challenge in automated metadata extraction is to find ways to execute the second step(the analysis which results in the identification of metadata types, and the values for these metadata), even the first part(the scanning of the document) is not a trivial task due to the variety of document content and electronic formats. Recent research can be grouped into work taking one of three directions which would fit into various stages of the metadata lifecycle: 1. Methods for automated extraction. They are most commonly based on domain specific indexing, formalisms for knowledge representation, e.g. ontologies, automatic abstracting, document genre recognition, automatic generation from semi- 3 4 http://www.portico.org Metadata Encoding and Transmission Standard, http://www.loc.gov/standards/mets/ Metadata Extraction and Digital Preservation: An Overview 5 structured metadata. These methods are best suited for implementation as part of the process of ingesting digital resources into a repository. 2. Methods for metadata enrichment. They could be used during the process of ingest of digital materials, as well as, for improving the quality of digital repositories. 3. Methods for generation of preservation metadata for web resources at the time of dissemination. This method is tailored for web resources, and appropriate for workflows reflecting their life cycle [30]. Extracting Specific Document Elements The methods suggested for metadata extraction of specific document elements fall into three categories: rule based approach, neural networks based approach and statistical-based approach. A. Rule-based approach This group of methods applies the rule-based approach using different information characteristics (layout of the source documents and natural language techniques). Giuffrida et al. developed a rule-based system for metadata extraction from research papers in Postscript [9]. The authors used the general layout rules, such as “titles are usually located on the upper portions of the first pages and they are usually in the largest font sizes”. Yilmazel et al. developed the system MetaExtract which assigns Dublin Core + GEM (Gateway to Educational Materials) metadata to educational materials (lesson plans and web-based educational activities in mathematics and science at secondary school level) using rule-based natural language processing technologies [34]. MetaExtract has three distinct extraction modules: (i) eQuery module (a rule-based system using shallow parsing rules to extract terms and phrases within single sentences which would then be assigned to the following metadata elements: Creator, Title, Date, Grade, Duration, Essential Resources, Pedagogy-Teaching Method, Pedagogy-Grouping, Pedagogy-Assessment, Pedagogy- Process, Audience, Standards, Publisher, and Relations; (ii) a HTML-based Extraction module which operates by comparing the text to a list of clue words developed previously, and (iii)a Keyword Generator module which operates by computing the standard TF–IDF5 metric on each document. The quality of the extracted metadata was evaluated through a web-based survey. It had been done through a questionnaire providing a lesson plan and its associated metadata, either manually or automatically assigned. The survey which showed significant difference between manual and automated extraction for two of the elements, Title and Keyword (the quality was higher when they were manually extracted). The quality of remaining extracted elements (Description, Grade, Duration, Essential Resources, Pedagogy-Teaching Method, and Pedagogy-Group) had been not found significantly different between the automated and manual approach. 5 term frequency ‐ inverse document frequency 6 Milena Dobreva, Yunhyong Kim, Seamus Ross Mao et al. conducted automatic metadata extraction from medical research papers using rules on formatting information [25]. Their work is concerned with developing a system to generate descriptive metadata (title, author, affiliation, and abstract) from scanned medical journals, for the preservation of scanned and online medical journal articles at the U.S. National Library of Medicine (NLM). The system consists of the following modules: (i) ZoneMatch - generates geometric and contextual features from a set of issues of each journal, and (ii) ZoneCzar - a rule–based labelling module which uses the generated features to perform labelling independent of journal layout styles. B. Neural networks approach Automatic extraction of metadata has also been attempted using a neural network [33]. The patent is intended for use in data archiving systems, as a method for automatic metadata extraction. The method is adaptable to non-standard documents where metadata locations are unknown. The claim is that the method contributes to extracting more metadata with greater accuracy and reliability (estimations are not given). The first step of the method is to provide a computer readable text document, an authority list consisting of common uses of a set of words, and a neural network trained to extract metadata from groupings of data called compounds. In the next step, the words within the document are compared against the authority list. In the third step, the compounds are processed through the neural network to generate metadata guesses. The metadata may then be derived from the metadata guesses by selecting those document, compound, and word guesses having the largest document, compound, and word confidence factors, respectively. C. Statistical-based approach Another track in metadata extraction of specific elements is based on the application of statistical methods. Han et al. describe Support Vector Machines (SVM) classification-based method as a machine learning method with better performance (higher precision) compared to Hidden Markov Models (HMM) [13]. They present the problem of classifying the lines in a document into the categories of metadata and proposed using SVM as the classifier. This method is also used in the research of Council et al. [3] where the basic studied objects are acknowledgements in research publications. The paper describes a mixed method for automatic identification and extraction of acknowledgements from research documents using a combination of a Support Vector Machine and regular expressions. The algorithm has been implemented as a plug-in to the CiteSeer Digital. As a demonstration, authors have used the CiteSeer's autonomous citation indexing (ACI) feature to measure the relative impact of acknowledged entities, and present the top twenty acknowledged entities within the archive. The experimental results proved precision of 0.7845 and recall of 0.8955. Metadata Extraction and Digital Preservation: An Overview 7 Hu et al. present the automatic extraction of titles from the bodies of documents encoded in HTML [14]. Titles fields of HTML documents is often not correctly filled in. In such cases, Hu et al. suggests the title be constructed from the body of the HTML document. The authors suggest a supervised machine learning approach which is based on format information (font size, position, and font weight) as additional features in the process of title extraction. It is reported that the proposed method significantly outperforms the baseline method of using the lines in largest font size as title (20.9%–32.6% improvement). The task of title extraction was developed further in the subsequent publications of this group where title extraction from Word and PowerPoint documents was presented [15], [16]. The authors again suggest the machine learning approach to title extraction from general documents which belong to a number of specific genres, including presentations, book chapters, technical papers, brochures, reports, and letters. In their approach, titles in sample documents (for Word and PowerPoint respectively) are annotated and taken as training data, based on which machine learning models are constructed, and finally used to perform title extraction. The method is unique in that it mainly utilizes formatting information, such as font size, as features in the models. The results show that the use of formatting information can lead to quite accurate extraction from general documents. Reported precision and recall for title extraction from Word documents are 0.810 and 0.837 respectively, and precision and recall for PowerPoint are 0.875 and 0.895. Another significant result from the presented approach is the fact that models can be trained in one domain and then applied to another domain. Extraction of metadata from news using SVM (support vector machine) method is discussed in [6], by Debnath and Giles, who performed experiments on headlines metadata extraction. News metadata includes DateLine, ByLine, HeadLine. The paper demonstrates that HeadLine is especially helpful in locating explanatory sentences for any major events such as significant changes in stock prices reported in financial news articles. Another application of the support vector machine is presented in Diekema et al [7] who use it for hierarchical text categorization (assigning predefined labels to text documents). The aim of this research was to provide tools for search in a digital library of teaching resources – NSDL6 which would involve educational standards of the different states. Thus automated standards alignment had been done for 27 state standards used for training of the system, and 20 standards used for testing. The recall reported for Mathematical and Science standards was over 99%. Precision was found to be 72.89% for Mathematics and 60.55 for Science standards. Liu et al. present the task of automatic identification, extraction, and search for the contents of tables in documents [24]. To extract the contents of tables and their metadata, an automatic table metadata extraction algorithm is designed and tested on PDF documents. This algorithm includes three processing steps: on the first one the PDF document is converted into a formatted text. Then during the second step table candidates are being detected based on location analysis and keyword matching and table metadata is extracted. Finally, table candidates are being confirmed or denied. The algorithm had been tested on 120 randomly selected PDF documents from digital 6 NSDL, The National Science Digital Library, http://nsdl.org/ 8 Milena Dobreva, Yunhyong Kim, Seamus Ross libraries. The experimental results reported show that the algorithm has good performance with an over-all precision, recall and accuracy of over 95%. Day et al. applied a hierarchical template–based reference metadata extraction method for scholarly publications [5]. Authors implemented a hierarchical knowledge representation framework called INFOMAP, which automatically extracts author, title, journal, volume, number (issue), year, and page information. The experimental results show that, by using INFOMAP, these can be extracted from different kinds of reference styles with a high degree of precision. The overall average accuracy of 92.39% is reported with respect to six major reference styles which have been compared in this study. Metadata Extraction in Specific Subject Domains There are also studies on automating metadata generation in specific subject domains. Cardinaels et al. [2] discuss an interface for generating learning objects metadata (LOM). As various categories of metadata sources they suggest document content analysis, document context analysis, document usage, and composite document structure. In the specialized area of geospatial metadata generation, Batcheller [1] suggested using an appropriate GIS (geographic information system). This approach is somewhere between the field of data management and metadata generation. In selected cases where geospatial information is needed, it could contribute to more correct and complete data entry. Performance Evaluation of Metadata Extraction Performance evaluation of various metadata extraction methods, as well as comparison between manual and automatic metadata extraction, are important for the benchmarking of metadata extraction tools. As a rule, the papers mention the precision and recall results for their approaches and/or give comparison with a baseline method for the specific field. Greenberg explores, in [11], the capabilities of two Dublin Core automatic metadata generation applications – Klarity (this tool seems not to be supported anymore after the producing company has been bought) and DC.dot. The top level Web page for each resource, from a sample of 29 resources obtained from the National Institute of Environmental Health Sciences (NIEHS), was submitted to both generators. Results indicate that text extraction algorithms can contribute to automated metadata generation. Results also indicate that harvesting metadata from META tags created by humans can have a positive impact on automatic metadata generation. The conclusion of the study is that integrating automated extraction methods will contribute to the creation of optimal metadata. A survey on the metadata experts’ opinions on the functionalities of automated metadata generation applications is done by Greenberg et al. [12]. The paper reports on the Automatic Metadata Generation Applications (AMeGA) project’s metadata Metadata Extraction and Digital Preservation: An Overview 9 expert survey. Participants anticipate greater accuracy will be exhibited by automatic techniques when dealing with technical metadata (e.g., ID, language, and format metadata) compared to those when dealing with metadata requiring intellectual discretion (e.g., subject and description metadata). Support for implementing automatic techniques paralleled anticipated accuracy results. Metadata experts are in favour of using automatic techniques, although they are generally not in favour of eliminating human evaluation or production when dealing with more intellectually demanding metadata extraction processes. Results are incorporated into Version 1.0 of the Recommended Functionalities for automatic metadata generation applications. Additional research is needed to identify in what way automatic generation could be combined with manual metadata entry, so that the best possible quality is achieved. There is also a need for more active research twinned with implementation activities which would lead to extraction of preservation metadata, or to the enrichment of existing metadata records, at the time of ingest into the repository, or at a later stage of the metadata lifecycle. It is very positive that ongoing research on metadata generation includes experiments with various methods, which could be applied at different metadata lifecycle stages (ingest, enrichment and dissemination). Ongoing research involves extraction of various metadata elements from different types of documents. The reported precision and recall results differ considerably with respect to different types of documents. Therefore we believe that future solutions should include components which analyze the document genre, as suggested by Kim and Ross [18], [19], [20], and, accordingly, select the method which is likely to give better results with respect to the document type. 4 Preservation as a Part of the Digital Library Reference Model The analysis of the ongoing research shows that methods for preservation metadata generation are still not in use, and the basic ‘guarantee’ for metadata quality is the use of an established standard. To illustrate the place for preservation in the digital library world, we will use the DELOS Digital Library Reference Model7. It is a formal and conceptual framework describing the characteristics of digital libraries as information systems. It introduces the main concepts (entities) and the relationships between them grouped into six domains (Content, User, Functionality, Architecture, Quality and Policy) on three levels: digital library, digital library system and digital library management system. The DELOS Digital Library Reference Model provides the general framework for discussing preservation-related objects and processes through the definitions of resource and information object within the Content domain. The specific concepts which are part of the DELOS DLRM and may be used to model preservation are listed below. They are also highlighted in Fig. 1 which presents the basic concepts of the Content domain. − Resource <hasMetadata> Information Object Ideally, information objects would be accompanied by metadata that can be used to 7 http://www.delos.info/index.php?option=com_content&task=view&id=345 10 − − − − Milena Dobreva, Yunhyong Kim, Seamus Ross automate decisions about preservation. This includes, for example, the date when an information object can be destroyed, and the format of the information object, which can be used to determine when the technology needed for interpreting the format disappears, necessitating migration to a different format. Resource <hasFormat> Resource Format The format of an information object is important for its correct interpretation. The issue of format applies both to primary InformationObjects and to MetadataObjects, which are InformationObjects. Resource <hasQuality> Quality Parameter. Resource <linkedTo> Resource Seeing an information object in its original context is important for the correct understanding of its meaning. Ontology with its specialization format is crucial for preservation. Format specifications need to be preserved so that information objects using an old format, or a previous version of an existing format, can still be interpreted. Likewise, the different versions of subject ontology need to be preserved, so that the subject metadata prepared using an old version of an ontology can be interpreted properly. Fig. 1. Concepts and relationships within the Content domain presented as a concept map. Preservation-related concepts are highlighted. The Functionality domain features the following functions which are important for preservation are presented in Table 1. Table 1. Functionality elements in DELOS DLRM model. Function convertTransform visualize compare Notes on Use A function which is used for conversion of files including format conversion. A function which is important for preserving look and feel. A function which ascertains whether two instances of an information Metadata Extraction and Digital Preservation: An Overview withdraw export Configure DL EvaluateMetadata ExtractMetadata EnrichMetadata Log Keeping 11 object are the same. A function which is supports making decisions whether to maintain the withdrawn object in a secondary store or whether to completely delete the object. A function which exports an entire digital library or pieces of it to create a mirror site or to create a backup copy. Also making information objects, especially metadata object, available for importing by another system (harvesting). A function which saves the configuration state after any changes. A function which initiates the evaluation of a selected set of quality parameters which are used to make a decision on the quality of metadata accompanying the digital resource. The results are used in determining whether metadata should be extracted or enriched. A function which initiates extraction of metadata. A function which initiates automated enrichment of metadata. A function which supports logging system actions and use. It is important for preservation in two ways: (1) It allows preserving the state of the total system at any given time, and (2) It provides for usage history of objects which preserves the context for later uses. In the Policy domain, two policies relate directly to preservation: Preservation policy and Disposal policy. In particular, digital rights govern what preservation measures can be taken, for instance, with respect to making backup copies. Among the quality parameters, the following are of particular importance for preservation: Generic quality parameters: Content quality parameters: − Security Enforcement − Integrity − Interoperability Support − Authenticity − Documentation Coverage − Authoritativeness Functionality quality parameter: − Performance or behaviour − Fault management Performance − Fidelity Architecture quality parameter: − Provenance − Compliance to standards The concepts and relationships from the DELOS Digital Library Reference Model, as we showed above, are capable to model the overall preservation process. The use of the model helps to address preservation issues consistently and in the needed level of detail. 5 Conclusion The preservation of digital material preservation is recognised as one of the vital issues for safeguarding the European heritage. Ms. Viviane Reding, EU Commissioner on IS & Media stressed in February 2007 [28]: “… if we do not actively pursue the preservation of digital material now, we risk having a gap in our intellectual record. 12 Milena Dobreva, Yunhyong Kim, Seamus Ross If you allow me another historical reference, we do not want to experience the digital equivalent of the destruction of the Alexandria Library. Scientific assets are just too valuable to be put at risk”. While manual metadata extraction definitely can not answer the current needs of the metadata production, automatic extraction can not be seen as the universal solution for obtaining metadata content. More attention should be paid to the development of combined approaches based on a comparison of manual and automatic extraction quality with respect to different metadata elements. Another promising research direction is in adding intelligent elements to the preservation metadata lifecycle, e.g. the analysis of the document genre in order to select the best automated extraction tool; and the implementation of self-documenting components. Given the high management costs of digital collections, it is also necessary to find ways of adding the value of preservation metadata to other functionalities of the digital libraries. The use of DELOS Digital Library Reference Model helps to understand better the preservation-related components and processes within the digital world and to model the processes in each separate case according to the specific collection needs. Acknowledgements. The research is being conducted as part of the Digital Curation Centre’s (DCC) research programme. It has been supported by the DELOS: Network of Excellence on Digital Libraries (G038-507618), funded under the European Commission’s IST 6th Framework programme. References [1] Batcheller J.K.: Automating Geospatial Metadata Generation--An Integrated Data Management and Documentation Approach. In: Proc. of the 10th AGILE International Conference on Geographic Information Science, 7 pp. (2007). [2] Cardinaels, K., Meire, M., Duval, E.: Automating Metadata Generation: the Simple Indexing Interface. In Proc. 14th Int. Conf. on World Wide Web (Chiba, Japan, May 10--14, 2005). WWW '05. ACM, New York, NY, 548--556. (2005). [3] Council, I., Giles, C., Han H., Manavoglu, E.: Automatic Acknowledgement Indexing: Expanding the Semantics of Contribution in the CiteSeer Digital Library. Proc. of the 3rd int. conf. on Knowledge capture, Banff, Alberta, Canada, 19--26, ISBN:1-59593-163-5 (2005). [4] Crystal A., Greenberg J.: Usability of a Metadata Creation Application for Resource Authors, Library & Information Science Research V. 27(2), 177--189 (2005). [5] Day, M., Tsai, R., Sung, C., Hsieh, C., Lee, C., Wu, C., Wu, K., Ong, C., Hsu, W.: Reference Metadata Extraction Using a Hierarchical Knowledge Representation Framework, Decision Support Systems 43 pp. 152--167 . (2007). [6] Debnath, S., Giles, C.: A Learning Based Model for Headline Extraction of News Articles to Find Explanatory Sentences for Events. Proc. of the 3rd Int. Conf. on Knowledge Capture, Banff, Alberta, Canada, 189--190, ISBN:1-59593-163-5 (2005). [7] Diekema, A. R., Yilmazel, O., Bailey, J., Harwell, S. C., and Liddy, E. D. 2007. Standards Alignment for Metadata Assignment. In Proc.2007 Conference on Digital Libraries (Vancouver, BC, Canada, June 18--23, 2007). JCDL '07. ACM, 398--399 (2007). [8] Foulonneau M.: Information Redundancy across Metadata Collections. Information Processing & Management V. 43 (3), Special Issue on Heterogeneous and Distributed IR, 740--751 (2007). [9] Giuffrida, G., Shek, E. Yang, J.: Knowledge-based Metadata Extraction from PostScript File. Proc. 5th ACM Intl. conf. Digital Libraries (2000) 77--84 (2004). Metadata Extraction and Digital Preservation: An Overview 13 [10] Glick, K. L., Wilczek, E., Dockins, R.: The Ingest and Maintenance of Electronic Records: Moving from Theory to Practice. Proc. 6th ACM/IEEE-CS Joint Conf. on Digital Libraries (Chapel Hill, NC, USA). JCDL '06. ACM Press, New York, NY, 359--359. (2006). [11] Greenberg J.: Metadata Extraction and Harvesting: A Comparison of Two Automatic Metadata Generation Applications, Journal of Internet Cataloging, 6(4): 59--82 (2004). [12] Greenberg, J., Spurgin, K., Crystal, A.: Functionalities for Automatic Metadata Generation Applications: a Survey of Metadata Experts’ Opinions. Int. J. of Metadata, Semantics & Ontologies, 1(1), 3--20. (2006). [13] Han, H., Giles, L., Manavoglu, E., Zha, H., Zhang, Z., Fox, E.A.: Automatic Document Metadata Extraction Using Support Vector Machines. In 3rd ACM/IEEECS Conf. Digital libraries. 37--48. (2003). [14] Hu, Y., Xin, G., Song, R., Hu, G., Shi, S., Cao, Y., Li, H.: Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval. Proc. 28th Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, Salvador, Brazil, 250--257, ISBN:1-59593-034-5 (2005). [15] Hu Y, Li H, Cao Y, Meyerzon D, Zheng Q.: Automatic Extraction of Titles from General Documents using Machine Learning, Int. Conf. on Digital Libraries, Proceedings of the 5th ACM/IEEE-CS joint conf. on Digital libraries, Denver, CO, USA, 145--154, ISBN:1-58113876-8 (2005). [16] Hu, Y., Li. H., Cao, Y., Teng, L, Meyerzon, D., Zheng, Q.: Automatic Extraction of Titles from General Documents using Machine Learning. In: Information Processing and Management 42, 1276--1293 (2006). [17] JaJa J.: Robust Technologies for Automated Ingestion and Long-Term Preservation of Digital Information. In Proc. of the 2006 Int. Conf. on Digital Government Research (San Diego, California,). dg.o '06, vol. 151. ACM Press, New York, NY, 285--286 (2006). [18] Kim Y., Ross S.: Genre Classification in Automated Ingest and Appraisal Metadata. Proc. 10th European Conference on research and advanced technology for digital libraries (ECDL 2006), Springer, LNCS 4172, ISBN 3-540-44636-2, pp. 63--74 (2006). [19] Kim Y., Ross S.: Detecting Family Resemblance: Automated Genre Classification’ Data Science Journal, Vol. 6 2007. pp. S172--S183 ISSN: 1683--1470 (2007). [20] Kim Y., Ross S.: Examining Variations of Prominent Features in Genre Classification. To appear in Proc. of the 41st Hawaiian International Conference on System Sciences, IEEE Computer Society Press (2008). [21] Lavoie, B., Gartner R.: Preservation Metadata. A Joint Report of OCLC, Oxford Library Services, and the Digital Preservation Coalition (DPC), published electronically as a DPC Technology Watch Report (No. 05-01) http://www.dpconline.org/docs/reports/dpctw0501.pdf. (2005). [22] Levitt M.: Worldwide Email Usage 2007-2011 Forecast: Resurgence of Spam Takes Its Toll. IDC Market Analysis #206038, 40 pp. (2007). [23] Liddy, E.D.: A Breadth of NLP Applications. ELSENEWS of the European Network in Human Language Technologies. Winter. (2002). [24] Liu,Y., Mitra, P., Giles, C., Bai, K.: Automatic Extraction of Table Metadata from Digital Documents. Proc. of the 6th ACM/IEEE-CS joint conference on Digital libraries, p. 339-340, ISBN:1-59593-354-9. (2006). [25] Mao, S., Kim, J., Thoma, G.: A Dynamic Feature Generation System for Automated Metadata Extraction in Preservation of Digital Materials. Proc of the First Int. Workshop on Document Image Analysis for Libraries, Palo Alto, CA, 225--232 (2004). [26] Oltmans, E., van Diessen, R., van Wijngaarden, H.: Preservation Functionality in a Digital Archive. Proc. of the 4th ACM/IEEE-CS Joint Conference on Digital Libraries (Tuscon, AZ, USA, June 7--11, 2004). JCDL '04. ACM Press, New York, NY, 279-286. (2004). [27] Owens E.: Automated Workflow for the Ingest and Preservation of Electronic Journals. In: Chapman S, Stovall SA (eds). Archiving 2006: Final Program and Proc. 109--112 (2006). 14 Milena Dobreva, Yunhyong Kim, Seamus Ross [28] Reding V.: Scientific Information In The Digital Age: How Accessible Should Publicly Funded Research Be?, Closing speech, Conf. on Scientific Publishing in the European Research Area Access, Dissemination and Preservation in the Digital Age, Brussels (2007). [29] Reference Model for an Open Archival Information System (OAIS), CCSDS 650.0-B-1, http://public.ccsds.org/publications/archive/650x0b1.pdf (2002). [30] Smith, J. A. and Nelson, M. L.: Generating Best-Effort Preservation Metadata for Web Resources at Time of Dissemination. Proc. of the 2007 Conf. on Digital Libraries. Vancouver, BC, Canada, JCDL '07. ACM Press, New York, NY, 51--52 (2007). [31] Strodl, S., Becker, C., Neumayer, R., Rauber, A.: How to Choose a Digital Preservation Strategy: Evaluating a Preservation Planning Procedure. Proc. of the 2007 Conference on Digital Libraries. Vancouver, BC, Canada, JCDL '07. ACM Press, New York, NY, 29-38 (2007). [32] Stuckenschmidt H., van Harmelen F.: Generating and Managing Metadata for Web-Based Information Systems, Knowledge-Based Systems Volume 17, Issues 5-6, Special Issue: Web Intelligence, 201--206 (2004). [33] US Patent 6044375: Automatic Extraction of Metadata Using a Neural Network, Inventors: O. Shmueli, D. Greig, C. Staelin, T. Tamir. Assignee: Hewlett-Packard Company. (2000). [34] Yilmazel, O., Finneran, C., Liddy, E.: Metaextract: an NLP System to Automatically Assign Metadata. Proc. of the 4th ACM/IEEE-CS joint conference on Digital libraries, Tuscon, AZ, USA, 241--242, ISBN: 1-58113-832-6. (2004) [35] Zhang, J. Jastram, I.: A Study of the Metadata Creation Behavior of Different User Groups on the Internet. Information Processing & Management, Vol. 42, 1099--1122 (2006).