Information Extraction
Information Extraction
Information Extraction
Information extraction (IE) is the task of automatically extracting structured information from
unstructured and/or semi-structured machine-readable documents and other electronically represented
sources. In most of the cases this activity concerns processing human language texts by means of natural
language processing (NLP).[1] Recent activities in multimedia document processing like automatic
annotation and content extraction out of images/audio/video/documents could be seen as information
extraction
Due to the difficulty of the problem, current approaches to IE (as of 2010) focus on narrowly restricted
domains. An example is the extraction from newswire reports of corporate mergers, such as denoted by the
formal relation:
"Yesterday, New York based Foo Inc. announced their acquisition of Bar Corp."
A broad goal of IE is to allow computation to be done on the previously unstructured data. A more specific
goal is to allow automated reasoning about the logical form of the input data. Structured data is semantically
well-defined data from a chosen target domain, interpreted with respect to category and context.
Information extraction is the part of a greater puzzle which deals with the problem of devising automatic
methods for text management, beyond its transmission, storage and display. The discipline of information
retrieval (IR)[2] has developed automatic methods, typically of a statistical flavor, for indexing large
document collections and classifying documents. Another complementary approach is that of natural
language processing (NLP) which has solved the problem of modelling human language processing with
considerable success when taking into account the magnitude of the task. In terms of both difficulty and
emphasis, IE deals with tasks in between both IR and NLP. In terms of input, IE assumes the existence of a
set of documents in which each document follows a template, i.e. describes one or more entities or events in
a manner that is similar to those in other documents but differing in the details. An example, consider a
group of newswire articles on Latin American terrorism with each article presumed to be based upon one or
more terroristic acts. We also define for any given IE task a template, which is a(or a set of) case frame(s) to
hold the information contained in a single document. For the terrorism example, a template would have
slots corresponding to the perpetrator, victim, and weapon of the terroristic act, and the date on which the
event happened. An IE system for this problem is required to “understand” an attack article only enough to
find data corresponding to the slots in this template.
History
Information extraction dates back to the late 1970s in the early days of NLP.[3] An early commercial system
from the mid-1980s was JASPER built for Reuters by the Carnegie Group Inc with the aim of providing
real-time financial news to financial traders.[4]
Considerable support came from the U.S. Defense Advanced Research Projects Agency (DARPA), who
wished to automate mundane tasks performed by government analysts, such as scanning newspapers for
possible links to terrorism.
Present significance
The present significance of IE pertains to the growing amount of information available in unstructured
form. Tim Berners-Lee, inventor of the World Wide Web, refers to the existing Internet as the web of
documents [6] and advocates that more of the content be made available as a web of data.[7] Until this
transpires, the web largely consists of unstructured documents lacking semantic metadata. Knowledge
contained within these documents can be made more accessible for machine processing by means of
transformation into relational form, or by marking-up with XML tags. An intelligent agent monitoring a
news data feed requires IE to transform unstructured data into something that can be reasoned with. A
typical application of IE is to scan a set of documents written in a natural language and populate a database
with the information extracted.[8]
Template filling: Extracting a fixed set of fields from a document, e.g. extract perpetrators,
victims, time, etc. from a newspaper article about a terrorist attack.
Event extraction: Given an input document, output zero or more event templates. For
instance, a newspaper article might describe multiple terrorist attacks.
Knowledge Base Population: Fill a database of facts given a set of documents. Typically the
database is in the form of triplets, (entity 1, relation, entity 2), e.g. (Barack Obama, Spouse,
Michelle Obama)
Named entity recognition: recognition of known entity names (for people and
organizations), place names, temporal expressions, and certain types of numerical
expressions, by employing existing knowledge of the domain or information extracted
from other sentences.[9] Typically the recognition task involves assigning a unique
identifier to the extracted entity. A simpler task is named entity detection, which aims at
detecting entities without having any existing knowledge about the entity instances. For
example, in processing the sentence "M. Smith likes fishing", named entity detection
would denote detecting that the phrase "M. Smith" does refer to a person, but without
necessarily having (or using) any knowledge about a certain M. Smith who is (or, "might
be") the specific person whom that sentence is talking about.
Coreference resolution: detection of coreference and anaphoric links between text
entities. In IE tasks, this is typically restricted to finding links between previously-
extracted named entities. For example, "International Business Machines" and "IBM"
refer to the same real-world entity. If we take the two sentences "M. Smith likes fishing.
But he doesn't like biking", it would be beneficial to detect that "he" is referring to the
previously detected person "M. Smith".
Relationship extraction: identification of relations between entities,[9] such as:
PERSON works for ORGANIZATION (extracted from the sentence "Bill works for
IBM.")
PERSON located in LOCATION (extracted from the sentence "Bill is in France.")
Semi-structured information extraction which may refer to any IE that tries to restore some
kind of information structure that has been lost through publication, such as:
Note that this list is not exhaustive and that the exact meaning of IE activities is not commonly accepted and
that many approaches combine multiple sub-tasks of IE in order to achieve a wider goal. Machine learning,
statistical analysis and/or natural language processing are often used in IE.
Wrappers typically handle highly structured collections of web pages, such as product catalogs and
telephone directories. They fail, however, when the text type is less structured, which is also common on
the Web. Recent effort on adaptive information extraction motivates the development of IE systems that
can handle different types of text, from well-structured to almost free text -where common wrappers fail-
including mixed types. Such systems can exploit shallow natural language knowledge and thus can be also
applied to less structured texts.
Approaches
The following standard approaches are now widely accepted:
Numerous other approaches exist for IE including hybrid approaches that combine some of the standard
approaches previously listed.
See also
Extraction
Data extraction
Keyword extraction
Knowledge extraction
Ontology extraction
Open information extraction
Table extraction
Terminology extraction
Mining, crawling, scraping, and recognition
Enterprise search
Faceted search
Semantic translation
General
References
1. name=Kariampuzha2023 Kariampuzha, William; Alyea, Gioconda; Qu, Sue; Sanjak, Jaleal;
Mathé, Ewy; Sid, Eric; Chatelaine, Haley; Yadaw, Arjun; Xu, Yanji; Zhu, Qian (2023).
"Precision information extraction for rare disease epidemiology at scale" (https://www.ncbi.nl
m.nih.gov/pmc/articles/PMC9972634). Journal of Translational Medicine. 21 (1): 157.
doi:10.1186/s12967-023-04011-y (https://doi.org/10.1186%2Fs12967-023-04011-y).
PMC 9972634 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9972634). PMID 36855134
(https://pubmed.ncbi.nlm.nih.gov/36855134).
2. FREITAG, DAYNE. "Machine Learning for Information Extraction in Informal Domains" (htt
p://www.cs.bilkent.edu.tr/~guvenir/courses/CS550/Seminar/freitag2000-ml.pdf) (PDF). 2000
Kluwer Academic Publishers. Printed in the Netherlands.
3. Cowie, Jim; Wilks, Yorick (1996). Information Extraction (https://web.archive.org/web/201902
20184608/http://pdfs.semanticscholar.org/2c90/fa59c6d9beed8dcb0e844725b872d3f33a35.
pdf) (PDF). p. 3. CiteSeerX 10.1.1.61.6480 (https://citeseerx.ist.psu.edu/viewdoc/summary?d
oi=10.1.1.61.6480). S2CID 10237124 (https://api.semanticscholar.org/CorpusID:10237124).
Archived from the original (http://pdfs.semanticscholar.org/2c90/fa59c6d9beed8dcb0e84472
5b872d3f33a35.pdf) (PDF) on 2019-02-20.
4. Andersen, Peggy M.; Hayes, Philip J.; Huettner, Alison K.; Schmandt, Linda M.; Nirenburg,
Irene B.; Weinstein, Steven P. (1992). "Automatic Extraction of Facts from Press Releases to
Generate News Stories" (https://www.aclweb.org/anthology/A92-1024). Proceedings of the
third conference on Applied natural language processing -. pp. 170–177.
CiteSeerX 10.1.1.14.7943 (https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.79
43). doi:10.3115/974499.974531 (https://doi.org/10.3115%2F974499.974531).
S2CID 14746386 (https://api.semanticscholar.org/CorpusID:14746386).
5. Marco Costantino, Paolo Coletti, Information Extraction in Finance, Wit Press, 2008.
ISBN 978-1-84564-146-7
6. "Linked Data - The Story So Far" (http://tomheath.com/papers/bizer-heath-berners-lee-ijswis
-linked-data.pdf) (PDF).
7. "Tim Berners-Lee on the next Web" (https://web.archive.org/web/20110410204952/http://ww
w.ted.com/talks/tim_berners_lee_on_the_next_web.html). Archived from the original (http://w
ww.ted.com/talks/tim_berners_lee_on_the_next_web.html) on 2011-04-10. Retrieved
2010-03-27.
8. R. K. Srihari, W. Li, C. Niu and T. Cornell,"InfoXtract: A Customizable Intermediate Level
Information Extraction Engine",Journal of Natural Language Engineering (https://web.archiv
e.org/web/20080507153920/http://journals.cambridge.org/action/displayIssue?iid=359643),
Cambridge U. Press, 14(1), 2008, pp.33-69.
9. Dat Quoc Nguyen and Karin Verspoor (2019). "End-to-end neural relation extraction using
deep biaffine attention". Proceedings of the 41st European Conference on Information
Retrieval (ECIR). arXiv:1812.11275 (https://arxiv.org/abs/1812.11275). doi:10.1007/978-3-
030-15712-8_47 (https://doi.org/10.1007%2F978-3-030-15712-8_47).
10. Milosevic N, Gregson C, Hernandez R, Nenadic G (February 2019). "A framework for
information extraction from tables in biomedical literature". International Journal on
Document Analysis and Recognition. 22 (1): 55–78. arXiv:1902.10031 (https://arxiv.org/abs/
1902.10031). Bibcode:2019arXiv190210031M (https://ui.adsabs.harvard.edu/abs/2019arXiv
190210031M). doi:10.1007/s10032-019-00317-0 (https://doi.org/10.1007%2Fs10032-019-00
317-0). S2CID 62880746 (https://api.semanticscholar.org/CorpusID:62880746).
11. Milosevic, Nikola (2018). A multi-layered approach to information extraction from tables in
biomedical documents (https://www.research.manchester.ac.uk/portal/files/70405100/FULL_
TEXT.PDF) (PDF) (PhD). University of Manchester.
12. Milosevic N, Gregson C, Hernandez R, Nenadic G (February 2019). "A framework for
information extraction from tables in biomedical literature". International Journal on
Document Analysis and Recognition. 22 (1): 55–78. arXiv:1902.10031 (https://arxiv.org/abs/
1902.10031). Bibcode:2019arXiv190210031M (https://ui.adsabs.harvard.edu/abs/2019arXiv
190210031M). doi:10.1007/s10032-019-00317-0 (https://doi.org/10.1007%2Fs10032-019-00
317-0). S2CID 62880746 (https://api.semanticscholar.org/CorpusID:62880746).
13. Milosevic N, Gregson C, Hernandez R, Nenadic G (June 2016). "Disentangling the structure
of tables in scientific literature" (https://www.research.manchester.ac.uk/portal/en/publication
s/disentangling-the-structure-of-tables-in-scientific-literature(473111c2-52e9-493a-be8c-1a7
8c5b7ce36).html). 21st International Conference on Applications of Natural Language to
Information Systems. Lecture Notes in Computer Science. 21: 162–174. doi:10.1007/978-3-
319-41754-7_14 (https://doi.org/10.1007%2F978-3-319-41754-7_14). ISBN 978-3-319-
41753-0. S2CID 19538141 (https://api.semanticscholar.org/CorpusID:19538141).
14. Milosevic, Nikola (2018). A multi-layered approach to information extraction from tables in
biomedical documents (https://www.research.manchester.ac.uk/portal/files/70405100/FULL_
TEXT.PDF) (PDF) (PhD). University of Manchester.
15. A.Zils, F.Pachet, O.Delerue and F. Gouyon, Automatic Extraction of Drum Tracks from
Polyphonic Music Signals (http://www.csl.sony.fr/downloads/papers/2002/ZilsMusic.pdf)
Archived (https://web.archive.org/web/20170829163036/http://www.csl.sony.fr/downloads/pa
pers/2002/ZilsMusic.pdf) 2017-08-29 at the Wayback Machine, Proceedings of WedelMusic,
Darmstadt, Germany, 2002.
16. Chenthamarakshan, Vijil; Desphande, Prasad M; Krishnapuram, Raghu; Varadarajan,
Ramakrishnan; Stolze, Knut (2015). "WYSIWYE: An Algebra for Expressing Spatial and
Textual Rules for Information Extraction". arXiv:1506.08454 (https://arxiv.org/abs/1506.0845
4) [cs.CL (https://arxiv.org/archive/cs.CL)].
17. Baumgartner, Robert; Flesca, Sergio; Gottlob, Georg (2001). "Visual Web Information
Extraction with Lixto". pp. 119–128. CiteSeerX 10.1.1.21.8236 (https://citeseerx.ist.psu.edu/v
iewdoc/summary?doi=10.1.1.21.8236).
18. Peng, F.; McCallum, A. (2006). "Information extraction from research papers using
conditional random fields☆". Information Processing & Management. 42 (4): 963.
doi:10.1016/j.ipm.2005.09.002 (https://doi.org/10.1016%2Fj.ipm.2005.09.002).
19. Shimizu, Nobuyuki; Hass, Andrew (2006). "Extracting Frame-based Knowledge
Representation from Route Instructions" (https://web.archive.org/web/20060901085639/htt
p://www.cs.albany.edu/~shimizu/shimizu+haas2006frame.pdf) (PDF). Archived from the
original (http://www.cs.albany.edu/~shimizu/shimizu+haas2006frame.pdf) (PDF) on 2006-
09-01. Retrieved 2010-03-27.
External links
Alias-I "competition" page (http://alias-i.com/lingpipe/web/competition.html) A listing of
academic toolkits and industrial toolkits for natural language information extraction.
Gabor Melli's page on IE (http://www.gabormelli.com/RKB/Information_Extraction_Task)
Detailed description of the information extraction task.