Information Extraction

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Information extraction

Information extraction (IE) is the task of automatically extracting structured information from
unstructured and/or semi-structured machine-readable documents and other electronically represented
sources. In most of the cases this activity concerns processing human language texts by means of natural
language processing (NLP).[1] Recent activities in multimedia document processing like automatic
annotation and content extraction out of images/audio/video/documents could be seen as information
extraction

Due to the difficulty of the problem, current approaches to IE (as of 2010) focus on narrowly restricted
domains. An example is the extraction from newswire reports of corporate mergers, such as denoted by the
formal relation:

from an online news sentence such as:

"Yesterday, New York based Foo Inc. announced their acquisition of Bar Corp."

A broad goal of IE is to allow computation to be done on the previously unstructured data. A more specific
goal is to allow automated reasoning about the logical form of the input data. Structured data is semantically
well-defined data from a chosen target domain, interpreted with respect to category and context.

Information extraction is the part of a greater puzzle which deals with the problem of devising automatic
methods for text management, beyond its transmission, storage and display. The discipline of information
retrieval (IR)[2] has developed automatic methods, typically of a statistical flavor, for indexing large
document collections and classifying documents. Another complementary approach is that of natural
language processing (NLP) which has solved the problem of modelling human language processing with
considerable success when taking into account the magnitude of the task. In terms of both difficulty and
emphasis, IE deals with tasks in between both IR and NLP. In terms of input, IE assumes the existence of a
set of documents in which each document follows a template, i.e. describes one or more entities or events in
a manner that is similar to those in other documents but differing in the details. An example, consider a
group of newswire articles on Latin American terrorism with each article presumed to be based upon one or
more terroristic acts. We also define for any given IE task a template, which is a(or a set of) case frame(s) to
hold the information contained in a single document. For the terrorism example, a template would have
slots corresponding to the perpetrator, victim, and weapon of the terroristic act, and the date on which the
event happened. An IE system for this problem is required to “understand” an attack article only enough to
find data corresponding to the slots in this template.

History
Information extraction dates back to the late 1970s in the early days of NLP.[3] An early commercial system
from the mid-1980s was JASPER built for Reuters by the Carnegie Group Inc with the aim of providing
real-time financial news to financial traders.[4]

Beginning in 1987, IE was spurred by a series of Message Understanding Conferences. MUC is a


competition-based conference[5] that focused on the following domains:
MUC-1 (1987), MUC-3 (1989): Naval operations messages.
MUC-3 (1991), MUC-4 (1992): Terrorism in Latin American countries.
MUC-5 (1993): Joint ventures and microelectronics domain.
MUC-6 (1995): News articles on management changes.
MUC-7 (1998): Satellite launch reports.

Considerable support came from the U.S. Defense Advanced Research Projects Agency (DARPA), who
wished to automate mundane tasks performed by government analysts, such as scanning newspapers for
possible links to terrorism.

Present significance
The present significance of IE pertains to the growing amount of information available in unstructured
form. Tim Berners-Lee, inventor of the World Wide Web, refers to the existing Internet as the web of
documents [6] and advocates that more of the content be made available as a web of data.[7] Until this
transpires, the web largely consists of unstructured documents lacking semantic metadata. Knowledge
contained within these documents can be made more accessible for machine processing by means of
transformation into relational form, or by marking-up with XML tags. An intelligent agent monitoring a
news data feed requires IE to transform unstructured data into something that can be reasoned with. A
typical application of IE is to scan a set of documents written in a natural language and populate a database
with the information extracted.[8]

Tasks and subtasks


Applying information extraction to text is linked to the problem of text simplification in order to create a
structured view of the information present in free text. The overall goal being to create a more easily
machine-readable text to process the sentences. Typical IE tasks and subtasks include:

Template filling: Extracting a fixed set of fields from a document, e.g. extract perpetrators,
victims, time, etc. from a newspaper article about a terrorist attack.
Event extraction: Given an input document, output zero or more event templates. For
instance, a newspaper article might describe multiple terrorist attacks.
Knowledge Base Population: Fill a database of facts given a set of documents. Typically the
database is in the form of triplets, (entity 1, relation, entity 2), e.g. (Barack Obama, Spouse,
Michelle Obama)
Named entity recognition: recognition of known entity names (for people and
organizations), place names, temporal expressions, and certain types of numerical
expressions, by employing existing knowledge of the domain or information extracted
from other sentences.[9] Typically the recognition task involves assigning a unique
identifier to the extracted entity. A simpler task is named entity detection, which aims at
detecting entities without having any existing knowledge about the entity instances. For
example, in processing the sentence "M. Smith likes fishing", named entity detection
would denote detecting that the phrase "M. Smith" does refer to a person, but without
necessarily having (or using) any knowledge about a certain M. Smith who is (or, "might
be") the specific person whom that sentence is talking about.
Coreference resolution: detection of coreference and anaphoric links between text
entities. In IE tasks, this is typically restricted to finding links between previously-
extracted named entities. For example, "International Business Machines" and "IBM"
refer to the same real-world entity. If we take the two sentences "M. Smith likes fishing.
But he doesn't like biking", it would be beneficial to detect that "he" is referring to the
previously detected person "M. Smith".
Relationship extraction: identification of relations between entities,[9] such as:
PERSON works for ORGANIZATION (extracted from the sentence "Bill works for
IBM.")
PERSON located in LOCATION (extracted from the sentence "Bill is in France.")
Semi-structured information extraction which may refer to any IE that tries to restore some
kind of information structure that has been lost through publication, such as:

Table extraction: finding and extracting tables from documents.[10][11]


Table information extraction : extracting information in structured manner from the tables.
This task is more complex than table extraction, as table extraction is only the first step,
while understanding the roles of the cells, rows, columns, linking the information inside
the table and understanding the information presented in the table are additional tasks
necessary for table information extraction. [12][13][14]
Comments extraction : extracting comments from the actual content of articles in order to
restore the link between authors of each of the sentences
Language and vocabulary analysis
Terminology extraction: finding the relevant terms for a given corpus
Audio extraction
Template-based music extraction: finding relevant characteristic in an audio signal taken
from a given repertoire; for instance [15] time indexes of occurrences of percussive
sounds can be extracted in order to represent the essential rhythmic component of a
music piece.

Note that this list is not exhaustive and that the exact meaning of IE activities is not commonly accepted and
that many approaches combine multiple sub-tasks of IE in order to achieve a wider goal. Machine learning,
statistical analysis and/or natural language processing are often used in IE.

IE on non-text documents is becoming an increasingly interesting topic in research, and information


extracted from multimedia documents can now be expressed in a high level structure as it is done on text.
This naturally leads to the fusion of extracted information from multiple kinds of documents and sources.

World Wide Web applications


IE has been the focus of the MUC conferences. The proliferation of the Web, however, intensified the need
for developing IE systems that help people to cope with the enormous amount of data that are available
online. Systems that perform IE from online text should meet the requirements of low cost, flexibility in
development and easy adaptation to new domains. MUC systems fail to meet those criteria. Moreover,
linguistic analysis performed for unstructured text does not exploit the HTML/XML tags and the layout
formats that are available in online texts. As a result, less linguistically intensive approaches have been
developed for IE on the Web using wrappers, which are sets of highly accurate rules that extract a particular
page's content. Manually developing wrappers has proved to be a time-consuming task, requiring a high
level of expertise. Machine learning techniques, either supervised or unsupervised, have been used to
induce such rules automatically.

Wrappers typically handle highly structured collections of web pages, such as product catalogs and
telephone directories. They fail, however, when the text type is less structured, which is also common on
the Web. Recent effort on adaptive information extraction motivates the development of IE systems that
can handle different types of text, from well-structured to almost free text -where common wrappers fail-
including mixed types. Such systems can exploit shallow natural language knowledge and thus can be also
applied to less structured texts.

A recent development is Visual Information Extraction,[16][17] that relies on rendering a webpage in a


browser and creating rules based on the proximity of regions in the rendered web page. This helps in
extracting entities from complex web pages that may exhibit a visual pattern, but lack a discernible pattern
in the HTML source code.

Approaches
The following standard approaches are now widely accepted:

Hand-written regular expressions (or nested group of regular expressions)


Using classifiers
Generative: naïve Bayes classifier
Discriminative: maximum entropy models such as Multinomial logistic regression
Sequence models
Recurrent neural network
Hidden Markov model
Conditional Markov model (CMM) / Maximum-entropy Markov model (MEMM)
Conditional random fields (CRF) are commonly used in conjunction with IE for tasks as
varied as extracting information from research papers[18] to extracting navigation
instructions.[19]

Numerous other approaches exist for IE including hybrid approaches that combine some of the standard
approaches previously listed.

Free or open source software and services


General Architecture for Text Engineering (GATE) is bundled with a free Information
Extraction system
Apache OpenNLP is a Java machine learning toolkit for natural language processing
OpenCalais is an automated information extraction web service from Thomson Reuters
(Free limited version)
Machine Learning for Language Toolkit (Mallet) is a Java-based package for a variety of
natural language processing tasks, including information extraction.
DBpedia Spotlight is an open source tool in Java/Scala (and free web service) that can be
used for named entity recognition and name resolution.
Natural Language Toolkit is a suite of libraries and programs for symbolic and statistical
natural language processing (NLP) for the Python programming language
See also CRF implementations

See also
Extraction

Data extraction
Keyword extraction
Knowledge extraction
Ontology extraction
Open information extraction
Table extraction
Terminology extraction
Mining, crawling, scraping, and recognition

Apache Nutch, web crawler


Concept mining
Named entity recognition
Textmining
Web scraping
Search and translation

Enterprise search
Faceted search
Semantic translation
General

Applications of artificial intelligence


DARPA TIPSTER Program
Lists

List of emerging technologies


Outline of artificial intelligence

References
1. name=Kariampuzha2023 Kariampuzha, William; Alyea, Gioconda; Qu, Sue; Sanjak, Jaleal;
Mathé, Ewy; Sid, Eric; Chatelaine, Haley; Yadaw, Arjun; Xu, Yanji; Zhu, Qian (2023).
"Precision information extraction for rare disease epidemiology at scale" (https://www.ncbi.nl
m.nih.gov/pmc/articles/PMC9972634). Journal of Translational Medicine. 21 (1): 157.
doi:10.1186/s12967-023-04011-y (https://doi.org/10.1186%2Fs12967-023-04011-y).
PMC 9972634 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9972634). PMID 36855134
(https://pubmed.ncbi.nlm.nih.gov/36855134).
2. FREITAG, DAYNE. "Machine Learning for Information Extraction in Informal Domains" (htt
p://www.cs.bilkent.edu.tr/~guvenir/courses/CS550/Seminar/freitag2000-ml.pdf) (PDF). 2000
Kluwer Academic Publishers. Printed in the Netherlands.
3. Cowie, Jim; Wilks, Yorick (1996). Information Extraction (https://web.archive.org/web/201902
20184608/http://pdfs.semanticscholar.org/2c90/fa59c6d9beed8dcb0e844725b872d3f33a35.
pdf) (PDF). p. 3. CiteSeerX 10.1.1.61.6480 (https://citeseerx.ist.psu.edu/viewdoc/summary?d
oi=10.1.1.61.6480). S2CID 10237124 (https://api.semanticscholar.org/CorpusID:10237124).
Archived from the original (http://pdfs.semanticscholar.org/2c90/fa59c6d9beed8dcb0e84472
5b872d3f33a35.pdf) (PDF) on 2019-02-20.
4. Andersen, Peggy M.; Hayes, Philip J.; Huettner, Alison K.; Schmandt, Linda M.; Nirenburg,
Irene B.; Weinstein, Steven P. (1992). "Automatic Extraction of Facts from Press Releases to
Generate News Stories" (https://www.aclweb.org/anthology/A92-1024). Proceedings of the
third conference on Applied natural language processing -. pp. 170–177.
CiteSeerX 10.1.1.14.7943 (https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.79
43). doi:10.3115/974499.974531 (https://doi.org/10.3115%2F974499.974531).
S2CID 14746386 (https://api.semanticscholar.org/CorpusID:14746386).
5. Marco Costantino, Paolo Coletti, Information Extraction in Finance, Wit Press, 2008.
ISBN 978-1-84564-146-7
6. "Linked Data - The Story So Far" (http://tomheath.com/papers/bizer-heath-berners-lee-ijswis
-linked-data.pdf) (PDF).
7. "Tim Berners-Lee on the next Web" (https://web.archive.org/web/20110410204952/http://ww
w.ted.com/talks/tim_berners_lee_on_the_next_web.html). Archived from the original (http://w
ww.ted.com/talks/tim_berners_lee_on_the_next_web.html) on 2011-04-10. Retrieved
2010-03-27.
8. R. K. Srihari, W. Li, C. Niu and T. Cornell,"InfoXtract: A Customizable Intermediate Level
Information Extraction Engine",Journal of Natural Language Engineering (https://web.archiv
e.org/web/20080507153920/http://journals.cambridge.org/action/displayIssue?iid=359643),
Cambridge U. Press, 14(1), 2008, pp.33-69.
9. Dat Quoc Nguyen and Karin Verspoor (2019). "End-to-end neural relation extraction using
deep biaffine attention". Proceedings of the 41st European Conference on Information
Retrieval (ECIR). arXiv:1812.11275 (https://arxiv.org/abs/1812.11275). doi:10.1007/978-3-
030-15712-8_47 (https://doi.org/10.1007%2F978-3-030-15712-8_47).
10. Milosevic N, Gregson C, Hernandez R, Nenadic G (February 2019). "A framework for
information extraction from tables in biomedical literature". International Journal on
Document Analysis and Recognition. 22 (1): 55–78. arXiv:1902.10031 (https://arxiv.org/abs/
1902.10031). Bibcode:2019arXiv190210031M (https://ui.adsabs.harvard.edu/abs/2019arXiv
190210031M). doi:10.1007/s10032-019-00317-0 (https://doi.org/10.1007%2Fs10032-019-00
317-0). S2CID 62880746 (https://api.semanticscholar.org/CorpusID:62880746).
11. Milosevic, Nikola (2018). A multi-layered approach to information extraction from tables in
biomedical documents (https://www.research.manchester.ac.uk/portal/files/70405100/FULL_
TEXT.PDF) (PDF) (PhD). University of Manchester.
12. Milosevic N, Gregson C, Hernandez R, Nenadic G (February 2019). "A framework for
information extraction from tables in biomedical literature". International Journal on
Document Analysis and Recognition. 22 (1): 55–78. arXiv:1902.10031 (https://arxiv.org/abs/
1902.10031). Bibcode:2019arXiv190210031M (https://ui.adsabs.harvard.edu/abs/2019arXiv
190210031M). doi:10.1007/s10032-019-00317-0 (https://doi.org/10.1007%2Fs10032-019-00
317-0). S2CID 62880746 (https://api.semanticscholar.org/CorpusID:62880746).
13. Milosevic N, Gregson C, Hernandez R, Nenadic G (June 2016). "Disentangling the structure
of tables in scientific literature" (https://www.research.manchester.ac.uk/portal/en/publication
s/disentangling-the-structure-of-tables-in-scientific-literature(473111c2-52e9-493a-be8c-1a7
8c5b7ce36).html). 21st International Conference on Applications of Natural Language to
Information Systems. Lecture Notes in Computer Science. 21: 162–174. doi:10.1007/978-3-
319-41754-7_14 (https://doi.org/10.1007%2F978-3-319-41754-7_14). ISBN 978-3-319-
41753-0. S2CID 19538141 (https://api.semanticscholar.org/CorpusID:19538141).
14. Milosevic, Nikola (2018). A multi-layered approach to information extraction from tables in
biomedical documents (https://www.research.manchester.ac.uk/portal/files/70405100/FULL_
TEXT.PDF) (PDF) (PhD). University of Manchester.
15. A.Zils, F.Pachet, O.Delerue and F. Gouyon, Automatic Extraction of Drum Tracks from
Polyphonic Music Signals (http://www.csl.sony.fr/downloads/papers/2002/ZilsMusic.pdf)
Archived (https://web.archive.org/web/20170829163036/http://www.csl.sony.fr/downloads/pa
pers/2002/ZilsMusic.pdf) 2017-08-29 at the Wayback Machine, Proceedings of WedelMusic,
Darmstadt, Germany, 2002.
16. Chenthamarakshan, Vijil; Desphande, Prasad M; Krishnapuram, Raghu; Varadarajan,
Ramakrishnan; Stolze, Knut (2015). "WYSIWYE: An Algebra for Expressing Spatial and
Textual Rules for Information Extraction". arXiv:1506.08454 (https://arxiv.org/abs/1506.0845
4) [cs.CL (https://arxiv.org/archive/cs.CL)].
17. Baumgartner, Robert; Flesca, Sergio; Gottlob, Georg (2001). "Visual Web Information
Extraction with Lixto". pp. 119–128. CiteSeerX 10.1.1.21.8236 (https://citeseerx.ist.psu.edu/v
iewdoc/summary?doi=10.1.1.21.8236).
18. Peng, F.; McCallum, A. (2006). "Information extraction from research papers using
conditional random fields☆". Information Processing & Management. 42 (4): 963.
doi:10.1016/j.ipm.2005.09.002 (https://doi.org/10.1016%2Fj.ipm.2005.09.002).
19. Shimizu, Nobuyuki; Hass, Andrew (2006). "Extracting Frame-based Knowledge
Representation from Route Instructions" (https://web.archive.org/web/20060901085639/htt
p://www.cs.albany.edu/~shimizu/shimizu+haas2006frame.pdf) (PDF). Archived from the
original (http://www.cs.albany.edu/~shimizu/shimizu+haas2006frame.pdf) (PDF) on 2006-
09-01. Retrieved 2010-03-27.

External links
Alias-I "competition" page (http://alias-i.com/lingpipe/web/competition.html) A listing of
academic toolkits and industrial toolkits for natural language information extraction.
Gabor Melli's page on IE (http://www.gabormelli.com/RKB/Information_Extraction_Task)
Detailed description of the information extraction task.

Retrieved from "https://en.wikipedia.org/w/index.php?title=Information_extraction&oldid=1163481791"

You might also like