Text Preprocessing: Information Retrieval
Text Preprocessing: Information Retrieval
Text Preprocessing: Information Retrieval
PREPROCESSING
Information Retrieval
Role
Preprocessing is an important task and critical step in Text
mining, Natural Language Processing (NLP) and
information retrieval (IR).
In the area of Text Mining, data preprocessing used for
extracting interesting and non-trivial and knowledge from
unstructured text data.
Information Retrieval (IR) is essentially a matter of
deciding which documents in a collection should be
retrieved to satisfy a user's need for information based
upon query.
Techniques
Tokenization
Stop Words (Common Words)
Removal
Normalization
Stemming
Lemmatization
Acronym Expansion
Tokenization
Break the input in to words.
Converts character stream -> token stream.
Called tokenizer / lexer / scanner.
The aim of the tokenization is the exploration of the
words in a sentence.
Tokenizer then hooks up to parser.
A parser defines rules of grammar for these tokens
and determines whether these statements are
correct.
Tokenizer then feeds the token to retrieval system for
further processing.
Identifying Tokens
Divide on whitespace and throw away the
punctuations?
There are many documents that we see in day to
day life that are structured using markup tags.
HTML, XML, ePub etc
What is the best way to deal with these tags?
Use them as delimiter / tokens.
Filter them our entirely.
Challenges in Tokenization
Challenges in tokenization depend on the type of
language.
Languages such as English and French are referred to as
space delimited.
Languages such as Chinese and Thai are referred to as
unsegmented as words do not have clear boundaries.
Tokenizing unsegmented language sentences requires
additional lexical and morphological information.
Advanced Tokenization
Convert the character stream to tokens using any
programming language.
Libraries used in the programming language uses
specific grammars to tokenize a text.
They can easily identify comments, literals.
Java has tokenization libraries
java.util.Scanner
Java.util.String.split()
java.util.StringTokenizer
Python provides you nltk.tokenize package which
tokenizes the input text.
Tweet Tokenizer
This tokenizer tokenizes the tweets according to their writing context
e.g links as a separate complete token.
['RT', ':', 'Shocking', 'picture', 'of', 'the', 'earthquake', 'in', 'Nepal', ',',
'http://t.co/2N09Jz96lq']
Dropping Common Terms (Stop Words)
Stop words are insignificant or most common words in a
language
In English articles prepositions etc.
To reduce vast amount of unnecessary information retrieved after
search query.
Some tools specifically avoid removing these stop words to
support phrase search
The general trend in IR systems over time has been from standard
use of quite large stop lists (200-300 terms) to very small stop lists
(7-12 terms) to no stop list whatsoever.
Some stop words Accordingly, Across, Actually, After, Afterwards
Dropping Common Terms (Stop Words)
Using nltk.wordnet.WordNetLemmatizer
Saw s bystemming
Sawseeorsawbylemmatization
Dependsonwhethertheuseofthewordwasasaverboranoun.
Acronym Expansion
Word or name formed as an abbreviation from the initial
components in a phrase or a word,
Individual letters (as in NATO or laser) or syllables (as in Benelux).
Words have multiple abbreviations depending on context
Thursday could be abbreviated as any of Thurs,Thur, Thr, Th, or T
Abbreviate library contains a dictionary of known
abbreviations.(Python)
Each word has preferred abbreviation (Thr for Thursday)
The basic `abbreviate` method will only apply preferred
abbreviations and no heuristics.
For best possible matching target length and effort is given.
Go SF Giants! Such an amaazzzing feelin!!!!
\m/ :D
Stopwords
Stemming