Lecture 6 - From Unstructured Texts to Structure Data I
Lecture 6 - From Unstructured Texts to Structure Data I
Cons:
• Creates many features (tens-of-thousands of
features)
• Doesn’t take into account other documents
• Doesn’t provide state-of-the-art results
• It’s possible to miss important characters by
removing punctuation
N-Grams
“N-gram is a contiguous sequence of n items from a
given sample of text or speech. The items can be
phonemes, syllables, letters, words or base pairs
according to the application” (from Wikipedia)
N-Grams
Pros:
• Easy to understand and simple to use
• Can be used both for “characters” or “words”
• Can utilize useful punctuation and other special
characters
• Usually provides decent results
Cons:
• Creates many features (hundreds-of-thousands of
features)
• Doesn’t take into account other documents
• Doesn’t provide state-of-the-art results
Term Frequency–Inverse
Document Frequency (TF-IDF)
• TF-IDF is used to reflect how important a word is to
a document in a collection
• A word’s TF–IDF value increases proportionally to the
number of times it appears in a document and is
offset by the number of documents in the corpus that
contains it
Term Frequency
A term frequency, tf(t,d), can be the number of time a
word appears in a document (there can also be other
measures)
see Wikipedia
Inverse Document Frequency
The inverse document frequency, idf(t,D), measures
if words are common or rare across all documents.
tfidf(t,d,D) = tf(t,d)*idf(t,D)
see Wikipedia
TF-IDF
Pros:
• Easy to understand and simple to use
• Usually provides decent results
• Takes into account other documents
Cons:
• Creates many features (tens-of-thousands of
features)
• Doesn’t provide state-of-the-art results
Topic Model
A topic model is a statistical model for discovering
the abstract "topics" that occur in a collection of
documents. Topic model algorithms are used to
discover hidden subjects in a large collection of
unstructured texts
Latent Dirichlet Allocation
“A generative statistical model that allows sets of
observations to be explained by unobserved groups
that explain why some parts of the data are similar.”