0% found this document useful (0 votes)
12 views

Text Mining - Vectorization

Text vectorization is the process of converting text into numerical representations. Popular methods include binary term frequency, bag-of-words term frequency, normalized term frequencies, TF-IDF, and Word2Vec. Word2Vec provides distributed representations of words by training a neural network to learn embedded word vectors from a large corpus of text. These vectors capture semantic and syntactic relationships between words based on their distributional properties.

Uploaded by

Zorka
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Text Mining - Vectorization

Text vectorization is the process of converting text into numerical representations. Popular methods include binary term frequency, bag-of-words term frequency, normalized term frequencies, TF-IDF, and Word2Vec. Word2Vec provides distributed representations of words by training a neural network to learn embedded word vectors from a large corpus of text. These vectors capture semantic and syntactic relationships between words based on their distributional properties.

Uploaded by

Zorka
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Text Vectorization

Hina Arora
TextVectorization.ipynb
• Text Vectorization is the process of converting text into numerical
representation

• Some popular methods to accomplish text vectorization:


o Binary Term Frequency
o Bag of Words (BoW) Term Frequency
o (L1) Normalized Term Frequency
o (L2) Normalized TFIDF
o Word2Vec
o etc
Binary Term Frequency
• Captures presence (1) or absence (0) of term in document
• Token_pattern = ‘(?u)\\b\\w\\w+\\b’
The default regexp select tokens of 2 or more alphanumeric characters (punctuation is
completely ignored and always treated as a token separator).

• lowercase = True

• stop_words = ‘english’

• max_df (default 1.0):


When building the vocabulary ignore terms that have a document frequency strictly higher
than the given threshold. If float, the parameter represents a proportion of documents, if
integer, the parameter represents absolute counts.

• min_df (default 1):


When building the vocabulary ignore terms that have a document frequency strictly lower
than the given threshold. If float, the parameter represents a proportion of documents, if
integer, the parameter represents absolute counts.

• max_features (default None) :


If not None, build a vocabulary that only consider the top max_features ordered by term
frequency across the corpus.

• ngram_range (default (1,1)):


The lower and upper boundary of the range of n-values for different n-grams to be
extracted. All values of n such that min_n <= n <= max_n will be used.
Bag of Words (BoW) Term Frequency
• Captures frequency of term in document
(L1) Normalized Term Frequency
• Captures normalized BoW term frequency in document
• TF typically L1-normalized
(L2) Normalized TFIDF
• Captures normalized TFIDF of term in document
• TFIDF typically L2-normalized
• Number of documents in corpus: N

• Number of documents in corpus with term t: Nt

• Term Frequency of term t in document d: TF(t, d)


o Bag of Words (BoW) Term Frequency
o The more frequent a term is, the higher the TF
o With sublinear TF: log(TF) + 1

• Inverse Document Frequency of term t in corpus: IDF(t) = log[N/Nt] + 1


o Measures how common a term is among all documents.
o The more common a term is, the lower its IDF.
o With smoothing: IDF(t) = log[(1+N)/(1+ Nt)] + 1

• TFIDF = Term Frequency * Inverse Document Frequency = TF*IDF (t, d)


o If a term appears frequently in a document, it's important - give the term a high score.
o If a term appears in many documents, it's not a unique identifier - give the term a low score.

• TFIDF score is then often l2-normalized (could also consider l1-normalized)


Word2Vec
• Captures embedded representation of terms

References:
Distributed Representations of Words and Phrases and their Compositionality
Efficient Estimation of Word Representations in Vector Space
Typical text representations provide localized representations of the word:
o Binary Term Frequency
o Bag of Words (BoW) Term Frequency
o (L1) Normalized Term Frequency
o (L2) Normalized TFIDF

ngrams try to capture some level of contextual information, but don’t


really do a great job.
• Word2Vec Provides distributed or embedded representation of words

• Start with OHE representation of all words in the corpus

• Train a NN (with 1 hidden layer) on a very large corpus of data. The rows of the
resulting hidden-layer weight-matrix are then used as the word vectors.

• One of two methods is typically used for training the NN:


o Continuous Bag of Words (CBOW): Predict vector representation of center/target word -
based on window of context words.
o Skip-Gram (SG): Predict vector representation of window of context words - based on
center/target word.
context words

𝑤𝒕−𝟐 𝑤𝒕−1 𝑤𝒕+1 𝑤𝒕+2

You shall know a word by the company it keeps


𝑤𝒕

center/target word

*Quote by J. R. Firth
Several factors influence the quality of the word vectors including:

• Amount and quality of the training data.


If you don’t have enough data, you may be able to use pre-trained vectors created by others (for
instance Google has shared a model trained on ~ 100 billion words from their News data. The
model contains 300-dimensional vectors for 3 million words and phrases). If you do end up using
pre-trained vectors, make sure the training data domain is similar to the data you’re working with.

• Size of the embedded vectors


In general, quality increases with higher dimensionality, but marginal gains typically diminish after
a threshold. Typically, the dimensionality of the vectors is set to be between 100 and 1000.

• Training algorithm
Typically, CBOW trains faster and has slightly better accuracy for the frequent words. SG works
well with small amounts of the training data, and does a good job representing rare words or
phrases.
Once we have the embedded vectors for each word, we can use them for NLP, for
instance:

• Compute similarity using cosine similarity between word vectors

• Create higher order representations (sentence/document) using weighted


average of the word vectors and feed to the classification task

You might also like