0% found this document useful (0 votes)

12 views

Text Mining - Vectorization

Text vectorization is the process of converting text into numerical representations. Popular methods include binary term frequency, bag-of-words term frequency, normalized term frequencies, TF-IDF, and Word2Vec. Word2Vec provides distributed representations of words by training a neural network to learn embedded word vectors from a large corpus of text. These vectors capture semantic and syntactic relationships between words based on their distributional properties.

Uploaded by

Zorka

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views

Text Mining - Vectorization

Uploaded by

Zorka

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Text Vectorization

Hina Arora
TextVectorization.ipynb
• Text Vectorization is the process of converting text into numerical
representation

• Some popular methods to accomplish text vectorization:

o Binary Term Frequency
o Bag of Words (BoW) Term Frequency
o (L1) Normalized Term Frequency
o (L2) Normalized TFIDF
o Word2Vec
o etc
Binary Term Frequency
• Captures presence (1) or absence (0) of term in document
• Token_pattern = ‘(?u)\\b\\w\\w+\\b’
The default regexp select tokens of 2 or more alphanumeric characters (punctuation is
completely ignored and always treated as a token separator).

• lowercase = True

• stop_words = ‘english’

• max_df (default 1.0):

When building the vocabulary ignore terms that have a document frequency strictly higher
than the given threshold. If float, the parameter represents a proportion of documents, if
integer, the parameter represents absolute counts.

• min_df (default 1):

When building the vocabulary ignore terms that have a document frequency strictly lower
than the given threshold. If float, the parameter represents a proportion of documents, if
integer, the parameter represents absolute counts.

• max_features (default None) :

If not None, build a vocabulary that only consider the top max_features ordered by term
frequency across the corpus.

• ngram_range (default (1,1)):

The lower and upper boundary of the range of n-values for different n-grams to be
extracted. All values of n such that min_n <= n <= max_n will be used.
Bag of Words (BoW) Term Frequency
• Captures frequency of term in document
(L1) Normalized Term Frequency
• Captures normalized BoW term frequency in document
• TF typically L1-normalized
(L2) Normalized TFIDF
• Captures normalized TFIDF of term in document
• TFIDF typically L2-normalized
• Number of documents in corpus: N

• Number of documents in corpus with term t: Nt

• Term Frequency of term t in document d: TF(t, d)

o Bag of Words (BoW) Term Frequency
o The more frequent a term is, the higher the TF
o With sublinear TF: log(TF) + 1

• Inverse Document Frequency of term t in corpus: IDF(t) = log[N/Nt] + 1

o Measures how common a term is among all documents.
o The more common a term is, the lower its IDF.
o With smoothing: IDF(t) = log[(1+N)/(1+ Nt)] + 1

• TFIDF = Term Frequency * Inverse Document Frequency = TF*IDF (t, d)

o If a term appears frequently in a document, it's important - give the term a high score.
o If a term appears in many documents, it's not a unique identifier - give the term a low score.

• TFIDF score is then often l2-normalized (could also consider l1-normalized)

Word2Vec
• Captures embedded representation of terms

References:
Distributed Representations of Words and Phrases and their Compositionality
Efficient Estimation of Word Representations in Vector Space
Typical text representations provide localized representations of the word:
o Binary Term Frequency
o Bag of Words (BoW) Term Frequency
o (L1) Normalized Term Frequency
o (L2) Normalized TFIDF

ngrams try to capture some level of contextual information, but don’t

really do a great job.
• Word2Vec Provides distributed or embedded representation of words

• Start with OHE representation of all words in the corpus

• Train a NN (with 1 hidden layer) on a very large corpus of data. The rows of the
resulting hidden-layer weight-matrix are then used as the word vectors.

• One of two methods is typically used for training the NN:

o Continuous Bag of Words (CBOW): Predict vector representation of center/target word -
based on window of context words.
o Skip-Gram (SG): Predict vector representation of window of context words - based on
center/target word.
context words

𝑤𝒕−𝟐 𝑤𝒕−1 𝑤𝒕+1 𝑤𝒕+2

You shall know a word by the company it keeps

𝑤𝒕

center/target word

*Quote by J. R. Firth
Several factors influence the quality of the word vectors including:

• Amount and quality of the training data.

If you don’t have enough data, you may be able to use pre-trained vectors created by others (for
instance Google has shared a model trained on ~ 100 billion words from their News data. The
model contains 300-dimensional vectors for 3 million words and phrases). If you do end up using
pre-trained vectors, make sure the training data domain is similar to the data you’re working with.

• Size of the embedded vectors

In general, quality increases with higher dimensionality, but marginal gains typically diminish after
a threshold. Typically, the dimensionality of the vectors is set to be between 100 and 1000.

• Training algorithm
Typically, CBOW trains faster and has slightly better accuracy for the frequent words. SG works
well with small amounts of the training data, and does a good job representing rare words or
phrases.
Once we have the embedded vectors for each word, we can use them for NLP, for
instance:

• Compute similarity using cosine similarity between word vectors

• Create higher order representations (sentence/document) using weighted

average of the word vectors and feed to the classification task

SDLC Assignment 1 BKC18400
67% (3)
SDLC Assignment 1 BKC18400
24 pages
b66042 - Programmer's Guide
100% (1)
b66042 - Programmer's Guide
362 pages
Unit iv
No ratings yet
Unit iv
58 pages
NLP Asgn3
No ratings yet
NLP Asgn3
6 pages
Unit iv
No ratings yet
Unit iv
57 pages
Machine Learning For NLP: Vocabulary
No ratings yet
Machine Learning For NLP: Vocabulary
37 pages
Module III
No ratings yet
Module III
42 pages
Lect04
No ratings yet
Lect04
44 pages
NLP Asgn2
No ratings yet
NLP Asgn2
7 pages
UNIT-II
No ratings yet
UNIT-II
20 pages
04 - Text Representation
No ratings yet
04 - Text Representation
131 pages
Traditional Word Embedding
No ratings yet
Traditional Word Embedding
9 pages
Feature extraction techniques in NLP
No ratings yet
Feature extraction techniques in NLP
10 pages
wordembed
No ratings yet
wordembed
31 pages
Module03 Embeddings
No ratings yet
Module03 Embeddings
102 pages
NLP An Intuitive Understanding of Word Embeddings From Count Vectors To Word2Vec
No ratings yet
NLP An Intuitive Understanding of Word Embeddings From Count Vectors To Word2Vec
18 pages
4 Word Representation
No ratings yet
4 Word Representation
41 pages
Bag of Words
No ratings yet
Bag of Words
32 pages
Data Science Interview Preparation Questions (#Day06)
No ratings yet
Data Science Interview Preparation Questions (#Day06)
10 pages
7a. Word Embeddings Word2Vec and GloVe
No ratings yet
7a. Word Embeddings Word2Vec and GloVe
39 pages
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
100% (1)
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
12 pages
Ch6 - Text Vectorization - 1
No ratings yet
Ch6 - Text Vectorization - 1
63 pages
NLP-Neuro Linguistic Programming: What Is A Corpus?
No ratings yet
NLP-Neuro Linguistic Programming: What Is A Corpus?
3 pages
Word Embedding
No ratings yet
Word Embedding
60 pages
NLP m3
No ratings yet
NLP m3
111 pages
ML UNIT-II
No ratings yet
ML UNIT-II
27 pages
Lebijp 59 SZ 31 Py
No ratings yet
Lebijp 59 SZ 31 Py
69 pages
Wordembed v2.0
No ratings yet
Wordembed v2.0
46 pages
unit2newml
No ratings yet
unit2newml
25 pages
Pipeline
No ratings yet
Pipeline
9 pages
ML7 - Text Classification
No ratings yet
ML7 - Text Classification
13 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
33 pages
Learning Representations That Convey Semantic and Syntactic Information
No ratings yet
Learning Representations That Convey Semantic and Syntactic Information
14 pages
Lecture Word Embeddings WordTo Vec IR
No ratings yet
Lecture Word Embeddings WordTo Vec IR
60 pages
NLP Text Preprocessing
No ratings yet
NLP Text Preprocessing
19 pages
Allnlp
No ratings yet
Allnlp
15 pages
Word and Document Embeddings
No ratings yet
Word and Document Embeddings
94 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
57 pages
3 WordMeaning
No ratings yet
3 WordMeaning
78 pages
06 Wordvectors
No ratings yet
06 Wordvectors
96 pages
Embeddings
No ratings yet
Embeddings
3 pages
NLP Prez Word - Sentence Embedding - MAQUET - MARTIN - LEEFEBURE - MOGAVERO
No ratings yet
NLP Prez Word - Sentence Embedding - MAQUET - MARTIN - LEEFEBURE - MOGAVERO
18 pages
Dealing With Textual Data
No ratings yet
Dealing With Textual Data
67 pages
ML for NLP-LO4
No ratings yet
ML for NLP-LO4
42 pages
Lec 6
No ratings yet
Lec 6
2 pages
TextFeatureEnginerring-NLP lec2
No ratings yet
TextFeatureEnginerring-NLP lec2
60 pages
Data Mining Numericals
No ratings yet
Data Mining Numericals
38 pages
Cs224n 2024 Lecture02 Wordvecs2
No ratings yet
Cs224n 2024 Lecture02 Wordvecs2
45 pages
AIML_P5
No ratings yet
AIML_P5
10 pages
MLA TAB Lecture2
No ratings yet
MLA TAB Lecture2
84 pages
Christopher Manning Lecture 2: Word Vectors, Word Senses, and Neural Classifiers
No ratings yet
Christopher Manning Lecture 2: Word Vectors, Word Senses, and Neural Classifiers
57 pages
Ass7 Write Up .Final
No ratings yet
Ass7 Write Up .Final
11 pages
DLNLP CH-3 N
No ratings yet
DLNLP CH-3 N
11 pages
Module5-Representing and Mining Text
No ratings yet
Module5-Representing and Mining Text
24 pages
Chapter II
No ratings yet
Chapter II
26 pages
Extra Feature NLP
No ratings yet
Extra Feature NLP
5 pages
word embedding
No ratings yet
word embedding
35 pages
SL-3_Assignment No 7
No ratings yet
SL-3_Assignment No 7
14 pages
Unit-2-TB
No ratings yet
Unit-2-TB
20 pages
He Laskar 2019
No ratings yet
He Laskar 2019
4 pages
Feature Extraction NLP
No ratings yet
Feature Extraction NLP
19 pages
Perceptual Computing: Fundamentals and Applications
From Everand
Perceptual Computing: Fundamentals and Applications
Fouad Sabry
No ratings yet
Philips 190P
No ratings yet
Philips 190P
2 pages
JavaScript Programs
No ratings yet
JavaScript Programs
7 pages
Wholesale Batch 10 Class 2
No ratings yet
Wholesale Batch 10 Class 2
21 pages
5th Session Forecasting Business
No ratings yet
5th Session Forecasting Business
13 pages
PR 2420329020 HTB180 E0306
100% (1)
PR 2420329020 HTB180 E0306
171 pages
Alekhya 4years Resume
No ratings yet
Alekhya 4years Resume
3 pages
EMPOWERMENT TECHNOLOGY 1-2
No ratings yet
EMPOWERMENT TECHNOLOGY 1-2
17 pages
DDX4019BT: Instruction Manual
No ratings yet
DDX4019BT: Instruction Manual
56 pages
Entry Behaviour Brief Introduction: Objectives
100% (1)
Entry Behaviour Brief Introduction: Objectives
7 pages
Software Update Beta v1.0.7548 For The Soundcraft Ui12/16 Digital Mixers Is Available Now!
No ratings yet
Software Update Beta v1.0.7548 For The Soundcraft Ui12/16 Digital Mixers Is Available Now!
2 pages
OL Support Matrix
No ratings yet
OL Support Matrix
12 pages
DevOps For AI-IEEE
No ratings yet
DevOps For AI-IEEE
6 pages
Course Outline MARK3460 - 2023
No ratings yet
Course Outline MARK3460 - 2023
5 pages
Ist Annual Examination, 2024ADA (C) 38403-1587510-2RollNoSlip
No ratings yet
Ist Annual Examination, 2024ADA (C) 38403-1587510-2RollNoSlip
1 page
[STUDENT] DATABASE THEMEPARK (1)
No ratings yet
[STUDENT] DATABASE THEMEPARK (1)
8 pages
Seminar Report On Raspberry Pi
No ratings yet
Seminar Report On Raspberry Pi
25 pages
Enterprise Resource Planning
No ratings yet
Enterprise Resource Planning
2 pages
Make Up Midterm
No ratings yet
Make Up Midterm
6 pages
Indian Sign Language Character Recognition: Course Project-CS365A
No ratings yet
Indian Sign Language Character Recognition: Course Project-CS365A
14 pages
Personal Statement Format For Masters
No ratings yet
Personal Statement Format For Masters
2 pages
Wago I/o System 750 Hvac-System-Macros
No ratings yet
Wago I/o System 750 Hvac-System-Macros
407 pages
DB Lesson 1 Class Note
No ratings yet
DB Lesson 1 Class Note
13 pages
j77 User Manual
No ratings yet
j77 User Manual
376 pages
programming-concepts-Kyd9xQWfPcF5SkXh
No ratings yet
programming-concepts-Kyd9xQWfPcF5SkXh
15 pages
FCX Certification
No ratings yet
FCX Certification
6 pages
Eaton Ellipse PRO UPS - 650-800-1200-1600 VA - Datasheet
No ratings yet
Eaton Ellipse PRO UPS - 650-800-1200-1600 VA - Datasheet
2 pages
Azure Mri™ Surescan™ / Astra Mri™ Surescan™ SW030
No ratings yet
Azure Mri™ Surescan™ / Astra Mri™ Surescan™ SW030
68 pages
2309.01038v1
No ratings yet
2309.01038v1
16 pages