CITS4012 Lecture 3
CITS4012 Lecture 3
CITS4012 Lecture 3
Traditional Sparse Word Vectorisation Vector Space Models for Dense Representation Count-based Methods Take-Aways Refer
wei.liu@uwa.edu.au
Computer Science and Software Engineering
The University of Western Australia
Similarity Results
code/dist.py
code/plot_sim.py
Embeddings → Similarities
One-Hot Encoding
One-Hot Encoding
Each unique token is represented by a vector full of zeros except for
one position, the position corresponding to the token’s index.
One-Hot Encoding
One-Hot Encoding
Each unique token is represented by a vector full of zeros except for
one position, the position corresponding to the token’s index.
One-Hot Encoding
One-Hot Encoding
Each unique token is represented by a vector full of zeros except for
one position, the position corresponding to the token’s index.
One-Hot Encoding
One-Hot Encoding
Each unique token is represented by a vector full of zeros except for
one position, the position corresponding to the token’s index.
One-Hot Encoding
One-Hot Encoding
Each unique token is represented by a vector full of zeros except for
one position, the position corresponding to the token’s index.
One-Hot Encoding
One-Hot Encoding
Each unique token is represented by a vector full of zeros except for
one position, the position corresponding to the token’s index.
One-Hot Encoding
One-Hot Encoding
Each unique token is represented by a vector full of zeros except for
one position, the position corresponding to the token’s index.
D1 D2 D3 D4 ...
tezgüino 1 1 1 1
loud 0 0 0 0
motor oil 1 0 0 1
tortillas 0 1 0 1
choices 0 1 0 0
wine 1 1 1 0
... ... ... ... ...
Table: Term-Document Matrix
The deep learning tsunami hits the shores of NLP around 2011.
Word Embeddings: Word2Vec was one of the massive waves
from this tsunami and once again accelerated the research in
semantic representation.
Despite not being “deep”, the model was a very efficient way of
constructing compact vector representations, by leveraging (shallow)
neural networks.
Since then, the term “embedding” almost replaced “representation”
and dominated the field of lexical semantics.
But it is static in nature and capture the most popular meaning of
words, e.g. mouse only has one embedding.
Contextualised representation - allowing the embedding to adapt
itself to the context.
BERT embedding and its variants are dominating now.
Count-based Methods
Count-based Methods
BoW Example
Step 1: Collect Data
It was the best of times.
it was the worst of times.
it was the age of wisdom.
it was the age of foolishness.
Step 2: Design the vocabulary, i.e. preprocess
Step 3: Create Term-Document Matrix
D1 D2 D2 D4
it
was
the
of
best
worst
age
times
wisdom
times
Table: Term Document Raw Count Matrix
A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 23 / 36
Why vectorising words? Traditional Sparse Word Vectorisation Vector Space Models for Dense Representation Count-based Methods Take-Aways Refer
BoW Example
Step 1: Collect Data
It was the best of times.
it was the worst of times.
it was the age of wisdom.
it was the age of foolishness.
Step 2: Design the vocabulary, i.e. preprocess
Step 3: Create Term-Document Matrix
D1 D2 D2 D4
it 1 1 1 1
was 1 1 1 1
the 1 1 1 1
of 1 1 1 1
best 1 0 0 0
worst 0 1 0 0
age 0 0 1 1
times 1 1 0 0
wisdom 0 0 1 0
foolishness 0 0 0 1
Table: Term Document Frequency Matrix
A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 23 / 36
Why vectorising words? Traditional Sparse Word Vectorisation Vector Space Models for Dense Representation Count-based Methods Take-Aways Refer
Document Frequency
Document Frequency (dft ) is defined to be the number of documents
in the collection that contain a term t
A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 24 / 36
Why vectorising words? Traditional Sparse Word Vectorisation Vector Space Models for Dense Representation Count-based Methods Take-Aways Refer
N
idft = log2
dft
TFIDF
Word-Context Matrix
Code Example -
constructing word co-occurrence matrix
code/word–coocur.py
Pair-Pattern Matrix
Limitations of BoW
The bag-of-words model is very simple to understand and implement
and offers a lot of flexibility for customization on your specific text
data. It has been used with great success on prediction problems like
language modeling and documentation classification.
Nevertheless, it suffers from some shortcomings, such as:
Vocabulary: The vocabulary requires careful design, most
specifically in order to manage the size, which impacts the
sparsity of the document representations.
Sparsity: Sparse representations are harder to model both for
computational reasons (space and time complexity) and also for
information reasons, where the challenge is for the models to
harness so little information in such a large representational
space.
Meaning: Discarding word order ignores the context, and in turn
meaning of words in the document (semantics). Context and
meaning can offer a lot to the model, that if modeled could tell the
difference between the same words differently arranged (‘this is
interesting’ vs ‘is this interesting’), synonyms (‘old bike’ vs ‘used
bike’), and much more.
A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 33 / 36
Why vectorising words? Traditional Sparse Word Vectorisation Vector Space Models for Dense Representation Count-based Methods Take-Aways Refer
Take-Aways
Take-Aways
one-hot encoding
vectorisation
count-based methods (tf -idf , PMI, PPMI)
References