CITS4012 Lecture 3

Why vectorising words?
Traditional Sparse Word Vectorisation Vector Space Models for Dense Representation Count-based Methods Take-Aways Refer
Distributional vs. Distributed

Semantics
CITS4012 Natural Language Processing
A/Prof. Wei Liu
wei.liu@uwa.edu.au
Computer Science and Software Engineering
The University of Western Australia
August 11, 2021
A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 1 / 36

Why vectorising words? Traditional Sparse Word Vectorisation Vector Space Models for Dense Representation Count-based Methods Take-Aways Refer
What we are going to cover today

1 Why vectorising words?
2 Traditional Sparse Word Vectorisation
[see Embeddings in NLP, Chapter 1&3]
ASCII Character-Based Representation
One-Hot Word Encoding
3 Vector Space Models for Dense Representation
Intuitions behind Vector Space Models
Distributional vs. Distributed Semantics
Overview of Vector Space Modelling Approaches
4 Count-based Methods
Term Document Matrix
Word context Matrix
Pair-Pattern Matrix
5 Take-Aways
[see Embeddings in NLP, Chapter 3]
[see Deep Learning with Pytorch, Page 1036-1042]
Why vectorising words?

What is an Embeddings Anyway

Word Embedding
An embedding is a continuous representation of an entity (a word, in
this case), and each one of its dimensions can be seen as an attribute
or feature.
Let’s forget about words for a moment and talk about restaurants
instead.
Discrete Representation of Restaurant Reviews

What is an Embeddings Anyway

Word Embedding
An embedding is a continuous representation of an entity (a word, in
this case), and each one of its dimensions can be seen as an attribute
or feature.
Although it’s fairly obvious to spot the similarities and differ-

ences among the restaurants in the table above,
it wouldn’t be so easy if there were dozens of dimensions
to compare; and
it is not objective.
Continuous Representation of Restaurant Reviews

Similarity Results

Calculating Similarities and Distances

1 import torch
2 from torch . nn import functional as F
3
4 ratings = torch . as_tensor ( df . values ) . float ()
5
6
7 manhattan_dist = torch . cdist ( ratings , ratings , p =1)
8
9 euclidean_dist = torch . cdist ( ratings , ratings , p =2)
10
11 dim = ratings . shape # torch . Size ([4 , 3])
12 nrows = dim [0] # 4
13
14 cos_sims = torch . zeros ( nrows , nrows )
15 for i in range ( nrows ) :
16 for j in range ( nrows ) :
17 cos_sims [i , j ] = F . cos ine _si mila rit y ( ratings [ i ] ,
18 ratings [ j ] ,
19 dim =0)
20 cos_sims
code/dist.py

Plotting using matplotlib
1 import matplotlib . pyplot as plt

2 import numpy as np
3
4 labels = ( ’ Restaurant 1 ’ , ’ Restaurant 2 ’ ,
5 ’ Restaurant 3 ’ , ’ Restaurant 4 ’)
6 plt . figure ()
7 plt . axes ([0 , 0 , 1 , 1])
8 plt . imshow ( cos_sims , interpolation = ’ nearest ’ ,
9 cmap = plt . cm . gnuplot2 , vmin = -1)
10 plt . xticks ( range ( nrows ) , labels , rotation =45)
11 plt . yticks ( range ( nrows ) , labels )
12 plt . colorbar ()
13 plt . tight_layout ()
14 plt . show ()
code/plot_sim.py

Embeddings → Similarities
We can compute the cosine similarity between any two

restaurants using pair-wise distance metrics.
Imagine we can represent each word as a vector, i.e. an
embedding
The values in the table above are not real embeddings. However,
it was only an example that illustrates well the concept of
embedding dimensions as attributes.
The dimensions of the word embeddings learned by the language
models such as Word2Vec do not have clear-cut meanings like
the restaurant example.
On the bright side, though, it is possible to do arithmetic with
word embeddings!

KING - MAN + WOMAN = ?

Traditional Sparse Word Vectorisation

ASCII Representation of Words
Computers only understand zeros and ones!
To represent the word “desk” or “table", we need to store each charac-

ter as a pattern of bits.
ASCII encoding for:
desk: 01100100 01100101 01110011 01101011

table: 01110100 01100001 01100010 01101100 01100101
Shortcomings:
1 The ASCII encoding does not represent meaning. Semantically
similar words are encoded very differently.
2 The representation is character-wise. The size depends on the
length of words. Variable sized representation cannot be readily
used for machine learning.

One-Hot Encoding
One-Hot Encoding
Each unique token is represented by a vector full of zeros except for
one position, the position corresponding to the token’s index.
Useful if we only have a vocabulary of words.

One-Hot Encoding
One-Hot Encoding

Addressed the problem of variable size.

One-Hot Encoding
One-Hot Encoding

The vectors are “orthognoal”, meaning that the vectors do not
represent meaning.

One-Hot Encoding
One-Hot Encoding

represent meaning.
The representation is large and sparse. A typical English
Vocabulary is about 100,000 words.

One-Hot Encoding
One-Hot Encoding

represent meaning.
A toy vocabulary of 5 words

One-Hot Encoding
One-Hot Encoding

represent meaning.
A slightly bigger vocabulary of 3704 words

One-Hot Encoding
One-Hot Encoding

represent meaning.
A slightly bigger vocabulary of 3704 words with special tokens

Vector Space Models for Dense

Representation

Vector Space Models

Vector Space Model (VSM), first proposed by Salton et al. [1975],
provides a solution to the limitations of one-hot representation.
Vector Space Model
In VSM, objects are represented as vectors in an imaginary
multi-dimensional continuous space. In NLP, the space is usually
referred to as the semantic space and the representation of the
objects are called distributed representation.
Objects can be words, documents, sentences, concepts or entities, or

any other semantic carrying item between two of which we can define
the notion of similarity.

One-Hot vs. Vector Space

One-hot representation is a specific type of a distributed representa-
tion
each word is represented as a vector along with one of the axes
in the semantic space
the semantic space needs to have n dimensions where n is the
number of words in the vocabulary
Moving from the local and discrete nature of one-hot representation
to distributed and continuous vector spaces brings about multiple
advantages:
Distributed representation introduces the notion of similarity: the
similarity of two words (vectors) can be measured by their
distance in the space.
Many more words can fit into a low dimensional space; hence, it
can potentially address the size issue of one-hot encoding: a
large vocabulary of size m can fit in an n-dimensional vector
space, where nm.

How to construct word Vector Space Model

The distributional hypothesis [Harris, 1954, Firth, 1957] state that:
Words that occur in the same contexts tend to have similar

meanings.
You shall know a word by the company it keeps.

How to construct word Vector Space Model

The distributional hypothesis [Harris, 1954, Firth, 1957] state that:
Words that occur in the same contexts tend to have similar

meanings.
You shall know a word by the company it keeps.
Here is a word that you may not know: tezgüino
D1) A bottle of is on the table.

D2) Everybody likes
D3) Don’t have before you drive.
D4) We make out of corn.
Distribution hypothesis is the foundation of automatically construct-

ing word VSM to obtain distributed representation.

Distributional semantics vs. Distributed Semantics

Distributional semantics are computed from context statistics.
Distributed semantics are a related but distinct idea: that meaning
can be represented by numerical vectors rather than symbolic
structures.
Distributed representations are often estimated from distributional
statistics, as in latent semantic analysis and Word2Vec.
D1 D2 D3 D4 ...
tezgüino 1 1 1 1
loud 0 0 0 0
motor oil 1 0 0 1
tortillas 0 1 0 1
choices 0 1 0 0
wine 1 1 1 0
... ... ... ... ...
Table: Term-Document Matrix

Overview of VSM Approaches- BoW
The interpretation of the distributional hypothesis and the way

of collecting “similarity” clues and constructing the space have
gone under enormous changes.
Earlier approaches - count-based techniques that collect word statis-

tics, in terms of occurrence and co-occurrence frequency.
These representations are often large and needed some sort of
dimensionality reduction
Count-based approaches are commonly known as
“Bag-of-Words" (BoW) models because any information about
the order or structure of words in the document is discarded.
A BoW model is only concerned with whether known words occur
in the document, not where in the document.

Overview of VSM Approaches - Neural Methods
The deep learning tsunami hits the shores of NLP around 2011.
Word Embeddings: Word2Vec was one of the massive waves
from this tsunami and once again accelerated the research in
semantic representation.
Despite not being “deep”, the model was a very efficient way of
constructing compact vector representations, by leveraging (shallow)
neural networks.
Since then, the term “embedding” almost replaced “representation”
and dominated the field of lexical semantics.
But it is static in nature and capture the most popular meaning of
words, e.g. mouse only has one embedding.
Contextualised representation - allowing the embedding to adapt
itself to the context.
BERT embedding and its variants are dominating now.

Count-based Methods

Count-based Methods
The general idea in count-based models is to construct a matrix

based on word frequencies.
Count-based models can be categorised into three general classes

based on the matrices used.
star Term Document matrix;
Word Context matrix;
Pair Pattern matrix.

Term Document Matrix

In this matrix, rows correspond to words and columns to documents.
Each cell denotes the frequency of a specific word in a given docu-
ment.
Two documents with similar patterns of numbers (similar columns)
are deemed to be having similar topics.
The term-document model is document centric; usually used for
document retrieval, classification, or similar document-based
purposes.
The value of the matrix can be
binary: 0 for not present, 1 for present
raw count: the number of times the term occurred in each
document
frequency: the raw count of the word out of the total number of
words in that document

BoW Example
Step 1: Collect Data
It was the best of times.
it was the worst of times.
it was the age of wisdom.
it was the age of foolishness.
Step 2: Design the vocabulary, i.e. preprocess
Step 3: Create Term-Document Matrix
D1 D2 D2 D4
it
was
the
of
best
worst
age
times
wisdom
times
Table: Term Document Raw Count Matrix
BoW Example
Step 1: Collect Data
It was the best of times.
it was the worst of times.
it was the age of wisdom.
it was the age of foolishness.
Step 2: Design the vocabulary, i.e. preprocess
Step 3: Create Term-Document Matrix
D1 D2 D2 D4
it 1 1 1 1
was 1 1 1 1
the 1 1 1 1
of 1 1 1 1
best 1 0 0 0
worst 0 1 0 0
age 0 0 1 1
times 1 1 0 0
wisdom 0 0 1 0
foolishness 0 0 0 1
Table: Term Document Frequency Matrix
Scoring a document’s relevance

Raw word frequency (a.k.a. Term Frequency) suffers from a
critical problem in the information retrieval context: all terms
are considered equally important when it comes to assessing
relevancy on a query.
For example, a collection of documents in “Computer Science” will

have the word “computing” in almost every document.
Certain terms have little or no discriminating power in determin-

ing relevance.
To discriminate between documents, it is better to use a document-

level statistic (such as the number of documents containing a term)
than to use a collection-wide statistic for the term (CF).
Document Frequency
Document Frequency (dft ) is defined to be the number of documents
in the collection that contain a term t
IDF - Inverse Document Frequency We want words that occur only

frequently in a document but not across the entire document collection
to receive a boost in score. In other words, higher document frequency
plays against the ranking of a word.
Inverse Document Frequency

Denoting as usual the total number of documents in a collection by N,
we define INVERSE DOCUMENT the inverse document frequency
(idf) of a term t as follows:
N
idft = log2
dft
Word CF DF IDF tfD1 tf -idf D1

it 4 4 0 1 0
wisdom 1 1 2 1 2
Table: Example of IDF values. CF is frequency of the word in the entire collection.

TFIDF
The tf -idf weighting scheme assigns to term t a weight in document d

given by
tf -idf t,d = tft,d × idft
In other words, tf -idf t,d assigns to term t a weight in document d that
is
highest when t occurs many times within a small number of
documents (thus lending high discriminating power to those
documents);
lower when the term occurs fewer times in a document, or occurs
in many documents (thus offering a less pronounced relevance
signal);
lowest when the term occurs in virtually all documents.

Word-Context Matrix
Word-context. Unlike the term-document matrix which focuses on

document representation, word-context matrix aims at representing
words.
Context spanning from neighboring words to windows of words,
grammatical dependencies, selectional preferences, or whole
documents.
Enabling different tasks, such as word similarity measurement,
word sense disambiguation, semantic role labeling, and query
expansion.

Word Co-occurrence Matrix
it was the of best worst age times wis- fool-

dom ish-
ness
it 0 4 4 4 1 1 1 2 1 1
was 4 0 4 4 1 1 1 2 1 1
the 4 4 0 4 1 1 1 2 1 1
of 4 4 4 0 1 1 1 2 1 1
best 1 1 1 1 0 0 0 1 0 0
worst 1 1 1 1 0 0 0 1 0 0
age 1 1 1 1 0 0 0 0 1 1
times 2 2 2 2 1 1 0 0 0 0
wisdom 1 1 1 1 0 0 1 0 0 0
foolishness 1 1 1 1 0 0 1 0 0 0
Table: Term Document Frequency Matrix

Code Example -
constructing word co-occurrence matrix
1 from collections import defaultdict

2 import numpy as np
3 import pandas as pd
4
5 def co_occurrence ( sentences , window_size ) :
6 d = defaultdict ( int )
7 vocab = set ()
8 for text in sentences :
9 # preprocessing ( use tokenizer instead )
10 text = text . lower () . split ()
11 # iterate over sentences
12 for i in range ( len ( text ) ) :
13 token = text [ i ]
14 vocab . add ( token ) # add to vocab
15 next_token = text [ i +1 : i +1+ window_size ]
16 for t in next_token :
17 key = tuple ( sorted ([ t , token ]) )
18 d [ key ] += 1
19
20 # formulate the dictionary into dataframe

21 vocab = sorted ( vocab ) # sort vocab
22 df = pd . DataFrame ( data = np . zeros (( len ( vocab ) , len ( vocab ) )
, dtype = np . int16 ) ,
23 index = vocab ,
24 columns = vocab )
25 for key , value in d . items () :
26 df . at [ key [0] , key [1]] = value
27 df . at [ key [1] , key [0]] = value
28 return df
29
30 text = [ " It was the best of times " ,
31 " it was the worst of times " ,
32 " it was the age of wisdom " ,
33 " it was the age of foolishness " ]
34 df = co_occurrence ( text , 20)
code/word–coocur.py

Similarity from Word Co-occurrence Matrix
cos-similarity between words

Point-wise Mutual Information

Raw frequencies do not provide a reliable measure of association.
A “stop word” such as “the” can frequently co-occur with a given
word, but this co-occurrence does not necessarily reflect a
semantic relationship since it is not discriminative.
It is more desirable to have a measure that can incorporate the
informativeness of a co-occurring words.
Pointwise Mutual Information (PMI) [Church and Hanks, 1990]

PMI normalizes the probability of the co-occurrence of two words by
their individual occurrence probabilities.
P(w1 , w2 )
PMI(w1 , w2 ) = log2 .
P(w1 )P(w2 )
PMI is calculated from probabilities, where
P(x) is an estimate of the probability of word x, which can be
directly computed based on its frequency in a given corpus,
P(w1 , w2 ) is the estimated probability that w1 and w2
co-occur in a corpus.
Positive Pointwise Mutual Information (PPMI)
PMI checks if w1 and w2 co-occur more than they occur independently.
A stop word has a high P value, resulting in a reduced overall PMI

value.
PMI values can range from − inf to + inf.
Negative values indicate a co-occurrence which is less likely to
happen than by chance.
Given that these associations are computed based on highly sparse
data and that they are not easily interpretable
it is hard to define what it means for two words to be very
unrelated,
we usually ignore negative values and replace them with 0, hence
Positive PMI (PPMI).

Pair-Pattern Matrix
Pair-pattern. Rows correspond to pairs of words and columns are the

patterns in which the two have occurred.
The matrix is suitable for measuring relational similarity: the
similarity of semantic relations between pairs of words, e.g.,
linux:grep and windows:findstr.
Extended distributional hypothesis: patterns that co-occur with
similar pairs (contexts) tend to have similar meanings.

Limitations of BoW
The bag-of-words model is very simple to understand and implement
and offers a lot of flexibility for customization on your specific text
data. It has been used with great success on prediction problems like
language modeling and documentation classification.
Nevertheless, it suffers from some shortcomings, such as:
Vocabulary: The vocabulary requires careful design, most
specifically in order to manage the size, which impacts the
sparsity of the document representations.
Sparsity: Sparse representations are harder to model both for
computational reasons (space and time complexity) and also for
information reasons, where the challenge is for the models to
harness so little information in such a large representational
space.
Meaning: Discarding word order ignores the context, and in turn
meaning of words in the document (semantics). Context and
meaning can offer a lot to the model, that if modeled could tell the
difference between the same words differently arranged (‘this is
interesting’ vs ‘is this interesting’), synonyms (‘old bike’ vs ‘used
bike’), and much more.
Take-Aways

Take-Aways
one-hot encoding
vectorisation
count-based methods (tf -idf , PMI, PPMI)

References
[1] Daniel Voigt Godoy. Deep Learning with PyTorch Step-by-Step:

A Beginner’s Guide. Published at: http://leanpub.com/pytorch,
http://leanpub.com/pytorch, 2021.
[2] Mohammad Taher Pilehvar and José Camacho-Collados. Em-
beddings in Natural Language Processing: Theory and Ad-
vances in Vector Representations of Meaning. Synthesis Lec-
tures on Human Language Technologies. Morgan & Claypool
Publishers, 2020. doi: 10.2200/S01057ED1V01Y202009HLT047.
URL https://doi.org/10.2200/S01057ED1V01Y202009HLT047.

CITS4012 Lecture 3

Uploaded by

Copyright:

Available Formats

CITS4012 Lecture 3

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CITS4012 Lecture 3

Uploaded by

Copyright:

Available Formats

Why vectorising words?

Distributional vs. Distributed

A/Prof. Wei Liu

August 11, 2021

A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 1 / 36

What we are going to cover today

Why vectorising words?

A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 3 / 36

What is an Embeddings Anyway

Discrete Representation of Restaurant Reviews

A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 4 / 36

What is an Embeddings Anyway

Although it’s fairly obvious to spot the similarities and differ-

Continuous Representation of Restaurant Reviews

A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 5 / 36

Calculating Similarities and Distances

A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 6 / 36

Plotting using matplotlib

1 import matplotlib . pyplot as plt

A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 7 / 36

We can compute the cosine similarity between any two

A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 8 / 36

KING - MAN + WOMAN = ?

A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 9 / 36

Traditional Sparse Word Vectorisation

A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 10 / 36

ASCII Representation of Words

Computers only understand zeros and ones!

To represent the word “desk” or “table", we need to store each charac-

desk: 01100100 01100101 01110011 01101011

A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 11 / 36

Useful if we only have a vocabulary of words.

A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 12 / 36

Useful if we only have a vocabulary of words.

A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 12 / 36

Useful if we only have a vocabulary of words.

A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 12 / 36

Useful if we only have a vocabulary of words.

A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 12 / 36

Useful if we only have a vocabulary of words.

A toy vocabulary of 5 words

A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 12 / 36

Useful if we only have a vocabulary of words.

A slightly bigger vocabulary of 3704 words

A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 12 / 36

Useful if we only have a vocabulary of words.

A slightly bigger vocabulary of 3704 words with special tokens

Vector Space Models for Dense

A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 13 / 36

Vector Space Models

Objects can be words, documents, sentences, concepts or entities, or

A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 14 / 36

One-Hot vs. Vector Space

A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 15 / 36

How to construct word Vector Space Model

Words that occur in the same contexts tend to have similar

You shall know a word by the company it keeps.

A/Prof. Wei Liu UWA Lecture 3 August 11, 2021 16 / 36