0% found this document useful (0 votes)
83 views38 pages

Lecture#14

word embedding process

Uploaded by

Qareena sadiq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
83 views38 pages

Lecture#14

word embedding process

Uploaded by

Qareena sadiq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 38

Word Embedding

by
What is Word Embedding?
A word embedding is a learned representation for text
where words that have the same meaning have a
similar representation.
One of the benefits of using dense and low-
dimensional vectors is computational: the majority of
neural network toolkits do not play well with very high-
dimensional, sparse vectors. … The main benefit of
the dense representations is generalization power: if
we believe some features may provide similar clues, it
is worthwhile to provide a representation that is able to
capture these similarities.
Word Embedding in NLP
The continuous bag-of-words (CBOW) model is a neural network
for natural languages processing tasks such as language translation
and text classification. It predicts a target word based on
the context of the surrounding words and is trained on a large
dataset of text using an optimization algorithm such as stochastic
gradient descent. Once trained, the CBOW model generates
numerical vectors, known as word embeddings, which capture the
semantics of words in a continuous vector space and can be used in
various NLP tasks. It is often combined with other techniques and
models, such as the skip-gram model, and can be implemented
using libraries like gensim in python.
Word embeddings Application

• Compute similar words


• Text classifications
• Document clustering/grouping
• Feature extraction for text
classifications
• Natural language processing.
Word Embedding in NLP
Word Embeddings or Word
vectorization is a methodology in
NLP to map words or phrases from
vocabulary to a corresponding
vector of real numbers which used
to find word predictions, word
similarities/semantics. The process
of converting words into numbers
are called Vectorization.After the
words are converted as vectors, we
need to use some techniques such
as Euclidean distance, Cosine
Similarity to identify similar words.
Word Embedding algorithms
An embedding layer, for lack of a better name, is a word
embedding that is learned jointly with a neural network model
on a specific natural language processing task, such as
language modeling or document classification. It requires that
document text be cleaned and prepared such that each word is
one-hot encoded. The size of the vector space is specified as
part of the model, such as 50, 100, or 300 dimensions. The
vectors are initialized with small random numbers. The
embedding layer is used on the front end of a neural network
and is fit in a supervised way using the Backpropagation
algorithm. If a multilayer Perceptron model is used, then the
word vectors are concatenated before being fed as input to the
model. If a recurrent neural network is used, then each word
may be taken as one input in a sequence
Types of Word Embedding

Word
Embedding

Frequency Prediction
based based
•BOW •word2Vc
•TFIDF
Word2vec
Word2vec is a widely used natural language processing
technique that uses a neural network to learn distributed
representations of words, also known as word embeddings.
These embeddings capture the semantics of a word in a
continuous vector space, such that similar words are close
together in the vector space. Word2vec has two main model
architectures: continuous bag-of-words (CBOW) and skip-gram.
CBOW predicts the current word based on the context of the
surrounding words, while skip-gram predicts the surrounding
words given the current word. Word2vec can be trained on a large
text dataset and is commonly used in various natural language
processing tasks, such as language translation, text classification,
and information retrieval.
Why Word2Vec

• BOW and TFIDF techniques are not able


to give semantic meaning suppose:
• Happy/Enjoy: Bow/TFIDF cannot find
similarties between these two words.
• Google engineers develop this technique
in 2013
Vocabulary in Word2Vec
• In word embedding models such as word2vec, the vocabulary
refers to the set of unique words on which the model has been
trained. The vocabulary is typically created by preprocessing the
input text data and selecting a subset of words to include based
on certain criteria, such as frequency of occurrence or length.
• For example, the word2vec model can create the vocabulary by
building a dictionary of all the unique words in the input text
data and filtering out words that occur too infrequently or are too
long. The vocabulary size is typically determined by the
parameter min_count, which specifies the minimum number of
occurrences of a word in the input data for it to be included in the
vocabulary.
• The vocabulary is used to create numerical vectors, also known
as word embeddings,
continuous bag-of-words (CBOW) model

The continuous bag-of-words (CBOW) model is a neural network for


natural languages processing tasks such as language translation and
text classification. It predicts a target word based on the context of the
surrounding words and is trained on a large dataset of text using an
optimization algorithm such as stochastic gradient descent. Once trained,
the CBOW model generates numerical vectors, known as word
embeddings, which capture the semantics of words in a continuous
vector space and can be used in various NLP tasks. It is often combined
with other techniques and models, such as the skip-gram model, and can
be implemented using libraries like gensim in python.
Difference
• Cannot find semantics • Find semantic meaning
meaning and relation of words
• very high dimensional • Low dimensional vectors
vectors bcz many zeros bcz every word have
are here values
• Sparse Vectors • Dense Vectors
• Overfitting Problem • No Overfitting problems
install these libraries ans modules for word2Vec

https://colab.research.google.com/dri...

https://github.com/campusx-official/g...
Let’s see an example
• Julie loves John more than Linda loves John
• Jane loves John more than Julie loves John

the two vectors are,

Item 1: [2, 0, 1, 1, 0, 2, 1, 1]
Item 2: [2, 1, 1, 0, 1, 1, 1, 1]
King Queen Man Women Monkey

Gender 1 0 1 0 1
Male
Wealth 1 1 0.7 0.3 0
Power 1 0.7 0.6 0.5 0
Weight 0.7 0.5 0.6 0.5 0.3
Speak 1 1 1 1 0

King-man+women
1-1+0=0
1-0.7+0.3=0.6
1-0.6+0.4=0.8
Why Cosine Similarity
• Count the common words or Euclidean distance is
the general approach used to match similar
documents which are based on counting the number
of common words between the documents.

• This approach will not work even if the number of


common words increases but the document talks
about different topics. To overcome this flaw, the
“Cosine Similarity” approach is used to find the
similarity between the documents.
Types of Word2Vec

Word2Vec

CBOW Skipgram
Word2Vec
Word2Vec is a statistical method for efficiently learning
a standalone word embedding from a text corpus.It
was developed by Tomas Mikolov, et al. at Google in
2013 as a response to make the neural-network-based
training of the embedding more efficient and since
then has become the de facto standard for developing
pre-trained word embedding.
•Continuous Bag-of-Words, or CBOW model.
•Continuous Skip-Gram Model.
CBOW tries to predict a word on the basis of its
neighbors, while Skip Gram tries to predict the
neighbors of a word.
• Word2Vec will set the windoe first:
• If window size is 3.
___Target____

• If window size is 5.
____,_____, Target,_____,_____

• If window size is 7.
• ____,____,____,Target,____,_____,_____
What is a Continuous Bag of Words (CBOW)?

Continuous Bag of Words (CBOW) is a popular natural language


processing technique used to generate word embeddings. Word
embeddings are important for many NLP tasks because they capture
semantic and syntactic relationships between words in a language.
CBOW is a neural network-based algorithm that predicts a target word
given its surrounding context words. It is a type of “unsupervised”
learning, meaning that it can learn from unlabeled data, and it is often
used to pre-train word embeddings that can be used for various NLP
tasks such as sentiment analysis, text classification, and machine
translation.
Word2Vec
Implementation of CBOW Model
Thanx for Listening

You might also like