The FastText Model
The FastText Model
Model
The FastText model was first introduced by Facebook in 2016 as an extension and supposedly
improvement of the vanilla Word2Vec model. Based on the original paper titled ‘Enriching
Word Vectors with Subword Information’ by Mikolov et al. which is an excellent read to gain an
in-depth understanding of how this model works. Overall, FastText is a framework for learning
word representations and also performing robust, fast and accurate text classification. The
framework is open-sourced by Facebook on GitHub and claims to have the following.
Though I haven’t implemented this model from scratch, based on the research paper, following is
what I learnt about how the model works. In general, predictive models like the Word2Vec
model typically considers each word as a distinct entity (e.g. where) and generates a dense
embedding for the word. However this poses to be a serious limitation with languages having
massive vocabularies and many rare words which may not occur a lot in different corpora. The
Word2Vec model typically ignores the morphological structure of each word and considers a
word as a single entity. The FastText model considers each word as a Bag of Character n-
grams. This is also called as a subword model in the paper.
We add special boundary symbols < and > at the beginning and end of words. This enables us to
distinguish prefixes and suffixes from other character sequences. We also include the
word w itself in the set of its n-grams, to learn a representation for each word (in addition to its
character n-grams). Taking the word where and n=3 (tri-grams) as an example, it will be
represented by the character n-grams: <wh, whe, her, ere, re> and the special
sequence <where> representing the whole word. Note that the sequence , corresponding to the
word <her> is different from the tri-gram her from the word where.
In practice, the paper recommends in extracting all the n-grams for n ≥ 3 and n ≤ 6. This is a
very simple approach, and different sets of n-grams could be considered, for example taking all
prefixes and suffixes. We typically associate a vector representation (embedding) to each n-gram
for a word. Thus, we can represent a word by the sum of the vector representations of its n-grams
or the average of the embedding of these n-grams. Thus, due to this effect of leveraging n-grams
from individual words based on their characters, there is a higher chance for rare words to get a
good representation since their character based n-grams should occur across other words of the
corpus.
The gensim package has nice wrappers providing us interfaces to leverage the FastText model
available under the gensim.models.fasttext module. Let’s apply this once again on our Bible
corpus and look at our words of interest and their most similar words.
You can see a lot of similarity in the results with our Word2Vec model with relevant similar
words for each of our words of interest. Do you notice any interesting associations and
similarities?