Related Natural Language Processing Links
Learn Fasttext Natural Language Processing Tutorial, validate concepts with Fasttext Natural Language Processing MCQ Questions, and prepare interviews through Fasttext Natural Language Processing Interview Questions and Answers.
FastText
Learn how Facebook's FastText improves upon Word2Vec by utilizing subword information (character n-grams) to handle Out-Of-Vocabulary words.
FastText: Subword Embeddings
Created by Facebook's AI Research (FAIR) lab in 2016, FastText is an extension of the Word2Vec model. While Word2Vec and GloVe treat every word as a distinct, atomic entity, FastText breaks words down into smaller pieces called character n-grams.
The Subword Breakdown Example
How does FastText view the word "apple" using an n-gram size of n=3?
FastText adds special boundary characters < and > to denote the beginning and end of a word.
N-grams (n=3): [ "<ap", "app", "ppl", "ple", "le>" ]
The final embedding for "apple" is the sum of the embeddings of all these little n-grams (plus the embedding for the whole word itself)!
Why is this Revolutionary?
- Handles Typos: If a user types "appple", Word2Vec completely crashes because it has never seen that word. FastText succeeds because "appple" shares 80% of its subword n-grams with "apple".
- Solves the OOV Problem: It can generate embeddings for Out-Of-Vocabulary (OOV) words it has never seen before, by summing their character parts.
- Great for Morphological Languages: Highly effective for languages like Turkish or Finnish where words are formed by gluing together many suffixes.
from gensim.models import FastText
corpus = [["hello", "world", "this", "is", "nlp"],
["machine", "learning", "is", "awesome"]]
# Train FastText
# min_n and max_n control the character n-gram sizes
model = FastText(sentences=corpus, vector_size=10,
window=3, min_count=1, min_n=3, max_n=6)
# The model has never seen "learnings", but it can
# calculate a vector anyway based on "learn" + "ing" + "s"!
oov_word = "learnings"
# This works perfectly, unlike Word2Vec!
vector = model.wv[oov_word]
print(f"Vector for {oov_word} generated successfully!")