Related Natural Language Processing Links
Learn Ngrams Natural Language Processing Tutorial, validate concepts with Ngrams Natural Language Processing MCQ Questions, and prepare interviews through Ngrams Natural Language Processing Interview Questions and Answers.
N-grams
Preserve local word order context by grouping continuous sequences of n items from a given text.
N-grams
Bag of Words (BoW) completely destroys sentence structure because it treats every word as an independent token (unigram). N-grams solve this by taking contiguous sequences of n words from a given text. This allows the model to capture local context and short grammatical structures.
Types of N-grams
Consider the sentence: "The weather is very good"
- Unigrams (n=1): ["The", "weather", "is", "very", "good"] (This is standard BoW!)
- Bigrams (n=2): ["The weather", "weather is", "is very", "very good"] (Captures 2-word contexts)
- Trigrams (n=3): ["The weather is", "weather is very", "is very good"]
Why are N-grams crucial for NLP?
1. Resolving Negations
Standard BoW misses negation. If an angry review says "not good", a Unigram BoW treats "not" and "good" separately, which might confuse the classifier into thinking positive sentiment exists. A Bigram model explicitly tracks "not good" as a single feature variable indicating negativity.
2. Named Entities
Names and monuments are often multi-word. "New York" means a city, whereas "New" and "York" separately mean an adjective and an English town. Bigrams capture ["New York"] precisely.
from sklearn.feature_extraction.text import CountVectorizer
docs = ["The food is not good but the service is very good"]
# ngram_range=(min_n, max_n).
# Example (2,2) means ONLY Bigrams. (1,2) means Unigrams AND Bigrams.
vectorizer = CountVectorizer(ngram_range=(2, 3))
X = vectorizer.fit_transform(docs)
# Print all extracted N-gram features
features = vectorizer.get_feature_names_out()
print("Extracted N-grams:")
for f in features:
print(f"- '{f}'")
'''
Output:
Extracted N-grams:
- 'but the'
- 'but the service'
- 'food is'
- 'food is not'
...
- 'is not good' <-- Trigram successfully captured the true sentiment
- 'not good' <-- Bigram captured negation
- 'very good'
'''