Related Natural Language Processing Links
Learn Stemming Natural Language Processing Tutorial, validate concepts with Stemming Natural Language Processing MCQ Questions, and prepare interviews through Stemming Natural Language Processing Interview Questions and Answers.
Stemming
Understand Stemming algorithms like Porter Stemmer and Snowball to reduce words to their base root form.
Stemming Techniques
Stemming is a text normalization technique that reduces words to their base or root form by chopping off prefixes or suffixes according to a fixed set of rules. It is a heuristic process that operates purely on string manipulation.
Porter Stemmer
One of the oldest (1980) and most widely used suffix stripping algorithms. It uses 5 phases of word reduction.
- "ponies" → "poni"
- "caresses" → "caress"
Snowball Stemmer
Also known as the Porter2 stemmer. It is a slightly faster and more logical algorithm than the original Porter stemmer, supporting multiple languages.
Lancaster Stemmer
The most aggressive stemming algorithm. It is very fast but often chops words down to unreadable levels.
- "maximum" → "maxim"
When two words with different meanings are stemmed to the same root.
"universal", "university", "universe" → "univers"
When two words with the same meaning are stemmed to different roots.
"alumnus", "alumni" → "alumnus", "alumni"
from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer
porter = PorterStemmer()
snowball = SnowballStemmer("english")
lancaster = LancasterStemmer()
words = ["running", "generously", "history", "historical", "better", "universities"]
print(f"{'Word':<15} | {'Porter':<15} | {'Snowball':<15} | {'Lancaster':<15}")
print("-" * 65)
for w in words:
p_stem = porter.stem(w)
s_stem = snowball.stem(w)
l_stem = lancaster.stem(w)
print(f"{w:<15} | {p_stem:<15} | {s_stem:<15} | {l_stem:<15}")
# Output observation:
# 'better' stays 'better' across all (stemming fails with irregular verbs)
# 'universities' -> 'univers' (Porter/Snowball) -> 'univers' (Lancaster)
# 'historical' -> 'histor' (Porter/Snowball) -> 'hist' (Lancaster - highly aggressive)