Related Natural Language Processing Links
Learn Tokenization Natural Language Processing Tutorial, validate concepts with Tokenization Natural Language Processing MCQ Questions, and prepare interviews through Tokenization Natural Language Processing Interview Questions and Answers.
Tokenization
Master word, sentence, and subword tokenization with practical examples using NLTK and spaCy.
Tokenization in NLP
Tokenization is the process of breaking down a stream of text into smaller, meaningful units called tokens. Why can't we just use the Python .split(' ') function? Because standard splitting fails on punctuation, contractions, and acronyms!
"Mr. O'Neill doesn't go." → ["Mr.", "O'Neill", "doesn't", "go."]NLP Tokenized: →
["Mr.", "O", "'", "Neill", "does", "n't", "go", "."]
Types of Tokenization
Sentence Tokenization
Splits paragraphs into sentences. Must be smart enough to know that "Dr." or "U.S.A." doesn't end a sentence.
Word Tokenization
Splitting sentences into words and independent punctuation marks like commas and periods.
Subword Tokenization (BPE)
Used in modern LLMs (BERT/GPT). Resolves "Out Of Vocabulary" errors by splitting rare words.
"Unfriendly" → ["un", "friend", "ly"]
Implementation Code
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
text = "Dr. Smith went to the U.S.A. Did he buy apples?"
# 1. Sentence Tokenization
sentences = sent_tokenize(text)
print("Sentences:", sentences)
# Output: ['Dr. Smith went to the U.S.A.',
# 'Did he buy apples?']
# 2. Word Tokenization
words = word_tokenize(sentences[0])
print("\nWords:", words)
# Output: ['Dr.', 'Smith', 'went', 'to',
# 'the', 'U.S.A.']
import spacy
nlp = spacy.load("en_core_web_sm")
text = "Dr. Smith went to the U.S.A."
# Process the text
doc = nlp(text)
# Tokenize
tokens = [token.text for token in doc]
print("spaCy Tokens:", tokens)
# Output: ['Dr.', 'Smith', 'went', 'to',
# 'the', 'U.S.A.', '.']
# Notice spaCy separates the final period!