Preprocess for Tesseract
import cv2
gray = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2GRAY)
gray = cv2.GaussianBlur(gray, (3, 3), 0)
_, th = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
# Optional: morphology to close gaps in strokes
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (2, 2))
th = cv2.morphologyEx(th, cv2.MORPH_CLOSE, kernel)
Deskew, denoise, and contrast normalization often improve line-based OCR more than raw color photos.
pytesseract
# pip install pytesseract; install Tesseract OCR binary on PATH
import pytesseract
from PIL import Image
text = pytesseract.image_to_string(Image.fromarray(th), lang="eng")
data = pytesseract.image_to_data(Image.fromarray(th), output_type=pytesseract.Output.DICT)
image_to_data returns per-word boxes and confidences for debugging.
Scene text (idea)
OpenCV’s DNN module can run frozen EAST text detection or ONNX recognition models: produce quadrilaterals or axis-aligned boxes, warp crops to fixed height, then run a recognition network. Training custom data yields better domain accuracy than generic English-only models.
Takeaways
- Match engine to layout: printed forms vs street signs vs handwriting.
- Evaluate with character/word accuracy on a held-out set.
- Privacy: redact or avoid storing sensitive text without policy.
Quick FAQ
tessdata pack and pass lang="hin+eng" (example) to pytesseract.