Related Machine Learning Links
Learn Knn Machine Learning Tutorial, validate concepts with Knn Machine Learning MCQ Questions, and prepare interviews through Knn Machine Learning Interview Questions and Answers.
Machine Learning
KNN
Instance-based
K-Nearest Neighbours (KNN)
KNN is a simple, non-parametric algorithm that classifies a point based on the majority label among its K closest neighbours.
Intuition
- Store the entire training set in memory.
- To classify a new point, compute distances to all training points.
- Pick the top K closest points and vote their labels.
KNN with scikit-learn
KNN Classifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
knn = KNeighborsClassifier(
n_neighbors=5,
metric="minkowski",
p=2 # Euclidean distance
)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print(classification_report(y_test, y_pred))
Distance Metrics & Variants
KNN relies entirely on the notion of distance. Common choices include:
- Euclidean distance (\(p = 2\)) – default for continuous features.
- Manhattan distance (\(p = 1\)) – more robust to outliers.
- Cosine similarity – for high‑dimensional sparse data (e.g., text).
scikit‑learn exposes this via the metric parameter, and you can also weight neighbours by distance:
knn = KNeighborsClassifier(
n_neighbors=10,
weights="distance", # closer neighbours have more influence
metric="manhattan"
)
Choosing K & Practical Tips
- Small K (e.g. 1–3) can overfit noise.
- Large K oversmooths decision boundaries and may underfit.
- Always scale numeric features (e.g. with StandardScaler) so that one feature’s units do not dominate distance.
- KNN can be expensive for very large datasets because prediction is \(O(N)\) per query.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
knn_clf = Pipeline(steps=[
("scaler", StandardScaler()),
("knn", KNeighborsClassifier(n_neighbors=7))
])