Supervised Learning

Learning from labeled data — regression, classification, and generalization.

What Is Supervised Learning?

Supervised learning trains a model on labeled examples — input-output pairs — so it can predict outputs for unseen inputs. The "supervision" comes from the labels.

Two main tasks:

Classification: predict a discrete category (spam/not spam, cat/dog)
Regression: predict a continuous value (house price, temperature)

The Learning Pipeline

Collect data — gather labeled examples
Feature engineering — transform raw data into useful inputs
Split data — train / validation / test sets (typically 70/15/15)
Train model — minimize a loss function on training data
Evaluate — measure performance on held-out test data
Iterate — tune hyperparameters, add features, try different models

Loss Functions

The loss function measures how wrong the model's predictions are:

Task	Loss Function	Formula
Regression	Mean Squared Error	`(1/n) Σ(y - ŷ)²`
Regression	Mean Absolute Error	`(1/n) Σ
Binary Classification	Binary Cross-Entropy	`-Σ(y log(ŷ) + (1-y) log(1-ŷ))`
Multi-class	Categorical Cross-Entropy	`-Σ y_i log(ŷ_i)`

Bias-Variance Tradeoff

Every model's error decomposes into:

Total Error = Bias² + Variance + Irreducible Error

High bias (underfitting): model is too simple, misses patterns
High variance (overfitting): model memorizes training data, fails on new data

The goal is the sweet spot — complex enough to capture patterns, simple enough to generalize.

Regularization

Regularization constrains the model to prevent overfitting:

L1 Regularization (Lasso)

Adds λ Σ|w_i| to the loss. Drives some weights to exactly zero, performing feature selection. Produces sparse models.

L2 Regularization (Ridge)

Adds λ Σw_i² to the loss. Shrinks all weights toward zero but doesn't eliminate any. Smoother solution.

Elastic Net

Combines L1 and L2: λ₁ Σ|w_i| + λ₂ Σw_i². Gets the benefits of both.

Other Techniques

Dropout — randomly zero out neurons during training (neural networks)
Early stopping — stop training when validation loss starts increasing
Data augmentation — artificially expand the training set

Evaluation Metrics

Classification

Accuracy — fraction correct (misleading with imbalanced classes)
Precision — of predicted positives, how many are truly positive
Recall — of actual positives, how many did we catch
F1 Score — harmonic mean of precision and recall
AUC-ROC — area under the receiver operating characteristic curve

Regression

MSE / RMSE — penalizes large errors heavily
MAE — treats all errors equally
R² — proportion of variance explained (1.0 = perfect)

Cross-Validation

K-fold cross-validation provides a robust performance estimate:

Split data into K folds (typically K=5 or K=10)
For each fold: train on K-1 folds, evaluate on the held-out fold
Average the K scores

This reduces the variance of the estimate and uses all data for both training and validation.

Common Algorithms

Algorithm	Type	Strengths
Linear/Logistic Regression	Both	Interpretable, fast, good baseline
Decision Trees	Both	Interpretable, handles non-linear relationships
Random Forest	Both	Robust, handles high dimensions, less overfitting
Gradient Boosting (XGBoost)	Both	State-of-the-art on tabular data
SVM	Both	Effective in high dimensions, kernel trick
Neural Networks	Both	Flexible, scales with data, learns features