Mnemosyne

Supervised Learning

Learning from labeled data — regression, classification, and generalization.

What Is Supervised Learning?

Supervised learning trains a model on labeled examples — input-output pairs — so it can predict outputs for unseen inputs. The "supervision" comes from the labels.

Two main tasks:

  • Classification: predict a discrete category (spam/not spam, cat/dog)
  • Regression: predict a continuous value (house price, temperature)

The Learning Pipeline

  1. Collect data — gather labeled examples
  2. Feature engineering — transform raw data into useful inputs
  3. Split data — train / validation / test sets (typically 70/15/15)
  4. Train model — minimize a loss function on training data
  5. Evaluate — measure performance on held-out test data
  6. Iterate — tune hyperparameters, add features, try different models

Loss Functions

The loss function measures how wrong the model's predictions are:

TaskLoss FunctionFormula
RegressionMean Squared Error(1/n) Σ(y - ŷ)²
RegressionMean Absolute Error`(1/n) Σ
Binary ClassificationBinary Cross-Entropy-Σ(y log(ŷ) + (1-y) log(1-ŷ))
Multi-classCategorical Cross-Entropy-Σ y_i log(ŷ_i)

Bias-Variance Tradeoff

Every model's error decomposes into:

Total Error = Bias² + Variance + Irreducible Error

  • High bias (underfitting): model is too simple, misses patterns
  • High variance (overfitting): model memorizes training data, fails on new data

The goal is the sweet spot — complex enough to capture patterns, simple enough to generalize.

Regularization

Regularization constrains the model to prevent overfitting:

L1 Regularization (Lasso)

Adds λ Σ|w_i| to the loss. Drives some weights to exactly zero, performing feature selection. Produces sparse models.

L2 Regularization (Ridge)

Adds λ Σw_i² to the loss. Shrinks all weights toward zero but doesn't eliminate any. Smoother solution.

Elastic Net

Combines L1 and L2: λ₁ Σ|w_i| + λ₂ Σw_i². Gets the benefits of both.

Other Techniques

  • Dropout — randomly zero out neurons during training (neural networks)
  • Early stopping — stop training when validation loss starts increasing
  • Data augmentation — artificially expand the training set

Evaluation Metrics

Classification

  • Accuracy — fraction correct (misleading with imbalanced classes)
  • Precision — of predicted positives, how many are truly positive
  • Recall — of actual positives, how many did we catch
  • F1 Score — harmonic mean of precision and recall
  • AUC-ROC — area under the receiver operating characteristic curve

Regression

  • MSE / RMSE — penalizes large errors heavily
  • MAE — treats all errors equally
  • — proportion of variance explained (1.0 = perfect)

Cross-Validation

K-fold cross-validation provides a robust performance estimate:

  1. Split data into K folds (typically K=5 or K=10)
  2. For each fold: train on K-1 folds, evaluate on the held-out fold
  3. Average the K scores

This reduces the variance of the estimate and uses all data for both training and validation.

Common Algorithms

AlgorithmTypeStrengths
Linear/Logistic RegressionBothInterpretable, fast, good baseline
Decision TreesBothInterpretable, handles non-linear relationships
Random ForestBothRobust, handles high dimensions, less overfitting
Gradient Boosting (XGBoost)BothState-of-the-art on tabular data
SVMBothEffective in high dimensions, kernel trick
Neural NetworksBothFlexible, scales with data, learns features

Review Questions