Supervised Learning
Learning from labeled data — regression, classification, and generalization.
What Is Supervised Learning?
Supervised learning trains a model on labeled examples — input-output pairs — so it can predict outputs for unseen inputs. The "supervision" comes from the labels.
Two main tasks:
- Classification: predict a discrete category (spam/not spam, cat/dog)
- Regression: predict a continuous value (house price, temperature)
The Learning Pipeline
- Collect data — gather labeled examples
- Feature engineering — transform raw data into useful inputs
- Split data — train / validation / test sets (typically 70/15/15)
- Train model — minimize a loss function on training data
- Evaluate — measure performance on held-out test data
- Iterate — tune hyperparameters, add features, try different models
Loss Functions
The loss function measures how wrong the model's predictions are:
| Task | Loss Function | Formula |
|---|---|---|
| Regression | Mean Squared Error | (1/n) Σ(y - ŷ)² |
| Regression | Mean Absolute Error | `(1/n) Σ |
| Binary Classification | Binary Cross-Entropy | -Σ(y log(ŷ) + (1-y) log(1-ŷ)) |
| Multi-class | Categorical Cross-Entropy | -Σ y_i log(ŷ_i) |
Bias-Variance Tradeoff
Every model's error decomposes into:
Total Error = Bias² + Variance + Irreducible Error
- High bias (underfitting): model is too simple, misses patterns
- High variance (overfitting): model memorizes training data, fails on new data
The goal is the sweet spot — complex enough to capture patterns, simple enough to generalize.
Regularization
Regularization constrains the model to prevent overfitting:
L1 Regularization (Lasso)
Adds λ Σ|w_i| to the loss. Drives some weights to exactly zero, performing feature selection. Produces sparse models.
L2 Regularization (Ridge)
Adds λ Σw_i² to the loss. Shrinks all weights toward zero but doesn't eliminate any. Smoother solution.
Elastic Net
Combines L1 and L2: λ₁ Σ|w_i| + λ₂ Σw_i². Gets the benefits of both.
Other Techniques
- Dropout — randomly zero out neurons during training (neural networks)
- Early stopping — stop training when validation loss starts increasing
- Data augmentation — artificially expand the training set
Evaluation Metrics
Classification
- Accuracy — fraction correct (misleading with imbalanced classes)
- Precision — of predicted positives, how many are truly positive
- Recall — of actual positives, how many did we catch
- F1 Score — harmonic mean of precision and recall
- AUC-ROC — area under the receiver operating characteristic curve
Regression
- MSE / RMSE — penalizes large errors heavily
- MAE — treats all errors equally
- R² — proportion of variance explained (1.0 = perfect)
Cross-Validation
K-fold cross-validation provides a robust performance estimate:
- Split data into K folds (typically K=5 or K=10)
- For each fold: train on K-1 folds, evaluate on the held-out fold
- Average the K scores
This reduces the variance of the estimate and uses all data for both training and validation.
Common Algorithms
| Algorithm | Type | Strengths |
|---|---|---|
| Linear/Logistic Regression | Both | Interpretable, fast, good baseline |
| Decision Trees | Both | Interpretable, handles non-linear relationships |
| Random Forest | Both | Robust, handles high dimensions, less overfitting |
| Gradient Boosting (XGBoost) | Both | State-of-the-art on tabular data |
| SVM | Both | Effective in high dimensions, kernel trick |
| Neural Networks | Both | Flexible, scales with data, learns features |