Train/Test Split

You can't evaluate a model on data it was trained on — it has already seen the answers. Splitting data into train, validation, and test sets gives you an honest measure of how well the model generalizes to new inputs.

Intuition First

Imagine a student who gets the exam questions in advance and memorizes all the answers. They score 100% — but have they learned anything?

Of course not. The exam stopped being a real test the moment they saw the questions.

Machine learning has the same problem. If you train and test on the same data, the model has already "seen the answers." Even a model that memorizes random noise would score perfectly. The test score becomes meaningless.

What's Actually Happening

During training, the model adjusts its weights to minimize loss on the training set. If that's also your evaluation set, the model has been specifically optimized for those examples.

To get an honest answer to "how does this model perform on data it has never seen?", you need to hold out some data and never show it during training.

Build the Idea Step-by-Step

Collect all your data

→

Split: 70-80% train, 10-15% validation, 10-15% test

→

Train model only on train set

→

Tune hyperparameters using validation set

→

Report final performance on test set (once)

→

Never touch test set again after reporting

Formal Explanation

The Three Splits

Training set (~70–80% of data)

Model sees this data and learns from it
Loss is minimized on this set
Model can overfit to this set

Validation set (~10–15%)

Model never trains on this
Used to tune hyperparameters (learning rate, regularization strength, architecture)
If you use validation loss to make decisions, the model indirectly "learns" from it over many iterations

Test set (~10–15%)

Touched once, at the very end, to report final performance
Represents the true population the model will face
If you tune anything based on test performance, it's no longer an honest estimate

Why You Need Both Validation AND Test

If you tune hyperparameters using the test set, you've effectively trained on it — you're using feedback from it to make decisions. The reported test performance will be optimistically biased.

The validation set absorbs this "leakage" from tuning. The test set stays clean.

Key Properties / Rules

Split	Model trains on it?	Used for?
Train	Yes	Weight updates
Validation	No	Hyperparameter tuning, early stopping
Test	No	Final reported accuracy (one time only)

K-Fold Cross-Validation

When you have limited data and can't afford to hold out a fixed validation set:

Divide data into k equal "folds" (typically k=5 or k=10)
Train on k-1 folds, validate on the remaining fold
Repeat k times, each time holding out a different fold
Average the k validation scores

This gives a more reliable estimate of generalization performance, at the cost of k× more training runs.

Why It Matters

Train/test separation is the foundation of honest model evaluation. Without it, you have no idea whether your model has actually learned anything.

In production:

Distribution shift: train and test may come from different time periods or sources. Monitor this.
Data leakage: if future information leaks into training features (e.g., using tomorrow's price to predict today's), the model will appear to work in evaluation but fail in deployment.
Test set contamination: if you build a model, check test error, tweak, check again... your "test" set has become a validation set. You need a fresh hold-out.

Common Pitfalls

Splitting before or after preprocessing? Always split first, then preprocess. If you normalize using statistics from the full dataset (including test), you've leaked information. Fit the scaler on train only, then transform validation and test.
Random split for time-series data. For sequential data, you must split by time — train on past, test on future. A random split lets the model see "future" data during training, which will never be available in production.
Tiny test sets. With 50 test examples, accuracy estimates have huge variance — a few examples either way swing results by 2–4%. Use cross-validation for small datasets.
Reporting the best validation score as test performance. The validation score is optimistically biased (you chose the run with the best validation score). The test set is for honest reporting.
Reusing test set across multiple experiments. Each time you check test performance and make a decision, the test set becomes slightly "used." Reserve it for final reporting only.

Examples

from sklearn.model_selection import train_test_split, KFold
from sklearn.preprocessing import StandardScaler
import numpy as np

X = np.random.randn(1000, 10)
y = np.random.randn(1000)

# --- Standard three-way split ---
# First split off test set
X_trainval, X_test, y_trainval, y_test = train_test_split(
    X, y, test_size=0.15, random_state=42
)
# Then split train/val from the remaining
X_train, X_val, y_train, y_val = train_test_split(
    X_trainval, y_trainval, test_size=0.176, random_state=42
    # 0.176 of 85% ≈ 15% of total
)
print(f"Train: {len(X_train)}, Val: {len(X_val)}, Test: {len(X_test)}")
# Train: 700, Val: 150, Test: 150

# --- CORRECT: fit scaler on train only ---
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)   # learns mean and std from train
X_val_scaled   = scaler.transform(X_val)          # applies train's mean/std
X_test_scaled  = scaler.transform(X_test)         # applies train's mean/std

# --- WRONG: fitting on all data ---
# scaler.fit(X)  ← leaks test statistics into training


# --- K-Fold Cross-Validation ---
kf = KFold(n_splits=5, shuffle=True, random_state=42)
val_scores = []

for train_idx, val_idx in kf.split(X_trainval):
    X_tr, X_v = X_trainval[train_idx], X_trainval[val_idx]
    y_tr, y_v = y_trainval[train_idx], y_trainval[val_idx]
    
    # Train model on X_tr, evaluate on X_v
    # ... model.fit(X_tr, y_tr)
    # score = model.score(X_v, y_v)
    # val_scores.append(score)

# print(f"CV score: {np.mean(val_scores):.3f} ± {np.std(val_scores):.3f}")

Time-series split (never random):

# For sequential data: always split by time
n = len(X)
train_end = int(0.7 * n)
val_end   = int(0.85 * n)

X_train, y_train = X[:train_end], y[:train_end]
X_val,   y_val   = X[train_end:val_end], y[train_end:val_end]
X_test,  y_test  = X[val_end:], y[val_end:]
# Past → train, present → val, future → test