Train/Test Split
You can't evaluate a model on data it was trained on — it has already seen the answers. Splitting data into train, validation, and test sets gives you an honest measure of how well the model generalizes to new inputs.
Intuition First
Imagine a student who gets the exam questions in advance and memorizes all the answers. They score 100% — but have they learned anything?
Of course not. The exam stopped being a real test the moment they saw the questions.
Machine learning has the same problem. If you train and test on the same data, the model has already "seen the answers." Even a model that memorizes random noise would score perfectly. The test score becomes meaningless.
What's Actually Happening
During training, the model adjusts its weights to minimize loss on the training set. If that's also your evaluation set, the model has been specifically optimized for those examples.
To get an honest answer to "how does this model perform on data it has never seen?", you need to hold out some data and never show it during training.
Build the Idea Step-by-Step
Formal Explanation
The Three Splits
Training set (~70–80% of data)
- Model sees this data and learns from it
- Loss is minimized on this set
- Model can overfit to this set
Validation set (~10–15%)
- Model never trains on this
- Used to tune hyperparameters (learning rate, regularization strength, architecture)
- If you use validation loss to make decisions, the model indirectly "learns" from it over many iterations
Test set (~10–15%)
- Touched once, at the very end, to report final performance
- Represents the true population the model will face
- If you tune anything based on test performance, it's no longer an honest estimate
Why You Need Both Validation AND Test
If you tune hyperparameters using the test set, you've effectively trained on it — you're using feedback from it to make decisions. The reported test performance will be optimistically biased.
The validation set absorbs this "leakage" from tuning. The test set stays clean.
Key Properties / Rules
| Split | Model trains on it? | Used for? |
|---|---|---|
| Train | Yes | Weight updates |
| Validation | No | Hyperparameter tuning, early stopping |
| Test | No | Final reported accuracy (one time only) |
K-Fold Cross-Validation
When you have limited data and can't afford to hold out a fixed validation set:
- Divide data into k equal "folds" (typically k=5 or k=10)
- Train on k-1 folds, validate on the remaining fold
- Repeat k times, each time holding out a different fold
- Average the k validation scores
This gives a more reliable estimate of generalization performance, at the cost of k× more training runs.
Why It Matters
Train/test separation is the foundation of honest model evaluation. Without it, you have no idea whether your model has actually learned anything.
In production:
- Distribution shift: train and test may come from different time periods or sources. Monitor this.
- Data leakage: if future information leaks into training features (e.g., using tomorrow's price to predict today's), the model will appear to work in evaluation but fail in deployment.
- Test set contamination: if you build a model, check test error, tweak, check again... your "test" set has become a validation set. You need a fresh hold-out.
Common Pitfalls
- Splitting before or after preprocessing? Always split first, then preprocess. If you normalize using statistics from the full dataset (including test), you've leaked information. Fit the scaler on train only, then transform validation and test.
- Random split for time-series data. For sequential data, you must split by time — train on past, test on future. A random split lets the model see "future" data during training, which will never be available in production.
- Tiny test sets. With 50 test examples, accuracy estimates have huge variance — a few examples either way swing results by 2–4%. Use cross-validation for small datasets.
- Reporting the best validation score as test performance. The validation score is optimistically biased (you chose the run with the best validation score). The test set is for honest reporting.
- Reusing test set across multiple experiments. Each time you check test performance and make a decision, the test set becomes slightly "used." Reserve it for final reporting only.
Examples
from sklearn.model_selection import train_test_split, KFold
from sklearn.preprocessing import StandardScaler
import numpy as np
X = np.random.randn(1000, 10)
y = np.random.randn(1000)
# --- Standard three-way split ---
# First split off test set
X_trainval, X_test, y_trainval, y_test = train_test_split(
X, y, test_size=0.15, random_state=42
)
# Then split train/val from the remaining
X_train, X_val, y_train, y_val = train_test_split(
X_trainval, y_trainval, test_size=0.176, random_state=42
# 0.176 of 85% ≈ 15% of total
)
print(f"Train: {len(X_train)}, Val: {len(X_val)}, Test: {len(X_test)}")
# Train: 700, Val: 150, Test: 150
# --- CORRECT: fit scaler on train only ---
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # learns mean and std from train
X_val_scaled = scaler.transform(X_val) # applies train's mean/std
X_test_scaled = scaler.transform(X_test) # applies train's mean/std
# --- WRONG: fitting on all data ---
# scaler.fit(X) ← leaks test statistics into training
# --- K-Fold Cross-Validation ---
kf = KFold(n_splits=5, shuffle=True, random_state=42)
val_scores = []
for train_idx, val_idx in kf.split(X_trainval):
X_tr, X_v = X_trainval[train_idx], X_trainval[val_idx]
y_tr, y_v = y_trainval[train_idx], y_trainval[val_idx]
# Train model on X_tr, evaluate on X_v
# ... model.fit(X_tr, y_tr)
# score = model.score(X_v, y_v)
# val_scores.append(score)
# print(f"CV score: {np.mean(val_scores):.3f} ± {np.std(val_scores):.3f}")
Time-series split (never random):
# For sequential data: always split by time
n = len(X)
train_end = int(0.7 * n)
val_end = int(0.85 * n)
X_train, y_train = X[:train_end], y[:train_end]
X_val, y_val = X[train_end:val_end], y[train_end:val_end]
X_test, y_test = X[val_end:], y[val_end:]
# Past → train, present → val, future → test