Mnemosyne
Search
⌘K
Quiz
156 questions across all topics. Click a question to reveal the answer.
+
Overfitting and Underfitting
What is the difference between overfitting and high variance?
+
Backpropagation
What is the vanishing gradient problem, and why does ReLU mitigate it where sigmoid does not?
+
Rank of a Matrix
LoRA fine-tunes LLMs by adding low-rank updates ΔW = BA. Why is this assumption that 'the update is low-rank' justified?
+
Matrices
Given A = [[1,2],[3,4]] and B = [[0,1],[1,0]], compute A@B and B@A. What does this confirm?
+
Learning Rate Effects
What is learning rate warmup and why is it needed for transformer training?
+
Learning Rate Effects
Describe the learning rate finder technique and how you would use it in practice.
+
Expectation
What is the expectation of a random variable, and what does it represent intuitively?
+
Cross-Entropy
Why is cross-entropy the standard loss function for classification, rather than, say, mean squared error?
+
Bias vs Variance
A model has 95% training accuracy and 60% test accuracy. Is this high bias or high variance? What should you do?
+
Orthogonality
What does it mean for two vectors to be orthogonal, and why does this definition make dot product zero the right test?
+
Singular Value Decomposition (SVD)
PCA can be computed either via eigendecomposition of the covariance matrix or via SVD of the centered data matrix. When would you prefer SVD?
+
Train/Test Split
What is the difference between a validation set and a test set?
+
Vectors
Compute the cosine similarity between u = [1, 0, 1] and v = [0, 1, 1].
+
Chain Rule
Differentiate y = (3x² + 1)⁴ using the chain rule. Show each step.
+
Regularization
What is L2 regularization and what effect does it have on weights during training?
+
Entropy
A model's softmax output for an image is [0.6, 0.3, 0.1]. Compute the entropy of this distribution and explain what it tells you.
+
Softmax and Cross-Entropy
Derive ∂L/∂zᵢ for the combined softmax + cross-entropy loss. Why is this gradient particularly elegant?
+
Gradient Descent Intuition
Write the gradient descent update rule and explain what happens to training if you double the learning rate α.
+
Expectation
What is the expected loss in ML, and why does minimizing the empirical loss on a training set approximate minimizing it?
+
Derivatives
The loss function L has ∂L/∂w = -5 at the current weight value. Should you increase or decrease w to reduce the loss? How much does L change if you increase w by 0.01?
+
Matrix-Vector Multiplication
What are the two ways to interpret matrix-vector multiplication Ax, and which is more useful for understanding neural networks?
+
Convex vs Non-Convex Functions
What is a convex function, and why does gradient descent always find the global minimum for convex objectives?
+
Probability Rules
A classifier outputs softmax probabilities [0.7, 0.2, 0.1] for classes A, B, C. A test example is class A. What is the cross-entropy loss, and which probability rule justifies why the probabilities must sum to 1?
+
Gradient Descent Variants
How does momentum help gradient descent, and what does the hyperparameter β control?
+
Activation Functions
When should you use softmax vs sigmoid at the output layer, and what happens if you confuse them?
+
Partial Derivatives
A network has ∂L/∂w₁ = 3.0 and ∂L/∂w₂ = -0.1. Which weight is having the bigger effect on the loss right now? What does the sign tell you about which direction each should move?
+
Distributions
Describe the Normal distribution and explain what μ and σ control geometrically.
+
Basis and Dimensionality
How does dimensionality of an embedding space affect a model's representational capacity?
+
Chain Rule
A network layer computes h = ReLU(Wx + b). To find ∂L/∂W during backprop, you need (∂L/∂h)·(∂h/∂W). What does each factor represent, and which direction does information flow?
+
Independence
What is conditional independence A ⊥ B | C, and why does it matter for Naive Bayes?
+
Gradient Descent Variants
What is the difference between batch gradient descent, SGD, and mini-batch SGD? Which is used in practice and why?
+
Orthogonality
What is an orthonormal basis and why is it preferred over just an orthogonal basis?
+
Learning Rate Effects
What is cosine annealing and why does decaying the learning rate near the end of training help?
+
Cross-Entropy
What is label smoothing and why does it improve model calibration?
+
Singular Value Decomposition (SVD)
How does SVD reveal the rank of a matrix, and why is computing rank via SVD more numerically stable than via row reduction?
+
Independence
Why does the i.i.d. assumption matter for mini-batch gradient descent, and what goes wrong when it's violated?
+
Cross-Entropy
What is the relationship between cross-entropy loss and KL divergence? Why does minimizing cross-entropy minimize KL divergence?
+
Regularization
How does weight decay (L2 regularization) prevent overfitting?
+
Convex vs Non-Convex Functions
What is a saddle point, and why are saddle points more common than local minima in high-dimensional neural network loss surfaces?
+
Gradient Descent Variants
Adam uses two moving averages. What are they, and why does each help?
+
Train/Test Split
What is k-fold cross-validation and when is it useful?
+
Variance and Standard Deviation
State the rule for Var(aX+b) and Var(X+Y), and explain when the sum rule fails.
+
Probability Rules
State the addition rule and explain when it simplifies to P(A) + P(B).
+
Bayes' Theorem
Explain the prosecutor's fallacy using Bayes' theorem.
+
Partial Derivatives
Why do we write ∂f/∂x instead of df/dx when a function has multiple inputs?
+
Loss Functions
Why does cross-entropy penalize confident wrong predictions so harshly?
+
Vectors
What does the dot product u·v = 0 tell you geometrically, and why does this matter for attention in transformers?
+
Backpropagation
Why must you call optimizer.zero_grad() before each backward pass in PyTorch? What goes wrong if you forget?
+
Eigenvalues and Eigenvectors
What is an eigenvector and eigenvalue, and what is the geometric intuition?
+
Random Variables
What is a CDF and how is it related to the PMF and PDF?
+
Activation Functions
Compute the sigmoid gradient at x = 0 and x = 5. Why does the value at x = 5 explain the vanishing gradient problem?
+
KL Divergence
How does KL divergence appear in the VAE loss function, and what would happen if you removed the KL term?
+
Matrix Decomposition
What do U, Σ, and Vᵀ represent in the SVD A = UΣVᵀ, and what is the geometric interpretation?
+
Basis and Dimensionality
What is the difference between an orthogonal and orthonormal basis, and why do orthonormal bases simplify computations?
+
Optimization Basics
Why is ∇L(w) = 0 necessary but not sufficient to confirm w is a local minimum?
+
Softmax and Cross-Entropy
What is the log-sum-exp trick and why is it needed for numerical stability in softmax?
+
Random Variables
A model outputs logits [2.0, 1.0, 0.1] for three classes. After softmax, what is the probability distribution, and what kind of random variable is the predicted class?
+
Rank of a Matrix
A system of equations Ax = b has A with shape (5, 3) and rank 2. How many solutions exist and what does this mean geometrically?
+
KL Divergence
Explain how KL divergence is used in RLHF (Reinforcement Learning from Human Feedback). Why is it necessary?
+
Loss Functions
Why doesn't MSE work well as a loss function for classification tasks?
+
Regularization
What is the key difference between L1 and L2 regularization in terms of weight behavior?
+
Distributions
Why does Normal weight initialization use specific variance values (e.g., 2/n for He initialization), and what goes wrong with too-large or too-small variance?
+
Distributions
What is a Bernoulli distribution, when is it used, and how does it determine the correct loss function?
+
Variance and Standard Deviation
How do He and Xavier initializations use variance to prevent vanishing/exploding gradients?
+
Vectorized Operations
A Python loop processes 1,000 samples through a linear layer one at a time. How does vectorization change this, and what is the speedup mechanism?
+
Gradients
What does it mean for gradients to 'vanish' in a deep neural network, and what causes it?
+
Entropy
How is entropy connected to data compression? What does Shannon's source coding theorem tell us?
+
Regularization
Why does L1 regularization produce sparse models (weights equal to zero)?
+
Matrix Decomposition
Why should you almost never explicitly compute a matrix inverse, and what should you use instead?
+
Gradients
After a backward pass, a weight's gradient is ∂L/∂w = +8.5. In gradient descent with learning rate 0.01, what is the exact weight update? What if the gradient were -8.5?
+
Vectorized Operations
What is the difference between elementwise and matrix multiplication in PyTorch, and when would you use each?
+
Backpropagation
Derive the gradient ∂L/∂W for a single linear layer with pre-activation z = Wx + b and loss L. What shape must the gradient have?
+
KL Divergence
What does KL(P ‖ Q) measure, and what happens when Q assigns zero probability to an event that P says is possible?
+
Random Variables
Why does the choice of output distribution (discrete vs continuous) determine which loss function is appropriate?
+
Expectation
Why is E[f(X)] ≠ f(E[X]) in general? Give an example.
+
Derivatives
What does f'(x) = 0 tell you, and why isn't it enough to confirm you've found a minimum?
+
Linear Combinations and Span
Three vectors in ℝ²: v₁=[1,0], v₂=[0,1], v₃=[3,2]. What is their span, and are they linearly independent?
+
Eigenvalues and Eigenvectors
Matrix A = [[2,0],[0,5]]. What are its eigenvalues and eigenvectors, and what does this matrix do geometrically?
+
Basis and Dimensionality
A dataset has 1000 features but the data matrix has rank 15. What does this mean, and what does it imply for a linear model?
+
Linear Combinations and Span
What is a linear combination and what is the span of a set of vectors?
+
Matrix-Vector Multiplication
What does it mean geometrically that a matrix transformation always maps the origin to the origin?
+
Convex vs Non-Convex Functions
Logistic regression has a convex loss surface, but a neural network doing the same binary classification does not. What causes this difference?
+
Linear Combinations and Span
Why does the span always pass through the origin, and how does this constrain what a linear model can represent?
+
Bias vs Variance
What is the bias-variance tradeoff in simple terms?
+
Overfitting and Underfitting
How do you diagnose overfitting from a training curve?
+
Convex vs Non-Convex Functions
Why does it not matter that gradient descent finds local minima instead of the global minimum in deep learning?
+
Softmax and Cross-Entropy
How does temperature scaling work in softmax, and why do language models use it during inference?
+
Variance and Standard Deviation
What is variance, how is it calculated, and why do we square the deviations?
+
Train/Test Split
What happens if you tune hyperparameters using test set performance?
+
Evaluation Metrics
Give an example where high recall is more important than high precision, and one where the opposite is true.
+
Derivatives
Using the power rule, find f'(x) for f(x) = 4x⁵ - 3x² + 7. What is the slope at x=1?
+
Partial Derivatives
Given f(x, y) = x³y + 2y², find ∂f/∂x and ∂f/∂y.
+
Orthogonality
Why does orthogonal weight initialization (nn.init.orthogonal_) prevent gradient vanishing/explosion at the start of training?
+
Softmax and Cross-Entropy
Why does softmax use exponentiation rather than just normalizing the raw scores by dividing by their sum?
+
Singular Value Decomposition (SVD)
Why is the rank-k SVD truncation (keeping only the top k singular values) the *optimal* rank-k approximation to A?
+
Rank of a Matrix
What does the rank of a matrix measure, and what is the geometric interpretation of a rank-1 matrix?
+
Rank of a Matrix
Why does rank(A) = number of non-zero singular values, and what does a near-zero singular value tell you about the matrix?
+
Bias vs Variance
How does model complexity relate to bias and variance?
+
Vectorized Operations
What does torch.einsum('bqd,bkd->bqk', Q, K) compute, and why might you prefer einsum over explicit transposes?
+
Expectation
State the linearity of expectation property and explain why it holds even for dependent random variables.
+
Partial Derivatives
Without computing anything: what is ∂f/∂z for f(x, y) = x²y + 3xy? Explain your reasoning.
+
Eigenvalues and Eigenvectors
What does a zero eigenvalue tell you about a matrix, and what are the practical implications?
+
Chain Rule
Why does the vanishing gradient problem arise specifically from stacking many sigmoid layers?
+
Gradient Descent Intuition
Adam optimizer uses adaptive per-weight learning rates. Weight w₁ has received gradient 8.0 every step for 100 steps. Weight w₂ has received gradient 0.01 every step. How does Adam adjust each weight's effective step size compared to vanilla SGD?
+
Optimization Basics
What is the difference between a local and global minimum? Why does finding the global minimum not matter much for neural network training in practice?
+
Entropy
Why does -log(p) measure the 'surprise' or information content of an event with probability p?
+
KL Divergence
KL divergence is not symmetric. What does KL(P ‖ Q) vs KL(Q ‖ P) each minimize, and how does this lead to different behavior?
+
Backpropagation
What are the two passes in backpropagation, and why does the forward pass need to save intermediate values?
+
Evaluation Metrics
A model predicts 'positive' for all 100 samples. 90 are actually negative, 10 are actually positive. Compute accuracy, precision, and recall.
+
Matrices
What is the dimension rule for matrix multiplication, and what does each output entry represent?
+
Chain Rule
For y = sin(x²), use the chain rule to find dy/dx, then verify numerically at x = 1.
+
Overfitting and Underfitting
Name three techniques to prevent overfitting.
+
Loss Functions
What is the key difference between MSE and cross-entropy loss, and when do you use each?
+
Independence
What does it mean for two events to be independent, and what is the formal test?
+
Overfitting and Underfitting
What is overfitting and why does it happen?
+
Gradient Descent Variants
Why is AdamW preferred over Adam for regularized training, and what does weight decay actually do?
+
Entropy
What does entropy measure, and why is a uniform distribution the highest-entropy distribution?
+
Eigenvalues and Eigenvectors
How does PCA use eigenvalues and eigenvectors, and what does each eigenvalue tell you?
+
Loss Functions
Calculate the MSE for predictions [2.5, 3.0, 5.0] against targets [1.0, 3.0, 4.0].
+
Derivatives
Why does PyTorch's x.backward() give you a derivative even though you never wrote the derivative formula yourself?
+
Matrix Decomposition
What is a low-rank approximation and why does it work for compression and LoRA?
+
Vectors
Why should you use cosine similarity instead of raw dot product when comparing embedding vectors?
+
Learning Rate Effects
What symptoms indicate a learning rate that is too high vs too low, and what is the typical fix for each?
+
Matrices
Why is matrix multiplication not commutative, and why does this matter when writing backpropagation?
+
Activation Functions
Why does a neural network with no activation functions collapse to a single linear transformation, regardless of depth?
+
Linear Combinations and Span
What does it mean for a set of vectors to be linearly dependent, and why does dependence matter for neural network weight matrices?
+
Probability Rules
What is the multiplication rule, and how does it extend to the chain rule for sequences?
+
Optimization Basics
f(x) = x³ - 3x. Find all critical points, classify each (min/max/saddle), and compute the function values there.
+
Optimization Basics
The training loss has plateaued at 0.45 for 20 epochs. You suspect a saddle point. What is the gradient at a saddle point, and why might mini-batch SGD help escape it while full-batch gradient descent might not?
+
Evaluation Metrics
What does the F1 score capture that accuracy misses?
+
Matrix-Vector Multiplication
A weight matrix W has shape (4, 3). What is the shape of the output when you multiply it by an input vector of shape (3,)? What about a batch of 8 inputs?
+
Train/Test Split
Why can't you evaluate a model on its training data?
+
Distributions
What is a Binomial distribution, and how does it relate to the Bernoulli?
+
Basis and Dimensionality
What makes a set of vectors a basis, and why does every basis of the same space have the same number of vectors?
+
Matrices
What is a symmetric matrix and why are covariance matrices always symmetric?
+
Gradients
Why is the gradient vector always perpendicular to the level curves of a function?
+
Bayes' Theorem
State Bayes' theorem and name each component in the context of a disease test.
+
Bias vs Variance
A model's test error doesn't improve when you add more training data. What does this tell you about bias vs variance?
+
Vectors
What is the projection of u = [3, 1] onto v = [1, 0], and what does projection represent geometrically?
+
Random Variables
What is the difference between a discrete and continuous random variable, and what do PMF and PDF each represent?
+
Vectorized Operations
Explain NumPy/PyTorch broadcasting rules. What shape is the output of (100, 1) + (5,)?
+
Bayes' Theorem
How does Bayes' theorem explain why L2 regularization (weight decay) is equivalent to assuming a Gaussian prior on weights?
+
Orthogonality
In transformer self-attention, how does orthogonality relate to which tokens attend to each other?
+
Independence
Are mutually exclusive events with positive probability independent? Explain.
+
Matrix-Vector Multiplication
Why does a linear neural network (no activations) always reduce to a single matrix multiplication, no matter how deep?
+
Gradient Descent Intuition
After 50 training epochs, training loss = 0.05 and validation loss = 0.48. Is this a gradient descent failure? What should you do?
+
Bayes' Theorem
A disease affects 1% of people. A test has 95% sensitivity (true positive rate) and 95% specificity (true negative rate). What is P(disease | positive test)?
+
Cross-Entropy
A model assigns probability 0.7 to the correct class. What is the cross-entropy loss? What if it assigns 0.1?
+
Evaluation Metrics
What is the difference between precision and recall?
+
Activation Functions
What is the dead ReLU problem, and how does Leaky ReLU fix it?
+
Gradient Descent Intuition
Why does mini-batch SGD often generalize better than full-batch gradient descent, even though its gradient estimate is noisier?
+
Singular Value Decomposition (SVD)
SVD decomposes any matrix A as UΣVᵀ. What do U, Σ, and V each represent geometrically?
+
Probability Rules
What is conditional probability P(A|B), and how does it differ from P(A)?
+
Variance and Standard Deviation
Why does batch normalization work, and which properties of variance does it exploit?
+
Gradients
For f(x,y) = 3x² + xy, compute ∇f at the point (2, 1). What direction does this gradient point, and what does its magnitude tell you?
+
Matrix Decomposition
A 100×100 matrix has singular values [50, 20, 10, 1, 1, 0.1, ...]. What is the effective rank, and how much of the 'energy' does a rank-2 approximation capture?