Mnemosyne

Reinforcement Learning

Learning through interaction — agents, rewards, and optimal policies.

What Is Reinforcement Learning?

Reinforcement learning (RL) trains an agent to make decisions by interacting with an environment. The agent takes actions, receives rewards (or penalties), and learns a policy — a mapping from states to actions that maximizes cumulative reward.

Unlike supervised learning, there are no labeled examples. The agent discovers what works through trial and error.

The RL Framework

The core loop:

  1. Agent observes state s
  2. Agent takes action a according to its policy
  3. Environment returns reward r and next state s'
  4. Agent updates its policy based on the experience

Formally described as a Markov Decision Process (MDP): a tuple (S, A, P, R, γ) where S is the state space, A is the action space, P is the transition function, R is the reward function, and γ is the discount factor.

Key Concepts

Discount Factor (γ)

A value between 0 and 1 that determines how much the agent values future rewards versus immediate rewards. γ=0 is purely greedy (only immediate reward). γ=0.99 strongly considers long-term consequences.

Value Function

V(s) estimates the expected cumulative reward starting from state s and following the policy. Used to evaluate how good a state is.

Q-Function

Q(s, a) estimates the expected cumulative reward of taking action a in state s, then following the policy. Used to choose the best action.

Policy

The agent's strategy. Can be deterministic (maps state to a single action) or stochastic (maps state to a probability distribution over actions).

Exploration vs Exploitation

The fundamental dilemma: should the agent exploit what it already knows works, or explore unknown actions that might be better?

Strategies

  • ε-greedy: with probability ε take a random action, otherwise take the best known action. Decay ε over time.
  • UCB (Upper Confidence Bound): choose the action with the highest upper confidence bound on its value, balancing exploitation with uncertainty.
  • Thompson Sampling: maintain a probability distribution for each action's reward; sample from each and pick the highest.
  • Boltzmann Exploration: choose actions with probability proportional to their estimated value (softmax over Q-values).

Model-Free Methods

Learn directly from experience without building a model of the environment.

Q-Learning (Off-Policy)

Update rule: Q(s,a) ← Q(s,a) + α[r + γ max Q(s',a') - Q(s,a)]

Learns the optimal Q-function regardless of the policy being followed. The "max" makes it off-policy — it assumes optimal future behavior.

SARSA (On-Policy)

Update rule: Q(s,a) ← Q(s,a) + α[r + γ Q(s',a') - Q(s,a)]

Uses the actual next action a' taken by the policy, making it on-policy. More conservative than Q-learning — accounts for exploration.

Policy Gradient

Directly optimize the policy by computing gradients of expected reward with respect to policy parameters. Handles continuous action spaces naturally. Foundation of modern deep RL (PPO, A3C).

Model-Based Methods

Learn a model of the environment (transition dynamics and rewards), then use it to plan.

Pros: much more sample efficient — can "imagine" experiences without real interaction. Cons: model errors compound over long planning horizons.

Examples: Dyna-Q (combines real and simulated experience), MuZero (learned model for game planning).

Deep Reinforcement Learning

Combine deep neural networks with RL:

  • DQN — neural network approximates Q-function. Used to play Atari games.
  • PPO (Proximal Policy Optimization) — stable policy gradient method. Widely used (robotics, RLHF for LLMs).
  • SAC (Soft Actor-Critic) — maximizes reward and entropy (encourages exploration). Good for continuous control.

Reward Shaping

Designing the reward function is critical and difficult:

  • Sparse rewards (only at goal): hard to learn, agent may never discover the goal
  • Dense rewards (every step): faster learning, but risk of reward hacking — agent finds unintended shortcuts
  • Curriculum learning: start with easy tasks, gradually increase difficulty
  • RLHF: use human feedback as the reward signal (used to align LLMs)

Review Questions