How Did AlphaGo Beat Lee Sedol?

From AlphaGo to Tic-Tac-Toe: Building Your First AI Game Player

Patrick Kalkman

Nov 7, 2024 — 15 min read

Self-play using reinforcement learning, image by Midjourney, prompt by author

Watching AlphaGo face Lee Sedol in 2016 was unforgettable. There was an AI program taking on one of the greatest Go players in history — and winning. If you haven’t seen AlphaGo — The Movie yet, stop reading and watch it. Seriously. I’ll wait.

The documentary captures something extraordinary: not just the DeepMind team’s breakthrough but also Lee Sedol’s incredible grace in what must have been one of his most challenging moments.

After watching those matches, I couldn’t help but wonder: How exactly does a computer learn to play a game? While Go is immensely complex, the principles behind game-playing AI can be explored with simpler games, like Tic-Tac-Toe.

In this article, we’ll build a simple AI using those same principles. No, we won’t be creating the next Go champion, but you’ll grasp the same fundamental principles that eventually can be scaled up to beat a world champion.

Using modern tools like Gymnasium, we will create an AI that learns through the same basic process — by playing against itself and learning from experience.

Don’t worry if you’re new to AI — if you know some Python basics, you’re ready for this journey. I’ve put all the code in this GitHub repository so you can follow along or experiment on your own.

My path into reinforcement learning

After 30 years of programming and a few years of working with deep learning and TensorFlow, reinforcement learning remained my blind spot. But I kept seeing these incredible examples of AIs mastering Atari games — playing Breakout better than any human, discovering strategies no one had thought of. That caught my attention.

AI was doing something beyond pattern matching — learning to play through pure trial and error, just like we do.

That curiosity led me down several learning paths. I started with Hugging Face’s reinforcement learning course by Thomas Simonini. The course allows you to recreate classic Atari experiments with reinforcement learning like Space Invaders and Breakout.

There was something fascinating about watching my agents grow from random button-mashing to skilled gameplay. The course’s global leaderboard added a competitive spark that kept me returning.

A leaderboard table for the CarRacing-v2 environment from the Hugging Face Deep RL Course, updated on 2024–10–26. At the top of the page is a navigation bar showing various environment options including LunarLander, CartPole, FrozenLake variants, and others. The main table shows rankings for 12 participants, with columns for Ranking, User, Model, Results, Mean Reward, and Std Reward. — Hugging Face RL Leaderboard, image by author

To deepen my understanding, I explored several other outstanding resources:

David Silver’s reinforcement learning lectures, offering insights from a key member of the AlphaGo team
UC Berkeley’s CS 285 course on Deep Reinforcement Learning (Fall 2023), providing a comprehensive theoretical foundation
Stanford’s CS234 Reinforcement Learning course (Winter 2023), offering another valuable academic perspective
Nicholas Renotte’s practical tutorials on implementing reinforcement learning in games

Having some deep learning background helped, but reinforcement learning introduced this extra dimension of agents actively exploring and learning from their environment. It was like adding a strategic gaming layer to everything I knew about AI.

This curiosity led me to the Tic-Tac-Toe project. After exploring various approaches, I chose Q-learning because it strikes a perfect balance — simple enough to understand completely but powerful enough to teach an AI real strategy.

Building our AI player

While AlphaGo mastered the complex game of Go using deep neural networks and advanced algorithms, we will start with something more manageable: Tic-Tac-Toe. Don’t let its simplicity fool you — this game is perfect for learning core AI concepts because it:

Has clear rules and limited moves
Provides quick feedback with short games
Requires both tactical moves and strategic thinking
Demonstrates key concepts like state evaluation and decision-making

Our project breaks down into three essential parts:

Creating a game environment where our AI can play
Teaching the AI through Q-learning
Training through self-play

Think of it as teaching a child to play Tic-Tac-Toe. First, you create the playing space and explain the rules (our environment). Then, you help the child understand good and bad moves (Q-learning). Finally, they improve by practicing with others (self-play).

We will build this using Gymnasium, the framework that powers more complex AI training environments. While our game is simpler than Go, the fundamental principles of learning, decision-making, and improvement remain the same.

The journey begins with Q-learning, the algorithm powering our solution.

Understanding Q-Learning

Q-learning might seem overwhelming at first. But here’s the thing — it’s pretty straightforward when you break it down. Following a simple learning cycle, let’s visualize how Q-learning functions in our Tic-Tac-Toe game.

A diagram showing the reinforcement learning cycle using a tic-tac-toe game as an example. The cycle flows from left to right with arrows connecting five rounded rectangular boxes, and a return arrow from the last to the first box. — The Q-learning process of Tic-Tac-Toe, image by author

Observes the current game state
Decide between exploring new moves or exploiting known good moves
Receives a reward (e.g., -1 for loss, +1 for win, 0 for draw)
Updates its knowledge (Q-values) for that state-action pair
Moves to the new game state

But how does it start learning? Like a beginner, it starts by making random moves. This exploration phase is essential for the AI to try various moves and learn from the outcomes.

Each game ends with a reward: positive for winning, negative for losing, and zero for a draw. Over time, these rewards are linked to the moves that led to them.

In Q-learning, each action is assigned a ‘quality’ score, or Q-value. The Bellman equation helps update this score for better decision-making. Higher scores mean better moves.

For instance, the AI might learn that taking the center square early leads to better outcomes, giving that action a high Q-value.

At the heart of Q-learning is the Bellman equation:

Q(s, a) = Q(s, a) + α[R + γ * max Q(s’, a’) — Q(s, a)]

This equation updates the Q-value for a given state-action pair (s, a). Here:

Q (s, a) is the current Q-value for state sss and action aaa,
α is the learning rate, controlling how much new information overrides old,
R is the immediate reward received,
γ is the discount factor, weighing future rewards,
max⁡ a′ Q (s′,a′) represents the maximum expected future reward for the next state s′.

The Bellman equation’s idea is simple but powerful: the quality of a move considers its immediate reward and the future rewards it leads to. In chess, a move’s value is more than its immediate outcome; it also concerns the winning positions it establishes later.

Q-learning is ideal for tabular data — situations where we can list all possible states and actions in a table. Tic-Tac-Toe, with its limited board positions and moves, is perfect for this. But complex games like Go need more sophisticated approaches, like AlphaGo’s deep learning model.

The real power of Q-learning comes from self-play. Our AI builds a strategy guide by playing thousands of games against itself. It learns from successes and failures, becoming a stronger player over time.

As it gains experience, it relies less on random moves and more on learned strategies — just like a human player. While Q-learning has many nuances, this foundational understanding helps us follow its progress as we build our game.

Let’s set up our Tic-Tac-Toe environment and see these concepts in action.

Setting up our game environment

To learn, the AI needs an environment to play in. Think of it as a video game. The AI needs to see the game state, make moves, and receive feedback on its performance.

Enter Gymnasium, a standard toolkit for reinforcement learning created by OpenAI. The main goal of Gym is to provide a rich collection of environments for RL using a unified interface. It powers everything from simple arcade games to complex simulations. Whether you’re teaching an AI to play Space Invaders, land a rocket, or drive a car, Gymnasium provides a consistent way to:

Show the AI the game state.
Allow it to make moves.
Provide feedback through rewards.
Update the game state.

We will create our own Tic-Tac-Toe environment using Gymnasium’s framework. Here’s the core of our game environment:

class TicTacToeEnv(gym.Env): 
    metadata = {"render_modes": ["human", "rgb_array"], "render_fps": 30} 
 
    def __init__(self, render_mode: Literal["human", "rgb_array"] = "human"): 
        super().__init__() 
        self.renderer = PygameRenderer() 
        self.render_mode = render_mode 
 
        self.observation_space = spaces.Dict({ 
            "board": spaces.Text(min_length=9, max_length=9, charset="XO-"), 
            "current_player": spaces.Discrete(2)}) 
        self.action_space = spaces.Discrete(9) 
 
        self.state = self._create_initial_state() 
 
    def reset(self, seed: Optional[int] = None, options: Optional[Dict] = None) -> Tuple[Dict[str, Any], Dict]: 
        super().reset(seed=seed) 
        self.state = self._create_initial_state() 
        self.state.current_player = random.choice([Player.X_SYMBOL, Player.O_SYMBOL]) 
        return self._get_observation(), {} 
 
    def step(self, action: int) -> Tuple[Dict[str, Any], float, bool, bool, Dict]: 
        if self._is_valid_move(action): 
            self._make_move(action) 
            reward, self.state.done = self._evaluate_game_state() 
            if not self.state.done: 
                self._switch_player() 
            return self._get_observation(), reward, self.state.done, False, {} 
        return self._get_observation(), -1, True, False, {}

This code creates a game environment where:

The AI sees the board as a string of X’s, O’s, and dashes
It can make moves by choosing a number 0–8 (representing board positions)
It gets rewards for winning moves and penalties for invalid ones
The game tracks whose turn it is and when the game ends

Each method serves a specific purpose:

__init__: Sets up the game rules and what the AI can see and do
reset: Starts a new game with a clean board
step: Handles one move and returns the results

Later, we will see how our Q-learning agent uses this environment to learn optimal play.

Teaching Ai to play Tic-Tac-Toe

Now that we have our game environment, let’s build the brain of our AI player. Our QLearningAgent class implements the concepts we discussed earlier - learning from experience, balancing exploration with known good moves, and updating its strategy based on rewards.

@dataclass 
class AgentConfig: 
  learning_rate: float = 0.1        # How quickly the AI learns from new experiences 
  discount_factor: float = 0.95     # How much future rewards matter vs immediate ones 
  exploration_rate: float = 1.0     # Starting chance of trying random moves 
  exploration_decay: float = 0.9995 # How quickly to reduce random exploration 
  min_exploration_rate: float = 0.01 # Always keep some randomness 
  initial_q_value: float = 0.01     # Starting value for new moves

The heart of our AI is the QLearningAgent class. It keeps track of what it has learned in a Q-table - essentially a big lookup table that stores how good each move is in each board position:

class QLearningAgent: 
  def __init__(self, action_space: Discrete, config: Optional[AgentConfig] = None): 
    self.action_space_size = action_space.n 
    self.config = config or AgentConfig() 
    self.q_table: Dict[str, Tuple[float, ...]] = {} 
    self.exploration_rate = self.config.exploration_rate

When it’s the AI’s turn to move, it follows these steps:

Look at the current board (choose_action method):

def choose_action(self, state: str) -> int: 
  available_moves = self.find_available_moves(state) 
   
  # Sometimes make a random move (exploration) 
  if random.uniform(0, 1) < self.exploration_rate: 
      return random.choice(available_moves) 
       
  # Otherwise choose the best known move 
  return self.select_best_move(state, available_moves)

2. After each move, update what it learned (update method):

def update(self, state: str, action: int, reward: float, next_state: str, done: bool): 
    current_value = self.q_table[state][action] 
     
    # Calculate the value of this move based on reward and future possibilities 
    if done: 
      target_value = reward 
    else: 
      available_moves = self.find_available_moves(next_state) 
      future_values = [self.q_table[next_state][move] for move in available_moves] 
      max_future_value = max(future_values) if future_values else 0 
      target_value = reward + self.config.discount_factor * max_future_value 
       
    # Update our knowledge slightly toward the target value 
    updated_value = current_value + self.config.learning_rate * (target_value - current_value)

Over time, the AI:

It starts by trying lots of random moves to explore
Gradually, it focuses more on moves it has found to work well.
Updates its Q-table after every move based on the rewards it gets
Considers both immediate rewards and future possibilities

This balance of exploration and exploitation lets our AI discover effective strategies while still being able to adapt to different opponents.

From novice to expert: The training process

Training an AI through self-play offers two approaches. We will try both:

Training against a random player (like learning against a beginner)
Two AIs training against each other (like two students practicing together)

An animated GIF showing a simulated Tic-Tac-Toe game between two AI agents. The interface resembles a classic desktop window with a title bar labeled ‘Tic-Tac-Toe’ and standard window control buttons (red, yellow, green) in the top-left corner. The game board is a 3x3 grid with a pale background and dark grid lines. The animation shows the agents taking turns placing X’s and O’s on the board, demonstrating the back-and-forth gameplay of a training session. — Our PyGame rendering of the Tic-Tac-Toe game, image by the author

Training against a random player

Let’s start with the basics: training our AI against a random player. This approach helps our agent learn fundamental strategies without being overwhelmed by complex counter-play.

Here’s how we track the learning process:

@dataclass 
class TrainingConfig: 
    episodes: int = 1_000_000        # Total games to play 
    render_interval: int = 100_000   # Show game visually every N episodes 
    stats_interval: int = 1_000      # Print stats every N episodes 
    window_size: int = 1_000         # Track recent performance over N games 
 
class TrainingStats: 
    def __init__(self, window_size: int): 
        self.recent_results: List[str] = []    # Track recent game outcomes 
        self.wins_x = 0                        # Total X wins 
        self.wins_o = 0                        # Total O wins 
        self.draws = 0                         # Total draws 
 
    def get_stats(self, episode: int) -> Tuple[float, float]: 
        # Calculate both overall and recent performance 
        recent_x_wins = self.recent_results.count("X") 
        total_episodes = episode + 1 
        overall_winrate = (self.wins_x / total_episodes) * 100 
        recent_winrate = (recent_x_wins / len(self.recent_results)) * 100 
        return overall_winrate, recent_winrate

The training loop alternates between our learning agent (X) and random moves (O):

def run_episode(self, episode: int) -> None: 
    obs, _ = self.env.reset() 
    state = obs["board"] 
    current_player = obs["current_player"] 
     
    while not done: 
        # X's turn (Learning Agent) 
        if current_player == "X": 
            action = self.agent.choose_action(state) 
            next_state, reward, done, _ = self.env.step(action) 
            self.agent.update(state, action, reward, next_state, done) 
        # O's turn (Random Player) 
        else: 
            action = self.select_random_move(state) 
            next_state, reward, done, _ = self.env.step(action) 
 
        current_player = obs_next["current_player"] 
        state = next_state

Every thousand games, we get a progress report:

Episode 1000 
Overall - X W: 623, O W: 298, Dr: 79 
Overall Win Rate: 62.3% 
Recent Win (last 1000): 62.3% 
Recent - X W: 623, O W: 298, Dr: 79 
X Exploration Rate: 0.606

This tells us:

How often does our AI win overall
Recent performance (last 1000 games)
Current exploration rate (how often it tries random moves)

Two Ais training against each other

While training against random moves teaches basic strategy, the real challenge comes from playing against an equal opponent. Let’s see how we can train two AIs that learn from each other, similar to two students practicing together.

First, we set up two learning agents with identical starting conditions:

def main(): 
    env = TicTacToeEnv(render_mode="human") 
 
    # Configuration shared by both agents 
    agent_config = AgentConfig( 
        learning_rate=0.1,          # How quickly agents learn 
        discount_factor=0.99,       # Weight of future rewards 
        exploration_rate=1.0,       # Start fully exploratory 
        exploration_decay=0.999995, # Very slow decay for thorough learning 
        initial_q_value=0.0        # Initial move valuations 
    ) 
 
    # Create two separate learning agents 
    agent_x = QLearningAgent(env.action_space, agent_config) 
    agent_o = QLearningAgent(env.action_space, agent_config) 
 
    training_config = TrainingConfig() 
    trainer = DualAgentTrainer(env, agent_x, agent_o, training_config)

To track their progress, like with the random opponent we will monitor wins, losses, and draws:

@dataclass 
class TrainingConfig: 
    episodes: int = 500_000         # Total training games 
    render_interval: int = 100_000  # Show visual game every N episodes 
    stats_interval: int = 1_000     # Print progress every N episodes 
    render_delay: float = 0.5       # Delay between moves (for visualization) 
 
class TrainingStats: 
    def __init__(self): 
        self.wins_x = 0 
        self.wins_o = 0 
        self.draws = 0 
     
    def print_progress(self, episode: int, agent_x: QLearningAgent, agent_o: QLearningAgent): 
        print(f"\nEpisode {episode}") 
        print(f"Agent X Wins: {self.wins_x}, Agent O Wins: {self.wins_o}, Draws: {self.draws}") 
        print(f"X Exploration Rate: {agent_x.exploration_rate:.3f}") 
        print(f"O Exploration Rate: {agent_o.exploration_rate:.3f}")

The key to self-play learning is how agents learn from each game. When one agent wins, both agents update their strategies:

def update_agents(self, current_player: str, last_moves: Dict[str, GameState], 
                 state: str, action: int, reward: float, next_state: str, done: bool): 
    current_agent = self.agents[current_player] 
    other_player = "O" if current_player == "X" else "X" 
    other_agent = self.agents[other_player] 
 
    # Current player learns from their move 
    current_agent.update(state, action, reward, next_state, done) 
 
    # If game ends with a winner, other player learns from their loss 
    if done and reward != 0: 
        other_last_move = last_moves[other_player] 
        if other_last_move.action is not None: 
            other_agent.update( 
                other_last_move.board, 
                other_last_move.action, 
                -reward,  # Opposite reward for the losing player 
                next_state, 
                done 
            )

During training, we see several stages of development:

Early Games (Episodes 1–1000):

Episode 1000 
Agent X Wins: 358, Agent O Wins: 341, Draws: 301 
X Exploration Rate: 0.607 
O Exploration Rate: 0.607

Both agents make many random moves.
Wins are pretty evenly distributed.
Few strategic patterns emerge

2. Middle Stage (Episodes ~50,000):

Episode 50000 
Agent X Wins: 16920, Agent O Wins: 16881, Draws: 16199 
X Exploration Rate: 0.082 
O Exploration Rate: 0.082

Agents start showing strategic behavior
More games end in draws
Less random exploration

3. Late Stage (Episodes ~100,000):

Episode 100000 
Agent X Wins: 33841, Agent O Wins: 33762, Draws: 32397 
X Exploration Rate: 0.018 
O Exploration Rate: 0.018

Very strategic play
A high percentage of draws
Primarily using learned strategies.

Key observations from self-play training:

Win rates stay close to 50–50 as both agents improve together.
Draws become more common as agents learn optimal play
Exploration rates gradually decrease as agents build confidence.
Games get longer as agents make fewer mistakes.

This dual-learning approach produces stronger players than training against random moves because:

Both agents continuously adapt to counter each other.
They learn from both winning and losing positions.
The challenge level increases as both improve
They discover and counter advanced strategies.

It’s fascinating to watch two AI agents start from complete randomness and gradually develop sophisticated strategies through self-play — much like how human players improve by practicing against worthy opponents.

How well does it play?

Training stats reveal the agent’s progress, reaching a win rate of around 75–80% against a random player. However, to understand how well our agent learned to play Tic-Tac-Toe, we must evaluate its performance in both scenarios.

Against a random player

First, let’s examine the results against a random player. How did the agent achieve this level of performance? The training graphs reveal several patterns.

Charts showing win rate trends, exploration decay, game outcomes (X vs. O wins and draws), and moving average win rate in Tic-Tac-Toe Q-learning. — Result of training against a random player, image by the author

Rapid Initial Learning: The agent’s win rate climbed sharply in the first 100,000 episodes, starting from around 45% and quickly reaching about 65%. This shows how quickly it learned basic winning strategies.
Steady Improvement: The win rate gradually improved after the initial surge, stabilizing around 75–80%. This pattern is typical in reinforcement learning — quick early gains followed by slower, incremental improvements.
Exploration vs Exploitation: The exploration rate graph shows how the agent transitioned from mostly exploring (trying random moves) to exploiting its learned strategies. This balance was crucial for developing robust gameplay.

Against Another AI Agent

We see a different pattern when we pit two trained agents against each other.

The self-play results reveal sophisticated learning dynamics:

Perfect Balance: Both agents achieve remarkably similar win rates (~35%), with X winning 286,609 games and O winning 289,341 games out of nearly 900,000 played. This near-perfect balance shows that neither player has found an exploitable advantage.
Rising Draw Rate: Perhaps most interesting is the steady increase in draws from ~15% to ~35% as training progresses. This mirrors what we see in human expert play — more games end in draws as players get better.
Strategic Depth: The high number of draws (298,051) suggests both agents have learned optimal defensive strategies. Like human experts, they’ve learned that perfect play often leads to a draw in Tic-Tac-Toe.
Exploration Strategy: Both agents show identical exploration patterns, starting with random moves and gradually transitioning to strategic play. This shared learning journey helps explain their evenly matched performance.

This dual evaluation shows that our agent has not just learned to beat random opponents—it’s discovered the fundamental strategic principles of Tic-Tac-Toe. Against a random player, it exploits weaknesses to win consistently. It plays cautiously against an equally skilled opponent and often forces a draw, much like human experts.

Setup and Installation

Requirements:

Python 3.12
Poetry package manager
Standard CPU (no GPU needed)

Clone the repository:

git clone  
https://github.com/PatrickKalkman/TicTacToe-RL-QTables.git

2. Set up the environment:

poetry install 
poetry shell

3. Run the training scripts:

cd src 
python train_agent_vs_random.py   # Train against a random opponent 
python train_agent_vs_agent.py    # Train against self

From Tic-Tac-Toe to Go: Scaling AI in Complexity

Our journey from Tic-Tac-Toe to AlphaGo demonstrates the need for varied approaches in different games. While Tic-Tac-Toe has just 5,478 possible valid board positions, Go is entirely different in complexity.

A 19x19 Go board has approximately 2.1 × 10¹⁷⁰ possible positions — more than the atoms in the observable universe. This vast number of possibilities makes a Q-learning table impractical for Go, as storing every possible board position in memory is impossible.

Advanced AI systems like AlphaGo use techniques beyond simple Q-learning to tackle such complexity. For example:

Search Algorithms: AlphaGo uses Monte Carlo Tree Search (MCTS) to evaluate possible moves efficiently.
Pattern Recognition: Deep neural networks recognize board patterns, predicting favorable moves.
Policy Networks and Value Networks: Policy networks guide moves, and value networks assess board positions early.

Even in our Tic-Tac-Toe project, tuning hyperparameters like learning rate, exploration rate decay, and discount factor was essential. For complex games, tuning these becomes even more critical, as AlphaGo’s neural networks involve millions of parameters that must be carefully adjusted.

What’s Next?

If you’re ready to build on this foundation, here are a few paths to deepen your understanding:

Try minimax with alpha-beta pruning for Tic-Tac-Toe: Minimax, combined with alpha-beta pruning, makes AI move evaluations more efficient by eliminating unnecessary calculations. It’s a great next step for understanding decision-making in simpler games.
Explore deep Q-learning for more complex games: Moving from tabular to deep Q-learning introduces neural networks to handle larger state spaces, paving the way for tackling games with higher complexity.
Create AI for intermediate games like Connect Four or Othello: These games introduce more complexity than Tic-Tac-Toe but remain manageable, providing a balanced challenge and further developing your skills.

The journey from Tic-Tac-Toe to Go demonstrates how AI methods scale with problem complexity. Our Q-learning approach serves as a stepping stone, helping us appreciate advanced systems like AlphaGo and the breakthroughs behind them.

This is just the beginning — Stay tuned for more insights as I dive deeper into the Hugging Face course. Let’s keep exploring and learning together.