Skip to main content
Featured image for blog post: Can You Beat My Dog at Chess?
AIgamesagentsAI agentssoftware development

Can You Beat My Dog at Chess?

5 min read
By Michael Cooper
Share:

How simple games reveal the real pace of AI agent progress.

On my personal website, I run a small free arcade. It includes simple games like Tic-Tac-Toe, more complex ones like Chess and Allies & Adversaries, and original simulations like Power Broker—a political strategy game designed to explore incentives, negotiation, and second-order effects.

At first glance, it looks like a hobby.

In reality, it's one of the most effective ways I've found to understand how fast AI agents are actually improving.

The Dogs

In every game, I name the AI difficulty levels after my dogs.

Bella (Easy) — my puppy. Curious, fast, inconsistent.

Coop (Medium) — me. Decent strategy, occasional overconfidence.

Bentley (Hard) — my older dog. Calm, patient, and ruthless.

It started as a joke.

But over time, it became a surprisingly accurate way to track how AI agents behave as they mature. Not just how "smart" they are, but how stable, patient, and self-correcting they become.

The question the arcade quietly asks is simple:

Can you beat my dog at chess?

Why Games Work When Benchmarks Don't

Games are unusually honest environments for AI.

They have:

  • Clear rules
  • Observable state
  • No hiding behind demos
  • Immediate feedback
  • Binary outcomes

You can't explain away a bad move in chess. You either saw it—or you didn't.

That makes games ideal for testing agent behavior over time, especially when you rebuild the same game repeatedly with newer tools.

October vs January: The Difference Is Not Subtle

Between October and January, the change in agent behavior has been dramatic.

Not incremental. Not theoretical. Obvious.

Across rebuilds of the same games, using the same prompts and architectures, I observed consistent improvements in:

  • Run length – agents could operate far longer without degrading
  • Context survival – less drift after resets
  • Error recovery – mistakes were corrected instead of compounding
  • Decision patience – fewer rushed or random moves
  • State awareness – better understanding of "what just happened"

Bentley didn't get better because I tuned difficulty.

Bentley got better because the agents did.

The difference between an October Bentley and a January Bentley is the difference between a clever demo and a credible system.

The Quiet Breakthrough: Self-Evaluation

One of the most important changes wasn't raw reasoning power—it was self-evaluation.

Modern agents are increasingly able to:

  • Observe their own output
  • Compare it to expected outcomes
  • Identify failure modes
  • Adjust behavior mid-run

This is where tools like Playwright matter enormously.

When an agent can interact with a real interface, inspect state, replay actions, and validate outcomes, it stops behaving like a one-shot responder and starts behaving like a system.

Games amplify this effect because feedback is immediate and unforgiving. A bad move is visible. A lost position is undeniable. The agent has to reconcile intent with outcome.

That loop—act, observe, correct—is improving fast.

Context Windows Are the Real Constraint

If there's one hard limit I kept running into, it wasn't model intelligence.

It was context management.

Games make this painfully obvious.

Long-running sessions stress:

  • Memory boundaries
  • Instruction decay
  • Goal drift
  • Accidental overwrites
  • Reset recovery

Between October and January, agents got significantly better at operating within constrained context windows. Not by magically remembering everything—but by becoming more selective, more structured, and more disciplined about what matters.

The takeaway is simple but uncomfortable:

AI performance is increasingly limited by how we manage context, not by how smart the models are.

Games surface this faster than almost anything else.

Same Game, Different Brains

One of the most valuable practices in this arcade has been rebuilding the same basic game across different AI tools.

Claude. Gemini. Codex.

Same rules. Same objectives. Same constraints.

Wildly different approaches.

Some models favor:

  • Explicit planning
  • Careful explanation
  • Conservative moves

Others:

  • Explore aggressively
  • Recover quickly
  • Optimize via iteration rather than foresight

None of them are "right." But seeing the differences side-by-side builds intuition fast.

You don't learn this from benchmarks. You learn it by watching how each agent struggles—and how that struggle changes month to month.

Why Static End States Matter

Every game eventually ends up as a static page, hosted simply, with the source code published.

That's intentional.

Static artifacts remove excuses.

No background services. No orchestration magic. No operational crutches.

If the experience is good as a frozen artifact, the system design worked. If it's fragile, confusing, or inconsistent, it didn't.

That discipline makes evaluation honest: OK. Good. Great.

Then move on.

The Real Lesson

AI isn't just getting smarter.

It's getting:

  • More patient
  • More stable
  • Better at self-correction
  • More tolerant of imperfect instructions
  • More capable of operating as a system

The speed of that change—from October to January—was impossible to miss when rebuilding the same games over and over.

That's why I'd encourage others to do something similar.

You don't need a chess engine. You don't need a political simulation.

You need:

  • A bounded system
  • A real endpoint
  • A willingness to rebuild it over time

Name it something human. Ship it. Play it.

And then ask the question that matters:

Can you still beat the dog?

Because lately... that's getting a lot harder.