By Isabelle Bousquette
Silicon Valley is rallying around a new way to evaluate its cutting-edge AI models. It involves a pixelated 1990s videogame and a little monster named Pikachu.
Among the world's top AI labs, Nintendo's original Pokémon games are emerging as a popular way to track the progress and abilities of AI models and figure out how to deploy them toward time-consuming complex goals.
"It provides us with like this great way to just see how a model is doing and to evaluate it in a quantitative way," said David Hershey, applied AI lead at Anthropic, the startup best known for its Claude chatbot. In Pokémon Blue, which became popular on Nintendo Game Boy devices in the late 90s, players must navigate through a series of mazes, catch Pokémon and battle so-called "gym masters" to earn badges.
Hershey is the architect of the "Claude Plays Pokémon" stream on the live-gaming platform Twitch that launched last February and that inspired similar streams featuring OpenAI and Google models. Independent developers created "GPT Plays Pokémon" and "Gemini Plays Pokémon," but they later received some support from the labs themselves.
The Pokémon streams tap in to a long tradition of using games to test and evaluate artificial-intelligence systems. A decade ago, Google's AlphaGo beat a human champion at the game of Go. Poker, chess and more recently videogames like Minecraft also have been used to test AI. Google subsidiary Kaggle even launched a "Game Arena," an open-source platform that evaluates AI models through competitive games. Its inaugural event was a chess tournament in August won by OpenAI's o3 model.
But something about Pokémon has captured the zeitgeist.
The streams of Claude, GPT and Gemini playing Pokémon on Twitch have collectively racked up hundreds of thousands of comments as the models chart their progress in real time. "Wait! I see it now!," Claude posted on its stream. "The x=11 wall has an OPENING at y=10-12."
OpenAI employees for a time had a live GPT Pokémon stream running on a television in their office. Google Chief Executive Sundar Pichai lauded a Gemini win on stage at Google's I/O conference last year. Google also included a breakdown of Gemini's Pokémon progress in a formal report recently.
Anthropic commonly brings a "Claude Plays Pokémon" booth to industry conferences. Its stream has even inspired "Claude Plays Pokémon" fan art, and there's a live Slack channel inside the company where staffers congratulate Claude on its achievements in the game.
"We're all a whole bunch of nerds," said Hershey.
But developers say there's also genuine merit to the exercise. Traditional benchmarking systems typically involve asking a model individual questions and evaluating its individual answers, said Graham Neubig, associate professor at the Carnegie Mellon University Language Technology Institute.
Pokémon is different, he said, because it can track a model's reasoning, decision-making and progress toward a goal over a long period, which is a closer analogy to the types of tasks users are aiming AI at today.
In the game, players must decide whether to train their existing Pokémon or catch new ones to build a team strong enough to beat core trainers. It also involves lots of mazes and puzzles that often pose the biggest challenges for an AI model.
"The thing that has made Pokémon fun and that has captured the [machine learning] community's interest is that it's a lot less constrained than Pong or some of the other games that people have historically done this on," said Hershey. "It's a pretty hard problem for a computer program to be able to do."
Newer versions of Claude have gotten better at tackling the game. While none have beaten it yet, the latest, Claude Opus 4.5, is currently trying live on Twitch.
Sending Claude off to play Pokémon has also been an exercise in learning how to build the software around an AI agent that helps it work better, known as a "harness," Hershey said. For example, he built a memory system where Claude could keep track of important information it learned through the course of playing the game.
Hershey, whose job involves working hands-on with Anthropic customers to deploy Claude, said he frequently shares best practices, based on what he learned from Pokémon.
Both GPT and Gemini have beaten the original Pokémon game, although that may have more to do with the varying harnesses the developers built around them to help.
Now the Google and OpenAI models are both tackling various Pokémon sequel games, according to Joel Zhang and Jonathan Verron, the freelance developers who built the "Gemini Plays Pokémon" and "GPT Plays Pokémon" streams, respectively.
"This is a perfect game for AI right now," said Verron. "I've tried to think about other games, but I haven't found as good an example as Pokémon."
Write to Isabelle Bousquette at isabelle.bousquette@wsj.com
(END) Dow Jones Newswires
January 22, 2026 07:00 ET (12:00 GMT)
Copyright (c) 2026 Dow Jones & Company, Inc.
Comments