Chess as a Benchmark for LLM-Based AI

Chess as a Benchmark for LLM-Based AI
Dmitry Dzhanhirov: Interview
02.18.2026

Dmitry Dzhanhirov — chess master and educator, US National Master. He views chess as a language of strategic thinking and a tool for understanding how decisions are made — by both humans and machines.

Anna Fierro: Why does AI play chess poorly, even though it can create artworks quite successfully?

Dmitriy Dzhanhirov: If an LLM-based AI chatbot is trained on tens or hundreds of thousands of artworks within a particular style, it can identify patterns and produce a reasonably convincing piece in that style. Even if so-called “hallucinations” appear in some fragments of the work, they are unlikely to be noticeable in many modern art styles — and may even fit naturally into the artistic discourse of “this is how the artist sees it.” In fact, artistic hallucination has long existed in human creativity — and it is hard to surpass Salvador Dalí in that regard.
Moreover, an interesting effect can occur: when exposed to an excessive amount of source material, AI-generated work may become an averaging of what it has seen. But with a certain optimal minimum of input, the AI may produce something relatively original, as it begins to “improvise” or hallucinate more creatively.
Chess, however, is governed by strict rules that cannot be violated without immediate consequences. As a result, searching for a statistically likely move based on millions of known positions may lead to a hallucination in the form of an illegal move.
A vivid example is the failure of the Chinese chatbot Kimi K2 in its match against OpenAI o3 at the Kaggle Game Arena Chess Exhibition Tournament 2025 (August 4–7, 2025), an unofficial world championship for LLM-based chess AI.
Kimi K2 lost all four games, failing to make more than eight moves in any of them before attempting illegal moves.
In the first game, on move seven, a position arose where Kimi K2, playing White, attempted to capture a black pawn on d4 with the queen from d1.
From a statistical perspective, the pattern “White queen on d1 — undefended black pawn on d4” is common and appears in the Center Game, certain variations of the Ruy Lopez and Philidor Defense, several open games, and even some Sicilian structures. Capturing the pawn with the queen is typical in most cases, so it is unsurprising that Kimi K2 attempted this move.
However, in this specific position, a White bishop stood on d3, blocking the d-file and making the capture impossible.
In statistical terms, pieces on d2 or d3 appear relatively rarely in this pattern. But in this particular game, a statistically plausible move turned out to be illegal.
A nearly identical situation occurred in game two of the same match: based on pattern recognition, Kimi K2, playing Black, attempted to capture a seemingly unprotected bishop on g5 with the queen from d8, overlooking a crucial detail — the king on e7 blocking the diagonal.

Dmitry Dzhanhirov conducts a simultaneous exhibition at Berkeley Chess School

Anna Fierro: Can the chess strength of LLM-based chatbots be improved?

Dmitriy Dzhanhirov: When using LLM-based AI in its pure form, there appears to be a fundamental limitation on playing strength — and at a relatively modest level.
A simple consideration is that hallucination rates in multi-step, long-horizon reasoning — which includes chess — remain significant in current models. Even if we eliminate illegal hallucinations such as impossible moves or pieces appearing on nonexistent squares, the number of severe mistakes within the rules — as seen throughout the Kaggle tournament — is unlikely to be drastically reduced.
I believe an important milestone for LLM-based AI will be the moment when a chatbot can defeat Turochamp — the first chess program in history, developed by Alan Turing and David Champernowne between 1948 and 1950.
Turochamp used a search depth of two ply (one move by each side), with occasional extensions, and relied on a simple but original heuristic evaluation function.
Even this program proved too complex for the computers of the early 1950s, and later, more advanced programs overshadowed it. Today, Turochamp has been reconstructed as a playable engine. In 2012, a demonstration game was played against Garry Kasparov, where the former world champion, playing Black, delivered checkmate on move sixteen.
Its Elo rating can be roughly estimated at around 1000+, corresponding to an advanced beginner — a player who no longer makes the most basic mistakes.

Dmitry Dzhanhirov conducts a simultaneous exhibition at Berkeley Chess School

Anna Fierro: What is the point of developing LLM-based chatbots that play chess?

Dmitriy Dzhanhirov: From the perspective of pure playing strength, LLM-based AI represents a dead-end branch in the evolution of chess programs — though the term “program” here is somewhat conditional.
However, for artificial intelligence, just as for human intelligence, the principle “chess is gymnastics for the mind” remains valid. The level at which an LLM-based AI plays chess directly correlates with its ability to solve other complex tasks. By improving AI performance in chess, we simultaneously improve its capacity for broader multi-step reasoning.
This leads to an interesting conclusion: chess serves as a benchmark for LLM-based AI. Chess tournaments and matches provide potential investors with a reliable way to evaluate and compare the capabilities of different AI systems.