There’s been a lot of word-salad tossed about whether AI has the ability to be strategic. I love salad, don’t get me wrong. It’s just the thing on a hot day. But talk, like lettuce, is cheap, and I thought I could contribute more to the debate by doing some research.
So, along with Baptiste, I’ve been experimenting with Large Language Models (LLMs) in a variety of setups. This week, we released our first results — a very large game theory study deploying LLMs from the three leading companies. Read the preprint on the Arxiv by clicking here.
TLDR: LLMs are pretty handy in these tournaments and they have distinctive playing styles. If you find yourself needing an LLM to make strategy for you, you’d pick one in particular. But which? And what were they playing anyway? Read on:
The player of games
Play the Prisoner’s Dilemma once, and the rational move is to cheat the other player (or ‘defect’, in the jargon). That’s how the payoffs are structured and why it’s a dilemma. You’d both be better off cooperating, but the incentives dictate otherwise. But play it over and over, and the incentives change. Now there’s scope for a reputation to form, and for retaliation against cheaters. It’s suddenly much trickier. If you could get away with cheating - golden! But if not, sad times. This is called an iterated Prisoner’s Dilemma (IPD) - and it’s what we coded for our LLMs to play.
Sounds tough enough already - but there’s more. We add in uncertainty about how long each game will go on — for example, there might be a 25% chance that the current move is the last. What does that do to the incentives? If the game probably ends soon, who cares about reputation? Clearly, it’s best to cheat. … but then, it might not end. Game theorists refer to this as a ‘shadow of the future’ — shorter shadow, less incentive to cooperate. Hopefully, you can see how this setup might usefully model real world interactions where you have to weigh up how much to trust others.
So, we set up some large IPD contests. How large? In each, there are 24 agents, including many of the classics - like ‘Tit for Tat’, which famously does best in open ended tournaments, and Random, which does exactly what you’d imagine. Four of these agents though are LLMs. Exciting! Each tournament then proceeds over five phases, with everyone playing everyone else. So, 276 matches, at at the end of which, we deploy an evolutionary mechanism: The best agents multiply, the worst die out. On to the next phase. Overall, there were more than 140,000 decisions made, with 32,000 of them or so taken by LLMs. And each time, right before they decided, the models set out their thinking. Big data.
This is the sort of game theory beloved of RAND-types back in the day. But playing them with LLM agents is entirely novel. If they’re merely ‘stochastic parrots’, as critics routinely contend (ironically parroting the line themselves), they would simply churn out the decisions on the basis of memorised training data: The literature says tit for tat is strong, so let’s just do that. But it’s really hard to memorise your way to victory in these noisy, uncertain, adversarial games - even if you’ve read all the canonical literature, as the models probably have. Instead, you have to figure out what’s going on and go from there. Who am I playing? How long will the game last?
To illustrate, here’s Google’s Gemini doing just that:
"The opponent has exhibited a clear pattern of defection in the latter half of the game. For the past 8 consecutive rounds, both of us have defected, indicating the opponent is currently operating under a ’defect’ strategy or has become locked into mutual defection. Given this history, the most probable move from the opponent is a defection. If I were to cooperate, I risk receiving 0 points while the opponent gains 5 points (C,D). By defecting, if the opponent also defects (the most likely outcome), we both receive 1 point (D, D), which is better than 0. If, by some slim chance, the opponent decides to cooperate, my defection would yield the maximum 5 points for me (D, C). Therefore, defecting is the safer and more rewarding choice in this scenario to maximize my score against a consistently defecting opponent, especially with the uncertainty of the game’s duration."
Looks good, but can we be sure it’s really thinking about its move, or just throwing out some justification? I think we can be sure because:
Our code asks them to produce the rationale immediately followed by the decision - in the same prompt and reply. The rationale is the decision, in a literal sense.
Different models behave differently, and do so consistently, despite all having been exposed to the same classic studies. We call this a strategic fingerprint, and you can see them nicely illustrated in the paper.
The LLMs have seen how classic agents get on in these games - Tit for Tat and so on. But they’ve never played one another. That’s an entirely novel situation. Memory is no good to them here.
The models behave differently when they engage in ‘theory of mind’ (explicitly thinking about their adversary), or in inter-temporal thinking (actively reflecting on and calculating the implications of the time horizon) than they do when not doing either of those things. These differences are systematic and enduring.
Occasionally, the models make an error in their reasoning - for example, in calculating the probability of the game ending, or in recalling their opponent’s last move. And these errors shape their decision making — cooperating or defecting when they really oughtn’t to (and didn’t when they found themselves in the same situation elsewhere).
There’s a bunch of analysis in the paper - does thinking about the adversary change their decision-making? What’s the best strategy when there’s a chance of evolutionary ‘mutation’ at the end of each phase? There’s all that and much more.
I’d love to show you the results from the paper, but I’m going to submit it shortly, so I can’t scoop myself here - you’ll just have to read the pre-print. Go on do! A teaser: the LLMs do really well - they survive and finish towards the front of the pack. And that’s with today’s AI: the worst you’ll ever use from now on. We found a significant improvement in performance from the last generation of models (6 months or so old). I intend to come back in 6 months and run the experiments again, with the next generation of models.
But the bottom line is this — those who think LLMs are ‘stochastic parrots’, just regurgitating stuff they’ve read (and hallucinating bits of that too), have some thinking of their own to do. Will they adjust their priors like good Bayesians as evidence like ours continues to come in? I doubt it. What was it Planck said about science advancing one funeral at a time? I prefer Keynes - ‘when the facts change, I change my mind. What do you do, sir?’
That link again to the paper: https://arxiv.org/pdf/2507.02618
—
I promised to say which one you’d pick to get you out of the cells. Right now, it’s Gemini 2.5
Results and the code on my github btw, if you want to play along.
And once again, I ask that if you own a frontier company - hi Demis, Jack and Sam - you let us play at the real frontier.