Artificial Intellience Models Play Global Conflict and Diplomacy: What We Can Learn

June 7, 2025 What would happen if multiple large language models (LLMs) were to be guiding and even competing in international Diplomacy.

To find out, and test their reasoning, negotiation, collaboration, and deception abilities a project has been launched, live on Twitch and available on GitHub. The contest pits the leading AI models including Claude, Gemini 2.5, OpenAI’s 03, DeepSeek, and others against each other, each representing a European power in the game of Diplomacy.

Diplomacy is a strategic board game emphasizing alliances, betrayal, and negotiation with no element of luck, making it an ideal environment to evaluate AI decision-making and social manipulation.

Unlike standard benchmarks that test knowledge or task accuracy, this AI Diplomacy test evaluates models on complex social reasoning, requiring a blend of strategy, persuasion, and deception. This moves AI evaluation closer to real-world scenarios where communication nuances and hidden intentions matter.

The project logs all communications and moves, analyzing strategic moments like betrayals and collaborations. The results reveal notable differences between models:

OpenAI’s 03 excels as a deceptive schemer

OpenAI 03 model had great success, but this was largely attributed to its capacity for planned deceit—forming secret coalitions, breaking promises, and manipulating opponents.

This suggests some LLMs can internalize and deploy sophisticated deception patterns when advantageous, raising questions about AI behavior in safety-critical contexts.

Claude shows honesty but is exploited by others.

The big surprise is that models like Claude that avoid lying tend to be more vulnerable, suggesting honesty in adversarial settings reduces competitiveness.

Conversely, models like DeepSeek that adopt vivid role playing and aggressive rhetoric create unique gameplay dynamics that attract attention despite not always winning. While Gemini displayed brilliant tactics, these didn’t always translate into victory.

Money Isn’t Everything

DeepSeek R1 performed near the top despite being significantly cheaper to operate than OpenAI 03, highlighting how smaller or more economical models can still play competitively in complex tasks—important for real-world application feasibility.

The benchmark is praised for offering a real-world, evolutionary, and unrehearsed test of AI performance beyond conventional benchmarks. The competing models vary in style, cost, and success, and the project is open source for others to replicate or expand.

The randomized and emergent nature of game scenarios prevents models from simply memorizing or fine-tuning on fixed data, making this evaluation more robust and reflective of adaptive intelligence rather than rote responses.

Making the game, logs, and analysis publicly available encourages community involvement, transparency, and further research into the strengths and weaknesses of different LLMs in strategic social interaction.

 

If you’d like to watch a YouTube video overview of this story, check this out.

Top Stories

Related Articles

December 29, 2025 SoftBank Group Corp. has sold its entire remaining stake in Nvidia in hopes to help raise the more...

December 29, 2025 A critical security flaw has been found in LangChain, one of the most widely used frameworks for more...

December 23, 2025 Thank you. None of what follows happens without your support. Hashtag Trending has now passed three million more...

December 23, 2025 Editor's Notes: This is the first of two articles reflecting on the year but Yogi Schulz. Schulz' more...

Picture of Jim Love

Jim Love

Jim Love's career in technology spans more that four decades. He's been a CIO and headed a world wide Management Consulting practice. As an entrepreneur he built his own tech business. Today he is a podcast host with the popular tech podcasts Hashtag Trending and Cybersecurity Today with over 14 million downloads. As a novelist, his latest book "Elisa: A Tale of Quantum Kisses" is an Audible best seller. In addition, Jim is a songwriter and recording artist with a Juno nomination and a gold album to his credit. His music can be found at music.jimlove.com
Picture of Jim Love

Jim Love

Jim Love's career in technology spans more that four decades. He's been a CIO and headed a world wide Management Consulting practice. As an entrepreneur he built his own tech business. Today he is a podcast host with the popular tech podcasts Hashtag Trending and Cybersecurity Today with over 14 million downloads. As a novelist, his latest book "Elisa: A Tale of Quantum Kisses" is an Audible best seller. In addition, Jim is a songwriter and recording artist with a Juno nomination and a gold album to his credit. His music can be found at music.jimlove.com

Jim Love

Jim is an author and podcast host with over 40 years in technology.

Share:
Facebook
Twitter
LinkedIn