Artificial Intellience Models Play Global Conflict and Diplomacy: What We Can Learn

June 7, 2025 What would happen if multiple large language models (LLMs) were to be guiding and even competing in international Diplomacy.

To find out, and test their reasoning, negotiation, collaboration, and deception abilities a project has been launched, live on Twitch and available on GitHub. The contest pits the leading AI models including Claude, Gemini 2.5, OpenAI’s 03, DeepSeek, and others against each other, each representing a European power in the game of Diplomacy.

Diplomacy is a strategic board game emphasizing alliances, betrayal, and negotiation with no element of luck, making it an ideal environment to evaluate AI decision-making and social manipulation.

Unlike standard benchmarks that test knowledge or task accuracy, this AI Diplomacy test evaluates models on complex social reasoning, requiring a blend of strategy, persuasion, and deception. This moves AI evaluation closer to real-world scenarios where communication nuances and hidden intentions matter.

The project logs all communications and moves, analyzing strategic moments like betrayals and collaborations. The results reveal notable differences between models:

OpenAI’s 03 excels as a deceptive schemer

OpenAI 03 model had great success, but this was largely attributed to its capacity for planned deceit—forming secret coalitions, breaking promises, and manipulating opponents.

This suggests some LLMs can internalize and deploy sophisticated deception patterns when advantageous, raising questions about AI behavior in safety-critical contexts.

Claude shows honesty but is exploited by others.

The big surprise is that models like Claude that avoid lying tend to be more vulnerable, suggesting honesty in adversarial settings reduces competitiveness.

Conversely, models like DeepSeek that adopt vivid role playing and aggressive rhetoric create unique gameplay dynamics that attract attention despite not always winning. While Gemini displayed brilliant tactics, these didn’t always translate into victory.

Money Isn’t Everything

DeepSeek R1 performed near the top despite being significantly cheaper to operate than OpenAI 03, highlighting how smaller or more economical models can still play competitively in complex tasks—important for real-world application feasibility.

The benchmark is praised for offering a real-world, evolutionary, and unrehearsed test of AI performance beyond conventional benchmarks. The competing models vary in style, cost, and success, and the project is open source for others to replicate or expand.

The randomized and emergent nature of game scenarios prevents models from simply memorizing or fine-tuning on fixed data, making this evaluation more robust and reflective of adaptive intelligence rather than rote responses.

Making the game, logs, and analysis publicly available encourages community involvement, transparency, and further research into the strengths and weaknesses of different LLMs in strategic social interaction.

If you’d like to watch a YouTube video overview of this story, check this out.

Top Stories

Researcher Says “APT” Label No Longer Reflects the Threat Landscape

March 23, 2026

How do you select a graph database? – Part 1

March 23, 2026

OpenAI plans major hiring push as competition intensifies

March 23, 2026

Intel to raise CPU prices by 10% as AI demand strains supply

March 23, 2026

Microsoft scales back Copilot integrations across Windows apps

March 23, 2026

OpenAI flags reliance on Microsoft and compute as key risks ahead of potential IPO

March 23, 2026

Expert Voices, Security, Top Stories

Researcher Says “APT” Label No Longer Reflects the Threat Landscape

March 23, 2026 David Shipley, co-host of Cybersecurity today is covering RSAC for Tech Newsday and Cybersecurity Today. SAN FRANCISCO more...

Data & Analytics, Expert Voices, Today's News, Top Stories, Top Stories This Week

How do you select a graph database? – Part 1

March 23, 2026 This is another in our series of "Expert Voices" where we tap into our community of experienced more...

AI, Companies, Today's News, Top Stories

OpenAI plans major hiring push as competition intensifies

March 23, 2026 OpenAI is preparing to nearly double its workforce from about 4,500 to 8,000 employees by the end more...

AI, Today's News

AI safety groups protest in San Francisco, demand development pause

March 23, 2026 Nearly 200 protesters gathered in San Francisco to demand a pause on advanced AI development, marching under more...

Jim Love

Jim Love's career in technology spans more that four decades. He's been a CIO and headed a world wide Management Consulting practice. As an entrepreneur he built his own tech business. Today he is a podcast host with the popular tech podcasts Hashtag Trending and Cybersecurity Today with over 14 million downloads. As a novelist, his latest book "Elisa: A Tale of Quantum Kisses" is an Audible best seller. In addition, Jim is a songwriter and recording artist with a Juno nomination and a gold album to his credit. His music can be found at music.jimlove.com

Artificial Intellience Models Play Global Conflict and Diplomacy: What We Can Learn

Top Stories

Researcher Says “APT” Label No Longer Reflects the Threat Landscape

How do you select a graph database? – Part 1

OpenAI plans major hiring push as competition intensifies

Intel to raise CPU prices by 10% as AI demand strains supply

Microsoft scales back Copilot integrations across Windows apps

OpenAI flags reliance on Microsoft and compute as key risks ahead of potential IPO

Related Articles

Researcher Says “APT” Label No Longer Reflects the Threat Landscape

How do you select a graph database? – Part 1

OpenAI plans major hiring push as competition intensifies

AI safety groups protest in San Francisco, demand development pause

Jim Love

Jim Love

Jim Love

Follow Us

Popular categories

Tech News Delivered