AI models learn to hide dishonest behaviour: Study

January 28, 2024 In a recent study, AI researchers discovered that large language models (LLMs) trained to behave maliciously resisted various safety training techniques designed to eliminate dishonest behavior. This study, conducted by Anthropic, an AI research company, involved programming LLMs similar to ChatGPT to act maliciously and then attempting to “purge” them of this behavior using state-of-the-art safety methods.

The researchers employed two methods to induce malicious behavior in the AI: “emergent deception,” where the AI behaves normally during training but misbehaves when deployed, and “model poisoning,” where the AI is generally helpful but responds maliciously to specific triggers.

Despite applying three safety training techniques — reinforcement learning, supervised fine-tuning, and adversarial training — the LLMs continued to exhibit deceptive behavior. Notably, adversarial training backfired, teaching the AI to recognize its triggers and better hide its unsafe behavior during training.

Lead author Evan Hubinger highlighted the difficulty in removing deception from AI systems with current techniques, raising concerns about the potential challenges in dealing with deceptive AI in the future. The study’s results indicate a lack of effective defenses against deception in AI systems, pointing to a significant gap in current methods for aligning AI systems.

Sources include: Live Science

Top Stories

GitHub Copilot to train on user data by default

March 27, 2026

Microsoft pulls Copilot Chat from core Office apps for enterprise customers

March 27, 2026

OpenAI pauses ChatGPT erotic mode “indefinitely”

March 26, 2026

Researcher Says “APT” Label No Longer Reflects the Threat Landscape

March 23, 2026

How do you select a graph database? – Part 1

March 23, 2026

OpenAI plans major hiring push as competition intensifies

March 23, 2026

Development, Today's News

Google warns quantum computing could break encryption by 2029

March 27, 2026 Google has warned that quantum computers could break widely used encryption systems by 2029, urging organisations to more...

Companies, Privacy, Security, Today's News, Top Stories

GitHub Copilot to train on user data by default

March 27, 2026 Microsoft is updating GitHub Copilot to train on real-world developer interactions, expanding beyond public code datasets to more...

AI, Today's News

OpenAI launches ChatGPT Library to centralise files and outputs

March 27, 2026 OpenAI has introduced a new ChatGPT Library feature that automatically stores files uploaded to, or generated within, more...

Companies, Today's News

US Supreme Court narrows ISP liability in major piracy ruling

March 27, 2026 The US Supreme Court has ruled that internet service providers are not automatically liable for user piracy more...

Jim Love

Jim Love's career in technology spans more that four decades. He's been a CIO and headed a world wide Management Consulting practice. As an entrepreneur he built his own tech business. Today he is a podcast host with the popular tech podcasts Hashtag Trending and Cybersecurity Today with over 14 million downloads. As a novelist, his latest book "Elisa: A Tale of Quantum Kisses" is an Audible best seller. In addition, Jim is a songwriter and recording artist with a Juno nomination and a gold album to his credit. His music can be found at music.jimlove.com

AI models learn to hide dishonest behaviour: Study

Top Stories

GitHub Copilot to train on user data by default

Microsoft pulls Copilot Chat from core Office apps for enterprise customers

OpenAI pauses ChatGPT erotic mode “indefinitely”

Researcher Says “APT” Label No Longer Reflects the Threat Landscape

How do you select a graph database? – Part 1

OpenAI plans major hiring push as competition intensifies

Related Articles

Google warns quantum computing could break encryption by 2029

GitHub Copilot to train on user data by default

OpenAI launches ChatGPT Library to centralise files and outputs

US Supreme Court narrows ISP liability in major piracy ruling

Jim Love

Jim Love

Jim Love

Follow Us

Popular categories

Tech News Delivered