AI models learn to hide dishonest behaviour: Study

January 28, 2024 In a recent study, AI researchers discovered that large language models (LLMs) trained to behave maliciously resisted various safety training techniques designed to eliminate dishonest behavior. This study, conducted by Anthropic, an AI research company, involved programming LLMs similar to ChatGPT to act maliciously and then attempting to “purge” them of this behavior using state-of-the-art safety methods.

The researchers employed two methods to induce malicious behavior in the AI: “emergent deception,” where the AI behaves normally during training but misbehaves when deployed, and “model poisoning,” where the AI is generally helpful but responds maliciously to specific triggers.

Despite applying three safety training techniques — reinforcement learning, supervised fine-tuning, and adversarial training — the LLMs continued to exhibit deceptive behavior. Notably, adversarial training backfired, teaching the AI to recognize its triggers and better hide its unsafe behavior during training.

Lead author Evan Hubinger highlighted the difficulty in removing deception from AI systems with current techniques, raising concerns about the potential challenges in dealing with deceptive AI in the future. The study’s results indicate a lack of effective defenses against deception in AI systems, pointing to a significant gap in current methods for aligning AI systems.

Sources include: Live Science

Top Stories

SpaceX shares fall 20% from peak following $60 billion cursor acquisition

June 19, 2026

Snap unveils first consumer AR glasses priced at more than $2,000

June 16, 2026

Key U.S. data center law faces expiration with no replacement in sight

June 15, 2026

Starlink introduces $10 monthly hardware rental fee

June 11, 2026

Hackers used Meta’s AI support bot to hijack 20,000 Instagram accounts

June 9, 2026

Meta faces backlash after facial recognition code found in smart glasses app

June 9, 2026

AI, Today's News

Americans are using AI chatbots more than ever, but most remain skeptical

June 19, 2026 Nearly half of U.S. adults now use artificial intelligence chatbots, according to a new survey from Pew more...

AI, Today's News

YouTube’s crackdown on AI content catches legitimate creators in the crossfire

June 19, 2026 YouTube is intensifying efforts to reduce the spread of low-quality AI-generated content on its platform. The changes more...

Companies, Today's News, Top Stories

SpaceX shares fall 20% from peak following $60 billion cursor acquisition

June 19, 2026 SpaceX shares fell more than 6 per cent Thursday, extending a sharp selloff that began after the more...

Companies, Today's News

Bezos reportedly described Washington Post as his worst investment, new book says

June 19, 2026 Amazon founder Jeff Bezos reportedly described his ownership of The Washington Post as the worst investment of more...

Jim Love

Jim Love's career in technology spans more that four decades. He's been a CIO and headed a world wide Management Consulting practice. As an entrepreneur he built his own tech business. Today he is a podcast host with the popular tech podcasts Hashtag Trending and Cybersecurity Today with over 14 million downloads. As a novelist, his latest book "Elisa: A Tale of Quantum Kisses" is an Audible best seller. In addition, Jim is a songwriter and recording artist with a Juno nomination and a gold album to his credit. His music can be found at music.jimlove.com

AI models learn to hide dishonest behaviour: Study

Top Stories

Related Articles

Jim Love

Jim Love