ChatGPT hides copyright training data, research finds

August 23, 2023

ChatGPT is trying to hide that it was trained on copyrighted material, according to new research published in a paper by a group of AI scientists from ByteDance.

The researchers found that ChatGPT now disrupts its outputs when users try to extract the next sentence from a prompt. This is a new behavior that was not present in previous versions of ChatGPT.

The researchers believe that ChatGPT developers have implemented a mechanism to detect if the prompts aim to extract copyright content. They also found that ChatGPT still responds to some prompts with copyrighted material, even with these new measures in place.

This is not the only LLM that has been found to contain copyrighted material. Other LLMs, such as OPT-1.3B from Meta and FLAN-T5 from Google, have also been found to respond to prompts with copyrighted text.

The researchers suggest that this is because LLMs are trained on massive amounts of data, including text from books, articles, and websites. This data often includes copyrighted material, which can then be inadvertently reproduced by the LLMs.

The sources for this piece include an article in BusinessInsider.

Top Stories

Related Articles

December 23, 2025 Editor's Notes: This is the first of two articles reflecting on the year but Yogi Schulz. Schulz' more...

December 23, 2025 Google parent company Alphabet said Monday that it will acquire Intersect Power for $4.75 billion in cash more...

December 22, 2025 Artificial intelligence dominated global search behaviour in 2025, with Google’s own AI assistant, Gemini, emerging as the more...

December 22, 2025 OpenAI has hired the former head of Shopify’s core product organization to lead its next phase of more...

Jim Love

Jim is an author and podcast host with over 40 years in technology.

Share:
Facebook
Twitter
LinkedIn