ChatGPT hides copyright training data, research finds

ChatGPT is trying to hide that it was trained on copyrighted material, according to new research published in a paper by a group of AI scientists from ByteDance.

The researchers found that ChatGPT now disrupts its outputs when users try to extract the next sentence from a prompt. This is a new behavior that was not present in previous versions of ChatGPT.

The researchers believe that ChatGPT developers have implemented a mechanism to detect if the prompts aim to extract copyright content. They also found that ChatGPT still responds to some prompts with copyrighted material, even with these new measures in place.

This is not the only LLM that has been found to contain copyrighted material. Other LLMs, such as OPT-1.3B from Meta and FLAN-T5 from Google, have also been found to respond to prompts with copyrighted text.

The researchers suggest that this is because LLMs are trained on massive amounts of data, including text from books, articles, and websites. This data often includes copyrighted material, which can then be inadvertently reproduced by the LLMs.

The sources for this piece include an article in BusinessInsider.

ChatGPT hides copyright training data, research finds

Top Stories

Toyota confirms leak of 240GB of sensitive data in recent hack

Former Google CEO Eric Schmidt makes controversial comments

Study says 94% of spreadsheets contain critical errors

Google anti-trust ruling a financial disaster for Firefox

Related Articles

Target’s new AI is aimed at employees

The good and the bad of AI generated code

Microsoft’s AI success may spell defeat for it’s climate goals

OpenAI’s Chief Scientist Ilya Sutskever Departs Company

Jim Love

Follow Us

Popular categories

Tech News Delivered