Researchers say ChatGPT model is trained on texts from copyrighted books

May 4, 2023

The University of California, Berkeley researchers published a paper titled “Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4” in which they discuss how OpenAI’s ChatGPT and the GPT-4 big language model were trained on material from copyrighted novels.

GPT-4 has learned a wide range of copyrighted content, according to the study team of Kent Chang, Mackenzie Cramer, Sandeep Soni, and David Bamman, with the degree of memory connected to the frequency of excerpts from the books appearing on the web. The team has made their code and data available on GitHub, as well as a list of the recognized novels, which includes titles such as Harry Potter, The Lord of the Rings, and The Hitchhiker’s Guide to the Galaxy.

The researchers observe that science fiction and fantasy literature lead the list, which they ascribe to their popularity on the internet. They also mention that remembering certain titles has a knock-on effect, making models more accurate in responding to specific instructions. ChatGPT, on the other hand, shows less understanding of works in other genres as a result of the models’ familiarity with sci-fi and fantasy novels. The researchers propose for the usage of publicly available training data to improve the transparency of the models’ behavior.

While the researchers have focused less on the copyright implications of memorizing copyrighted texts, text-generating applications based on these models may produce passages that are significantly similar or identical to copyrighted texts ingested.

The sources for this piece include an article in TheRegister.

Top Stories

Related Articles

December 23, 2025 Editor's Notes: This is the first of two articles reflecting on the year by Yogi Schulz. Schulz' more...

December 23, 2025 Google parent company Alphabet said Monday that it will acquire Intersect Power for $4.75 billion in cash more...

December 22, 2025 Artificial intelligence dominated global search behaviour in 2025, with Google’s own AI assistant, Gemini, emerging as the more...

December 22, 2025 OpenAI has hired the former head of Shopify’s core product organization to lead its next phase of more...

Jim Love

Jim is an author and podcast host with over 40 years in technology.

Share:
Facebook
Twitter
LinkedIn