Internet data unknowingly contributes to the training of chatbots

April 25, 2023

The internet and the enormous amount of data it has generated have had a tremendous influence on the advancement of artificial intelligence (AI). According to a recent Washington Post investigation, the AI industry trained its neural networks using a publicly available dataset spanning 30 years of web publication.

This investigation discovered that our online contributions, such as blogs, web pages, and social media threads, unknowingly helped AI chatbots learn. Moreover, humans unintentionally created a large archive of human expression, allowing AI models such as ChatGPT to do astounding sentence-completion tasks.

The study allows users to enter any internet domain name and determine its contribution to a specific AI training database. The researchers examined a database that had over 500,000 personal blogs, accounting for 3.8 percent of the total “tokens” in the dataset. However, because some cultures, groups, and subjects may be oversampled while others may be neglected, biases, limits, and poisonous parts of internet culture may be present in AI training data.

The immense quantity of information, thoughts, and emotions that people have created on the internet, which may be compared to digital stockpiles and landfills, is what is responsible for the developments in AI technology that we witness today.

The sources for this piece include an article in Axios.

Top Stories

Related Articles

December 23, 2025 Editor's Notes: This is the first of two articles reflecting on the year but Yogi Schulz. Schulz' more...

December 23, 2025 Google parent company Alphabet said Monday that it will acquire Intersect Power for $4.75 billion in cash more...

December 22, 2025 Artificial intelligence dominated global search behaviour in 2025, with Google’s own AI assistant, Gemini, emerging as the more...

December 22, 2025 OpenAI has hired the former head of Shopify’s core product organization to lead its next phase of more...

Jim Love

Jim is an author and podcast host with over 40 years in technology.

Share:
Facebook
Twitter
LinkedIn