Human-Content-to-Machine-Data_Final - Flipbook - Page 13
8went dark9 in protest over the platform9s plans to enable AI developers to access the mass of
forum conversations they9d played a vital role in creating.45 Contributors to Stack Overüow, an
internet forum for developers, have been banned from the site after they deleted their
content in order to stop it from being used to train large AI models.46
An increasing number of web publishers appear to be blocking their content from being used
for AI model training, or from being crawled at all. In 2023, the New York Times updated its
terms of service to explicitly prohibit its content from being crawled and asked Common
Crawl to remove any of its existing news articles from its dataset.47 This practice extends to
smaller publishers, too. According to Cloudüare, more than 84% of the websites that use its
hosting services have now applied technical restrictions to stop crawlers from accessing
their content.48 An analysis of the C4 training dataset found that between 2023 and 2024
there had been