Human-Content-to-Machine-Data_Final - Flipbook - Page 10
coherent theory or practice has emerged for maintaining attribution (and thereby being able
to give credit) for the sources of information used.27 What does a knowledge ecosystem look
like when the very underpinnings of context, container, and credibility are divorced from the
outputs made available to the public?
In the 2000s and 2010s, most machine use of web data involved addressing targeted
research questions or had contained objectives, such as extracting facts from a body of
scientiûc literature or patterns from a database. Large AI models, in contrast, require web
data at much larger scale and are almost indiscriminate in scope, ingesting vast quantities of
multimodal content. As a result, web publishers have begun to report large increases in
machine crawling activity by AI developers.
Numerous open source software projects have described this increase as overwhelming and
amounting to a denial-of-service (DoS) attack, with some estimating that more than 70% of
trafûc on their infrastructure is from AI crawlers.28 Some domains are reporting particularly
aggressive practices from AI crawlers, including deliberate circumvention of standard
blocking measures, disregarding terms of service and licenses, ignoring robots.txt directives,
spooûng user agents, and rotating IP addresses to avoid detection.29
This demonstrates bad faith and places a huge burden on the web in the form of increased
bandwidth costs and service instability, including on the many nonproût institutions and
communities that maintain it. Wikipedia, for example, has described how a high volume of
crawling is creating signiûcant work for its site reliability team.30 Some digital collections
maintained by libraries, archives, and museums have been knocked ofüine.31 This issue is only
likely to worsen as the crawlers used by AI developers not only crawl in order to build static
27
Chandrasekhar, R. (2025, May 12). Legal frictions for data openness. French National Centre for
Scientific Research; InnoCube; Open Knowledge Foundation.
https://ok.hypotheses.org/files/2025/03/Legal-frictions-for-data-openness-open-web-and-AI-RC-2025-f
inal.pdf
28
Venerandi, N. (2025, March 20). FOSS infrastructure is under attack by AI companies. LibreNews.
https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies/
29
Edwards, B. (2025, March 25). Open Source devs say AI crawlers dominate traffic, forcing blocks
on entire countries. Ars Technica.
https://arstechnica.com/ai/2025/03/devs-say-ai-crawlers-dominate-traffic-forcing-blocks-on-entire-c
ountries/
30
Mueller, B., Danis, C. & Lavagetto, G. (2025, April). How crawlers impact the operations of the
Wikimedia projects. Wikimedia Foundation.
https://diff.wikimedia.org/2025/04/01/how-crawlers-impact-the-operations-of-the-wikimedia-projects/
31
Maiberg, E. (2025, June 17). AI Scraping Bots Are Breaking Open Libraries, Archives, and Museums.
404 Media.
https://www.404media.co/ai-scraping-bots-are-breaking-open-libraries-archives-and-museums/
10