Human-Content-to-Machine-Data_Final - Flipbook - Page 7
Governing Machine Use of Web Data, To Date
The use of large volumes of data, including from across the public web, isn9t speciûc to
training and deploying large AI models.
Text and data mining (TDM), the process of transforming large amounts of unstructured text
into structured formats in order to identify patterns, trends, and other insights,14 has long
been deployed in many ûelds of research, from computer science and linguistics to
environmental sciences and humanities. There is broad international convergence on the
potential social value of TDM, and while the scope and details vary widely, every copyright law
in the world has at least one exception that promotes research purposes.15
Creating archives of the web, such as the Internet Archive16 and Wayback Machine,17 relies on
using machines to systematically navigate and make copies of data from billions of websites.
Common Crawl, mentioned above, was established as a nonproût foundation in 2007 to
produce large crawls of web data for anyone to access and use for analysis, rather than only
the handful of companies who, at the time, could afford to undertake their own crawling at
scale.18 Prior to becoming a key source of training data for large AI models, most creators and
web users would have been unaware of Common Crawl, and its largely research-oriented use
did not spark major debate.
Web search is similarly predicated on the use of machines to ûnd and store information from
across the web.19 In their traditional form, search engines presented a fairly simple 8deal9 for
website owners, which was: if your search product sends us trafûc, then we9ll allow crawling.
In this context, this reciprocal exchange of value has been important, especially given much
of the web9s reliance on advertising revenue based on trafûc and clicks.
This doesn9t mean machine use of web data has been entirely uncontested or an
anything-goes free-for-all. Some websites and news publishers, for example, sued the
14
IBM. (2021, October 15). Text Mining. IBM. https://www.ibm.com/think/topics/text-mining
15
Flynn, S., Schirru, L., Palmedo, M., & Izquierdo, A. (2022). Research Exceptions in Comparative
Copyright. Joint PIJIP/TLS Research Paper Series.
https://digitalcommons.wcl.american.edu/research/75/
16
Internet Archive. (n.d.). Internet Archive. Internet Archive. https://archive.org/
17
Wikipedia Contributors. (2019, March 8). Wayback Machine. Wikipedia; Wikimedia Foundation.
https://en.wikipedia.org/wiki/Wayback_Machine
18
Baack, S. (2024, February 6). Training Data for the Price of a Sandwich. Mozilla Foundation.
https://www.mozillafoundation.org/en/research/library/generative-ai-training-data/common-crawl/
19
Google. (n.d.). Organizing Information 3 How Google Search Works. Google.
https://www.google.com/search/howsearchworks/how-search-works/organizing-information/
7