Human-Content-to-Machine-Data_Final - Flipbook - Page 5
This paper describes why we9re arrived at these beliefs and are taking this action. We9re
publishing this alongside an initial prototype of CC signals and a request for feedback.
We can9t make this a reality without community—join us.
A Note on Terminology
In this paper, we use the terms 8AI9 and 8large AI models9 as shorthand terms for what we
know is a complex ûeld of technologies and practices. We recognize that AI is not really
8artiûcial9 (in that it is created and used by humans), nor 8intelligent9 (at least in the way
we think of human intelligence), and that model size is relative (we use 8large9 to
describe models developed since the late 2010s that are able to process large volumes
of multimodal data following the introduction of transformer architectures). We talk
more speciûcally about certain types or capabilities of AI, such as generative AI models,
where it is necessary.
Background
How AI Models Use Data
Recent progress in AI has been characterized by models of large scale and complex
architectures, capable of tasks such as natural language processing and content generation.
Many of these models have been developed using large amounts of data from the public
web.1 Web crawling plays a signiûcant part in this. It involves using automated programs to
systematically navigate and make copies of data from websites, blogs, forums, books, social
media platforms, and other sources.
Some AI developers, for example, rely on crawlers to extract textual content from different
sources in order to train models to detect patterns and then generate human-like text in
response to prompts.2 The datasets used to train large AI models are often made up of
multiple datasets generated through web crawling (especially Common Crawl3 and
1
Huang, S. & Siddarth, D. (2023, February 6). Generative AI and the Digital Commons. The
Collective Intelligence Project. https://cip.org/research/generative-ai-digital-commons
2
Murgia, M. (2023, September 9). Generative AI Exists Because of the Transformer. Financial Times.
https://ig.ft.com/generative-ai/
3
Common Crawl. (n.d.). Common Crawl. Common Crawl. https://commoncrawl.org/
5