Human-Content-to-Machine-Data_Final - Flipbook - Page 6
derivatives of it, such as LAION5B4 and The Pile5) along with supplemental data that an AI
developer has crawled themselves.6 An analysis of the widely-used C4 training dataset, a
public dataset based on Common Crawl9s corpus, found that its content originated from
more than 14 million different web domains.7
AI developers also make use of large datasets that are created and maintained with the
express purpose of being widely used, including by collaborative communities (e.g., datasets
derived from Wikipedia8), open source communities (e.g., WikiSQL9), scientiûc projects (e.g.,
AlphaFold Protein Structure Database10), and governments (e.g., ofûcial statistics). Platforms
for AI development such as Hugging Face11 and Kaggle12 now host large collections of training
datasets of varying provenance. Some AI developers enter into partnerships with other
organizations to gain access to valuable data sources. Large technology organizations, such
as Google and Meta, repurpose the masses of data generated through users9 interactions
with their platform services for model training.
Large AI models rely on access to data throughout their lifecycle. Various types and sources
of data, and approaches to accessing it, are used in the process of testing, validating,
benchmarking, and ûne-tuning models.13 Once models have been deployed, techniques such
as retrieval augmented generation (RAG) enable them to retrieve, in real time, information
from the web or a user9s system in response to queries, as opposed to generating the
response from the trained model alone.
4
LAION. (n.d.). Projects. LAION. https://laion.ai/projects/
5
Eleuther AI. (n.d.). The Pile. Eleuther AI. https://pile.eleuther.ai/
6
Baack, S. (2024, February 6). Training Data for the Price of a Sandwich. Mozilla Foundation.
https://www.mozillafoundation.org/en/research/library/generative-ai-training-data/common-crawl/
7
Schaul, K., Chen, S. Y., & Tiku, N. (2023, April 19). Inside the secret list of websites that make
AI like ChatGPT sound smart. Washington Post.
https://www.washingtonpost.com/technology/interactive/2023/ai-chatbot-learning/
8
Wikipedia. (n.d.). Wikipedia. Wikipedia.org; Wikimedia Foundation. https://www.wikipedia.org/
9
salesforce/WikiSQL. (n.d.). A large annotated semantic parsing corpus for developing natural
language interfaces. (n.d.). GitHub. https://github.com/salesforce/WikiSQL
10
Google Deepmind & EMBL-EBI. (n.d.). AlphaFold Protein Structure Database. Alphafold.ebi.ac.uk.
https://alphafold.ebi.ac.uk
11
Hugging Face. (n.d.). Datasets. Hugging Face. https://huggingface.co/docs/datasets/en/index
12
Kaggle. (2024). Datasets. Kaggle.com. https://www.kaggle.com/datasets
13
Hardinges, J. & Simperl, E. (2024, October 15). A data for AI taxonomy. Open Data Institute.
https://theodi.org/news-and-events/blog/a-data-for-ai-taxonomy/
6