The new dataset also has limitations. Much public domain data is out of date (in the United States, for example, copyright protection typically lasts more than 70 years from the death of the author). This type of dataset will therefore not be able to anchor an AI model in current events or, for example, how to create a blog post using current slang. (On the other hand, it could make for a nasty Proust pastiche.)
“As far as I know, this is currently the largest public domain dataset to date for training LLMs,” says Stella Biderman, executive director of EleutherAI, an open source collective project that publishes models of AI. “It’s an invaluable resource.”
Projects like this are also extremely rare. No other LLMs besides 273 have been submitted to Fairly Trained for certification. But some who want to make AI fairer to artists whose works have been integrated into systems like GPT-4 hope that Common Corpus and KL3M can demonstrate that there is a pocket of the AI world skeptical of the arguments justifying data scraping without authorization.
“It’s a selling point,” says Mary Rasenberger, CEO of the Authors Guild, which represents book authors. “We’re starting to see a lot more licenses and license applications. This is a growing trend. The Authors Guild, along with the radio actors and entertainers union SAG-AFTRA and a few other professional groups, were recently named official supporters of Fairly Trained.
Although it doesn’t have any additional LLMs under its belt, Fairly Trained recently certified its first company to offer AI voice models, Spanish voice-changing startup VoiceMod, as well as its first “AI Group,” a project heavy metal band called Frostbite. Orckings.
“We were always going to see great language models created legally and ethically,” says Newton-Rex. “It just took a little while.”
Updated March 20, 2024 at 2:45 p.m. EDT: The Common Corpus dataset contains 500 billion tokens, not 500 million.