Data is the new oil, as they say, and perhaps that makes Harvard University the new Exxon. The school announced Thursday the launch of a dataset containing nearly one million public domain books that can be used for training AI models. Under the newly formed Institutional Data Initiative, the project has received funding from both Microsoft and OpenAI, and contains books scanned by Google Books that are old enough that their copyright protection has expired.
Thank you for reading this post, don't forget to subscribe!Wired in a piece on the new project says the dataset includes a wide variety of books with “classics from Shakespeare, Charles Dickens, and Dante included alongside obscure Czech math textbooks and Welsh pocket dictionaries.” As a general rule, copyright protections last for the lifetime of the author plus an additional 70 years.
Foundational language models, like ChatGPT, that behave like a verisimilitude of a real human require an immense amount of high-quality text for their training—generally the more information they ingest, the better the models perform at imitating humans and serving up knowledge. But that thirst for data has caused problems as the likes of OpenAI have hit walls on how much new information they can find—without stealing it, at least.
Publishers including the Wall Street Journal and the New York Times have sued OpenAI and competitor Perplexity for ingesting their data without permission. Proponents of AI companies have made various arguments to defend their activities. They will sometimes say that humans themselves produce new works based on studying and synthesizing material from other sources, and AI isn’t any different. Everyone goes to school, reads books, and then produces new work using the knowledge they gained. Remixing is legally considered fair use if the new creation is materially different. But that fails to take into account that humans cannot ingest billions of pieces of text at the speed a computer can, so it’s not exactly a fair comparison. The Wall Street Journal in its lawsuit against Perplexity has said the startup “copies on a massive scale.”
Players in the space have also put forth the argument that any content made available on the open web is essentially fair game and that the user of a chatbot is the one accessing copyrighted content by requesting it through a prompt. Basically, a chatbot like Perplexity is akin to a web browser. It will be some time before these arguments play out in court.
OpenAI has struck deals with some content providers in response to the criticisms, and Perplexity has rolled out an ad-supported partner program with publishers. But it is clear they have done so begrudgingly.
At the same time as AI companies are running out of new content to utilize, commonly used web sources that are already included in training sets have quickly begun restricting access. Companies including Reddit and X have been aggressive about limiting the use of their data as they have recognized its immense value, especially in having real-time data to augment foundational models with more up-to-date information on the world.
Reddit makes hundreds of millions of dollars licensing its corpus of subreddits and comments to Google for training its models. Elon Musk’s X has an exclusive arrangement with his other company, xAI, to give its models access to the social network’s content for training and retrieval of current information. It’s kind of ironic to consider that these companies closely guard their own data, but essentially think content from media publishers has no value and should be free.
One million books won’t be enough to supply any AI company’s training needs, especially considering these books are old and don’t contain modern information, like the slang Gen Z kids are using. In order to differentiate themselves from competitors, AI companies will want to continue accessing other data—especially the exclusive kind—so they are not all creating models that are the same. The Institutional Data Initiative’s dataset can at least offer some assistance to AI companies trying to train their initial foundational models without getting into any legal trouble.
2024-12-12 18:45:17