AI doesn't read or write like humans, and we shouldn't act like it does.

We shouldn’t be affording companies the ability to profit off other people’s creations without their consent, and despite the intentions, that's basically how current copyright law works.

Jul 11, 2023

Sarah Silverman is suing OpenAI and Meta. The lawsuit claims that the datasets for ChatGPT and LLaMA, the company’s LLMs (language learning models), uses pirated content taken from LibGen and Z-Library.

The companies state their models are trained on books, articles, and Common Crawl, a repository of publicly available websites. If they did use pirated content, then the companies could be in hot water, but it could be argued that even if they did use the content purchased legally, that it is no different than a person reading a novel and summarising it, like CliffsNotes.

I disagree.

Machine learning algorithms are an incredibly new way of processing data.

Those scenarios require a human to be the one doing the reading and summarising, which for most authors is fine, they expect people to read their work and summarise it, or quote it. They want people to read and interact with and interpret their content, they want people to engage with their ideas and generate ideas of their own in response to them.

What they don’t expect is for that work to be fed in full into a private companies data set to train a machine how to duplicate their content at speeds completely incomparable to human capabilities.

Imagine a single human, reading a novel. Now pretend that human has a photographic memory and they can store that data perfectly. Then they spend the next 450 million years reading without the need for sleep or any kind of rest.

a stack of chocolate bars — Photo by Artur Matosyan on Unsplash

We’re talking about something completely new, completely unseen and we do not even have the capacity to understand the amount of data being provided to these models.

Without any guidelines or restrictions on what can be fed into these models, we are disregarding the rights of those creators to not want their art, music or writing to be fed into the endless churn of data for these megacorporations.

Acting as though a human writing a summary is the same thing as a vast network of computers processing data at a speed that is hundreds if not thousands times faster than a human is foolish.

Perhaps it is also foolish to try and apply our current copyright laws (which already favour large corporations and not individual creators) to this slew of new technology, but just ignoring the fundamental difference between the two is no way of going about it.

We need copyright reform, we need protections for creators, and we need to stop acting as though machine learning algorithms are remotely comparable to humans both in their capabilities, responsibilities and rights.

There is a perfectly reasonable way of doing this ethically, and that is using content that people have provided to the model of their own volition with their consent either volunteered or paid for, but not scraped from an epub, regardless of if you bought it or downloaded it from LibGen.

There are already companies training machine learning models ethically in this manner, and if creators do not want their content used as training data, it should not be.

I cannot believe I’m saying this, but we can look at Adobe’s use of training data for it’s visual machine learning algorithm Adobe Firefly, collected from their own Adobe Stock library, and from public domain art from which copyright has expired. Adobe is actually at the forefront of sourcing their data for this ethically, and even considering their other transgressions, they are also creating standards to help determine if artwork and images is authentic or not, ideally preventing deepfakes being used in defamatory manners.

We need to see similar initiatives and transparency taking place in the LLM world. Steps toward ethical use of human created content. A review of consent and fair use of content when it is used not by humans, but by algorithms.

person using MacBook Pro — Photo by Glenn Carstens-Peters on Unsplash

We’re new to this technology, and we’re slowly trying to decide how to best react to it, in a way that protects individuals rights and the things the create in a world and internet that is slowly becoming less and less human.

I believe we need a new legal structure, a method of copyright against individual’s content being used to train models without their consent.

You may believe me a luddite, but I am genuinely excited for the ways this technology can be used, from creating realistic dialogue for video game NPCs to debugging code, to a myriad of use cases we haven’t dreamt of.

I have no aspersions that this cat can be stuffed back into bag, nor do I think it should I just want to ensure that in this new world of the machine, that there will be a space for the human.

AI doesn't read or write like humans, and we shouldn't act like it does.

We shouldn’t be affording companies the ability to profit off other people’s creations without their consent, and despite the intentions, that's basically how current copyright law works.

Discussion about this post