Foundation models and copyright

I’ve recently started a newsletter — imaginatively titled AI and Copyright, with the goal of tracking the shifting legal picture around the copyright status of the Foundation Models, including the Large Language Models (LLMs), and their cousins the image- (and video- and 3D-) generating models like DALL-E2. For the last month I’ve been sending these weekly. I’m not sure how long I will keep this up, or how long these will be needed. But for as long as it lasts, please subscribe!

We just don’t know enough about the text corpora these models were trained on. (Who are you Books2?) But we do know that some nontrivial proportion of the text (and later images) they were trained on was material that was copied without permission from their creators, in other words, without respecting copyright.

The people who did the copying have said, when they say anything about this, which is not often, that the doctrine of fair use as established in US jurisprudence makes this legitimate. In some cases, copying is said to be allowed under a TDM research exception, even when that exception is for research only, and the model so trained is used for commercial purposes.

These questions have been in the public domain since at least 2016, but are only now really getting traction. I’m really looking forward to a more robust conversation on the matter. I’m no lawyer, but my understanding of fair use is that must be based on the facts of the case, and take different factors into account, including the economic harm to creators set in motion by the copying in question. Now it is clear that Foundation Models trained on unauthorized copying are being used to create services which harm or narrow the market for creators. What does this mean for the fair use argument?

These same services are creating important new tools that will boost creativity as well, but just because some benefit doesn’t mean the harm isn’t real.

OK, we’ll get into this in more detail as we go along. There is a lot of uncertainty. There is a long conversation to be had! This will be a platform for me to track the evolving conversation, and to develop and evolve my own views on the subject.

Here are the newsletters I’ve issued so far, in reverse order:

October 27, 2022: “We’re all going to poop rainbows”
October 21, 2022: Great poetry? Or lifehack listicles?
October 19, 2022: Breaking: The first LLM training data legal case?
October 16, 2022: I hear Steve Jobs laughing
October 1, 2022: Algorithmic disgorgement, ISIS executions…
September 25, 2022: This requires a legislative solution