I presented this paper in mid-October 2022 to the annual conference of the Asian Pacific Copyright Association (the candid shot above is by Vincent Nghai; I’m the blur one!). The field has zoomed ahead at lightning speed in the three months since then, but if anything, the copyright situation has become even murkier. I’m writing a Substack newsletter to try to keep up with events.


The latest AI models, foundation models, require a rethinking of much that has been written about the copyright status of AI training data. The tech industry’s assumption that copying of training data is US-style fair use was always a bit shaky (given that determinations of fair use should be fact-based and specific to the particular use), but the ability of these new models to generate high quality content, and the way that they are trained, present at least three more reasons why fair use might not apply. First, these models are now strongly impacting the economic interests of the creators whose work has been copied; secondly, the new models are built on, and wholly derived from, protected expression, not from unprotected facts or ideas; with the third—if related—point that the models are capable of reproducing protected content, in amongst the large but not infinite field of their potential outputs. Furthermore, while some have argued that unauthorized copying of training data is legal under European and UK text and data-mining exceptions, those exceptions have certain limits that might also constrain the use of training data for these models. Halting AI development because of copyright violations is hardy going to be the right outcome, but at the very least the troubled copyright status of the models should help push an urgent consideration of the many issues at stake in AI governance.


A new generation of AI tools is here. Models like GPT-3, Stable Diffusion, DALL-E2, or PaLM represent a phase transition in machine learning, a new set of capabilities of staggering power, including the ability to generate useful content, in images and text, and increasingly in video and 3d environments. Their proponents argue that these models allow the disaggregation of the processes of conception and substantiation in the creation of a work. That is to say, you don’t need to know how to draw to create a professional quality illustration, you just need to have an idea you can put into words. The model will create the image. Nearly all commentators see that these models have enormous disruptive potential. They are already disrupting copyright markets. There are many urgent questions to answer around these models as they are rushed from development to commercial deployment; one of these urgent questions is to understand their legal status.

In order to make my overall point, that new thinking is required for these new sorts of models, I am going to have to spend a fair amount of time describing how they work. Terminology is important here, and I am using the term “foundation models” to describe these new tools. This terminology was popularized earlier this year by Stanford University’s Institute for Human-Centered Artificial Intelligence, which has spun off a dedicated Center for Research on Foundation Models.[1]

The phase transition in AI

How did we get here? What brought us to this “phase transition”?

The story begins with Large Language Models (LLMs), [2] models like Google’s BERT and Open.AI’s GPT, its third version now in public release. At their base, these are text generators, trained to predict something very simple: they are given a number of words, and asked what the next words will be in that sequence. (Now to be only a very little bit more precise, the models don’t know what “words” are, so what they are given are “tokens”, a sequence of letters, or punctuation marks, a language-neutral way of feeding language to a model. They predict the next token in a sequence of tokens.) It is from this very basic challenge, one of interest originally only in the geeky world of natural language processing (NLP), that emerged the first of the foundation models.

The phase transition comes from two things: first, the vast increases in scale of these models, and then, the emergence of unexpected new capabilities as these models grew. Big changes in quantity have led to marked changes in quality.

There are three sorts of increases in scale: A) adding computing power to enable bigger models, with more analysis of more data, B) new software approaches that allow the models to process more information in parallel, and C) a new approach that allows much larger amounts of data to be used in training.[3]

Let’s start with the data question, central to our concerns here. As a reminder, large amounts of the text training these models is copyrighted, and was copied without authorization. The foundation models use a new form of training, self-supervision, which means data does not have to be organized, parsed, placed in tables, tagged and cleaned, as was often the key requirement in earlier generations of supervised machine learning (and sometimes the source of copyright holders’ value-add for their text-and-data mining license business, estimated for the UK to be worth GBP 350m annually[4]). Self-supervised models by contrast can work with huge amounts of relatively raw text or images (or audio, etc) and this is a key reason why the amounts of data and number of works processed have increased so dramatically in recent years. There is no mining of facts out of the ore of expression here. Expression is raw material for the foundation models.

Major breakthroughs were achieved in early 2019 with the BERT LLM, trained on tens of millions of tokens, from English language Wikipedia and from some eleven thousand novels on the Smashwords self-publishing website. Three years later, the amount of training data has increased 10,000 fold. Google’s PaLM, for example, was trained on 780 billion tokens, roughly 585 billion words, mostly scraped in various ways from copyrighted content on the Internet.

Data is essential to these models in a way that might not be familiar from older metaphors of software as a machine to process data. First of all, the models are not “expert systems”. This was last century’s approach to AI, which attempted to break down expertise into a series of decisions. Previous generations of generative art software work this way: they follow formulas and rules to create new content, as when an equation or program describes a series of shapes, or create a minuet in the style of Mozart. Neither do the LLMs receive data simply as an “input” to the model, running through a set of operations until it comes out the other side processed and packaged, sliced and diced. No, the models in question here are much more closely tied to the data used to train them. The LLMs are in fact a powerful statistical abstraction of the training data, a representation of the data as vectors or sets of relationships between the different tokens, creating a many-dimensional space of relationships between tokens, which generates predictions when prompted. The model is a kind of potent, highly compressed representation of text (or image), a representation which also has the ability to reproduce the trained-on texts, and a huge space of variations on them. As one technical publication puts it “the dataset is practically compressed into the weights of the network”.[5]

The second area where we’ve seen an increase in scale is in the increasing sophistication of the software that trains models, that refines the statistical model above. The key breakthrough, confusingly called a “transformer” in the industry, is in the way the software manages its prediction against both the text in the prompt or cue, and the length of text that the model is working with as it decides what token to spit out next. Transformers allow processing of much larger amounts of text in parallel, in both directions, at the same time. The literature refers to the formula for what to consider when making a statistical prediction as “attention”, one of the very many anthropomorphisms that litter the conceptual field.

And for our third factor, this parallel processing approach on large amounts of data requires large amounts of computing power. One report sees the amount of computing processing power used by the major models as doubling every six months since 2016.[6] Open.AI saw a doubling every 3.4 months, and a 300,000x increase in the processing power devoted to machine learning from 2012 to 2018.[7] One estimate has it that training GPT-3 took as much electrical power as 126 Northern European households for a year.[8]

So when we talk about scaling up these models, we are talking about exponential increases in scale along three axes. Exponential growth in three dimensions. And if there is any limits to this growth approaching, it will be for the data, though increasingly models are trained on data that they themselves generate. Costs of chips and the energy they require are becoming a practical limit, and one reason most of this research is now performed within large companies and not in research institutions, or if it is performed in research institutions, the required computing is sponsored by very well-capitalized companies. According to the most recent of the influential State of AI reports,

“Over [the last decade], the percentage of these projects run by academics has plummeted from ~60% to almost 0%. If the AI community is to continue scaling models, this chasm of ‘have’ and ‘have nots’ creates significant challenges for AI safety, pursuing diverse ideas, talent concentration, and more.”[9]

So the text models were being scaled up, with more and more data fed to them, with this new transformer architecture being used and greater computational space allocated. By the time of GPT-3 new capabilities emerged that surprised and delighted researchers.[10]

These improvements in capacity and task range could not have been predicted from the capabilities of older, smaller models. And to this day no one can really say in any definitive way why these new capabilities emerged. Researchers are still trying to work out what is going on. But the leap in capabilities has inspired a fresh burst of energy in an already pretty energized sector of technology development. The Silicon Valley buzz is now entirely here.

But why “foundation” models? The Stanford team gives several reasons, aside from this question of capability.

First, as hinted above, the models turn out to be highly generalizable. They can be used for purposes for which they were not specifically trained. (Remember the LLMs were trained only to predict what word comes next in a sequence.) Models are so generalizable, so amenable to being used for different purposes, that they are understood within the field to be the foundation of the next cycle of developments in AI, a cycle which is having (or is about to have) much deeper, much stronger impacts in the real world than previous AI cycles. In transfer learning, models trained in one task can be retrained in entirely different arenas. Models “pre-trained” to predict words could be usefully retrained to, say, predict protein sequences.[11]

Not only are they general in themselves, they can be used in a modular way. A model can be “fine-tuned”, trained on specific tasks in the same general realm. Much of the usefulness of the foundation models is that work can be done “on top of”, or as adaptations of, them.[12] They serve as a foundation upon which different sorts of edifices of function can be built, through a variety of different processes of adaptation, fine-tuning and “few-shot” prompting. Also, different foundation models can be connected together in interesting ways. Understanding this modular architecture is crucial, and has direct relevance to the copyright questions around these models.

Also, these are foundation models in the sense that we are still at a very early stage in their development, and there is much that is not understood about how they work. They are highly powerful, but they can also fail unexpectedly. The LLMs are so good at language, they seem to deliver more than they actually do. (I have a few colleagues like that.)

And finally, Stanford’s Center says they use the word foundation to remind us that the decisions we make about how to govern the use of these models today will determine much about the edifice that is being built, at lightning speed, and which will emerge on top of them.

What are the sorts of capabilities that have emerged from these models? Let’s confine ourselves only to the text models here. Not only did they demonstrate a huge leap in performance in generating plausible-sounding text, and in natural language processing tasks like summarization. They can translate between human languages, despite not being trained for that task. They turned out to be able to write software. They can translate between software languages. They can write documentation of software. Microsoft’s Github commercial Copilot product already has around 1.2m developers using the $10/month product to help them write software. The Github CEO recently claimed that about 40% of the code generated by Co-pilot-enabled developers is from the algorithm.[13]

But there’s more. Text prediction models can now do math.  They can play chess. In 2019, an adapted foundation model scored a 91.7% on the NY Regents 8th grade science exam. A recent paper describes using a text prediction model to power a general purpose domestic robot. It knows to pour water into the kettle before switching it to boil.[14] Remember all of this comes from models initially trained only to fill in missing words.

So great are both the reality and the hype here (and I am not even bringing in the image-generation models, and the stunning text-to-image developments) that it’s worth considering what the models can’t (yet) do. One important failure point is that the models are strangely bad with facts. GPT-3’s performance on the “truthful question answering” benchmark is actually worse than random[15]. (In order to get around this, GPT-3 is sometimes connected to a model which checks facts with Wikipedia — this is part of the modular nature of the foundation models referred to above.)[16]

Foundation models are a system

Developments in the field are best understood as a system that involves different activities, each of which depends on the one prior. This is essential to know for those of us outside the system working to understand it as a whole.

Using the Stanford analysis, these are the five main activities which make up an AI system, in which the foundation models are embedded:

Data generation. This is the starting point. Data is generated by people, by human activity (if only to measure natural phenomena). All data belongs to someone. The largest amounts of data are owned by the big tech companies, but they do not own large repositories of high quality text.

Data curating. What data will you train your models with? When tech company researchers looked for large bodies of text on which to train their models to be better at generating text, they had to go to copyright holders. Or rather, they chose to copy from copyright holders, without their permission.[17] Benjamin Sobel has developed a typology of AI training data that is useful:[18]

  • public-domain data,
  • licensed data, and
  • copyrighted data used without a license.

He then splits the unlicensed copyrighted data into two categories that depend on how the data is used, A) for purposes which are market-encroaching in one bucket, as well as B) purposes which are not.

Training.  This is the process of creating the statistical model that is used to predict new sequences of text. The LLM model is asked to complete sentences, or fill in a missing word, and of course will get the answer wrong many times, if it starts from untrained neural network. But over time, after corrections, it will work its way through the possibilities, and start to build of a model, a map, of which words come close together, how words used in the beginning of a sentence will influence the probabilities of which words will complete the sentence. Of how a certain sentence will predict the next sentence, and so on. All based on massive computing, massive amounts of training data and increasingly clever software. It does not learn grammar, it learns to predict what comes next. As discussed earlier, the best metaphor might be to think of training as actually creating a compressed statistical representation of the content, one that can also be used to generate many plausible variations of the content.

Adaptation. The probabilities mapped by a foundation model, and its outputs, can then be tweaked in different ways as a model is adapted for different uses. This chart from the Stanford paper gives us three sorts of adaptation: 1) task specialization, 2) model patching, and 3) temporal adaptation.[19]

Figure 18 from Bommasani et al., p. 85

I don’t think this is necessarily definitive, published in July 2022, ancient history, as new ways of connecting and adapting models are being described every day. See the first strategy of adapting a model to be better at question answering by connecting it to a fact-check process (typically using Wikipedia). We will definitely come back to this diagram, particularly the “Copyright Warning” model patch.

The simplest way to “task specialize” a foundation model is with a prompt. Simply by giving the model a few concrete examples of what you are looking for, you can effectively tune the statistical model, without having to go back and redo the whole enormous thing. This is called “few-shot” prompting. Here is an example of a prompt that helps adapts the model for specific purposes.

`Tweet: "I hate it when my phone battery dies."
Sentiment: Negative

Tweet: "My day has been 👍"
Sentiment: Positive

Tweet: "This is the link to the article"
Sentiment: Neutral

Tweet: "This new music video was incredibile"

And then you carry on from there. This training turns the foundation model into a sentiment analysis tool, one that does not require the painful building of vocabulary lists and attempts come up with equations that explain irony and sarcasm, problems that plagued earlier approaches to the problem. It’s hard to believe, but these few-shot prompts are enough to increase the model’s performance significantly on given tasks.

Sometimes the fine-tuning involves a lot more data. Codex, the model behind Microsoft’s Copilot software assistance tool already earning millions on the market, is a GPT model that was fine-tuned on another large data set, publicly available software code scraped from Github, a popular software repository.

Deployment. Once models are tuned, adapted, filtered for compliance, etc, then they are deployed into the real world. Many of the AI businesses launching every day now are creating adapted models on top of GPT-3 accessed via API from Open.AI. This means the model is itself not available for study, by the end-user or by the API licensee. It is a black box to those companies and individuals who sign up to use it.  Neither do we have visibility on the filtering of outputs done by Open.ai to attempt to “de-bias” the model, and filter out harmful content. To come back to the Stanford paper,

“...the defects of the foundation model are inherited by all the adapted models downstream. Despite the impending widespread deployment of foundation models, we currently lack a clear understanding of how they work, when they fail, and what they are even capable of due to their emergent properties.”[20]

That is a whirlwind tour of the new world of the foundation models, and a really, just a whirlwind tour of the LLM corner of that world.

Is it Fair Use?

Let us turn now to the arguments used to justify the unauthorized copying of copyrighted training data, beginning with fair use.

BookCorpus is one of the key early data sets, used in the training of BERT and then in RoBERTa, BART and others in this lineage. It was also used as part of the training of the GPT lineage of models[21]. It was created by scraping and copying some 11,000 novels found on the Smashwords website (actually on investigation there was found to be quite a bit of duplication in the dataset downloaded, with only around 7000 unique novels left after de-duping). In 2016, when BERT was announced (as having been used to help improve the auto-complete function in Gmail), Guardian journalist Richard Lea noticed that while most Smashwords novels were free to download, they often came with copyright statements that were very clear that novels were “licensed for your personal enjoyment only”.

A later audit of this dataset concluded “we found that many books contained copyright claims that, based on a plain language interpretation, should have prevented their distribution in the form of a free machine learning dataset. Many books explicitly claimed that they ‘may not be redistributed to others for commercial or non-commercial purposes,’ and thus should not have been included in BookCorpus. Also, at least 323 books were included in BookCorpus for free even though the authors have since increased the price of the book.”[22]

Journalist Lea also got a comment from an unnamed Google spokesperson, which is the first relevant fair use claim that I’ve been able to track. “The machine learning community has long published open research with these kinds of datasets, including many academic researchers with this set of free ebooks – it doesn’t harm the authors and is done for a very different purpose from the authors’, so it’s fair use under US law.”[23]

The fair use assertion has been made often enough that it has entered the consciousness of the users of these new models. Rather than list more of the claims of fair use, by tech CEOs and also by legal scholars, let me repeat the justifications recently given by a young Nigerian programmer known as MysteryInc152 for his creation of a tool aimed at reproducing the art style of well-known professional 2D artist Hollie Mengert. In order to fine-tune his version of image generation model Stable Diffusion, he fed unauthorized copies of 32 of Mengert’s images[24] to the model. Tech blogger Andy Baio reports on his discussions with MysteryInc152:

“His take was very practical: he thinks it’s legal to train and use, likely to be determined fair use in court, and you can’t copyright a style....He also thinks it’s inevitable: Adobe is adding generative AI tools to Photoshop, Microsoft is adding an image generator to their design suite. ‘The technology is here, like we’ve seen countless times throughout history.’”[25]

But of course not everyone has been so fast to imagine that unauthorized copying to train AIs would be fair use. Benjamin Sobel raised the matter in urgent terms, during his tenure at the Berkman Klein Center for Internet and Society, with a late 2017 paper on “Artificial Intelligence's Fair Use Crisis”[26]. The paper is summarized as “current fair use doctrine threatens either to derail the progress of machine learning or to disenfranchise the human creators whose work makes it possible.”  His paper has been cited by key critics of the current AI governance system.[27]

It would be far beyond the scope of this paper to try for a quack adjudication of fair use, but let me at least rehearse the arguments why unauthorized copying of training data like the BooksCorpus may not be fair use.

Argument one: Unauthorized copying now leads to market encroachment

Models built on unauthorized copying are now being used in ways that not only narrow or encroach on the market for copyright content, against the interests of the creators’ whose works were copied, they may entirely reshape some markets for creative work altogether.

Sobel’s later 2022 paper anticipates this, as his typology further split copyrighted training data into two parts, that which will be used for market-encroaching purposes, and that used in ways which does not affect markets. The interests of photographers are perhaps not harmed when their photographs are copied to train self-driving cars to recognize stop signs, as an example of the second. This is very useful conceptually, but I’m not sure how it helps us in a practical sense. The same data is used and re-used to train many different models. The same foundation model may be combined with others or fine-tuned with additional data to perform tasks unimagined by the creators of the original foundation model.

The market encroachment has moved extremely quickly just in the weeks before this writing, to the point where “generative content” has become an important investment theme for some venture capitalists. One well-established VC, Sequoia Capital sees “the potential to generate trillions of dollars of economic value”.[28] While the caveats that this represents tech-promoter hype, there is also real money being invested against similar views of the market.

Here is a very quick selection of some of the hundreds of businesses being launched on the back of AI-tooled content, built from unauthorized copying:

  • Scale.AI announced a product to “enable marketers to AI-generate UNLIMITED and INFINITELY CREATIVE images of their products for ad creatives, brand campaigns and  social media”. The AI model can generate background images for product shots, while preserving the integrity of the product image, in a move that looks like it will significantly damage the product photography market.[29]
  • Charlie 2.0, AI content creation for everyone - https://GoCharlie.ai. From the website: “Charlie is your new best friend for creating content. He can generate HD, 2K, 4K, Widescreen, Vertical and square images. In addition, he can create highly engaging blogs and other text content for branding and marketing needs.” Who needs to hire troublesome writers and illustrators when you can have a friendly AI robot dog cartoon to do all that hard work for you…
  • Jasper.ai raised a Series A round of US$ 125m to fund growth of a business to auto-generate promotional blog posts and other marketing materials. It already has 80,000 subscribers and US$ 35 million in revenues according to the Wall Street Journal.[30]

It is interesting to note that before market encroaching content generation hit the market, some tech commentators accepted that use of training data might no longer be fair use were it to be used in market-encroaching fashion. To pick one example, a 2020 submission to the US Copyright Office by the Business Software Alliance argued that

“creating a database of lawfully accessed works for use as training data for machine learning will almost always be considered non-infringing in circumstances where the output of that process does not compete with the works used to train the AI system.”

By 2022 we have a host of well-capitalized companies building businesses that do indeed compete with the works used to train their AI systems. I haven’t seen a statement by the Business Software Alliance warning their members that the use of the training data may not be fair use, but many in the technology world do fully understand the legal uncertainty.[31]

Even the original 2016 Google defence of the Smashwords copying was predicated on the fact that authors were not harmed, and the copying was done with a very different end purpose in mind. Neither condition remains true with the latest models and their many adaptations.

Argument two: Copying training data is not reproducing unprotected facts or ideas, it is reproducing protected expression

In an influential February 2020 paper titled “Fair Learning”, Mark A. Lemley, a professor at Stanford Law School, and his colleague Bryan Casey discuss why unauthorized copies should be allowed to train AIs, even though “AIs aren’t transforming the databases they train on; they are using the entire database, and for a commercial purpose at that.” They start with a policy argument (copyright shouldn’t stand in the way of AI development), and then a practicality argument (it is practically impossible to license all the data one would need to train an AI) before coming to their legal argument, that AIs might copy expression, but they are actually only interested in learning facts. And facts are not meant to be protected by copyright.

This I would argue does not apply to the LLMs. GPT-3 is not at all interested in facts or ideas, it only deals with expression. It is a statistical model of expression, not a model of concepts or ontologies, a knowledge graph. What is style if not a statistical measure — the frequency of use of certain words, the average length of sentences?  Despite GPT-3’s great power, on question answering benchmarks it performs worse than random guessing.[32] As one recent paper put it “to maximise their objective function they [LLMs] strategize to be plausible instead of truthful”.[33] In order to get a text generator to answer questions correctly, the LLM has to be connected to another module which will run assertions of facts against Wikipedia, in order to improve the model’s ability to answer questions. This is the “question answering” adaptation mentioned in the diagram above.

This is not to argue that such adaptations are not useful. It is to argue that they have the unintended consequence of hiding aspects of the operation of the LLMs, in this case masking the fact that the LLM copies and reproduces expression, not facts. Foundation models do not learn facts, ideas or concepts. These are written (and debated about) by humans in Wikipedia.

Lemley and Casey argue that “fair use should consider under both factors one and two whether the purpose of the defendant’s copying was to appropriate the plaintiff’s expression or just the ideas.” A fair point, but the outcome of this enquiry in the case of the LLMs will be that the defendants are appropriating expression and it is hard to see how that can be protected by fair use. Here we should understand that copies were fed to the LLM in order to appropriate expression, even if the copied expression is not reproduced exactly by the outputs requested from the model.

One of the elements of the class action that was being considered against the Microsoft-owned, Open.ai generated Copilot software authoring tool is that it can and does reproduce copyrighted code, reproducing protected chunks of the software it was trained on.[34]

In 2021 a group of researchers (including some from Google and Open.ai) were “able to extract hundreds of verbatim text sequences from the model’s [GPT-2’s] training data. These extracted examples include (public) personally identifiable information (names, phone numbers, and email addresses), IRC conversations, code, and 128-bit UUIDs. Our attack is possible even though each of the above sequences are included in just one document in the training data.”[35]

[Ed note: This para and example was added in January 2023] Here is one example I was able to create that revealed something about Chat GPT’s ability to copy copyrighted works, and the way that ability is being filtered in model patching stages:

Screenshots rom Chat-GPT

I call your attention back to the diagram on adaptation above which includes as an example of “model patching”, a compliance copyright patch, to prevent the model from reproducing copyright text. The owners of these models have been clear that they do add “compliance patches” to filter the output of their models, to avoid obvious harms, abusive content and perceptions of bias. They do not share the nature of these patches publicly. In any case, it would seem that it is in the nature of the LLMs to be able to reproduce the copyrighted content they were created on, and if they do not in practice, it is both because they produce so many possibilities, and because such behavior is masked.

Copying training data under exception

Do the text and data mining exceptions in various jurisfications allow unauthorized copying for training data?  The TDM exceptions now in the books in the UK, in Switzerland, Japan and Singapore, and embodied in the EU’s Digital Single Market Directive, were mostly conceived of in the context of the earlier wave of AI. They were not designed for the foundation models and generative content. Singapore’s exception, which went into force just less than a year ago, was first mooted in 2017 for example.

Still, most of the exceptions do have various limits or guardrails to protect the interests of creators. Arguments that training data under exception is “perfectly legal” need to reckon with these a bit more than they have to date.[36]

Commercial vs non-commercial purpose / research only

The relevant clause (29A) in the UK Copyright, Designs and Patents Act 1988 makes very clear that “a person who has lawful access to the work may carry out a computational analysis of anything recorded in the work for the sole purpose of research for a non-commercial purpose.” I’m not sure if the distinction between the work and “anything recorded in the work” has any relevance, but the crucial point here is that the sole purpose of the copying must be non-commercial. The Swiss exception (Federal Act on Copyright and Related Rights, Chapter 5, Article 24(d)) doesn’t use the commercial/non-commercial distinction, but it insists that the copying be done only for the purposes of scientific research.

Here we come to the points discussed earlier about the foundation models existing in a system, in which data curation and collection, training, adaptation and deployment are linked and dependent on each other. Is it the intent of the law that if a non-profit research center makes unauthorized copies to collect the training data, then a commercial organization may use the same training data to train a model for commercial purposes? How about if the computer resources required to gather and clean the data were donated by the commercial entity in question?[37]

Admittedly, the distinction between commercial and non-commercial purposes was never going to be easy to apply. Still  the accusation by Baio that big tech is “data-laundering” would seem to be worth taking seriously.

Reproduction, but no Communication

The UK exception 29A(2)a says  “Where a copy of a work has been made under this section, copyright in the work is infringed if the copy is transferred to any other person, except where the transfer is authorised by the copyright owner…”.

The corpora used to train LLMs are often posted online, for downloading and checking of various kinds. They are routinely communicated by the machine learning community, and this would seem on the face of it to therefore not be covered under the exception. Data may be copied by a non-commercial research group, but then it is often communicated to a business entity.

Depending on the extent to which one accepts the paradigm that a machine-learning model is actually a compressed representation of the data it was trained on, it may also be that the models themselves communicate the material copied, in their routine operation.

Thoughts and sentiments

A 2018 law in Japan also includes a text and data mining exception, but it too has guardrails. Recently the use of the work of popular manga and anime artists to train image generation models which can then reproduce works “in the style” of those artists has created a backlash among artists in Japan.[38]

An article on the controversy quotes a Tokyo-based lawyer as saying that Tokyo copyright law allows copying for machine learning under exception. My non-lawyer understanding of the relevant Chapter 5, Subsection 2, Article 30-4 of the Japanese law[39]is that it takes into account the uses of the models trained under exception. Specifically, such copying would only be allowed if “it is not [the copier’s] purpose to personally enjoy or cause another person to enjoy the thoughts or sentiments expressed in that work”. And separately, that “this does not apply if the action would unreasonably prejudice the interests of the copyright owner in light of the nature or purpose of the work or the circumstances of its exploitation.”

So it is not at all clear that copying of works for the purposes of training foundation models would be allowed under the Japanese exception, given that much generated content is indeed to created to cause someone to enjoy the thoughts or sentiments expressed in the works copied, and that the copying does seem to unreasonably prejudice the interests of the copyright owner, as we have discussed under the fair use claim.

Unfortunately, time does not allow a consideration of how training data interacts with the moral rights of attribution and integrity applicable in European contexts, which would also seem to be issues for much of the copying of works as training data.


Please don’t conclude from this paper that I am arguing against the use of the foundation models. My interest in this issue is from a position of wanting these tools to be deployed in a way that is equitable and which recognizes the interests of the creators whose works have been copied. These same creators, and rights holders, will be the best users of these new tools once we have addressed issues of fairness and designed better governance of them. As we speak I am fine-tuning the GPT-3 model to create an editor’s assistant tool for my own employer, NUS Press, though I am not sure whether I will feel able to deploy it once it is ready. I am thoroughly enjoying generating images with Stable Diffusion, learning more about the dark arts of prompting text-to-image generators.

The practical question of how to reconcile the social interest in AI development and the rights of creators as protected by copyright will not be a simple one. Other important issues, especially rights of privacy, may be even more important than the copy right. But I do feel very strongly that simply accepting these new tools as inevitable, a genie already out of the bottle, is not the correct response. Surely our experience with social media has taught us that a bit more reflection on the take up of powerful new technologies might be a useful thing. Copyright law and expertise is one of the tools at our disposal to do that and so I believe we should engage on the issue.


[1] The key paper setting their thinking out is Rishi Bommasani et al., ‘On the Opportunities and Risks of Foundation Models’ (arXiv, 12 July 2022), http://arxiv.org/abs/2108.07258.

[2] LLMs are not the only foundation models, but it was via LLMs that the foundation model breakthroughs were made.

[3] This view of the three dimensions of scale is now a commonplace in the AI community. For a very clear discussion of these three dimensions, and the way the interact, see the interview with Open.AI President and Chairman Greg Brockman at https://exchange.scale.com/home/videos/llms-generative-models-foundation-openai-greg-brockman

[4] Publishers Association (UK), ‘Publishers Association Briefing for the Intellectual Property Office on Text and Data Mining’ (Publishers Association (UK), 26 August 2022), https://www.publishers.org.uk/publications/publishers-association-briefing-for-the-intellectual-property-office-on-text-and-data-mining/.

[5] Chuan Li, “OpenAI's GPT-3 Language Model: A Technical Overview”, the Lambda blog, June 3, 2020, https://lambdalabs.com/blog/demystifying-gpt-3

[6] Jaime Sevilla et al., ‘Compute Trends Across Three Eras of Machine Learning’ (arXiv, 9 March 2022), http://arxiv.org/abs/2202.05924.

[7] Citation to come

[8] See the University of Copenhagen press release at https://news.ku.dk/all_news/2020/11/students-develop-tool-to-predict-the-carbon-footprint-of-algorithms/

[9] See Nathan Benaich and Ian Hogarth, ‘State of AI Report 2022’, Annual, State of AI, n.d., slide 82. https://docs.google.com/presentation/d/1WrkeJ9-CjuotTXoa4ZZlB3UPBXpxe4B3FMs9R9tn34I/

[10] Wei, Jason, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, et al. “Emergent Abilities of Large Language Models.” arXiv, June 15, 2022. https://doi.org/10.48550/arXiv.2206.07682.

[11] https://ai.facebook.com/blog/protein-folding-esmfold-metagenomics/

[12] The GPT-3 model available via the Open.ai API is already highly adapted from its foundation. It was further trained on a set of text processing instructions, and given feedback via by human evaluations. See Long Ouyang et al., ‘Training Language Models to Follow Instructions with Human Feedback’, 2022, https://doi.org/10.48550/ARXIV.2203.02155. Secondly, it filters outputs so as to avoid potential liability for Open.ai.

[13] And just two weeks ago the announcement of a potential copyright infringement class action lawsuit against Co-Pilot project attracted a great deal of attention. See https://githubcopilotinvestigation.com/ or https://www.theregister.com/2022/10/19/github_copilot_copyright/

[14] Citation needed.

[15] Wei et al, p.3.

[16] For more on the limits of LLMs, see Srivastava, Aarohi, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, et al. “Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models.” arXiv, June 10, 2022. https://doi.org/10.48550/arXiv.2206.04615.

[17] This was first brought to public attention in 2016, see Richard Lea, ‘Google Swallows 11,000 Novels to Improve AI’s Conversation’, The Guardian, 28 September 2016, sec. Books, https://perma.cc/LG94-ZXZA.

[18] Sobel, Benjamin. “A Taxonomy of Training Data: Disentangling the Mismatched Rights, Remedies, and Rationales for Restricting Machine Learning.” In Artificial Intelligence and Intellectual Property, 221–42. Oxford University Press, 2021. https://doi.org/10.1093/oso/9780198870944.003.0011.

[19] Figure 18 from Bommasani et al., p. 85

[20] Rishi Bommasani et al., p. 1

[21] The HuggingFace page for this dataset shows 355 different models trained with this dataset. https://huggingface.co/models?dataset=dataset:bookcorpus

[22] Bandy, Jack, and Nicholas Vincent. “Addressing ‘Documentation Debt’ in Machine Learning: A Retrospective Datasheet for BookCorpus,” November 11, 2021. https://openreview.net/forum?id=Qd_eU1wvJeu, p. 9

[23] Interestingly enough the BookCorpus dataset is no longer as readily available as it was in 2017 or even 2020, and the webpage dedicated to it instead leads to instructions on how one might download material from the Smashwords site in order to reproduce the original dataset for yourself.

[24] You can see them here: https://imgur.com/a/8YRCGsW

[25] See https://waxy.org/2022/11/invasive-diffusion-how-one-unwilling-illustrator-found-herself-turned-into-an-ai-model/

[26] See and Sobel, Benjamin L. W. “Artificial Intelligence’s Fair Use Crisis.” The Columbia Journal of Law & the Arts, December 5, 2017, 45-97 Pages. https://doi.org/10.7916/JLA.V41I1.2036.

[27] Citations to come…

[28] See for example this post from Sequoia Capital: https://www.sequoiacap.com/article/generative-ai-a-creative-new-world/

[29] See https://twitter.com/alexandr_wang/status/1585660889067290625

[30] https://www.wsj.com/articles/generative-ai-startups-attract-business-customers-investor-funding-11666736176

[31] See the syllabus of the most recent edition Stanford’s CS CS324 - Large Language Models at https://stanford-cs324.github.io/winter2022/lectures/legality/, which has a useful preçis of the issues, with its legality lecture notes concluding: “the future of copyright and machine learning in light of large language models is very much open.”

[32] Wei et al, “Emergent Capabilities”, p. x

[33] Sobieszek, Adam, and Tadeusz Price. “Playing Games with Ais: The Limits of GPT-3 and Similar Large Language Models.” Minds and Machines 32, no. 2 (2022): 341–64. https://doi.org/10.1007/s11023-022-09602-0

[34] For a good journalistic summary of the issues, see Thomas Claburn, “How GitHub Copilot could steer Microsoft into a copyright storm”, The Register, October 19, 2022, https://www.theregister.com/2022/10/19/github_copilot_copyright/, although it turns out that copyright was dropped from the claim as filed. See https://githubcopilotlitigation.com/pdf/1-0-github_complaint.pdf. See also the discussion in Andres Guadamuz, ‘Copilot: The next Stage in the AI Copyright Wars?’, personal blog, Technollama(blog), 20 October 2022, https://www.technollama.co.uk/copilot-the-next-stage-in-the-ai-copyright-wars

[35] Carlini, Nicholas, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, et al. “Extracting Training Data from Large Language Models,” December 14, 2020. https://doi.org/10.48550/arXiv.2012.07805.

[36] Guadamuz, ‘Copilot...’

[37] According to Andy Baio, image generator software company Stability has funded the collection of the data needed to train Stable Diffusion by the Germany-based LAION non-profit, and the computation time needed to train the model by the Machine Vision & Learning research group at the Ludwig Maximilian University of Munich. Commercialisation of the model is done by Stability, now a billion-dollar company. See Andy Baio, ‘AI Data Laundering: How Academic and Nonprofit Researchers Shield Tech Companies from Accountability’, personal blog, Waxy (blog), 30 September 2022, https://waxy.org/2022/09/ai-data-laundering-how-academic-and-nonprofit-researchers-shield-tech-companies-from-accountability/.

[38] Andrew Deck, ‘AI-Generated Art Sparks Furious Backlash from Japan’s Anime Community’, Rest of the World, 27 October 2022, https://restofworld.org/2022/ai-backlash-anime-artists/.

[39] The English translation of the Japanese law is from the website of the Copyright Research and Information Centre of Japan, a public-interest corporation authorized by the government.