Publishers, authors, has your book been used to train an AI model without your permission?
Here’s one place you can look. https://github.com/psmedia/Books3Info
I’ve placed on GitHub a list of some 85,600 ISBNs extracted from the Books3 dataset that was used to train Facebook’s latest model, LLaMA, launched Feb 24th. Other big companies that have used Books3 to train models include Microsoft and nVidia. We can’t know for 100% sure if the Books2 dataset was used by OpenAI to train the GPT family, because OpenAI refuses to share any details of the data it has used for training. However Books3 is very similar in size to the Books2 dataset OpenAI used but will not describe or share, and Chat-GPT was pretty good at reproducing text from the first Harry Potter novel…
Some background on Books3. In their paper announcing LLaMA, Facebook described it as “a publicly available dataset for training large language models.” Well it sure is publically available, for download here or from AI community HuggingFace. But that description elides the fact that this is a cache of (mostly) pirated ebooks, “all of bibliotik”, described as in torrent directories as **“the largest private torrenting site for downloading ebooks”. ** Someone (according to HuggingFace an AI software engineer named Shawn Presser) had to download the bibliotik cache. Then he (or someone) had to do a great deal of processing to turn that cache of epub files into text that could be fed to an LLM for training, stripped of all its html tags, etc.
I haven’t spent much time with the data, but I can tell you there are 65 converted epub files that have Harry Potter in the title. Not all the ISBNs are from the US or UK, there is a healthy sprinkling of ISBNs from places like Denmark, Japan and Singapore, and many of the books are in languages other than English.
You are welcome to download the ISBN list in a text file from the Github repository above, and search for your own ISBNs.
For more information on the ambiguous legal status of generative AI, see my newsletter at https://aicopyright.substack.com.
The first 100 ISBNS, just to give you a flavour:
- 978 17803 1000 8
- 978-1-4766-0562-3
- 978-1-60486-090-0
- 978-0-545-76594-7
- 978-2-7024-3455-0
- 978-0-374-10021-6
- 978-0-545-76593-0
- 978-0-545-76600-5
- 978-0-385-37033-2
- 9780596002985
- 9780596100797
- 978-1-60868-092-4
- 978-0-545-77780-3
- 978-0-545-84800-8
- 978-0-394-83013-1
- 978-0-231-52800-9
- 9781616200640
- 978-1904859-66-6
- 978-1-568-58646-5
- 978-84-376-3225-4
- 978-1-62157-274-9
- 978-958-9007-81-5
- 978-0-307-83252-8
- 978-9974-955-16-5
- 978-0-345-80739-7
- 978-84-92567-43-0
- 978-607-11-3450-9
- 978 1 78087 036 6
- 978-607-16-3073-5
- 978-987-04-3655-3
- 978-0-307-83286-3
- 978-1-101-14952-2
- 978-84-339-3452-9
- 978-607-317-219-6
- 978-1-60255-182-4
- 978 1 78001 229 2
- 978-0-06-227312-3
- 978-1-68331-834-7
- 978-1-5445-0355-4
- 978-0-06-269623-6
- 978-1-368-01485-4
- 978-1-938231-44-5
- 9780007594580
- 978-1-63409-975-2
- 9780731407002
- 978-0-84875-626-0
- 978-1-4412-6439-8
- 9781250011459
- 9780316361354
- 978-0-544-30318-8
- 978-1-59921-822-9
- 978-1-76014-318-3
- 978-1-5040-1257-7
- 9780006476009
- 978-1-5040-1258-4
- 978-0-19-273584-3
- 978-0-14-192658-2
- 9782072757495 -
- 978-0-307-27907-1
- 9781855752504
- 978-1-4597-3034-2
- 978-1-4592-1389-0
- 978-0-7582-8465-5
- 978-1-4081-4432-9
- 978-1-61147-000-0
- 9781409092414
- 978-1-59229-374-2
- 9781409092421
- 9781101560419
- 978-1-101-56891-0
- 978-1-937007-72-0
- 978-0-698-16507-6
- 9780698165083
- 9781101988534
- 978-0-441-01615-0
- 978-0-914671-25-1
- 978-1-4976-8940-4
- 978-1-4411-4203-0
- 978-1-4411-5086-8
- 978-1-4411-1086-2
- 978-0-231-51068-4
- 978-0-7653-3312-4
- 978-0-7653-7940-5
- 978-0-7653-3574-6
- 978-0-7653-7940-5
- 978-0-7653-7942-9
- 978-0-7653-9588-7
- 978-1-940427-09-6
- 978-0-345-53148-3
- 978-1-943051-90-8
- 978-0-06-213596-4
- 9780385539203
- 9781101964958
- 9781101964965
- 978-0-252-04088-7
- 978-1-60807-869-1
- 978-1-101-55238-4
- 9781466877771
- 9781101562659
- 978-1-101-66226-7
- 978-0-698-16426-0