Extracting memorized pieces of books from open-weight language models

by fzliuon 6/16/2025, 5:41 PMwith 105 comments

by ai_legal_suson 6/20/2025, 1:34 AM

I feel like role-playing as a lawyer, I'm curious how would you defend against this in court?

I don't think anyone denies that frontier models were trained on copyrighted material - it's well documented and public knowledge. (and a separate legal question regarding fair-use and acquisition)

I also don't think anyone denies that a model that strongly fits the training data approximates the copy-paste function. (Or at the very least, if A then B, consistently)

In practice, training resembles lossy compression of the data. Technically one could frame an LLM as a database of compressed training inputs.

This paper argues and demonstrates that "extraction is evidence of memorization" which affirms the above.

In terms of LLM output (the valuable product customers are paying for) this is familiar, albeit grey, legal territory.

https://en.wikipedia.org/wiki/Substantial_similarity

When a customer pays for an AI service, they're paying for access to a database of compressed training data - the additional layers of indirection sometimes produce novel output, and many times do not.

Unless you advocate for discarding the whole regime of intellectual property or you can argue for a better model of IP laws, the question stands: why shouldn't LLM services trained on copyrighted material be held responsible when their product violates "substantial similarity" of said copyrighted works? Why should failure to do so be immune from legal action?

by Animatson 6/20/2025, 2:51 AM

Can they generate a list of books for which at least, say, 10% of the text can be recovered from the weights? Is this a generic problem, or is there just so much fan material around the Harry Potter books to exaggerate their importance during training?

by Huxley1on 6/20/2025, 7:16 AM

I think this is somewhat like how we memorize when we read but the model is not just rote memorization it is more like compressing and recombining content. The copyright issue is definitely complicated and I am curious how the law will adapt to these technologies in the future.

by iamlepperton 6/20/2025, 4:27 PM

The tech companies have consolidated so much power, and are so invested in AI, none of this really matters. If there is any defense, even an illogical or contrived one that can reasonably be expected to play out, expect that defense as the one to win as the final outcome in a protracted legal battle. The law at its highest levels is less about interpreting black and white rules (like many people think it is) and has more to do with the biases and motivations of those doing the interpreting.

by andy99on 6/19/2025, 11:44 PM

There are two legitimate points where copyright violation can occur with LLMs. (Not arguing the merits of copyright, just based on the concept as it is).

One is when copyrighted material is "pirated" for use in training, i.e. you torrent "the pile" instead of paying to acquire the books.

The other is when a user uses an LLM to generate a work that violates copyright.

Training itself isn't a violation, that's common sense. I am aware of lots of copyrighted things, and I could generate a work that violates copyright. My knowing this in and of itself isn't a violation.

The fact that an LLM agrees to help someone violate copyright is a failure mode, on par with telling them how to make meth or whatever other things their creators don't want them doing. There's a good argument for hardening them against requests to generate copyrighted content, and this already happens.

by jrm4on 6/19/2025, 9:31 PM

And hopefully this puts to rest all the painfully bad, often anthropomorphizing, takes about how what the LLMs do isn't copyright infringement.

It's simple. If you put the works into the LLM, it can later make immediately identifiable, if imperfect, copies of the work. If you didn't put the work in, it wouldn't be able to do that.

The fact that you can't "see the copy" inside is wildly irrelevant.

by landl0rdon 6/20/2025, 12:30 AM

Important note: they likely “memorize” Harry Potter and 1984 almost completely because they don’t. No coincidence that some of the most popular books, often quoted, are “memorized”. It’s likely what they’re actually memorizing are fair use quotes from the books, at least mostly, making these some of the more represented in the training set.

by w10-1on 6/20/2025, 5:37 AM

This approach is misguided, as are most applications of copyright to AI.

Copyright violations are a form of stealing, like conversion or misappropriation, where limited rights granted are later expanded.

The "substantial similarity" test is just a way courts have evolved to see if there was copying, and if it was important -- in the context of human beings. But because it doesn't really matter if people make personal copies, and because you have to quote something to criticize it, and because some art is like other art -- because that level of stealing is normal -- copyright built a bunch of exceptions.

But imho there is no doubt that though a book grants the right to read for the sake of enjoyment, the right to process the text for recall or replication by automated means is not included in any sale of any copy -- regardless of whether one can trigger output that meets a substantial-similarity test.

"All Rights Reserved"

I understand case law and statutes state nothing like this, and that prior law does more to obscure than clarify the issue. But that's the take from first principles.