

I think it’s critically important to be very specific about what LLMs are “able to do” vs what they tend to do in practice.
The argument is that the initial training data is sufficiently altered and “transformed” so as not to be breaking copyright. If the model is capable of reproducing the majority of the book unaltered, then we know that is not the case. Whether or not it’s easy to access is irrelevant. The fact that the people performing the study had to “jailbreak” the models to get past checks tells you that the model’s creators are very aware that the model is very capable of producing an un-transformed version of the copyrighted work.
From the end-user’s perspective, if the model is sufficiently gated from distributing copyrighted works, it doesn’t matter what it’s inherently capable of, but the argument shouldn’t be “the model isn’t breaking the law” it should be “we have a staff of people working around the clock to make sure the model doesn’t try to break the law.”


Interesting. Didn’t know about the google books case. I agree that it applies here.