Market Dilution: Extending Fair Use for Artificial Intelligence

Questions of when the use of a copyrighted work is acceptable (a “fair use”) are questions of borderlines: Where do creativity and transformation end, and where do repetition and monopolization begin? In the context of large language models (LLMs), the use of vast datasets of copyrighted online works for training has come under recent legal scrutiny. Two recent cases from the Northern District of California—Bartz v. Anthropic PBC (2025) (“Bartz”) [1] and Kadrey v. Meta Platforms, Inc. (2025) (“Kadrey”) [2]—have shed some light on the issue, though disagreeing on a critical factor: Although both courts agreed that, of the four factors that determine copyright fair use, the fourth—the effect on the potential market—plays the largest role in LLM-training cases, they diverged on the factor’s application. This article argues that courts should interpret the fourth factor of fair use in LLM training cases through the Kadrey court’s market dilution framework rather than through the Bartz court’s human learning analogy.

Article 1, Section 8, Clause 8 of the Constitution—the so-called Copyright Clause—forms the foundation of U.S. copyright law; its laconic text, however, has invited centuries of judicial interpretation. The first major fair use opinion in the Supreme Court’s 1841 Folsom v. Marsh case outlined four factors “to be explored, and the results weighed together” in fair use cases—1. the purpose and character of the use, 2. the nature of the copyrighted work, 3. the amount and substantiality of the portion taken, and 4. the effect of the use upon the potential market [3]. These factors were later enshrined into law by the 1976 Copyright Act, the latest congressional revision of copyright law [4]. Though now statutory, the four factors’ precise application continue to evolve with the courts’ case-by-case discretion, allowing them, to quote the late Justice O’Connor, to “avoid rigid application of the copyright statute when, on occasion, it would stifle the very creativity which that law is designed to foster” [5]. Fair use, then, can be granted when copyright might otherwise stifle creativity or the incentives in a creative sector [6].

At first blush, fitting LLM training into a copyright fair-use analysis may seem unreasonable: After all, language models seem to only predict and generate text, not copy it for profit. However, this interpretation fails to consider that LLMs train by ingesting and internalizing as many works as possible [7]. Put simply, more data makes better models [8]. Books and other such verifiably well-written copyrighted content make for especially good training data, as their length and coherence allow LLMs to learn larger patterns [9]. Of course, if these works were all properly licensed, such training would not violate copyright law. The issue arises when models train on works using “shadow libraries” of unlicensed literature or directly scraping sources from the web, blatantly using copyrighted works without permission [10]. As LLMs have grown larger, companies have increasingly turned to such “shadow libraries,” pushing the envelope of fair use jurisprudence. A sensitive analysis of such activity vis-à-vis the four factors of fair use, then, becomes of great importance.

Regarding the first factor—the purpose and character of the use—and third factor—the amount and substantiality of the portion used—of copyright fair use, the Bartz and Kadrey courts largely agreed: As LLMs can’t directly reproduce more than a few fragments of a training work, their function was considered sufficiently transformative in character under the first factor [11], and, given the highly transformative nature of LLM training, the amount copied was held reasonable under the third [12]. Specifically, both the Bartz and Kadrey courts followed the Supreme Court precedent holding that the first factor can mitigate the third: the third factor “will generally weigh in favor of fair use where … the amount of copying was tethered to a valid, and transformative, purpose [emphasis added]” [13]. In short, the first and third factors favor LLM training as fair use.

The courts’ analysis of the second factor was more nuanced, but ultimately insignificant. Although the creative nature of the works in question generally finds against fair use [14], their published nature (and the writer’s having exercised their right to publish) diminishes the second factor’s weight in fair use analysis [15]. Compounded with the factor’s general unimportance (as its effects are largely overshadowed by the other three factors, “[t]he second factor has rarely played a significant role in the determination of a fair use dispute [16].”), its role in LLM training cases is marginal at best.

A fair use verdict, then, hinges on the fourth factor. The fourth factor has been called “undoubtedly the single most important element of fair use” because any determination of whether a use stifles incentives to create—the key question in copyright cases—largely depends on the effects it has on the creative market [17]. Recent caselaw has reinvigorated the fourth factor, showing a “renewed sensitivity to market substitution” [18]. Market substitution, in this context, refers to whether the derivative work “usurps the market of the original work” [19]. For instance, if a teacher incorporates a photocopy of a journal article into their teaching notes, demand for licensed copies of the original journal article still exists, and thus the teacher’s use was not market substitution. Derivative works, to the extent that they diminish the original copyright holder’s ability to license or otherwise sell their work, can function as market substitutes. The plaintiffs in the Bartz and Kadrey cases advanced three theories for market substitution, namely that LLMs would harm the licensing market for the copyrighted works, that they would create works identical (or nearly identical) to the training material, and that they would flood the market with competing works [20]. These rationales will be discussed below.

Permitting LLMs to train on unlicensed works could harm the licensing market for that purpose; that is, the ability for writers to license their books to be trained on for a profit. Both courts recognized that a market of this kind is not inherently protected under the Copyright Act, as such protection would make analysis “circular and favor[] the rightsholder in every case. [21]” After all, any unlicensed use diminishes the possibility of licensing for that use. If one does not have a right to license a work, one cannot claim their rights are being infringed because another is making it harder to license that very work; that is, as scholars have observed, “we cannot know the market effect until we first decide whether there is a market to be affected [emphasis added] [22].” Courts, therefore, do not need to consider whether such a market actually exists, would be likely to develop, or likely to be significant in value.

If LLMs output identical replicas of their training text, people could read the output instead of the actual work, thus harming the creative market [23]. This fear, however, is completely unfounded. As LLMs can output “almost any string with the right prompt” [24], merely reproducing training text with “adversarial” prompting does not necessarily show that the model tends to reproduce its training data [25]. In addition, experts in Kadrey were unable to reproduce more than fifty words from training, a miniscule portion of often 80,000-word novels [26]. Reproduction of such a small portion of text in such a specific circumstance, both courts concluded, did not harm the creative market.

The plaintiffs’ final rationale that LLMs could flood the market with competing works is the source of the split between Kadrey and Bartz. The Kadrey court distinguishes this case from other fourth factor cases through a framework of “market dilution,” where, instead of directly copying another’s work and thereby causing market harm, LLM training creates countless outputs to compete with the original work, diluting their overall market share [27]. The critical factor is scale: though individual LLM outputs themselves may not infringe, generative, automated LLM ghostwriting can generate related content at such a scale that the originals would be drowned out [28].

The Bartz court, however, dismisses this logic out-of-hand, instead analogizing LLM training to “training schoolchildren to write well” [29]. This analogy allowed the Bartz court to undermine the Kadrey framework in two ways. First, under such a metaphor, market dilution is not a cognizable harm because teaching the skills for further creative expression “is not the kind of competitive or creative displacement that concerns the Copyright Act” [30]. Second, they alleged the market dilution framework protects authors as a class against competition from LLMs. Since copyright law is designed to protect expressions of individual authors, not the entire class of human authors, the court reasoned that adopting the market dilution framework would fall outside copyright’s purview and only stifle the novel creative expression of new technologies [31]. These claims, however, do not apply to the Kadrey framework.

Regarding the first claim, although teaching schoolchildren to write is indeed not cognizable under the Copyright Act, LLM training “is not even remotely like” teaching schoolchildren to write [32]. Teaching is a time-consuming, intentional, and interpersonal process whereby children develop understandings of what constitutes good writing and how to express their original ideas in such a manner later in life. LLM training, in stark contrast, teaches an algorithm to mimic as closely as possible the style and expression of the training data, generating “countless competing works with a miniscule fraction of the time and creativity it would otherwise take” [33]. In a slogan, teaching children allows them to express their ideas; LLM training allows the LLM to regurgitate others’ ideas.

Regarding the second claim, market dilution does not broadly protect writers as a class; the determination of market dilution itself is highly fact-specific, so “not all copyrighted works would have their markets diluted equally” [34]. Determining whether an author’s work has suffered market harm is, as all fair use cases, a case-by-case analysis. Though the Kadrey court comments on whether certain types of works are more likely to be protectable under their market dilution framework, these comments do not translate into broad protection of authors generally. Like the “nature of the work” distinctions made by the second factor, certain qualities of works can render them more or less likely to dilute a creative market [35].

The Bartz court offered another objection against the market dilution framework: Since LLM training and LLM outputs are sufficiently removed from each other, the court argued, the market dilution caused by the (non-infringing) outputs would be legitimate downstream competition, rather than a direct violation of the training works’ copyright themselves. This analysis misjudges the distance between training and output; the two are inextricable. Recent advances have allowed researchers to trace the outputs of LLMs—even those with millions of parameters and billions of pieces of training data—back to specific training snippets, without which the model would have generated another string of text [36]. For every work ingested of a particular style, the LLM becomes better at crowding out that style when fully trained. An estimate for the percentage of LLM-generated posts on the popular blogging platform Medium, for instance, estimates that “over 47 percent were likely AI-generated” [37]. The stylistic differences between LLMs trained with and without copyrighted content are large. OpenAI has testified in the United Kingdom that “[l]imiting training data to public domain books … would not provide AI systems that meet the needs of today’s citizens,” admitting that copyrighted works not yet in the public domain are essential to modern LLMs’ outputs [38]. Even if the Bartz court was right that LLMs are only downstream competition, to quote the Kadrey court: “indirect substitution is still substitution” [39]. Such behavior still causes market harm to the works in question.

Another possible counterargument to the Kadrey framework comes from the Supreme Court’s Campbell v. Acuff-Rose Music (1994) decision [40], which implied that if a use is transformative enough, this alone drowns out the fourth factor’s commercial considerations [41]. Despite convincing case language, such first factor supremacy runs logically counter to copyright’s goal “to promote the progress of science and the arts, without diminishing the incentive to create [emphasis added]” [42]. However, highly transformative yet creatively hindering use cannot be considered fair use. Analysis must be case-by-case, fact dependent, and consider “significant changes in technology” [43]. While emphasizing the first factor may streamline cases in which only a small body of secondary work may infringe (such as in Cariou v. Prince (2013) [44]), the rationale is a heuristic, rather than a brightline rule—of which fair use has, famously, none [45]—and fails to consider the magnitude of market dilution specifically for LLM outputs. Indeed, Campbell itself left room for cases where transformation and market harm exist (i.e., a transformative parody that nonetheless directly substitutes for the original [46]). Reading a binary tradeoff between the first and fourth factors into such a decision misstates the nuances of fair use.

LLM training underscores the core tension of copyright: balancing technological innovation with creative progress. Novel harms require novel doctrines, and Kadrey’s market dilution framework best acknowledges the unprecedented nature of LLM outputs vis-à-vis market harms. Acceptance and continued application of this framework will lay the groundwork for future cases to further refine its boundaries, allowing fair use to keep pace with new technologies. The Bartz rationale abstracts away the algorithmic, industrial magnitude of LLM training and outputs, misdiagnosing them as merely artificial versions of human intelligence-gathering, rather than fundamentally distinct processes. As fair use cases continue to bombard the courts, this abstraction would only impede creative incentives long-term [47]. Only recognizing market dilution as a form of copyright infringement affords copyright law the flexibility needed to deal with emerging technologies like generative LLMs.

 

Footnotes

[1] Bartz et al. v. Anthropic PBC, No. C 24-05417 WHA (N.D. Cal June 23, 2025) (settled).

[2] Kadrey et al. v. Meta Platforms, Inc., No. 23-CV-03417 VC (N.D. Cal June 25, 2025) (pending).

[3] Campbell v. Acuff-Rose Music, Inc., 510 U.S. 569 (1994) at 578. The court has repeatedly held this to be the case.

[4] 17 USC § 107.

[5] Stewart v. Abend, 495 U.S. 207 (1990) at 236, quoting Justice Sandra Day O’Connor.

[6] Andy Warhol Foundation for the Visual Arts, Inc. v. Goldsmith, 598 U.S. 508 (2023), at 15.

[7] Bartz, No. C 24-05417 WHA at 11. “The LLMs ‘memorize[d] A LOT, like A LOT.’”

[8] Pablo Villalobos, Anson Ho, Jamie Sevilla, et al., “Position: Will we run out of data? Limits of LLM scaling based on human-generated data,” Proceedings of Machine Learning Research 235 (2024): 49523-49544, available from https://proceedings.mlr.press/v235/villalobos24a.html.

[9] Bartz, No. C 24-05417 WHA at 6.

[10] Shadow libraries are unauthorized, often online collections of copyrighted works that users can download such content from. See generally Joe Karaganis, Shadow Libraries: Access to Knowledge in Global Higher Education (Cambridge, MA: MIT Press: International Development Research Centre, 2018).

[11] Bartz, No. C 24-05417 WHA at 11. The Bartz court noted that “using works to train LLMs was transformative—spectacularly so.” Kadrey, No. 23-CV-03417 VC at 16. “There is no serious question that Meta’s use of the plaintiffs’ books had a ‘further purpose’ and ‘different character’ than the books—that it was highly transformative.”

[12] Bartz, No. C 24-05417 WHA at 25. Judge Alsup puts the issue particularly clearly:

“Was all this copying reasonably necessary to the transformative use?

Yes.”

Kadrey, No. 23-CV-03417 VC at 25. “The amount that Meta copied was reasonable given its relationship to Meta’s transformative purpose.”

[13] Google LLC v. Oracle America, Inc., 593 U.S. 1 (2021). Hereafter referred to as Oracle.

[14] Bartz, No. C 24-05417 WHA at 24; Harper & Row, Publishers, Inc. v. Nation Enterprises, 471 U.S. 539 (1985), at 564-565.

[15] VHT, Inc. v. Zillow Group, Inc., et al., No. 22-35147 (9th Cir. 2023). A key consideration for copyright law is protecting authors’ right to publish; if an author has already published a work (hence the published nature of the work), copyright has less obligation to further protect that work.

[16] Authors Guild v. Google, Inc., 804 F.3d 206 (2d Cir. 2015) at 220.

[17] Harper & Row, 471 U.S. at 566.

[18] Jane C. Ginsburg, “Fair Use Factor Four Revisited: Valuing the ‘Value of the Copyrighted Work.’” Journal, Copyright Society of the USA 67 (2020): 35.

[19] NXIVM Corp. v. Ross Inst., 364 F.3d 471 (2d Cir. 2004).

[20] Campbell, at 593. “the only harm to derivatives that need concern us, as discussed above, is the harm of market substitution [emphasis added].”

[21] Kadrey, No. 23-CV-03417 VC at 28.

[22] James Gibson, “Risk Aversion and Rights Accretion in Intellectual Property Law,” The Yale Law Journal 116, no. 5 (March 1, 2007): 896, https://doi.org/10.2307/20455747.

[23] Ibid.

[24] John X. Morris, Chawin Sitawarin, Chuan Guo, et al., “How Much Do Language Models Memorize?” arXiv, May 30, 2025, arXiv:2505.24832 at 2.

[25] Kadrey, No. 23-CV-03417 VC at 27.

[26] Ibid.

[27] Matthew Sag, “Fairness and Fair Use in Generative AI,” Fordham Law Review 92, no. 5 (2024): 1919; Kadrey, No. 23-CV-03417 VC at 28.

[28] Kadrey, No. 23-CV-03417 VC at 31-32. Notably, this new doctrine doesn’t affect “traditional” fair use cases, where courts compare small volumes of original works against secondary uses; “it’s unlikely that harm from market dilution would be significant enough to matter.”

[29] Bartz, No. C 24-05417 WHA at 28.

[30] Bartz, No. C 24-05417 WHA at 28.

[31] Bartz, No. C 24-05417 WHA at 28; Nichols v. Universal Pictures Corp., 45 F.2d 119 (2d Cir. 1930). Hand’s abstraction test distinguishes authors (the class) vs an individual author: “…there is a point in this series of abstractions where they are no longer protected, since otherwise the playwright could prevent the use of his ‘ideas,’ to which, apart from their expression, his property is never extended.”

[32] Kadrey, No. 23-CV-03417 VC at 3.

[33] Ibid.

[34] Ibid, at 28.

[35] Harper & Row, 471 U.S. at 564-565. Noting, for example, the factual-creative distinction: “The law generally recognizes a greater need to disseminate factual works than works of fiction or fantasy.” Noting also the published-unpublished distinction: “…the scope of fair use is narrower with respect to unpublished works.”

[36] Liu, Jiacheng, Taylor Blanton, Yanai Elazar, et al., “OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens,” arXiv, April 9, 2025, arxiv: 2504.07096.

[37] Kate Knibbs, “AI Slop Is Flooding Medium,” Wired, October 28, 2024, https://www.wired.com/story/ai-generated-medium-posts-content-moderation/.

[38] OpenAI, Evidence to the House of Lords Communications and Digital Select Committee, HL LLM0113 (2023) at 4, https://committees.parliament.uk/writtenevidence/126981/pdf.

[39] Kadrey, No. 23-CV-03417 VC at 31.

[40] Campbell v. Acuff-Rose Music, Inc., 510 U.S. 569 (1994).

[41] Campbell, at 569. “The more transformative the new work, the less will be the significance of other factors, like commercialism, that may weigh against a finding of fair use.”

[42] Andy Warhol, 598 U.S. at 18.

[43] Sony Corp. of America v. Universal City Studios, Inc., 464 US 417 (1984) at 430.

[44] Cariou v. Prince, 714 F.3d 694 (2d Cir. 2013).

[45] Kadrey, No. 23-CV-03417 VC at 3.

[46] Campbell, at 569. See note 31. The phrasing “less will be the significance” can’t impute such a tradeoff. See also__Campbell, at 569. “The four statutory factors are to be explored and weighed together in light of copyright’s purpose of promoting science and the arts [emphasis added].”

[47] Edward Lee, “Updated Map of US Copyright Suits v. AI Companies (Oct. 19, 2025),” ChatGPT Is Eating the World, October 19, 2025, https://chatgptiseatingtheworld.com/2025/10/19/updated-map-of-us-copyright-suits-v-ai-companies-oct-19-2025/.

Previous
Previous

Outsourcing Asylum: The Legality of Third-Country Deportations Under International Law

Next
Next

The Unconstitutionality of the WAQF Amendment Act 2025 and the Erosion of the Minority Rights