Inference Time for Copyright Law
It seems as though the New York Times has developed a sustained interest in suing OpenAI. This long, ongoing case has attracted significant attention both because of the cultural draw of both companies and its call to revisit copyright law in a time when market dominant companies depend on automated text generation built on publicly available but institutionally owned data.
Copyright’s statutory design was built for a world defined by discrete outputs. The concept of copyright inherently presumes a stable distinction between protected and unprotected works, and treats infringement as something that can be detected through comparison. However, large language models complicate the ability to distinguish copying. They are trained on massive volumes of text which they compress into statistical representations, and their outputs arise from probabilistic transformations rather than direct duplication. This makes reproduction harder to identify, as it's rare for transformations to echo copyrighted material verbatim. Given the complexities of large language models, courts have not yet articulated a framework for determining how such training fits within the Copyright Act or whether it can be squared with fair use, and leading cases such as Authors Guild v. Google do not clearly resolve the issue either.
In Authors Guild v. Google, the plaintiffs attempted to force a novel computational process back into the older doctrinal structure regulating copyright. The case involved Google’s creation of a searchable index of millions of scanned books, which required copying entire works but produced only small text snippets to users. The court ultimately held that this transformative use fell within fair use. The Times’s complaint echoes Authors Guild’s: it frames training as direct reproduction and characterizes certain outputs as unauthorized derivatives of Times journalism. This framing represents a decisive shift in the Times’s treatment of technology companies. For years, the organization relied on negotiated licensing and partnerships as its primary tools of governance toward technology companies using their output. In this way, the lawsuit signals a turn toward litigation as a strategy for protecting institutional relevance and preserving the economic viability of professional reporting, as the rise of generative AI weakens the effectiveness of previously cooperative arrangements. Models can extract informational value from Times reporting without generating traffic or licensing revenue, undermining the financial link between journalistic labor and the markets that support it.
OpenAI’s response rests on a fundamentally different theory of data use. The company argues that model training is analogous to reading, research, and indexing, all of which have long been understood as lawful uses of publicly available information. It insists that internal model weights are not stored copies and that the Times is effectively seeking a novel right to control facts, ideas, and statistical inferences drawn from its journalism, none of which copyright has previously protected. OpenAI further contends that the verbatim excerpts highlighted in the complaint were elicited through adversarial prompting rather than ordinary model behavior, framing those examples not as evidence of systemic infringement but as artifacts of manipulation.
The mere existence of this dispute exposes a structural issue in American information governance. Despite acknowledging that the current statute provides little guidance for assessing AI training, Congress has not updated the Copyright Act to address machine learning implications, and neither the Copyright Office nor the judiciary has developed a coherent framework for evaluating AI training practices at scale. In this vacuum, private actors end up defining the boundaries of lawful data use through litigation. This shifts practical rulemaking power to those with the resources to sue and produces ad hoc standards shaped by litigants’ interests, which in turn leaves courts improvising de facto copyright rules case by case.
The Times’s case is the most visible recent attempt to import longstanding copyright theory into the AI domain, but it also highlights the limits of the existing theory. The statute was not designed to regulate nor could it have anticipated the effects of learning systems that extract patterns rather than merely store copies.
Extending beyond the realm of copyright statute, this case tests how far the Copyright Clause can be stretched to govern forms of information processing that do not fit traditional models of reproduction. Asking courts to resolve these novel questions raises issues of institutional competence, since expanding copyright doctrine to cover machine learning risks generating de facto constitutional rules without congressional guidance. These constitutional stakes have consequences far beyond the litigants, shaping the legal boundaries for news organizations, AI developers, researchers, and platforms as they navigate uncertain rules governing data use and model training.
Courts may attempt to adapt existing doctrine to accommodate learning systems, or they may signal that Congress must establish a new statutory framework tailored to AI. Either way, this case reveals that the existing copyright regime cannot sustain modern technologies. Judicial improvisation will produce only legal uncertainty and ad hoc policy, leaving fundamental questions of information governance to whichever party can afford the most expensive lawyers. Congress must provide a statutory framework designed for systems that learn rather than copy, or resign itself to regulating technologies that are here to stay through case-by-case improvisation.