Robot Judges?
As artificial intelligence continues to automate ever-growing parts of the global economy, questions about the role of AI in global judiciaries become increasingly important. Some countries—Estonia, China, and Brazil in particular—appear to be welcoming AI’s ability to help manage judicial backlog, while others remain silent on the issue. How can we make sense of this global landscape?
In order to describe and assess these developments, we need to distinguish between different potential uses of AI in the judiciary. First, AI could help judges with simple administrative tasks, like writing emails or summarizing texts. England and Wales, for example, allow such uses but caution against anything else. Because this approach mandates that humans continue to do all the legal research and reasoning, downsides seem minimal as long as judges verify AI outputs. Moreover, the mostly harmless time-saving potential here will likely ensure widespread adoption.
Second, judges can outsource some amount of actual legal research to AI systems. Anecdotally, this already happens unofficially in the United States. Kevin Newsom, an 11th Circuit judge, recently reported that he used ChatGPT as one input on the ordinary meaning of the word “landscaping,” reasoning that ChatGPT would have a useful perspective given that it’s trained on millions of appearances of that word. More worryingly, a recent scandal around error-laden court orders in two U.S. District Courts revealed that law clerks were outsourcing entire portions of their research to LLMs without manually verifying work. Meanwhile Brazil, India, and China, citing the need to manage backlogs of low-value but high-volume cases, have developed their own AI platforms to manage judicial research and suggest precedent application, although these systems still retain human oversight over all final decisions. Finally, countries around the world use Online Dispute Resolution (ODR) systems—systems that do not render a judicial decision, but rather mediate conflicts and suggest settlements.
Finally, some jurisdictions may elect to have AI autonomously handle certain kinds of cases in entirety, although none seem to have done so. Some scholars have suggested using an AI system to sort cases into “easy” and “hard” buckets, and then having an AI judge decide the “easy” cases (with litigants retaining the option of appealing to human judges). For some time, Estonia appeared to be building something similar to this proposal: “robot judges” to autonomously decide cases in small claims court valued less than €7,000), with human judges reviewing final decisions. This system, however, does not seem to have materialized yet, and it is unclear whether it ever will.
That being said, there are several compelling arguments for more widespread adoption of AI in judicial research and decisionmaking. First, even if some “hard” cases require human judgment, AI could likely efficiently handle “easy” cases, which would dramatically reduce backlogs and increase access to justice. Second, AI could reduce human bias. Empirical research has demonstrated that different judges render different decisions and, more worryingly, an individual judge’s decisions change based on factors like meal timing and weather. AI would face no such human biases and, indeed, would likely apply precedent in a much stricter manner. Finally, AI might be better suited to discover the “ordinary meaning” of statutes, which would guarantee more predictable application of law.
But even if some future ideal superintelligence could optimally apply the law, research indicates that current systems fall far short. According to a recent study, LLMs like ChatGPT don’t produce stable outputs or definitions, but rather say different things given different styles of prompting or broader context in the conversation. Although human judges likely do something similar, the degree of variance in LLM responses is certainly a cause for concern. Moreover, a swath of empirical evidence shows that LLMs, rather than eliminating bias, encode the assumptions in their training data, including harmful stereotypes. For example, LLMs are more likely to recommend convicting people who speak African American Vernacular English (AAVE), and COMPAS—the risk-assessment algorithm used by many U.S. jurisdictions to evaluate bail requests—is demonstrably biased against people of color.
It is an open question, then, whether more widespread adoption of AI in judiciaries would be beneficial. Several countries have moved in that direction to combat judicial backlogs, but it remains to be seen how far they will go. What is clear, however, is that jurisdictions should proceed with care; the upsides are large, but so too are the downsides.