Quantitative Methods in the Survey of Predictability in Bench Trials

In your average legal drama, the verdict is the zenith of dramatic tension. The proceedings of a courtroom lend themselves particularly well to a narrative structure: adversarial parties, high stakes (life, honor, and money), and above all, a dénouement that is anything but certain. The biggest element of drama, namely the (un)predictability of the finding, has been the object of various attempts of demystification, with degrees of complexity ranging from back-of-the-envelope calculations (that is, estimates meant to produce ballpark figures) to machine learning, the frontier of legal prediction. The attempts made at predicting judicial outcomes for bench trials have highlighted one salient fact: intrinsically “human” factors continue to undermine even the most sophisticated predictive systems. The prevalence of these extraneous factors has the potential to shake our confidence in our current legal system and pave the way for a legal future that circumvents the human factor altogether. 

Every day, lawyers and general counsels make predictions about the outcome of a case. Senior partners are highly valued because their experience in the field has sharpened their legal intuition and predictive abilities.[1] Notwithstanding the incredible capacities of the human mind, the best lawyers only have access to so many cases in their career. With the help of quantitative methods, can non-experts beat even the most eminent jurist? 

Using back-of-the-envelope calculations, it is possible to make rough predictions about a judge’s findings. The following predictive system is largely inspired from one early attempt by Professor Stuart Nagel published in the American Behavioral Scientist in 1960.[2] In criminal cases, one could start by identifying those arguments made by the prosecution that are most highly correlated with a conviction, for a certain class of crimes in a specific jurisdiction (cases with one charge of assault in Texas for example). By calculating correlation coefficients between the occurrence of those arguments and conviction rates, and subsequently identifying when those same arguments are made in a new case, one could reasonably predict the outcome of an ongoing case. Based on the correlation coefficients, weights can be given to each argument such that the system would predict a conviction if the prosecution makes a critical number of heavily weighted arguments. The main advantage of this method is its transparency. Each prediction is explained by the weight of the arguments made by the prosecution, such that it is possible to “rationalize” adjudication. This is based on the comforting assumption that judges rationally consider every argument and are consistent in their findings for the same category of offense –an assumption that I will prove to be unwarranted.  

I argue that the above approach fails to encompass the complexity of a trial, and is thus likely to produce inaccurate predictions. Indeed, this method considers each argument to have its own discrete impact on the outcome of the case, not accounting for the fact that arguments influence each other, and can be sublimized by an eloquent prosecutor. It also fails to reflect the fact that a trial is a dynamic system of interrelated legal and psychological forces, not a rational calculus of causes and effects. A study looking at judges’ rate of parole approval conducted in Israel shows that it significantly decreases as the workday goes by.[3] More specifically, the probability of a favorable parole decision steadily declines –from 0.65 at the beginning of the day to 0 right before the judge’s break. The rate of parole approval returns to 0.65 after a break, suggesting that judges can suffer from mental exhaustion that inhibits their ability to make rational decisions. When putting this fact in the context of a trial rather than a hearing, we can reasonably expect that arguments made at the end of the day will have a smaller impact on the judge’s decision than those that were made early on. 

Even if we were able to take into account the level of exhaustion of the judge as a variable in our previous model, it does not resolve the issue of having to rely on labor-intensive human predictions. Indeed, the inability to thoroughly survey thousands of cases is a major hurdle to scaling up legal predictions. But today, we have new quantitative methods that use large datasets and allow us to build predictive systems that do not limit themselves to rational arguments. 

Big data applied to the courtroom, or the large-scale statistical analysis of judgements combined with machine learning, is the frontier of legal prediction. Rather than limiting ourselves to the arbitrary category of “legal arguments” in our prediction, we let language analysis identify which combination of words is important for judicial decisions. In this article, we will look at a predictive system that uses machine learning to predict decisions of the European Court of Human Rights.[4]

In this study, the binary outcome is the finding, or not, of a human rights violation by the Court. Theoretically, by analyzing the facts of the case, the machine learning algorithm identifies which word or set of words are the best predictors of a violation or non-violation. Predictions are then made based on the frequency of those predictors in the court’s summary of the facts. The model is trained using an algorithm that can then be tested using cross-validation. To test the model, we start by splitting the data into a number of groups, which are called “folds.” For example, if we let k=5, this means that we split the data into five groups, each containing the same number of cases being used for training. The model is then trained on k-1 folds (in our example, this would mean that the model is trained on four folds), and then tested on the fold that was not used to train the data. This operation is repeated until each fold has been used as a testing fold. This allows the researchers to determine the optimal parameters of the machine learning system for out-of-sample performance, since the model would become overly specific to our current dataset if it used the same cases for training and testing. In this type of model, it is important to have a balanced dataset, with the training data consisting of an equal number of “violation” cases and “non-violation” cases. If there is an imbalance in the number of cases, the algorithm will only learn that there are a lot more “non-violation” cases than “violation” cases and predict an outcome of “non-violation” for every new case.[5]

While a blind guess would result in a 50% accuracy rate, the results of this model average a performance of 75%. Amongst the top predictors of violation, we find the words “district prosecutor office,” “district prosecutor,” or “of the Chechen republic.”  The top predictors of non-violation include “state attorney office,” “Bosnia and Herzegovina,” and “the death penalty.” These results suggest that the algorithm is capturing frequency of winning (and losing) parties in previous judgements, rather than the relevant facts. In predicting future cases, this algorithm might incorrectly predict a violation just because a case involves the Chechen republic to some degree. Furthermore, the inclusion of predictors that denote either parties (“district prosecutor” or “state attorney”) suggest that the parties that are most often cited (and are thus possibly also the more convincing) are likely to win. Yet, those predictors do not inform us on the decision-making process of the judge. Finally, this method is also flawed when we try to predict future cases. When limiting the training data to earlier cases and then testing it on more recent cases, the average performance ranges between .58 to .68. The larger the time gap between the training cases and the testing ones, the lower the performance rates. 

Those limitations are just an indication of how important it is to choose an appropriate training set. Balancing the training set is crucial to ensure that the algorithm captures the actual predictive features. The low performance rate for a model trained on early datasets and tested on later one suggests that the law could be, to some extent, evolving, thus making it harder to predict future cases using older ones. Finally, for certain types of human rights violation, knowing just the name of the judge is enough to get a correct prediction for 79% of violation cases.[6] 

The two methods that I have introduced both have advantages and drawbacks. The correlation coefficient approach allows us to frame our predictive system in terms of legal arguments, which helps make our predictions more credible. The machine learning approach, while it has the advantage of not limiting itself to mere arguments, captures elements (like the recurrence of a country’s name) that make the algorithm biased. Finally, the fact that the mood of the judge or their character (as captured by their name) is a significant predictor of case outcome suggests that, maybe, predictive systems aimed at replacing human decisions should not be trained on previous datasets, since they would merely capture the insufficiencies of the current system. Merely predicting the finding of a judge rather than aiming to replace them is in itself a valuable endeavor, which will require further improvements. For now, human judgements and predictions will likely continue to dominate the legal field, so don’t worry: machines will not be ruining your Suits binge-watching anytime soon.[1]   

[1] Katz, Daniel M., “Quantitative Legal Predictions –or– How I Learned to Stop Worrying and Start Preparing for the Data-Driven Future of the Legal Services Industry, Emory Law Journal,” vol. 62, p. 59.

[2]  Nagel, Stuart. “Using Simple Calculations to Predict Judicial Decisions.” American Behavioral Scientist, vol. 4, no. 4, 1960, pp. 24–28.

[3] Shai Danziger ; Jonathan Levav ; Liora Avnaim-Pesso, Extraneous factors in judicial decisions, Proceedings of the National Academy of Sciences, 26 April 2011, Vol.108(17), p.6889

[4]  Medvedeva, Masha, Michel Vols, and Martijn Wieling. "Using Machine Learning to Predict Decisions of the European Court of Human Rights." Artificial Intelligence and Law, 2019, 1-30.

[5]  Medvedeva, Masha, Michel Vols, and Martijn Wieling. "Using Machine Learning to Predict Decisions of the European Court of Human Rights." Artificial Intelligence and Law, 2019, 1-30.

[6] In this phase, the researchers did not consider how each judge voted, but simply if their name was associated with a “violation” or a “non-violation” case.

Previous
Previous

Flaws in Intercountry Adoption

Next
Next

The Case for a Rewritten Authorization to Use Military Force