J-Term: Data and Predictive Coding for Lawyers

by Jinbao Hu


Back in June 2018, GLS sent an email encouraging us to reflect on our motivation for pursuing the LL.M. degree and to make a plan for the LL.M. year. In light of the rapidly growing use of quantitative data by legal professionals, one of the goals I set was to learn more about the interaction between law and quantitative methods at CLS. Data and Predicative Coding for Lawyers (“Data”) taught by Professor Mitts during J-term perfectly suited this goal.


Data provided a rare opportunity to gain both theoretical and experiential experience with inferential statistics and machine learning. At a theoretical level, we used introductory textbooks for both statistics and machine learning. The two books were well selected to make sure students without quantitative background could learn the fundamentals of statistics and machine learning. At an experiential level, we used BigML (https://bigml.com), a pretty user-friendly machine learning platform, to solve and automate classification, decision tree, ensemble, deep neural network, regression and topic modeling tasks. The jargon for machine learning was intimidating at first but all made a lot of sense when we followed Professor Mitts’s instructions to do exercises during and after class.


The application of machine learning was structured into two parts: one was traditional application in data with numerical values, while the other was application in textual data such as voluminous judgments. The first part served as a foundation so that we became familiar with the models used in machine learning. The second part demonstrated the application of machine learning to legal data. Professor Mitts used a paper* to illustrate how to analyze thousands of judgments leveraging the topic model provided by BigML. The conclusion drawn from the data analysis was innovative and powerful. By contrast, prior literature only drew conclusion from a few examples, which may not be good estimators of the population of relevant judgments.


I would like to also emphasize that all assignments from Data were completed by teamwork. The cooperation and discussion between team members was very constructive in helping each other better understand machine learning.


As the statistics textbook said, you can tell lies with statistics, but you can never tell the truth without statistics. In a world of massive legal data, we cannot arbitrarily state a conclusion by reviewing only a small portion of such data. Meanwhile, it is costly and time-consuming to manually review every piece of legal data. Such a paradox makes machine learning an indispensable candidate to correctly and efficiently analyzing legal data. Although machine learning in the context of text analysis is currently far from perfect, I believe its future is pretty promising.


Jinbao is an LL.M. student from China. He obtained LL.B and LL.M. degrees from Peking University and an additional Master of Common Law degree from Hong Kong University. Before he came to Columbia Law School, he practiced corporate law at Sullivan & Cromwell in Beijing.


*Jonathan Macey; Joshua Mitts, Finding Order in the Morass: The Three Real Justifications for Piercing the Corporate Veil, 100 Cornell L. Rev. 99 (2014)