Want to improve AI for law? Let's talk about public data and collaboration
When data scientists want to know if their artificial intelligence software can recognize handwritten digits, they have to test it. For most, this means taking a dataset of black-and-white handwritten symbols and running it through the software.
MNIST is one of the older and more well-known datasets used in this task. Called a training dataset, this data trains software to spot patterns so it can later apply those patterns to analyze new handwriting samples.
The popularity of the MNIST dataset among those working on image processing led it to become a benchmark, a dataset that people could use to compare their software’s accuracy. The dataset, like a racetrack, allows developers to compete for the best score. This is one way that artificial intelligence and machine learning get better.
With expanded applications of machine learning in law, the time has come to develop MNIST-like datasets for legal system applications.
Creating robust, publicly available training data on a variety of legal topics would improve accuracy and adoption while lowering the cost of entry, which will increase the number of people experimenting and researching in machine learning applications for law.
“Most people in AI forget that the hardest part of building a new AI solution or product is not the AI or algorithms — it’s the data collection and labeling,” writes Luke de Oliveria, co-founder of Vai Technologies, an AI software company. “Standard datasets can be used as validation or a good starting point for building a more tailored solution.”
This is as much true in imagine processing as it is legal applications of AI. But when it comes to legal applications, the data is not always there.
“This is a missing thing,” says David Colarusso, director of the Legal Innovation and Technology Lab at Suffolk University Law School in Boston. “You can’t find the datasets because the people that have done this work” consider it proprietary or claim attorney-client privilege.
Colarusso says this dearth of data limits the capacity of developers and researchers to use machine learning to tackle legal issues, like the access-to-justice problem. This is because collecting and labeling this data, necessary steps in developing a training dataset, is arduous and often expensive.
Josh Becker, the CEO of Lex Machina, a legal analytics company, and leader of the LexisNexis accelerator program, explains that access to data is a sticking point for new or expanding companies.
He says that every time a company like his wants to expand into a new subject matter area, it will spend upwards of $1 million to build the appropriate dataset from PACER, the federal courts’ document portal. This is an immense hurdle for a startup, and it creates a near impossible roadblock for a nonprofit organization or an academic researcher.
In reaction, there are attempts to liberate legal data. Free Law Project created RECAP to build a free version of PACER. Carl Malamud’s work to free public legal data at the state and federal levels is well documented. Chicago-Kent College of Law professor Dan Katz’s company LexPredict recently released a framework to build datasets from the Securities and Exchange Commission’s EDGAR database (a fight Malamud has also undertaken). And Measures for Justice, a nonprofit, is traveling the country county-by-county, collecting criminal justice data to aid cross-jurisdictional analysis.
These projects have had varying success, and they often fall short of collecting the complete datasets they seek. This is not for lack of trying, but a clear sign that freeing legal system data is hard. (In the case of LexPredict’s project, we do not know its potential because it was released this month.)
Collecting this data is only one step to building a training dataset.
With this in mind, the LIT Lab teamed up with Stanford’s Legal Design Lab, led by Margaret Hagan, to create a taxonomy of legal questions as asked by laypeople that can be used to label datasets that machine-learning models can be trained on.
Colarusso explains that this project is necessary because there is a “matchmaking problem” when it comes to websites providing legal information. The current dominant model is listing topics based on legal terms of art like “family law.”
By taking over 75,000 questions covering several dozen legal issues, Colarusso says the project aims to create a training dataset that can help create “algorithmically driven issue spotting” to assist online legal help portals more accurately connect information and resources to users and diminish the access-to-justice gap. The project is currently seeking help from volunteer attorneys.
Providing use beyond the lab’s work, he hopes that by making the labeled dataset public it can be used for benchmarking.
Colarusso and his partners are a small cadre of people looking to fill this need for legal system training data, even though legal AI applications are growing. According to contract review company LawGeex, between 2017 and 2018 the number of AI legal technology companies increased from 40 to 66, or by 65 percent. Similarly, algorithmic bail risk-assessment tools have grown in popularity and use by criminal justice system stakeholders over the past decade.
Creating robust, public training datasets for law has a few potential benefits.
First, large, available datasets like the one being created by Suffolk and Stanford would lower the cost of entry for new companies and researchers in this space and embolden exploration of these important issues. These datasets would create a ripple effect through the profession that building a single, proprietary dataset does not.
Second, these datasets have the potential to provide insight for consumers confronting machine learning tools in court or the marketplace.
If, for example, there was a large, labeled, public dataset of business-to-business contract disputes from federal district courts, every platform that claims to predict these types of cases could be tested on it, which would illustrate the relative accuracy of each tool.
While not lifting the veil on private datasets, consumers would have some comparative analysis to base their purchasing decisions on besides marketing material and online reviews.
However, Colarusso notes that: “To reach the stage of benchmarking, there needs to be community consensus that the dataset is a gold standard.” This would require collaboration among companies, law firms and researchers in the space.
This is not an unattainable goal, and luckily there is an example worth replicating.
From 2006 to 2011, the National Institute of Standards and Technology held a competition called “Legal Track Interactive Task” at its Text REtrieval Conference to evaluate automated document review.
This voluntary event provided datasets—that are still public—with millions of documents to competing companies and researchers to evaluate three areas of competency and then rate them on an accuracy scale of 0 to 100.
“Not all machine learning-enabled processes in document review are very effective, few have in fact been shown to do as well or better than humans, almost all have difficulty assessing their own performance [accuracy] adequately,” says Nicolas Economou, CEO of e-discovery firm H5. He argues that TREC allowed for a scientifically rigorous comparison rarely seen in the field. H5 took part in the event twice.
This type of cross-platform comparison can be helpful to firms and in-house counsel considering one of many e-discovery services on the market. With the right datasets, this same approach can be applied to bail risk assessments, case outcome prediction models and contract review platforms. No longer would there need to be reliance on “man v. machine” PR stunts.
Beyond legitimizing this technology for the consumer, Economou says, “these studies resulted, in part, in the greater acceptance of machine learning in discovery.”
Supporting this conclusion, he points to a 2012 order from then-U.S. Magistrate Andrew Peck, which recognized “that computer-assisted review is an acceptable way to search for relevant ESI in appropriate cases.” The opinion cited work produced by TREC, among others, as evidence for this conclusion.
An opinion like this should have every legal machine-learning company clamoring for public training data and the opportunity to benchmark against competitors in a scientifically valid way.
“In my view, these studies serve as a shining (and to this day, pretty unique) example of how independent government measurement laboratories can provide tools and protocols that can help with the safe deployment of AI.” says Economou of the NIST trials.
This type of work does not have to be done by a government agency, as illustrated by industry-led examples like MLPerf. However, if machine learning for law wants to mature and improve its adoption and efficacy, then tech companies, law firms, researchers and universities are going to have to step up and work together.
Corrects to 2011 in 27th paragraph on May 23.