In Search of the Perfect Search

Features

In Search of the Perfect Search

By Jason Krause

April 2, 2009, 3:40 am CDT

Illustration by
Robin Bartholick/Corbis

It would be the ultimate discovery for e-discovery: a perfect method to turn terabytes of digital data into a collection of case-relevant documents.

Three years ago, a handful of lawyers and scientists started the quest, a project to save litigation from being buried in an avalanche of electronic documents. Since then, the Text Retrieval Conference Legal Track has been using different types of computer searches to wade through huge piles of digital information, hoping to get closer to a complete picture of what is issue-important in a computer’s data stores.

The good news: The TREC Legal Track team believes it is close to finding a protocol that can work. The bad: The project also found disturbing problems with the way lawyers work today.

And the harshest conclusion: Keyword searching—what most lawyers use to find litigation documents—misses the majority of relevant documents. Or as Jason Baron, one of the Legal Track study coordinators, puts it, “Lawyers need to understand that the way they have been searching for electronic documents has some serious flaws.”

So as they search for a solution, the Legal Track team has tossed a ton of online documents, the efforts of academics worldwide and commercial e-discovery advisers, the skills of senior litigators, a lawyer collecting frog sounds and the ghost of Ludwig Wittgenstein into the challenge.

And results are on the way.

Jason Baron and Doug Oard
Photo by Ron Aira

DIGITAL DRAMA

Ever since bill gates turned into a whiny, twitching mess on the stand as his own e-mails were read back to him during the 1998 Microsoft monopoly trial, lawyers have known that digital documents—especially e-mail—are a key to winning cases.

But without improvements in technology, those “gotcha” moments might be hard to come by. Facing the prospect of monumental e-discovery costs, some lawyers may settle important cases, further reducing the number of trials. And mishandling e-discovery demands has cost firms millions in court fines and lost claims. In fact, it was a landmark case that spurred the creation of TREC Legal Track.

It’s something of an accident of history that Baron is now an expert on information retrieval. He’s a lawyer by training with no special background in computer search technology. In addition to seven years as a Social Security litigator for class actions, he has been involved in high-profile cases for the federal government for decades, including the legal challenge to the Communications Decency Act, which restricted online pornography.

In the closing days of the Reagan administration, a federal court granted a temporary order to retain backup tapes containing Iran-Contra records from the National Security Council. At that point, electronic records in litigation were pretty much unheard of. But as the Justice Department’s lead counsel on the case and its 7 million pages of e-mail, documents and calendars from the mid-1980s through 1989, Baron got his first glimpse at the inadequacies of search techniques.

Later, as director of litigation for the U.S. National Archives and Records Administration, Baron was assigned a request to review documents pertaining to tobacco litigation in U.S. v. Philip Morris. (The case itself is a data monster. The final opinion in PDF runs 1,683 pages, including a 30-page table of contents.)

No big deal. Searching for records is what an archivist does, he thought. But for this litigation, archivists were expected to search for paper documents going back 50 years, and also to search millions of e-mails dated as far back as the 1980s.

Even with 25 archivists and lawyers on task, it was impossible to review every page, document and e-mail. So Baron applied the same technology lawyers have been using to tackle document sets for a long time—simple keyword computer searches.

About 1 percent of the 20 million Clinton-era White House e-mails they combed through came up as potentially relevant through simple keyword searches. But even that tiny fraction brought Baron’s team near the breaking point.

“It was obvious to me that the volume of information was overwhelming us in litigation, and the technology we have to deal with it was just not sufficient,” Baron says.

He figured someone somewhere in the federal government must have done some research on the topic of information retrieval. In fact, he discovered that the U.S. Department of Commerce’s National Institute of Standards and Technology had been conducting a 15-year investigation on retrieval of text from large document collections.

When Baron approached the government scientists involved, they were thrilled to have a real-world problem to tackle as part of what had been a pure research project. TREC Legal Track, begun in 2006, is now co-sponsored by NIST and subagencies of the U.S. Office of the Director of National Intelligence.

Since then, Baron’s career hasn’t been the same. He now gives lectures and seminars on e-discovery, works on Legal Track and keeps his day job as a top government lawyer. “Ever since I’ve been swept up in this battle with e-mail and electronic records, it’s only grown in importance and scandal,” he says. “It’s been a fascinating career, but nothing I anticipated, that’s for sure.”

THE LANGUAGE CHALLENGE

The basic problems of e-discovery relate directly to language itself. George Paul, a partner at Lewis and Roca in Phoenix, has written extensively on e-discovery, which helped make him a devotee of 20th century philosopher Wittgenstein, who also grappled with the limits of language.

“The big-picture issue is that we are up against a fundamental, philosophical problem,” says Paul. “This is not a computer problem. … Words don’t stand for behavior, but are elastic and change their meaning depending on their context.”

During his time off, Paul sometimes makes amateur field recordings of frogs, which he sends to Legal Track colleagues. It’s a hobby that reflects a deep interest in the fundamental questions of communication.

“I’m not sure what the recordings mean, but I think he may have discovered frog language,” says Ken Withers, director of judicial education and content with the Sedona Conference, a Phoenix-based, law-oriented think tank.

Whether or not croaking is communicating, the problems of human language can be even more engaging.

“Misses and false alarms are facts of life in any system because language use necessarily involves ambiguity,” says TREC Legal Track co-coordinator Doug Oard. “You can reduce errors, but you cannot eliminate them.”

The fundamental ambiguity of language is compounded by human error. Add in the fact that scanning documents with optical character recognition software introduces mistakes and garbled text, and that’s just if you’re talking about one language. You also have the problems of nontextual information like sound, video, images, spreadsheets and other database files.

HERE COMES THE JUDGE

But most lawyers and their clients aren’t interested in philosophical debates, frogspeak or the hassles of proofreading, and they don’t care what search technology reveals about language. They just want an e-discovery technology that will stand up in court. And, while keyword searching is deficient, it is defensible.

“It may be dumb and naive to use keywords, but you can’t defend any of these new, advanced search technologies to a judge,” says Bill Speros, a Cleveland-based e-discovery consultant. “If the judge asks you how you got a result, you can’t just say, ‘I don’t know; that’s just what the computer told me.’ ”

There is no mystery about how keyword searches work: You sit a lawyer or paralegal at a computer and you show the court the search terms you negotiated with your opponents and what those terms found. That looks to many a judge like a good-faith effort.

However, the strange thing about using computers to search massive computer databases is that, even though massive data stores are now the norm, there has been almost no research on the topic. And that leaves the door open for unsubstantiated claims.

“Right now anyone can say anything about search technology,” says John Jessen, founder and chairman of Daticon Electronic Evidence Discovery, which is based in Kirkland, Wash. “You can say your search is 8 percent more accurate than the next guy’s, but there’s no common benchmark about what those claims mean. Right now it’s a marketing world; whatever marketers say goes.”

The most commonly cited research on information retrieval is a 1985 article, “An Evaluation of Retrieval Effectiveness for a Full-Text Document Retrieval System,” by David C. Blair and M.E. Maron. Published in Communications of the Association for Computing Machinery, it is also about the only major research project on the problem.

Blair and Maron asked lawyers and paralegals to use keyword searches to find documents on a given topic among 40,000 documents comprising 350,000 pages. Legal researchers estimated they would find 75 percent of relevant documents, but the research showed only 20 percent had been found. In turn, Legal Track tested search technology from more than 20 different institutions and e-discovery vendors, finding that no single technology provided any better results than the standard Boolean keyword searches commonly used.

Baron says those last-century results are stunningly similar to results from the first two years of the TREC Legal Track more than two decades later. Legal Track showed Boolean keyword searches using commands such as and, or and within so many words across a range of different hypothetical topics found only between 22 and 57 percent of all relevant documents cumulatively retrieved through a variety of alternative search methods. But the Boolean search was no better or worse than other more sophisticated search methods tested, and it still represents the current standard.

Keyword searches done thoughtfully can return a viable amount of documents. E-discovery consultant Speros recalled an insurance case in which lawyers needed documents pertaining to young people. Rather than just search for that term and some synonyms, he used words like mother, father, dad and words for activities correlated with children like baseball and football.

“If I can do it in a thoughtful way,” he says, “I can get better results than some fancy new search technology.”

Notes Phoenix litigator Paul, “The top people in the world have been working on this problem; they’ve had years and years and years and thousands of times more computing power since the days of Blair and Maron, and yet there’s been no material advance. How come we have 10,000 times more computing power than we did years ago and see no more advance?”

Still, something very important did come out of those earlier tests: While almost every test found roughly 20 percent of potentially relevant documents, each different type of search basically found different documents. When testers threw different combinations of search technologies at the database, they were able to find roughly 78 percent of the total number of relevant documents.

Baron believes these paradoxical and confounding findings can be reconciled if “lawyers come to realize that to improve the results of searching, one needs to use a variety of available search methods and tools. No one off-the-shelf method will solve all of your e-discovery issues.”

MIX AND MATCH

Baron and the legal track team are trying to create a credible process and protocol to improve digital searches. It won’t exactly be the perfect search; no one expects that. The researchers are using all the computing power and search techniques they can muster to try to crack the problem.

Here’s where the tobacco litigation archive comes in. Legal Track is using the nearly 7 million publicly available documents from the master settlement agreement database, a collection of tobacco documents produced in relation to several state lawsuits against the industry. That database was chosen because it contains a wide spectrum of types of documents.

At that target cache, TREC Legal Track is aiming 13 hypothetical legal complaints (PDF). Written like normal legal documents, they contain all the information included in real-world complaints for fictional tobacco-related lawsuits, such as campaign finance violations, class actions, antitrust investigations, securities litigation, patent infringement and wrongful death suits. The most important part is the search terms these hypotheticals lay out.

Baron says the Legal Track team has had fun dreaming up hypotheticals on subjects ranging from the music of Bob Dylan and Joan Baez to research on pigeon deaths. “Basically, anything you can think of has been contained in some subset of documents that were gathered together for purposes of the prior tobacco litigation, and we have taken full advantage,” he says.

For example, the request for documents pertaining to folk singers is part of a hypothetical complaint targeting a fictional tobacco company for securities fraud. The company’s fictional marketing campaign features counterculture music and film icons of the 1960s, so the complaint requests a search for all documents making a connection between the music of Peter, Paul, and Mary; Baez; or Dylan and the sale of cigarettes. It even asks for documents discussing the use of psychedelic colors and documents referencing Bonnie and Clyde, James Bond movies or the films of Stanley Kubrick.

Baron says it’s too early to know how these different topics will work out, but he expects graduate students will be poring over the results and writing Ph.D. theses to produce all kinds of further scholarship. The tests will use a number of leading search strategies to see which combination of search strategies gets the best results.

The participating teams of information scientists from around the world are mostly from academic institutions, with computer scientists like Paul Thompson from Dartmouth and Stephen Tomlinson from e-discovery vendor Open Text helping to design and run tests.

E-discovery vendors have been reluctant to help with Legal Track. Last year a half-dozen companies were involved, and some larger ones have offered advice, but only this year did two companies offer full participation.

“What we’ve tried to represent is that this is not going to be some sort of U.S. News & World Report ranking,” says Bruce Hedin, who is with San Francisco-based e-discovery company H5. “Some firms may stay on the sidelines because they’re reluctant to have their search technology measured, but I think they’ll see that we can offer standards and protocol that can actually validate their approach once they subscribe to what we’re doing.”

In Hedin’s critique of the project’s first two years, he said the results were interesting, but the research protocol used didn’t fully reflect the challenges lawyers face in the real world. He argued that a senior litigator usually makes the ultimate decision about relevance as a check on decisions made by a large number of researchers.

Hedin soon found himself enlisted in creating a new task for the project, and the group is trying a new approach called the interactive track, in which those who participate will be expected to concentrate their resources on finding relevant documents modeled on how a senior litigator in a real e-discovery setting operates.

While Hedin’s company is not officially involved, he has volunteered to help coordinate the effort. For year three, Legal Track has enlisted an estimated 110 volunteers and three topic authorities, expert attorneys who have experience similar to that of a senior litigator.

Experienced legal minds are hard to get on a volunteer basis. “As you’d expect, a lot of these people are busy with their day jobs,” Hedin says. “And it’s not like having people on staff where you can call a meeting anytime you need to.”

So far the TREC Legal Track research has identified a couple of practices that improve on the baseline keyword search. To start, lawyers need to work with opposing counsel to identify good search terms and to negotiate proposed Boolean search strings.

And it is important to use sampling—testing to see whether the search engines are finding documents known to be relevant. That means deploying what e-discovery experts call iterative feedback loops. These involve a team of lawyers and other in-house experts conducting searches in stages, and conferring with counsel and experts from the opposing party to determine whether the process is working.

Experts say that when litigators set up a search, they should identify the data types and then prove that the search tool they’re using works with those data types.

“Judges don’t want to get into a fight about tools, but want to hear a reasonable plan,” says Jessen, the e-discovery firm founder who is a volunteer in Legal Track. “This is not about perfection, but did you set up, enable and audit a process in good faith?”

Legal Track has an example of an imaginary negotiation (PDF) in which lawyers offer counterproposals over whether search strings using the terms asthma, bronch, respire, breath, trach, child, young and juve are better than strings using severe asthma or asthma attack and child.

Unfortunately, only a small minority of lawyers actually sit down with opposing counsel to negotiate document search strings. The iterative process employed in Legal Track is an aspirational model for the profession, which still largely treats discovery as adversarial.

The interactive task used experienced lawyers to review search results in a feedback loop, as happens in actual litigation. In some limited circumstances, advanced search technologies could beat Boolean in a head-to-head comparison. Previously, TREC researchers were able to find more documents than Boolean only by employing multiple search technologies together.

Oard says it’s not clear yet why this is happening, but the results are an improvement. “A lawyer can go to a judge and tell him with a straight face that, if well-implemented, our system is a reasonable alternative to Boolean.”

Baron expects to get more commercial participants and academic teams in the next year of TREC Legal Track, when the tests will target the online collection of documents from the Enron litigation. It is newer than the tobacco litigation database, which was made up primarily of scanned records, and should produce even more useful results.

BOOSTER SHOTS

Legal track has gotten a big boost from two sources. the first is the Sedona Conference, which has been the major forum for lawyers to hash out e-discovery issues. The conference, which backed Baron and Oard’s original TREC research in 2006, was highly influential in the creation of recent amendments to the Federal Rules of Civil Procedure for e-discovery.

The second comes from federal judges. They have become keenly aware of the deficiencies in search technology and are increasingly impatient with lawyers who complain about the limitations without offering solutions.

“For lawyers and judges to dare opine that a certain search term or terms would be more likely to produce information than the terms that were used is truly to go where angels fear to tread,” wrote Magistrate Judge John Facciola of the U.S. District Court in Washington, D.C., in U.S. v. O’Keefe.

In Victory Stanley Inc. v. Creative Pipe (PDF), Judge Paul Grimm at the U.S. District Court in Maryland wrote that experts are likely to be necessary in complex e-discovery cases, which immediately drives up cost and complexity. However, he also cited the Sedona Conference’s “Best Practices Commentary on the Use of Search and Information Retrieval Methods in E-Discovery” and its eight practice pointers on successful keyword searching.

But federal rulings, even with amended federal procedure rules, can’t do the whole job. “The Federal Rules of Civil Procedure carried with them an unspoken optimism that technology will get cheaper and better, and offset the huge costs and complexity involved in e-discovery,” says Paul. “I think now there’s a growing sense of ‘Uh-oh, where are the efficiencies we thought we’d see?’ ”

Lawyers tend to leave search technology to consultants or technologists, but as judges become increasingly aware of the deficiencies of search, lawyers will need to know what their tech guys are up to.

As for the lawyers currently conducting e-discovery, they are finding themselves in an untenable position. “If we use some fancy search technology that we can’t explain, we’re put in harm’s way in front of the judge,” says Speros. “And if we use dumb and naive keyword searches, we’re in harm’s way. For lawyers like me, we just want to know one simple thing: What will work for me and my client?”

TREC researchers warn that their work has not yet found that ultimate search method, but they have created a viable test environment. The TREC Legal Track is to the point where it may soon offer lawyers a common language and defined processes for search that can account for the inherent deficiencies of the technology.

“What I hope to get out of this is some common language about search,” Jessen says, with a caveat. “Of course, if judges don’t buy into what we’re doing, this is not going anywhere.”

Sidebar

Search Models and Methods

Besides Boolean, TREC Legal Track is using several search technologies, including fuzzy search models, probabilistic (or Bayesian) models, statistical methods (also called clustering), machine-learning approaches, categorization tools and social network analysis. These search technologies fall into several classes:

• Boolean, the most familiar model, uses keywords to pull results by using connecting words like and or or to find specific combinations. In more complex litigation, more sophisticated Boolean strings are often used with a fuzzy search technique, designed to account for spelling mistakes and word variations.

• Fuzzy search models attempt to refine a search beyond specific words, recognizing that words can have multiple forms, so that even if search terms don’t use the exact words in a relevant document, the document might still be found.

• Algebraic search is based on the premise that mathematical models can figure out the meaning of a document and then retrieve relevant documents by looking at the proximity of related words.

• Probabilistic search uses language models, including Bayesian belief networks, which make inferences about the relevance of documents based on computer learning about how concepts are communicated in a collection.

• Alternative search methods, usually grouped under the rubric of concept searching, often involve complex mathematical and linguistic models, but usually take human training to “teach” the computers to recognize terms and concepts.

• Clustering searches group terms found in similar contexts to figure out which words are used in connection with a search topic.

• Concept and categorization tool search systems rely on a thesaurus to capture documents that use different words to express the same thought.

Jason Krause is a freelance journalist in Madison, Wis., who contributes regularly to the ABA Journal.