Smart Sampling in e-Discovery

Reduce Document Review Costs Without Compromising Results

In the past few years, the volume of electronic content has increased dramatically. Email, word-processing files, spreadsheets and more are churned out and distributed at a rate few of us could have imagined when we started our professional careers. Back then, big cases involved a few thousand documents: most files could fit in your briefcase. Today, similar cases might easily require review of tens or hundreds of thousands of documents. Some involve collections running into the millions.

It used to be accepted practice that attorneys representing clients in a lawsuit would read every document at least once, and often more than once. With discovery populations mushrooming, that is no longer a possibility. Clients can no longer afford the cost to review every document, even if counsel could find the time. There has to be a better way to find what is relevant and push aside what is not.

In recent years, promising techniques have evolved to help lighten the review burden. Using these strategies can dramatically reduce the amount of electronic records that require attorney review. Some of these techniques include:

  • Find and omit duplicate and near-duplicate documents from the review. (Near-duplicate documents are identical except for minor differences, such as the same letter addressed to two different people).
  • Develop an agreed-upon list of key search terms and/or date ranges for identifying potentially relevant documents for further treatment. With counsel and court in agreement over the key terms, the parties simply ignore the mountain of documents that don’t meet the search criteria.

What troubles the courts, and often counsel as well, is this: How can the parties be assured that the agreed-upon search terms did not overlook other documents that are relevant to the case? One answer is data sampling, a process whereby the producing party reviews a sample set of documents and extrapolates the results to the entire population.

When Should You Sample?

When you consider the stages of e-discovery, as depicted by the Electronic Discovery Reference Model (EDRM), sampling can be useful at several points along the way. During processing, for example, sampling can be a check on your procedures. For early case assessment, sampling can help identify key themes. During review, sampling can be used to check for inconsistencies in coding calls. Before production, sampling can be used for quality control.

Even so, the most common — and perhaps most critical — use of sampling is during review and analysis. With data populations exploding, sampling is an essential method of check and balance. It serves two key purposes:

1. To make reasonably sure that responsive documents are identified, reviewed and produced.
Courts do not typically demand that litigants find and produce all responsive documents. In today’s electronic world, that is almost an impossibility — or at least hugely impractical. Rather, they require that litigants make reasonable efforts to find and produce responsive documents.[1]

In a typical search and review, you might run a set of responsive search terms across the document population, segregate the responsive documents from the nonresponsive documents, and produce only the responsive documents. Sampling provides a check on the results. You review a statistically valid random sample of the nonresponsive documents to see if the search terms missed capturing any responsive documents. If they did, you modify your search, run it again and then sample it again, until you are reasonably satisfied with the results.

To better illustrate this point, consider this: The accepted standard for statistically valid sampling is a 95 percent confidence level with a 2 percent margin of error or confidence interval.[2] Therefore, if you had one million documents to review, by applying statistically valid sampling methodologies, you would need to review only 2,345 documents (~0.23 percent of the document population) to forecast the results onto the entire one million documents. You can then focus on reviewing the documents that are most likely to be relevant.

Through this iterative process of search and sampling, you add an extra layer of quality control, one that provides assurance to a court that you took reasonable efforts, and that could help avoid unwanted sanctions.

2. To safeguard against inadvertent production of privileged documents.

Inadvertent production of privileged documents can carry serious consequences and may cripple your case. Even so, a manual, linear privilege review of every document can be enormously time-consuming, if not virtually impossible.

Sampling can help ensure that privileged documents do not slip through the cracks. If, through sampling, you find any that did slip through, you can take appropriate measures to correct the mistake, such as by revising search terms and rerunning the search or by creating review rules to flag such documents for a second level review.

Judicial Requirements for Sampling

Recent court opinions suggest that sampling is not only useful but may be required. Several decisions in the past few years have penalized lawyers for not sampling documents before they were produced (waiver of privilege) and for not sampling the documents that were not produced (omission of responsive data).

In two landmark decisions, U.S. Magistrate Judges John M. Facciola and Paul W. Grimm issued key rulings discussing sampling. Specifically, they criticized counsel who hoped to be excused for inadvertent waiver of privilege because they did not sample the documents produced after key-word searches.[3]

Even more recently, another court found waiver of privilege in a “smoking gun” attorney-client communication because counsel failed to sample.[4]

Courts understand that there can be mistakes and that the explosion of data makes it impossible to look at everything. When counsel seek forgiveness for an inadvertent production, courts are increasingly likely to ask them whether they used sampling technologies. While not perfect, sampling is a reliable way to check your review. It provides a higher level of comfort before the production that nonproduced documents do not include something that should have been produced and that the produced documents do not contain privileged material.

Sampling also provides a way to confirm that the scope of the discovery sought is appropriate for the case. In fact, Tennessee courts have required sampling or “tiered discovery” to narrow down certain discovery requests. In Frees Inc. v. McMillian,[5] the court found that the discovery should be accomplished in stages. In the “first tier,” the plaintiff was to identify at least one project involving files allegedly removed from a disputed laptop, which would be subsequently searched and produced. If any of the produced documents and/or files could be shown to be relevant to the case, the parties would proceed to the “second tier” of discovery and the plaintiff could request documents related to other projects. However, if no responsive documents could be found in the first tier, the plaintiff would be required to make a sufficient showing to the court as to why discovery should proceed further.

The Tennessee Rules of Civil Procedure mentions sampling in e-discovery multiple times: to determine if information sources are reasonably accessible, to learn about burdens and costs for electronic discovery, to understand what the data consists of, to establish if the data is pertinent and valuable to the litigation, to determine if additional production is warranted, to consider whether to shift some or all of the discovery costs to the requesting party, and to investigate whether the responding party has deleted any electronic information after litigation was probable or had commenced.

What Does Sampling Entail?

So what, exactly, is sampling all about? The concept comes from the world of statistics and is broadly applied in any number of common circumstances.

Sampling is an iterative process that continues until you reach a point where you can be confident about your results. Data sampling, properly done, allows counsel to review a small, representative portion of the total document universe and extrapolate the findings to the larger population.

Generally, sampling involves one of three methods:

Judgmental sampling
For this method, the sampler is exercising judgment in selecting elements to be sampled. That means that every item of data does not have an equal chance of being selected. Typically, judgmental samples are used when staff or time resources are limited or there is no need to generalize about the entire population. For example, a judgmental sample may be sufficient to show a control weakness or to prompt management to take corrective action.

Random sampling
In this method, any piece of data in the population has an equal chance of being selected. This ensures that no bias is used in the sample selection. However, a random sample does not imply a “statistical sample” and the results cannot be projected to the population. A random, nonstatistical sampling method would typically be used as a way of emphasizing that the results were not biased or exaggerated by selecting, for example, known cases of noncompliance.

Statistically valid sampling
This method combines random sampling with additional statistical criteria such as confidence level, confidence intervals, expected error rate and precision. This method is used to make a statement about the population from which the sample was selected. In this case, outcome measures can be projected to the population.

Sampling for Early Case Assessment

An important reason for conducting early case assessment (ECA) is to direct the review process to be more efficient and effective. For the litigation team, sampling provides a “bird’s eye view” of what the data contains, helps in prioritizing tasks, assists in identifying search terms with high responsive rates, and aids in isolating relevant and junk data.

For lawyers, ECA is often a delicate balance between precision and recall. Judgmental sampling is generally sufficient to establish a solid baseline to begin with. Sampling helps lead you to “fish where the fish are.”
ECA sampling helps gauge the strength of your case, determine whether you have gathered the most relevant data, and assess the most effective way to cull the information. The cost of ECA can be offset later in the process by predetermining the best search terms and methods. ECA can save further costs if the information gleaned from one or two custodians lets you know that it is not worth pursuing 20 others.

Regardless of the result, ECA puts lawyers in a better position to negotiate terms, determine strategy or even assess whether a case should be litigated or settled. If ECA reveals weaknesses, litigants can cut their losses early in the game. If the case proceeds, the search criteria can be applied to the full universe of data.

Quality Control Using Sampling

Sampling is a helpful tool for quality control in e-discovery. As mentioned earlier, sampling can be used at all stages of the discovery process. Depending on the stage and the intent of sampling, one can use one or all of the types of sampling methodologies described previously.

The main goal of quality control sampling is to minimize risk. It ensures that the correct documents are produced to the opposing counsel and enhances overall confidence in the e-discovery process. This extra step can increase costs, depending on the quantity of data used and the number of times sampling is done. However, any added cost is outweighed by the value of avoiding problems in court.

Sampling Techniques

While e-discovery law increasingly mandates sampling, it does not mandate a specific technique. Various e-discovery vendors offer various approaches. As among the leading vendors, the distinction is not whether they sample, but the degree to which they sample and the effectiveness of their techniques. Healthy debate can occur over the number and placement of checkpoints, the technology used and the percentage of data tested.

Among common sampling techniques are the following:

  • Clustering is a technique that groups documents that are similar in some way based on certain underlying concepts. This is useful when trying to identify low-hanging fruit within a large group of mostly irrelevant documents or to identify similar responsive or privileged documents within a huge set of unreviewed documents. Sampling a few of these grouped document clusters, you can make coding decisions and apply them to the entire cluster, thereby saving on attorney review time and costs.
  • Auto-classifiers learn the relevancy of documents. As the lawyer starts to review a small sample of documents, auto-classifiers learn the pattern of the lawyer’s coding calls. Once the system understands the pattern in totality, it applies it to the complete set of data. Auto-classifiers can help make predictions about coding based on actions taken before. In most cases, the reviewers should need to review less than 10 percent of the total document population before the system learns what it needs to know.
  • Predictive scoring is a statistical analysis based upon coding decisions made by counsel during the initial document review and coupled with weighted key concepts and search terms. The higher the weighting of a document, the more likely it is to be relevant. Documents that hit on the higher-weighted search terms are given priority for review, thereby making the review more efficient. This is a simple, easily implemented and cost-effective method for prioritizing review.

The key to an effective analysis lies in finding a statistically valid sample from which to work. Often we find that collections focus on single custodians, sometimes the ones thought to be most relevant. Later collections broaden the net and return a wider sampling of data both by type and content.

If you are going to develop a sample set, be careful that the initial documents being considered are not too one-dimensional to be helpful. Going back to a paper metaphor, imagine what your impression of a case would be if all you sampled initially were boxes on invoices. You might have a lot of documents in your sample, but you would not have representative documents from which to extrapolate your case analysis.

Should sampling show that you have somehow missed the proverbial “smoking gun,” there is a relatively simple fix. The data is still in a format and structure where it can be searched and processed. New key terms can be applied to the data universe, reviewers can revise the parameters of what is responsive or privileged, or other checkpoints can be added into the workflow to ensure that the discovery process delivers the desired results. Before final production, additional rounds of sampling should be done to confirm that the responsive documents are included, the nonresponsive and privileged documents are removed and that documents with redactions have the correct versions produced.

The Future of Sampling in e-Discovery

The use of sampling in e-discovery is in its early stages, with new techniques appearing almost by the day. With
document-by-document review becoming physically and financially unfeasible, sampling will of necessity be a key tool for lawyers and e-discovery professionals.

There is a myth, perpetuated by fear of the unknown, that skilled lawyers could do a review without missing relevant documents or accidentally producing privileged documents. Computers are not less accurate than people. As a matter of fact, research reported in eDiscovery Institute’s Survey on Predictive Coding[6] points to a high degree of fallibility in human review. As e-discovery evolves and the volume of information continues to grow, a necessary outcome will be greater reliance on technology to analyze data.

The courts’ scrutiny of e-discovery methods requires counsel to ensure that their processes are reasonable and sound. Sampling is a persuasive way for counsel to demonstrate effective procedures in information management. When sampling is executed properly, there are no downsides to having those results and understanding more about the data, there are only benefits.

The authors wish to thank the senior consultants at Catalyst Search & Analytics Consulting for their assistance, especially Nirupama Bhatt, James Eidelman and Ron B. Tienzo.

Notes

1. Federal Rules of Civil Procedure, 26(g)(1).
2. Rumsey, Deborah. 2003. Statistics for Dummies, 83.
3. United States v. O’Keefe, 537 F. Supp. 2d 14 (D.D.C. 2008) (Facciola); Victor Stanley Inc. v. Creative Pipe Inc., 250 F.R.D. 251 (D. Md. 2008) (Grimm).
4. Mt. Hawley Ins. Co. v. Felman Prod. Inc., 2010 WL 1990555 (S.D. W. Va. May 18, 2010).
5. Frees Inc. v. McMillian, 2006 WL 2668843 (E.D. Tenn. Sept. 15, 2006).
6. Kershaw, Anne, and Howie, Joe. 2010. “eDiscovery Institute Survey on Predictive Coding,” eDiscovery Institute. www.ediscoveryinstitute.org/pubs/PredictiveCodingSurvey.pdf.


TOM TURNERE TOM TURNER is president and co-founder of Document Solutions Inc. (DSi), an e-discovery, digital forensics and litigation support services company that provides a wide range of traditional and technology-driven services. He can be reached at tturner@document-solutions.biz.

 

 

 

 

JOHN TREDENNICK JOHN TREDENNICK was a trial lawyer and litigation partner for 20 years before founding Catalyst Repository Systems Inc., which provides secure, hosted document repositories. For more than a decade, his company has helped counsel search, review and sample large volumes of discovery documents.