Academia.eduAcademia.edu

Artificial Intelligence and Law

1994, International Journal of Law and Information Technology

This paper examines coding applied by seven different review groups on the same set of twenty eight thousand documents. The results indicate that the level of agreement between the reviewer groups is much lower than might be suspected based on the general level of confidence on the part of the legal profession in the accuracy and consistency of document review by humans. Each document from a set of twenty eight thousand documents was reviewed for responsiveness, privilege and relevance to specific issues by seven independent review teams. Examination of the seven sets of coding tags for responsiveness revealed an inter-reviewer agreement of 43% for either responsive or non-responsive determinations. The agreement on the responsive determination alone was 9% and on the non-responsive determination was 34% of the total document family count. Pair-wise analysis of the seven groups of reviewers provided higher rates, however no pairing of the teams indicated that there is an unequivocally 1 Thomas I. Barnett is the leader of the e-Discovery, records and information management consulting division of Iron Mountain, Inc.; Svetlana Godjevac is a senior consultant at Iron Mountain, Inc. 2 superior assessment of the dataset by any of the teams. This paper considers the ramifications of low agreement of human manual review in the legal domain and the need for industry benchmarks and standards. Suggestions are offered for improving the quality of human manual review using statistical quality control (QC) measures and machine-learning tools for pre-assessment and document categorization. 4 unit 2 size was two. The majority of the corpus, 99%, consisted of families with no more than eight attachments. The family size frequencies are provided in Figure 2. FIGURE 1-DATA COMPOSITION OF THE REVIEW SET FIGURE 2-FREQUENCY DISTRIBUTION OF FAMILY-UNIT SIZE-Most families consisted of two or one member. Bin Frequency Cumulative %

ICAIL 2011/DESI IV Workshop on Setting Standards for Searching Electronically Stored Information In Discovery Proceedings June 6, 2011 Thirteenth International Conference on ARTIFICIAL INTELLIGENCE and LAW ICAIL 2011 University of Pittsburgh School of Law, Pittsburgh PA DESI IV Workshop Organizing Committee Jason R. Baron, US National Archives and Records Administration, College Park MD Laura Ellsworth, Jones Day, Pittsburgh PA Dave Lewis, David D. Lewis Consulting, Chicago IL Debra Logan, Gartner Research, London, UK Douglas W. Oard, University of Maryland, College Park MD TABLE OF CONTENTS Research Papers Thomas I. Barnett and Svetlana Godjevac, Faster, Better, Cheaper Legal Document Review, Pipe Dream or Reality? 2. Maura R. Grossman and Gordon V. Cormack, Inconsistent Assessment of Responsiveness in E-Discovery: Difference of Opinion or Human Error? 3. Richard T. Oehrle, Retrospective and Prospective Statistical Sampling in Legal Discovery 1. Position Papers 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. Steve Akers, Jennifer Keadle Mason and Peter L. Mansmann, An Intelligent Approach to E-Discovery Susan A. Ardisson, W. Scott Ardisson and Decker, bit-x-bit, LLC Cody Bennett, A Perfect Storm for Pessimism: Converging Technologies, Cost and Standardization Bennett B. Borden, Monica McCarroll and Sam Strickland, Why Document Review is Broken Macyl A. Burke, Planning for Variation and E-Discovery Costs David van Dijk, Hans Henseler and Maarten de Rijke, Semantic Search in EDiscovery Foster Gibbons, Best Practices in Managed Document Review Chris Heckman, Searches Without Borders Logan Herlinger and Jennifer Fiorentino, The Discovery Process Should Account for Iterative Search Strategy Amanda Jones, Adaptable Search Standards for Optimal Search Solutions Chris Knox and Scott Dawson, ISO 9001: A Foundation for E-Discovery Sean M. McNee, Steve Antoch and Eddie O'Brien, A Call for Processing and Search Standards in E-Discovery Eli Nelson, A False Dichotomy of Relevance: The Difficulty of Evaluating the Accuracy of Discovery Review Methods Using Binary Notions of Relevance Christopher H. Paskach and Michael J. Carter, Sampling – The Key to Process Validation Jeremy Pickens, John Tredennick and Bruce Kiefer, Process Evaluation in eDiscovery as Awareness of Alternatives Venkat Rangan, Discovery of Related Terms in a Corpus using Reflective Random Indexing Howard Sklar, Using Built-in Sampling to Overcome Defensibility Concerns with Computer-Expedited Review Doug Stewart, Application of Simple Random Sampling in eDiscovery Faster, better, cheaper legal document review, pipe dream or reality? Using statistical sampling, quality control and predictive coding to improve accuracy and efficiency Thomas I. Barnett and Svetlana Godjevac1 Iron Mountain Abstract ........................................................................................................................................................................................ 1 Introduction.................................................................................................................................................................................. 2 Background .................................................................................................................................................................................. 3 Data Set and experiment .............................................................................................................................................................. 3 Data Set ................................................................................................................................................................................... 3 Training ................................................................................................................................................................................... 5 The Task ................................................................................................................................................................................. 5 Coding Results ............................................................................................................................................................................. 5 Analysis ....................................................................................................................................................................................... 6 Global Agreement Analysis .................................................................................................................................................... 6 Pair-wise Analysis................................................................................................................................................................... 8 Kappa ...................................................................................................................................................................................... 8 Other Industry Standards ............................................................................................................................................................. 9 Discussion .................................................................................................................................................................................. 11 Recommendations ...................................................................................................................................................................... 13 Conclusion ................................................................................................................................................................................. 14 References.................................................................................................................................................................................. 15 APPENDIX................................................................................................................................................................................ 16 Abstract This paper examines coding applied by seven different review groups on the same set of twenty eight thousand documents. The results indicate that the level of agreement between the reviewer groups is much lower than might be suspected based on the general level of confidence on the part of the legal profession in the accuracy and consistency of document review by humans. Each document from a set of twenty eight thousand documents was reviewed for responsiveness, privilege and relevance to specific issues by seven independent review teams. Examination of the seven sets of coding tags for responsiveness revealed an inter-reviewer agreement of 43% for either responsive or non-responsive determinations. The agreement on the responsive determination alone was 9% and on the non-responsive determination was 34% of the total document family count. Pair-wise analysis of the seven groups of reviewers provided higher rates, however no pairing of the teams indicated that there is an unequivocally 1 Thomas I. Barnett is the leader of the e-Discovery, records and information management consulting division of Iron Mountain, Inc.; Svetlana Godjevac is a senior consultant at Iron Mountain, Inc. 1 superior assessment of the dataset by any of the teams. This paper considers the ramifications of low agreement of human manual review in the legal domain and the need for industry benchmarks and standards. Suggestions are offered for improving the quality of human manual review using statistical quality control (QC) measures and machine-learning tools for pre-assessment and document categorization. Introduction In the world of technology assisted searching, analysis, review and coding of documents in litigation, review by human beings is typically viewed as the gold standard by which the accuracy and reliability of computer designations is measured. Similarly, humans are expected to be able to make judgments with computer-like accuracy and consistency across large sets of data. Expecting computer-like consistency from humans and expecting human-like reasoning from computers is bound to lead to disappointment all the way around. The level of quality of human review of a small number of documents by an expert reviewer familiar with the facts and issues in the matter is in fact a gold standard. But, the typical case involves review of large amounts of data by professional review teams not immersed in the subject matter of the case and the level of accuracy and consistency vary greatly. The levels of accuracy demanded of automated approaches to document classification are expected to confirm to the subject matter expert gold standard not the standard of the typical professional review team. The vast majority of data in legal document review is coded by professional review teams not by the subject matter experts. Thus, holding automated approaches to the gold standard that is barely, if ever, reached in the human review in actual matters creates an unreasonable and likely unachievable goal. This paper proposes that the comparisons be done on a level-playing field and that each approach, human and automated review, be applied to tasks to which they are best suited. As more human reviewers are applied to the same set of data, the level of consistency and agreement predictably declines. This paper suggests that statistical sampling and statistical quality control is needed to establish a uniform framework from which to assess and compare human and automated review. The tools used to search, analyze and make determinations about documents in a set of data need to be calibrated and guided by human understanding of the underlying facts and issues in the matter. For now at least, and with acknowledgement of the resounding victory by IBM’s Watson on Jeopardy!, computers don’t “understand” things in the way human beings do. Computers can execute vast amounts of simple binary calculations at speeds that are difficult to contemplate. Such calculations can be aggregated and structured in complex ways to mimic human analysis and decision making. But in the end, computers do exactly what they are told and are incapable of independent thought nor can they make decisions outside the scope of their programmatic instructions. Conversely, human beings do not blindly execute precise complex instructions at lightning speed in a predictable and measurable way as computers do. Human creativity and independent thought result in variability and unpredictability when attempting to make large numbers of fine distinctions. The independence and creativity that allows a person to make a novel observation or discovery is the flip side of the lack of the ability to make fast, mechanically precise consistent determinations about documents. This paper proposes considering a set of documents for review in a litigation as a continuum of relevance to a set of criteria rather than as a set of uniform discreet yes/no determinations. Under that model, the review process can be designed to play to the relative strengths of computer and human analysis. Within any typical set of data, certain documents will be clearly responsive. Others will be clearly non-responsive. The remaining documents can be characterized as having an ambiguous classification. Trying to get computers to accurately assess documents that humans find ambiguous is not effective—it plays to the computer’s weakness. Computers should be utilized where they are strongest—quick, fast, accurate determinations of clear cut binary determinations. By contrast, for documents that are not clearly responsive or non-responsive, human judgment, creativity and flexibility is best suited to make the judgment calls. Based on this model, this 2 paper asserts that computers should be used to classify non-ambiguous documents while human reviewers should focus attention on documents whose classification is ambiguous. This paper examines coding applied by seven different review groups on the same set of twenty eight thousand documents. The results indicate that the level of agreement between the reviewer groups is much lower than might be suspected based on the general level of confidence on the part of the legal profession in the accuracy and consistency of document review by humans (see Grossman and Cormack, 2011 for a similar position). However, a comparison to other industries, such as medical text coding for example, suggests that the legal industry is on a par with the results in other industries. This should not be surprising considering that both tasks are language-based tasks involving interpretation and translation of vast amounts of text into a single numeric code. This paper argues that the identified distribution of disagreements among human reviewers suggests that the nature of the task itself will never allow significant improvement in human review without disproportionate additional cost and time spend reviewing and cross checking document determinations. A proposed method to achieve higher consistency and accuracy lies in redistribution of the task between humans and computers. Computers should be allowed to jump-start the review, as they will easily recognize high-certainty sets, and humans should focus on ambiguous, middle of the scale sets, as only human analytical and inferential ability can successfully classify the documents of ambiguous classification. Background This experiment was originally conducted as a pilot by a company for the purpose of selecting a provider of document review services. The intent was to compare the document coding of five different document review providers against a control set of the same documents coded by outside counsel. The results of the six team review (five document review vendors and the outside counsel team) proved inconclusive to client in determining which provider to select. Subsequently, the client decided to assess the quality and accuracy of the providers’ coding of the documents using the assessments of a different outside counsel who had reviewed the same set of documents. This second control group constituted the seventh set of human manual assessments for each document in this set. The additional control group’s document coding determinations were ultimately not considered definitive and the pilot did not result in any clear “winner.” The analysis was performed on the final aggregate set of document coding from all review teams and does not assume that the coding of any one group is the ground truth. The client concluded that neither of the two control groups was able to provide coding that was of sufficient accuracy to be considered a gold standard. From the client’s perspective, the experiment failed, as it was not possible to determine a winner among the document review service providers. Nevertheless for purposes of this analysis, the data provided a unique and valuable source of information for the eDiscovery industry and it is hoped that the results can be instructive in conducting comparisons of document review groups as well as creating quality control standards and workflow improvements for legal document review. Data Set and experiment Data Set The reviewed document population for this experiment consisted of a sample of the electronically stored information (ESI) from six different custodians. The starting set contains 12,272 families comprised of 28,209 documents. Of the total 28,209 documents, most of the documents were emails and Microsoft Office application files. The basic data composition is represented in Figure 1. The most common family 3 unit2 size was two. The majority of the corpus, 99%, consisted of families with no more than eight attachments. The family size frequencies are provided in Figure 2. Data Composition 7% Email 10% MS Office Image 45% 38% Other FIGURE 1- DATA COMPOSITION OF THE REVIEW SET 6000 150% 4000 100% Frequency Cumulative % 2000 50% 0 0% 2 1 3 4 5 6 10 15 20 30 More FIGURE 2 – FREQUENCY DISTRIBUTION OF FAMILY-UNIT SIZE – Most families consisted of two or one member. Bin 2 1 3 4 5 6 7‐10 11‐15 16‐20 21‐30 31 or More Frequency 5023 4432 1375 542 318 235 233 66 25 13 10 Cumulative % 40.93% 77.05% 88.25% 92.67% 95.26% 97.17% 99.07% 99.61% 99.81% 99.92% 100.00% TABLE 1 – HISTOGRAM TABLE FOR THE FREQUENCY OF DISTRIBUTION OF SIZE OF FAMILY-UNITS Due to errors in coding, the original set had to be cleaned up for the purpose of analysis. Forty-seven document families were excluded because at least one member has been coded “Technical Issue.” Ninety five families were excluded because one or more members in the family were not coded consistently with the rest of the family. A summary of the data exclusion is presented in Table 2. Documents Families ORIGINAL 28,209 12,272 EXCLUDED TECH ERRORS FAMILIES 205 47 EXCLUDED INCONSISTENT FAMILIES 350 95 CONSISTENT FAMILIES FINAL COUNT 27,654 12,130 TABLE 2 – DATA SETS THAT WERE EXCLUDED FROM THE ORIGINAL SET AND THE FINAL SET COUNTS 2 A “family unit” for purposes of this paper means an email and any associated attachments. 4 Reviewers Seven reviewer groups were provided with access to the data for assessment. The review was conducted by groups of attorneys employed by five different legal document review providers and groups of litigators at two different law firms. Each group had a range of between six and seventeen attorneys who were provided access to the data. Training Each reviewer group received approximately three hours of subject matter training by the first law firm and the client. They were also provided with a review protocol, a coding manual, and an hour of training on the review platform. Each reviewer also received a binder with the review protocol, the official complaint, a list of acronyms and other subject matter materials necessary for document assessment. All but one team used the same hosted review platform which they accessed in a controlled environment during business hours. One group, group F, performed the review on their own platform, although there is no data to suggest that that influenced the document coding decisions. The Task The documents were arranged into batches of approximately 100 (keeping family units together). The batches were made up of randomly selected document families from the data set. The task involved reviewing and coding each document in the batch before the next batch could be requested. The coding tags included assessments for responsiveness, privilege, issue, and “hot” (significant) document designations. The assessments were made at the family unit level rather than by the individual component of a message unit. For example, if any member of the family was considered responsive, the entire family was coded responsive. Similarly, if any member of the responsive family was considered privileged, the entire family was tagged privileged. Each review team performed quality control checks according to their standard practice before providing the coded documents to the client. Reviewers also had an option to tag documents for any technical problems, such as difficulty in viewing or errors in processing. Some of these errors prevented reviewers from making assessments for responsiveness and privilege. Consequently, due to the absence of coding for responsiveness, 205 documents were excluded from the overall agreement comparisons. For purposes of analysis, responsiveness determinations were the sole focus. Unlike issue coding, these assessments are binary and all documents must be coded either responsive or non-responsive. Privilege determinations were not included because the privilege rates were very low, less than 1%, and were dependent on the responsive assessment (i.e., if a document was coded non-responsive, no determination would be made as to whether or not it was privileged). Coding Results The responsiveness rates among the seven review groups range from 23% to 54% of the total families. The difference spans 31% with a standard deviation of 0.11. The coding of each review group is presented in Table 3 below. Tag Count per Family Non‐Responsive Responsive A 8279 3851 B 5560 6570 C 7641 4489 Group D 9331 2799 Total 12130 12130 12130 12130 12130 12130 12130 Responsive Rate 31.75% 54.16% 37.01% 23.08% 27.11% 50.09% 39.69% TABLE 3 – CODING COUNTS FOR EACH REVIEW TEAM 5 E 8842 3288 F 6054 6076 G 7316 4814 By definittion, the global inter-review wer agreemen nt (the percenntage of docum ment coding all groups agrree on) canno ot exceed the lowest l respon nsiveness ratees found amonng all seven ggroups. In othher words, thee maximum m rate of agreeement cannot be higher thaan the sum off the lowest prroportion of rresponsive taggs among alll the teams an nd the lowest proportion off non-responssive tags amonng all the team ms (i.e., 23.08%+4 45.84%=68.92%). Analysis Two types of analyses were conductted: a global analysis of aggreement, andd a pair-wise agreement analysis. In I the global analysis, the level l of agreeement betweeen all revieweer groups was the focus. Seets of documentts on which different teamss agreed weree identified: a set of docum ments for whicch all seven groups ag greed, or 7/7, sets of docum ments for whicch six out of tthe seven agrreed, or 6/7, fi five out of sevven, 5/7, and four f out of sev ven, 4/7. The remaining co ombinations aare the inversee of these fouur. The pair-w wise analysis was w performed d in two wayss: an agreemeent expressed as a percent overlap betw ween a pair of review teaams and agreeement expressed as Cohen n’s Kappa coeefficient. Global Agreement An nalysis The analy yzed documen nt set had 12,1 130 family un nits with a tottal of 27,654 ddocuments. T The set of fam milies for which h all seven gro oups agreed on responsiven ness (either thhe responsivee or non-respoonsive tag), iss 5,233 fam mily units, or 43.14% 4 of thee data set. Six x groups agreeed on 2,482 ffamily units, oor 20.46% of the data. Fivee groups agreeed on 2,120 family fa units, or o 17.48% of tthe data and ffour groups aagreed on 2,2995 family un nits, or 18.92% % of the data. The agreemeent results aree shown in Figgure 3. ed on resp ponsivenesss Numberr of teams that agree 60 000 50 000 40 000 30 000 20 000 000 10 0 Fam milies Ratio 7 6 5 4 5233 2482 2120 2295 43.1 14% 20.46% 17.48% 18.92% % 50% 40% 30% 20% 10% 0% FIGURE 3 – REVIEWER AGREEMENTS A ON RESPONSI VENESS The chart shows s the num mber of documeent families an nd the number of teams that tagged the doccuments the saame way. For example, e all se ven teams cod ded 5233 famil ies the same w way. The data in i Figure 3 in nclude agreem ments on both responsive annd non-responsive determinations. Breaaking down thiss agreement in nto its constitu uent parts and d consideringg only the respponsive tag (tthe non-respoonsive tag is a mirror m image of the responsiive tag) show ws that the revviewers agreedd more often on non-respoonsive than on th he responsive tags. The disstribution of th he responsivee tag agreemeent is providedd in Figure 4.. 6 Rattio of team ms responssive codingg 45 500 40 000 35 500 30 000 25 500 20 000 15 500 10 000 500 5 0 40% 35% 30% 25% 20% 15% 10% 5% 7/7 Fam milies 1063 Ratio 6/7 5/7 4/7 3/7 2/7 1/7 0/7 1239 1129 1257 103 38 991 11243 4170 0% 1 9.31% % 10.36% 8.56 6% 8.17% 100.25% 34.38% 8.76% 10.21% FIGURE 4 – DISTRIBUTIO ON OF THE RE SPONSIVE TAG G ACROSS REV VIEW GROUPS S The chart sh hows how many document famillies different num mber of teams’ ccoded responsivee. For example, seven review teaams agreed on 1063 families beiing responsive; 6 review teams agreed a on 1239 ffamilies being reesponsive etc. Noo group agreed tthat 4170 familiees were responsiive, i.e., all seven groups coded this set as non-reesponsive. All-group ps agreement of o 43.14%, sh hown in Figurre 3, is the suum of the 8.766% documentt pool for whiich all seven team ms said the do ocument famiily was responsive and thee 34.38% docuument pool foor which all sseven teams said d the document family wass non-responssive. The highher non-respoonsive agreem ment can be viiewed as a resultt of the low reesponsive ratee found in mo ost groups codding. Two ouut of seven reeviewer groupps had responsive rates less th han 30% (grou ups D and E, see Table 2). 3 With fewer documents ccoded responssive by two groups, the oveerall agreemen nt for responssive would noot be expectedd to be higherr than the low west responsive rate (Group p D). Similar reasoning can n be applied tto the non-ressponsive ratess. plied the coding simply by y guessing, whhat level agreeement wouldd be expected?? If the seveen groups app With the seven s review teams, a puree guessing approach wouldd be the equivvalent of seveen coin tossess for each famiily of documeents. Each coiin toss is a bin nary decision , heads or taills, similar to ddocument responsiveness tagging g. The probab bility of havin ng a chance aagreement by having review wers guess raather 4 y analysis and d reasoning am mong the sev ven groups is 1.56%, or 2/1128. The achhieved 43% than apply agreemen nt is thus evideence of a deciision and not a mere guesss. However, thhe decision w would require more specificity y if it were to be executed in perfect acccordance at a higher frequeency than 43% %. The next considerattion was pair--wise comparrisons5 and th he level of corrrelation amonng the group pairs. 3 This is ev vidence that thee actual, thoug gh unknown, nu umber of respoonsive families, is much loweer than the num mber of the actual, also unknow wn, number off non-responsiv ve families. 7 The probability for seveen out of seven n agreement on n each responsiive or non-respponsive is  1  , which is  1  . 2  1288  ding agreementt on both respoonsive and non--responsive thoose Since aggrregate agreement was computted, i.e., includ two probab bilities are added to arrive at  2  .  128  5 Pair-wisee comparison iss a common scientific method d for calculatinng a relationshiip between a pair of results too determine which memberr of the pair is better or has a greater level oof property thatt is under discuussion. 4 7 Pair-wise Analysis This analysis presents calculations of percent overlap (or agreement) between any two groups. The results are given in Table 4. Overlap is defined as the sum of all document families where two review teams agreed in responsiveness (responsive and non-responsive tag agreement) divided by the total number of document families they reviewed. The raw agreement values are shown in Table 9 in the Appendix. 6 A B C D E F G A B 75.06% C 83.05% 75.01% D 74.51% 65.53% 72.20% E 79.91% 71.95% 76.69% 80.32% F 76.94% 84.90% 75.21% 68.17% 74.26% G 76.94% 75.23% 74.11% 67.39% 73.08% 77.20% TABLE 4 – PAIR-WISE AGREEMENTS The table presents percent overlap of tagging assessments between a pair of review teams. For example, A and B teams tagging overlapped 75% of the time. The highest overlap was achieved by groups A&C (83%) and B&F (85%). The lowest overlap was manifest between groups B&D (66%). The average overlap between group pairs is 75%. The group average aligns very closely with the results from a recent study by Roitblat et al.(2010) that compared agreement of pairs of manual review teams. Their comparison of manual review indicated that two different human review teams agreed with the original assessment at remarkably similar levels to the ones presented here. Their Team A agreed with the original review 75.58%, and Team B agreed with the original review 72.00%. So, results presented here replicate and reinforce the results presented in Roitblat et al. (2010). However, an earlier TREC study (Voorhees 2000) provided much lower agreement levels. In that study three different pairs of manual review teams had overlaps of 42.1%, 49.4% and 42.6%. It is not clear though, how the difference in ~30% agreement between the more recent studies and Voorhees’ might be accounted for. The average 75% coding overlap between two review teams suggests that even among the professional reviewers one in every four documents is not agreed upon. This result challenges the common assumption that there are discernable right and wrong determination for every document and that such a determination will be reached uniformly by different human reviewers. Kappa To further examine the level of agreement of responsiveness tagging between reviewer groups, Cohen’s Kappa coefficient was computed. The Kappa coefficient is a measure of a level of agreement between two judges on a sorting of any number of items into a defined number of mutually exclusive categories. In our scenario, each review team is a judge and responsiveness tagging is a sorting into two mutually exclusive categories (responsive and non-responsive). Kappa coefficient values can range between 1 (complete agreement, or far more than expected by chance) to -1 (complete disagreement, or far less than expected by chance), with 0 being a neutral case, or as one would expect by pure chance. This coefficient is regarded as a better measure of agreement than percent-overlap because it eliminates the level of chance-agreement from its value. Landis and Koch (1977) propose the following interpretation of Kappa scores: 6 Overlap presented in Table 4 was calculated from the values provided in Table 9. A and B teams agreed on 3698 document families being responsive and 5407 document families being non-responsive. Their coding then overlapped 75.06% ((3698+5407)/12130=0.7506). 8 0.01-0.20 – Slight agreement 0.21-0.40 – Fair agreement 0.41-0.60 – Moderate agreement 0.61-0.80 – Substantial agreement 0.81-0.99 – Almost perfect agreement Kappa values for the seven review groups are presented in Table 5. Using the Landis and Koch interpretation scale for the Kappa scores, most of the team pairs, 13 of them, show moderate agreement. Their Kappa values range from 0.45 to 0.54. Two team pairs show substantial agreement, and six team pairs show fair agreement. The lowest score is 0.3402 (Groups B &D), and 0.6979 is the highest (Groups B & F). The Kappa values confirm the pair-wise analysis of percent-overlap for the groups: B&F exhibit the highest overlap and B&D the lowest on both analyses. A B C D E F G A B 0.5159 C 0.6255 0.5108 D 0.3655 0.3402 0.3536 E 0.5175 0.4597 0.4709 0.4776 F 0.494 0.6979 0.5044 0.364 0.4857 G 0.5013 0.5131 0.4528 0.4053 0.4053 0.5441 TABLE 5 – KAPPA COEFFICIENT The Kappa scores range [0.3402 - 0.6979 ] is similar to the one found by Wang & Soergel (2010) in their study of inter-rater agreement between two groups of human reviewers. Their experiment involved four law students as the LAW team and four library and information studies students, as the LIS team. The goal of their experiment was to test whether the legal background affects the quality of document review. The Kappa mean scores within the LAW team, within the LIS team and across the two teams show remarkably similar ranges: (a) within LAW [0.38 – 0.69], (b) within LIS [0.30 – 0.54] and (c) across LAW and LIS [0.47 – 0.61]. The range of the Kappa coefficient for Wang and Soergel’s LAW group closely parallels the range reported here for the seven review teams. The Kappa coefficient analysis further confirms that humans reviewing the same documents frequently disagree. As discussed below, this fact suggests that greater focus on quality control is warranted. Other Industry Standards In order to put results presented here into a broader context, a short overview of similar tasks in other domains is presented. There are a variety of applications that require translation of natural language into other systems, whether other natural languages or man-made systems. Document review coding is an example of a man-made system that requires a translation from document text into review codes. Tasks of this nature could theoretically be automated if explicit sets of rules could accurately be defined in advance. For tasks that involve natural language, the number of explicit rules is too numerous to be able to be defined in advance. One solution to this problem is machine learning. Machine learning is a substitute for pre-defined set of rules. In the absence of explicit rules, a machine learning program uses input from a training set and “learns” how to apply it in situations that are similar to the ones in the training set. Machine learning is used in search engines, natural language processing, detecting credit card fraud, stock market analysis, handwriting recognition, game playing, medicine, and many others areas. 9 The training phase of machine learning requires high quality human input, where the high level of accuracy is confirmed through agreement with multiple human experts on the same task. The medical industry has been faced with the challenge of coding millions of records for medical diagnosis, billing and insurance purposes, among others. In the domain of patient records, a medical diagnosis is required to be translated into a billing code. The billing codes are based on the classification provided by the World Health Organization in the International Classification of Diseases (ICD). The process of human coding of medical diagnoses is challenged by the existence of thousands of possible codes, which is both time-consuming and error-prone. To alleviate the burden and improve consistency of human coding, a number of machine learning systems for classifying text using natural language processing have been designed and implemented in the medical industry. The “training” of the system, using a set of documents that have been coded by highly trained human ICD coding experts is critical to the accuracy of all of the ICD automated coding systems. The application of ICD codes for medical diagnosis is in many ways similar to legal document review. Both involve reading and understanding natural language texts (or listening to audio files) and applying a code as an output of the process. The ICD codes are directly parallel with issue coding in legal document review in that a number of possibilities per document are open for assignment. The interpretation of natural language (verbal encoding of someone else’s intentions) is at the core of the process in both tasks. Responsive and privilege binary distinctions are a simpler form of coding than relevance to a specific issue in a lawsuit as the number of possibilities are reduced to two. So, the agreement results achieved in responsiveness tagging are expected to be higher than agreements on issue tagging in legal review or ICD coding in medical review due to the smaller number of choices a reviewer/coder is faced with. The literature on training and automation of the ICD coding assignment and other systems for classification of medical information, such as SNOMED (Systematized Nomenclature of Medicine), is vast. Kappa is often used as a measure of inter-reviewer agreement and for comparison of automated system against human review, the most commonly used metric is the harmonic mean of precision and recall, or the F-score.7 This score can only be computed if precision and recall can be computed. Having a gold standard is the key to all machine learning systems as well as the evaluation metrics. If the “true” answer is unavailable, the system is unable to learn.8 Some examples of results in the medical domain are provided below. Uzuner et al. (2008) measured inter-annotator agreement of the patient’s smoking status based on the hospital discharge summary. The annotators were two pulmonologists who provided annotations relying on the explicit text in the summary as well as their understanding of the same text. The metric shown in Table 6 is the Kappa coefficient. The intuitive judgment values are the most directly comparable to the document review assessments as they rely on human ability for interpretation. These scores are similar to the ones reported here for attorney teams. The overall range is wider with the highest score in the “almost perfect” category. 7 The F-score is computed as 2*P*R/(P+R), where P is precision and R is recall. Precision is a metric that quantifies how many of the retrieved documents are correct and precision is a metric that quantifies how many correct documents were missed. In order to calculate these values, the number of correct documents must be known. The set of correct documents is what is referred to as the “gold standard.” 8 Human intelligence, although incomparably more flexible and dynamic in comparison to a machine, is also dependent on the “system updates”, or the feedback loop for arriving at the truth. Quality control checks of a sample of documents being reviewed often serve to provide feedback to the reviewers on the accuracy of their coding choices so that they can make course-corrections going forward. This process is an important calibration tool in manual review. 10 Textual Judgment Agreement Observed 0.93 Intuitive Judgment 0.73 Specific (Past Smoker) 0.85 0.56 Specific (Current Smoker) 0.72 0.44 Specific (Smoker) 0.40 0.30 Specific (Non‐Smoker) 0.95 0.60 Specific (Unknown) 0.98 0.84 TABLE 6- KAPPA COEFFICIENTS FOR INTER-ANNOTATOR AGREEMENT FOR PATIENT’S SMOKING STATUS From Uzuner et al. 2008 study on patient smoking status from medical discharge summaries. The study shows the kappa scores for assessments based on explicit text and interpretive judgments based on human understanding. Table 7, below, shows pair-wise comparison of inter-reviewer agreement using the F-measure, for three human annotators for ICD-9-CM codes applied to radiology reports on a test set (unseen data). The Fscores of the training set were approximately 2 points higher in each case. This higher measure is as one would expect, as the training set is the set that they’ve seen prior to evaluation. A1 A1 A2 A3 73.97 65.61 A2 73.97 A3 65.61 70.89 70.89 TABLE 7- INTER-ANNOTATOR AGREEMENT ON ICD-9-CM CODING OF RADIOLOGY REPORTS (Richard Farkas And Gyorgy Szarvas, 2008) Crammer et al. (2007) study of inter-annotator agreement for ICD-9-CM coding of free text radiology reports, also using three human coders, the average F-measure of 74.85 (with standard deviation of 0.06). Resnik et al. (2006) provide measures of inter-annotator agreement on task involving code application for for ICD-9-CM and CPT (Current Procedural Terminology) on a random sample of 720 radiology notes from a single week from a large teaching hospital. Their evaluations show averages for all annotators. They’ve used a proportion measure for ICD. Their results are provided in Table 8. Intra‐coder agreement Inter‐coder agreement ICD 64% 47% TABLE 8 – INTER AND INTRA CODER AGREEMENT ON ICD CODE ASSIGNMENTS (Resnik et al. 2006) In all of the radiology coding tasks presented above, the inter-reviewer agreement is not dramatically different from the agreements found in this study of legal document review. Given that similarity of tasks, this suggests that manual (human) review of discovery documents should not be expected to improve significantly unless additional means are used to help better allocate time for human review of more complex documents that need to be assessed with more attention. One common thread to the medical studies and the studies on legal document review, whether by humans or machines, referenced here is the fact that none of them show results approaching full agreement or high retrieval (measured by the F-score). Both fields appear to be at the same level of advancement when it comes to coping with the inherent ambiguity of human language. Discussion Results and their implication The global agreement calculations show that reviewers unanimously agreed on nearly half the documents, or 43%. This set of documents can be termed a high certainty set. On the other roughly half of the 11 documents, the reviewers had varying degrees of certainty, 6/7, 5/7, and 4/7. This distribution of varying degrees of collective uncertainty can be viewed as a consequence of the “translation” reviewers had to make in order to force a simple yes/no determination onto intrinsically subjective nonlinear data. In other words, the perspective of multiple review groups reviewing the same set of documents rather than a single review team provides support for the intuitive understanding that documents have varying degrees of relevance. When reviewers are asked to code documents either responsive or non-responsive, they are essentially being asked to translate a continuum of degrees of responsiveness into a threshold that will create a single artificial boundary for a yes/no determination. Where this boundary lies is subject to interpretation. The subject matter training the reviewers receive at the beginning of a review is supposed to train them to find this boundary uniformly at the same place every time. However, in reality, each reviewer (and consequently each group) arrives at a different threshold that defines that boundary. Quality control is needed to moderate the understanding of the boundary placement throughout the review. The level of QC needed to guarantee that this boundary is perfectly calibrated and aligned for all reviewers is not practical in terms of time and cost in the context of legal document review. Part of the quality of control process in the context of document review is evaluation of performance. The most effective means of evaluating quality of performance is to use a quantifiable system. Often used steps for quantifiable evaluation of language-based tasks are: a) comparison to a gold standard b) inter-coder agreement (consistency across multiple reviewers) c) intra-coder agreement (consistency within the same reviewer) This study of agreement only focused on inter-coder agreement. Access to a gold standard was unavailable and inter-reviewer consistency either at the group-level or reviewer-level would require more complex computations such as creating document sub-groupings based on content similarity and assessing consistency of coding within each subgroup within a reviewer, within a reviewer team and across all reviewer teams. Comparison of inter-reviewer agreement from these seven groups to the quoted radiology annotators shows that the legal review groups are on a par with the medical profession. The ICD proportion for intercoder agreement was 47% (Table 8). This value is directly comparable to the average value of 75%, calculated in Table 5 for legal document review. The comparison of these two values gives legal review a superior grade. The comparison analysis, however, must acknowledge that the ICD coders use thousands of codes, rather the just two (i.e., responsive or non-responsive), as do the legal document reviewers and thus the probability of agreement is reduced by the larger number of possible choices. If it is assumed that the set of varying degrees of certainty (the sets where agreements were 6/7, 5/7, and 4/7) and the sets outside of agreement (intersections) in the pair-wise comparisons are the sets that contain errors, the nature of these errors and the cost associated with them needs to be considered. Error types and their cost Errors are divided into two types:  False positives (Type I error) – documents coded responsive, but are actually non-responsive.  False negatives (Type II error) – documents coded non-responsive, but are actually responsive. False positives are typically caught by QC and/or additional review passes. This is because the set of responsive documents is usually further reviewed either for assessment/confirmation of privilege, privilege type or redaction. Errors of this type, Type I, are usually more costly for the client in the field of legal document review, because these types of errors may result in waiver of privilege or revealing potentially damaging information to the opposing side. 12 False negatives and the degree of their presence in the non-responsive set usually remain undiscovered, unless active measures are taken to identify them such as re-review or inferential statistics through sampling.. This type of error is often neglected as it is less costly from the perspective of the risk of unintentional information exposure. However, if detected by the opposing side, it could lead to sanctions for withholding relevant information. In this study, an assumption was made that the gold standard for this set was not available, However, if the set of 7/7 agreements for responsive and non-responsive were to be used as the gold standard, the calculation based on this gold standard would be biased in favor of the groups who made conservative judgments on responsiveness. So, this evaluation cannot be used as a measure of quality of the review groups, although it could be used as a way of measuring the cost of error for the client. Recommendations Sharing the work The distribution of partial agreements, viewed as a continuum of degrees of certainty, is analogous to the predictive coding systems whose output is a probability score for each document, rather than a binary decision on category membership. If human review manifests a continuum of certainty levels with respect to relevancy judgment anyway, why not then share the task of review with the predictive coding systems which automatically output degrees of certainty? Sharing the task does not mean fully delegating, but rather incorporating predictive coding technologies to aid human document review by using computer software to segregate the high-certainty sets (the high probabilities and the low probabilities for category membership, or the 7/7 and 0/7 agreements in this study) and allow human experts to focus on the middle range probabilities (the 6/7, 5/7, and 4/7 in this study). The high certainty sets are the easy calls to make as they are more clear-cut and so they should be delegated to the low cost (computer) labor. The difficult decisions are the decisions that require human intelligence for disambiguation as well as strong subject matter expertise. The generated probabilities can also speed up the review of the middle of the scale sets. Resnik et al. (2006) show that computer assisted workflow improves human scores by 6% in ICD coding. This improvement in speed may come with a bias, however, and so, it should be considered carefully. They note that: “Post hoc reviews can overestimate levels of agreement when complex or subjective judgments are involved, since it is more likely that a reviewer will approve of a choice than it is that they would have made exactly the same choice independently” Whether predictive coding should be revealed to the reviewers for the middle of the scale sets is a decision that will require determination on a case-by-case basis. Feedback Feedback is essential for any learning environment. Legal document review is a business process that starts anew with each case. The task begins typically after no more than a day of training, if that. Due to the high costs of document review by attorneys, the learning phase is becoming shorter and shorter and the expectation is that even very complex subject matters can be absorbed in short time frames. Unfortunately, that assumption is to the detriment of the depth of expertise reviewers can attain and consequently the quality of the review. The actual subject matter experts rarely review documents and thus the true gold standard is an illusion. To improve the quality of review, continuous dynamic updates of expert judgments provided to the reviewers are critical. If reviewers receive feedback about the 13 accuracy of their work promptly, fewer errors will ensue. This result will minimize the need for recoding after quality control checks are performed as fewer errors should be present. Statistical QC Current legal document review practices rely more often on judgmental sampling as a QC procedure than on statistical sampling. Although judgmental sampling has value in the QC process, it also has deficiencies. The key detriment is the inability to apply inferences to the larger set. So, while judgmental sampling may reveal errors, there is no way of estimating if the types of errors the QC team didn’t consider are present and the degree to which they may be present in the population as a whole. For example, because judgmental sampling deals with the known risks the searches target known “keywords” to create samples for QC. The end result is that unanticipated uses of language to describe the high-risk activities at the core of review will remain undetected. Implementing statistical sampling for the QC process would allow document review to provide quantifiable metrics on the quality of the output and it would also create a higher chance of finding unanticipated references that may inform new searches and require document recoding. As predictive coding is becoming a more widely available offering in the practice of legal document review, it is essential that the double standard that seems to be applied to this programmatic approach as compared to the standards for human review be addressed. Clients uniformly require that predictive coding come with 95%-99% accuracy. This level of accuracy for the machine is expected because the assumption is that human review is in the 100% range of accuracy (for a similar discussion see Grossman and Cormack 2011). There are at least two problems with this reasoning. First, no research was uncovered that suggests that human accuracy level ever approaches 100% accuracy. Second, it seems that this unsupported assumption is also tacitly known to be false. Either way, the predictive coding should be welcomed by the legal community and judged by the same, not higher, standards than manual review. In order to provide the ground for comparison and equivalent standards of quality, manual review should incorporate statistical QC into its workflow as only with this type of quality check can measures of accuracy, such as precision and recall, be calculated. Conclusion Document review for litigation discovery is demanding, time-consuming, expensive and risky. It requires both the ability to perform routine repetitive tasks in an accurate and timely manner as well as the ability to apply human judgment, reasoning and making fine distinctions about complex matters. And the faulty decisions can have tremendous legal and financial consequences. Neither humans nor computers are perfectly suited to accomplish these diverse tasks. The recommended approach to achieve greater accuracy and efficiency is to allocate tasks between humans and computers that play to their respective strengths rather than to their respective weaknesses. Computers perform high speed, repetitive tasks far more efficiently than humans. But computers have no ability to use reason, creativity or judgment beyond the predefined rule sets that are used to program them. Large sets of documents subject to review in litigation contain a continuum of responsiveness. That is, there are some documents that are clearly responsive, some that are clearly non-responsive and the remainder are somewhere in between. Efficiency and accuracy in legal document review can be improved by allocating computer assisted sorting and categorization processes to the high certainty ends of the continuum while human reviewers focus their time and attention using their uniquely human analytical and inferential ability classifying the ambiguous documents. 14 Referen nces F and Gy yorgy Szarvass, “Automaticc constructionn of rule-basedd ICD-9-CM coding systems”, Richard Farkas BMC Bio oniformatics 2008, 2 9. Koby Craammer and Mark Dredze an nd Kuzman Ganchev G and P Partha Partim m Talukdar, “A Automatic Coode Assignmeent To Medicaal Text”, BioN NLP ’07 Procceedings of thhe Workshop on BioNLP 22007: Bilogiccal, Translatio onal, and Clin nical Languag ge Processing 2007. Maura R.. Grossman & Gordon V. Cormack, Technology-A T Assisted Revieew in E-Disccovery Can B Be More Effe fective and More M Efficientt Than Exhau ustive Manuaal Review, XV VII RICH. J.L L. & TECH. 111 (2011), http://jolt.rich hmond.edu/v1 17i3/article11 1.pdf Landis, J.R. and Koch,, G. G., “The measurementt of observer agreement foor categorical data”, Biomeetrics 1977, 33. Kapit and Richard Philip Ressnik, Micahell Niv, Micaheel Nossal, Greegory Schnitzzer, Jean Stonner, Andrew K Toren, 20 006, “Using In ntrinsic and Extrinsic E Metrrics to Evaluaate Accuracy aand Facilitation in Compuuterassisted Coding”, C Persp pectives in Health H Informa ation Manageement Compuuter Assisted C Coding Conferencce Proceeding gs; Fall 2006.. Roitblat, h. h L., Kershaw w, A., and Oo ot, P. “Docum ment categorizzation inlegall electronic diiscovery: Computerr classification n vs. manual review”, Jourrnal of the Am merican Socieety for Inform mation Sciencce and Technolog gy 61 (2010),, 70–80. Özlem Uzzuner, PhD, a b Ira Goldsteein, MBA, a Yuan Y Luo, MS S, a and Isaac K Kohane, MD,, PhD c, “Identifyiing Patient Sm moking Statuss from Medical Discharge””, Journal of tthe Americann Medical Informaticcs Association. 2008 Jan-F Feb; 15(1): 14 4-24. Vooerheees, Ellen M., “Variations in n relevance ju udgments andd the measurem ment of retrieeval effectiven ness”, Informaation Processing & Manag gement 36, 5 ((2000), 697–7716 Wang, Jiaanqiang and Dagobert D Soerrgel, “A Userr Study of Rellevance Judgm ments for E-D Discovery”, A ASIST 2010, Octtober 22-27, 2010, 2 Pittsburrgh, PA. 15 APPENDIX A R B NR Total R 3698 2872 6570 NR 153 5407 5560 Total 3851 8279 12130 A R C B NR Total R 3142 1347 4489 NR R C NR 2556 5085 7641 Total 3851 8279 12130 Total 6570 5560 12130 A B NR Total R 1779 1020 2799 R D R D NR Total R 1958 841 2799 NR 3976 5355 9331 NR 2531 6800 9331 Total 3851 8279 12130 Total 6570 5560 12130 Total 4489 7641 12130 B NR Total R 2351 937 3288 R E C NR R 3228 Total 60 3288 R E D NR Total R R 2475 813 3288 NR 1500 7342 8842 NR 3342 5500 8842 NR 2014 6828 8842 Total 3851 8279 12130 Total 6570 5560 12130 Total 4489 7641 12130 A R B NR Total R 3428 2648 6076 NR R F Total R 5407 669 6076 NR 1163 4891 6054 Total 3851 8279 12130 Total 6570 5560 12130 A Total R 2934 1880 4814 NR R R G NR F NR Total 8842 Total 2799 9331 12130 Total R 6076 710 5344 6054 Total 4489 7641 12130 R 4190 624 4814 R G 3288 949 7893 F E NR R 2507 3569 NR Total 6076 292 5762 6054 Total 2799 9331 12130 C NR Total D NR R 3779 2297 B NR E NR R 1850 1438 C NR 423 5631 6054 R G Total R 2594 205 2799 A F C NR NR 2072 7259 9331 R E Total 709 6932 7641 R D NR R 4014 475 4489 R F NR Total R R 3081 1733 4814 917 6399 7316 NR 2380 4936 7316 NR 1408 5908 7316 Total 3851 8279 12130 Total 6570 5560 12130 Total 4489 7641 12130 G 6076 6054 Total 3288 8842 12130 E NR R 1829 2985 NR Total 167 5887 D NR NR R 3121 2955 Total 4814 970 6346 7316 Total 2799 9331 12130 R G F NR R 2418 2396 NR Total 4814 R G Total 870 6446 7316 NR 2014 5302 7316 Total 3288 8842 12130 Total 6076 6054 12130 TABLE 9 – THE CONTINGENT RELATIONS BETWEEN THE RESPONSIVENESS CODING OF SEVEN REVIEW TEAMS 16 NR R 4062 752 4814 Research paper at DESI IV: The ICAIL 2011 Workshop on Setting Standards for Searching Electronically Stored Information in Discovery Proceedings, June 6, 2011, University of Pittsburgh, Pittsburgh, PA, USA. http://www.umiacs.umd.edu/~oard/desi4/ Inconsistent Assessment of Responsiveness in E-Discovery: Difference of Opinion or Human Error? Maura R. Grossman, J.D., Ph.D. Wachtell, Lipton, Rosen & Katz1 Gordon V. Cormack, Ph.D. University of Waterloo 1 Introduction In responding to a request for production in civil litigation, the goal is generally to produce, as nearly as practicable, all and only the non-privileged documents that are responsive to the request.2 Recall – the proportion of responsive documents that are produced – and precision – the proportion of produced documents that are responsive – quantify how nearly all of and only such responsive, non-privileged documents are produced [2, pp 67-68]. The traditional approach to measuring recall and precision consists of constructing a gold standard that identifies the set of documents that are responsive to the request. If the gold standard is complete and correct, it is a simple matter to compute recall and precision by comparing the production set to the gold standard. Construction of the gold standard typically relies on human assessment, where a reviewer or team of reviewers examines each document, and codes it as responsive or not [2, pp 73-75]. It is well known that any two reviewers will often disagree as to the responsiveness of particular documents; that is, one will code a document as responsive, while the other will code the same document as non-responsive [1, 3, 5, 8, 9, 10]. Does such disagreement indicate that responsiveness is ill-defined, or does it indicate that reviewers are sometimes mistaken in their assessments? If responsiveness is ill-defined, can there be such a thing as an accurate gold standard, or accurate measurements of recall and precision? Answering this question in the negative might call into question the ability to measure, and thus certify, the accuracy of a response to a production request. If, on the other hand, responsiveness is well-defined, might there be ways to measure and thereby correct for reviewer error, yielding a better gold standard, and therefore, more accurate measurements of recall and precision? This study provides a qualitative analysis of the cases of disagreement on responsiveness determinations rendered during the course of constructing the gold standard 1 The views expressed herein are solely those of the author and should not be attributed to her firm or its clients. 2 See Fed. R. Civ. P. 26(b) & (g), 34(a), and 37(a)(4). 1 for the TREC 2009 Legal Track Interactive task (“TREC 2009”) [7]. For each disagreement, we examined the document in question, and made our own determination of whether the document was “clearly responsive,” “clearly non-responsive,” or “arguable,” meaning that it could reasonably be construed as either responsive or not, given the production request and operative assessment guidelines. 2 Prediction Our objective was to test two competing hypotheses: Hypothesis 1: Assessor disagreement is largely due to ambiguity or inconsistency in applying the criteria for responsiveness to particular documents. Hypothesis 2: Assessor disagreement is largely due to human error. Hypothesis 1 and Hypothesis 2 are mutually incompatible; evidence refuting Hypothesis 1 supports Hypothesis 2, and vice versa. To test the validity of the two hypotheses, we constructed an experiment in which, prior to the experiment, the two hypotheses were used to predict the outcome. An observed result consistent with one hypothesis and inconsistent with the other would provide evidence supporting the former and refuting the latter. In particular, Hypothesis 1 predicted that if we examined a document about whose responsiveness assessors disagreed, it would generally be difficult to determine whether or not the document was responsive; that is, it would usually be possible to construct a reasonable argument that the document was either responsive or non-responsive. On the other hand, Hypothesis 2 predicted that it would generally be clear whether or not the document was responsive; it would usually be possible to construct a reasonable argument that the document was responsive, or that the document was non-responsive, but not both. At the outset, we conjectured that the results of our experiment would more likely support Hypothesis 1. 3 TREC Adjudicated Assessments The TREC 2009 Legal Track Interactive Task used a two-pass adjudicated review process to construct the gold standard [7]. In the first pass, law students or contract attorneys assessed a sample of documents for each of seven production requests – “topics,” in TREC parlance – coding each document in the sample as responsive or not. TREC 2009 participants were invited to appeal any of the assessor coding decisions with which they disagreed, and the Topic Authority (or “TA”) – a senior lawyer tasked with defining responsiveness – was asked to make a final determination as to whether the appealed document was responsive or not. The gold standard considered a document to be responsive if the first-pass assessor coded it as responsive and that decision was not appealed, the first-pass assessor coded it as responsive and that decision was upheld by the Topic Authority, or the first-pass assessor coded it as non-responsive and 2 Topic 201 201 202 202 203 203 204 204 205 205 206 206 207 207 All All First-Pass Assessment Responsive Non-responsive Responsive Non-responsive Responsive Non-responsive Responsive Non-responsive Responsive Non-responsive Responsive Non-responsive Responsive Non-responsive Responsive Non-responsive Assessed 603 5,605 1,743 5,462 131 5,296 105 7,024 1,631 4,289 235 6,860 938 7,377 5,386 41,913 Appealed 374 123 167 541 74 209 59 207 889 78 52 0 43 154 1,658 1,312 Success 363 101 115 469 69 186 50 169 882 50 50 0 23 125 1,552 1,100 % Success 97% 82% 68% 86% 93% 88% 84% 81% 99% 64% 96% – 53% 81% 93% 83% Table 1: Number of documents assessed, appealed, and the success rates of appeals for the TREC 2009 Legal Track Interactive Task, categorized by topic and first-pass assessment. that decision was overturned by the Topic Authority. The gold standard considered a document to be non-responsive if the first-pass assessor coded it as non-responsive and that decision was not appealed, the first-pass assessor coded it as non-responsive and that decision was upheld by the Topic Authority, or the first-pass assessor coded it as responsive and the decision was overturned by the Topic Authority. A gold standard was created for each of the seven topics.3 A total of 49,285 documents – about 7,000 per topic – were assessed for the first-pass review. A total of 2,976 documents (5%) were appealed and therefore adjudicated by the Topic Authority. Of those appeals, 2,652 (89%) were successful; that is, the Topic Authority disagreed with the first-pass assessment 89% of the time. A breakdown of the number of documents appealed per topic, and the outcome of those appeals, appears in Table 1.4 4 Post-Hoc Assessment We performed a qualitative, post-hoc assessment on a sample of the successfully appealed documents from each category represented in Table 1; that is, the documents where the TREC 2009 first-pass assessor and Topic Authority disagreed. Where 50 or more documents were successfully appealed, we selected a random sample of 50. 3 The gold standard and evaluation tools are available at http://trec.nist.gov/data/legal09.html. 4 The pertinent documents may be identified by comparing files qrels_doc_pre_all.txt and qrels_doc_post_all.txt in http://trec.nist.gov/data/legal/09/evalInt09.zip. 3 Topic 201 201 202 202 203 203 204 204 205 205 206 206 207 207 All All TA Opinion Responsive Non-responsive Responsive Non-responsive Responsive Non-responsive Responsive Non-responsive Responsive Non-responsive Responsive Non-responsive Responsive Non-responsive Responsive Non-responsive TA Correct 74% 94% 96% 96% 94% 82% 90% 90% 100% 82% – 96% 74% 70% 88% (84–91%) 89% (85–92%) Arguable 20% 2% 2% 0% 2% 4% 10% 8% 0% 4% – 2% 12% 0% 8% (5–11%) 3% (2–6%) TA Incorrect 6% 4% 2% 4% 4% 14% 0% 2% 0% 14% – 2% 14% 28% 4% (2–7%) 8% (5–12%) Table 2: Post-hoc assessment of documents whose first pass responsiveness assessment was overturned by the Topic Authority in the TREC 2009 Legal Track Interactive Task. The columns indicate the topic number, the TA’s assessment, the proportion of documents for which the authors believe the TA was clearly correct, the proportion of documents for which the authors believe the correct assessment is arguable, and the proportion of documents for which the authors believe the TA was clearly incorrect. The final two rows give these proportions over all topics, with 95% binomial confidence intervals. Doc. Id. 0.7.47.1151420 0.7.47.1310694 0.7.47.272751 0.7.6.180557 0.7.6.252211 0.7.47.1082536.1 0.7.47.14687.1 0.7.47.758281 0.7.6.707917.2 0.7.6.731168 TA Opinion Responsive Responsive Responsive Responsive Responsive Non-responsive Non-responsive Non-responsive Non-responsive Non-responsive Post-Hoc Assessment Arguable Arguable TA Incorrect Arguable Arguable Arguable Arguable Arguable Arguable Arguable TA Reconsideration TA Incorrect TA Incorrect Arguable TA Correct TA Incorrect TA Correct Arguable TA Correct TA Correct TA Correct Table 3: Blind reconsideration of adjudication decisions for Topic 204 by the Topic Authority (Grossman) that were contradicted or deemed arguable by the post-hoc reviewer (Cormack). The columns represent the TREC document identifier for each of the ten documents, the opinion rendered by the TA during the TREC 2009 adjudication process, the opinion rendered by the post-hoc reviewer, and the de novo opinion of the same Topic Authority for the purposes of this study. 4 Where fewer than 50 documents were successfully appealed, we selected all of the appealed documents. We used the plain-text version of the TREC 2009 Legal Track Interactive Track corpus, downloaded by one of the authors while participating in TREC 2009 [4], and redistributed for use at TREC 2010.5 One of the authors of this study examined every document, in every sample, and coded each as “responsive,” “non-responsive,” or “arguable,” based on the content of the document, the production request, and the written assessment guidelines composed for TREC 2009 by each Topic Authority. We coded a document as “responsive” if we believed there was no reasonable argument that the document fell outside the definition of responsiveness dictated by the production request and guidelines. Similarly, we coded a document as “non-responsive” if we believed there was no reasonable argument that the document should have been identified as responsive to the production request. Finally, we coded the document as “arguable” if we believed that informed, reasonable people might disagree about whether or not the document met the criteria specified by the production request and guidelines. Table 2 shows the agreement of our post-hoc assessment with the TREC 2009 Topic Authority’s assessment on appeal, categorized by topic and by the TA’s assessment of responsiveness. Each row shows the TA opinion (which is necessarily the opposite of the first-pass opinion), the percentage of post-hoc assessments for which we believe that the only reasonable coding was that rendered by the TA, the percentage of posthoc assessments for which we believe that either coding would be reasonable, and the percentage of post-hoc assessments for which we believe that the only reasonable coding contradicts the one that was made by the TA. 5 Topic Authority Reconsideration One of the authors (Grossman) was the Topic Authority for Topic 204 at TREC 2009. The other author (Cormack) conducted the post-hoc assessment for Topic 204. The post-hoc assessment clearly disagreed with the Topic Authority in only one case, and was “arguable” in nine other cases. The ten documents were presented to the TA for de novo reconsideration, in random order, with no indication as to how they had been previously coded. For this reconsideration effort, the TA used the same three categories as for the post-hoc assessment: “responsive,” “non-responsive,” or “arguable.”6 Table 3 shows the results of the TA’s reconsideration of the ten documents. 6 Document Exemplars Table 4 lists the production requests for the seven TREC topics. Based on the production request and his or her legal judgement, each Topic Authority prepared a set 5 Available at http://plg1.uwaterloo.ca/~gvcormac/treclegal09/. Note that when the TA adjudicated documents as part of TREC 2009, she was constrained to the categories of “responsive” and “non-responsive”; there was no category for “arguable” documents. Therefore, we cannot consider a post-hoc determination of “arguable” as necessarily contradicting the TA’s original adjudication at TREC 2009. 6 5 Topic 201 202 203 204 205 206 207 Production Request All documents or communications that describe, discuss, refer to, report on, or relate to the Company’s engagement in structured commodity transactions known as “prepay transactions.” All documents or communications that describe, discuss, refer to, report on, or relate to the Company’s engagement in transactions that the Company characterized as compliant with FAS 140 (or its predecessor FAS 125). All documents or communications that describe, discuss, refer to, report on, or relate to whether the Company had met, or could, would, or might meet its financial forecasts, models, projections, or plans at any time after January 1, 1999. All documents or communications that describe, discuss, refer to, report on, or relate to any intentions, plans, efforts, or activities involving the alteration, destruction, retention, lack of retention, deletion, or shredding of documents or other evidence, whether in hard-copy or electronic form. All documents or communications that describe, discuss, refer to, report on, or relate to energy schedules and bids, including but not limited to, estimates, forecasts, descriptions, characterizations, analyses, evaluations, projections, plans, and reports on the volume(s) or geographic location(s) of energy loads. All documents or communications that describe, discuss, refer to, report on, or relate to any discussion(s), communication(s), or contact(s) with financial analyst(s), or with the firm(s) that employ them, regarding (i) the Company’s financial condition, (ii) analysts’ coverage of the Company and/or its financial condition, (iii) analysts’ rating of the Company’s stock, or (iv) the impact of an analyst’s coverage of the Company on the business relationship between the Company and the firm that employs the analyst. All documents or communications that describe, discuss, refer to, report on, or relate to fantasy football, gambling on football, and related activities, including but not limited to, football teams, football players, football games, football statistics, and football performance. Table 4: Mock production requests (“Topics”) composed for the TREC 2009 Legal Track Interactive Task. 6 Date: Tuesday, January 22, 2002 11:31:39 GMT Subject: I’m in. I’ll be shredding ’till 11am so I should haveplenty of time to make it. Figure 1: A clearly responsive document to Topic 204. This document was coded as non-responsive by a contract attorney, although it clearly pertains to document shredding, as specified in the production request. From: Bass, Eric Sent: Thursday, January 17, 2002 11:19 AM To: Lenhart, Matthew Subject: FFL Dues You owe $80 for fantasy football. When can you pay? Figure 2: A clearly responsive document to Topic 207. This document was coded as non-responsive by a contract attorney, although it clearly pertains to fantasy football, as specified in the production request. of assessment guidelines.7 We illustrate our post-hoc analysis using exemplar documents that were successfully appealed as responsive to topics 204 and 207. We chose these topics because they were the least technical and, therefore, the most accessible to readers lacking subject-matter expertise. Figures 1 and 2 provide examples of documents that are clearly responsive to Topics 204 and 207, but were coded as non-responsive by the first-pass assessors. The first document concerns shredding, while the second concerns payment of a Fantasy Football8 debt. We assert that the only reasonable assessment for both of these documents is “responsive.” Figures 3 and 4, on the other hand, illustrate documents for which the responsiveness to Topics 204 and 207, respectively, is arguable. Reasonable, informed assessors might disagree, or find it difficult to determine, whether or not these documents met the criteria spelled out in the production requests and assessment guidelines. 7 The guidelines, along with the complaint, production requests, and exemplar documents, may be found at http://plg1.cs.uwaterloo.ca/trec-assess/. 8 “Fantasy football an interactive, virtual competition in which people manage professional football players versus one another.” http://en.wikipedia.org/wiki/Fantasy_football_(American). 7 Subject: Original Guarantees Just a followup note: We are still unclear as to whether we should continue to send original incoming and outgoing guarantees to Global Contracts (which is what we have been doing for about 4 years, since the Corp. Secretary kicked us out of using their vault on 48 for originals because we had too many documents). I think it would be good practice if Legal and Credit sent the originals to the same place, so we will be able to find them when we want them. So my question to y’all is, do you think we should send them to Global Contracts, to you, or directly the the 48th floor vault (if they let us!). Figure 3: A document of arguable responsiveness to Topic 204. This message concerns where to store particular documents, not specifically their destruction or retention. Reasonable, informed assessors might disagree as to its responsiveness, based on the TA’s conception of relevance. Subject: RE: How good is Temptation Island 2 They have some cute guy lawyers this year-but I bet you probably watch that manly Monday night Football. Figure 4: A document of arguable responsiveness to Topic 207. This message mentions football whimsically and in passing, but does not reference a specific football team, player, or game. Reasonable, informed assessors might disagree about whether or not it is responsive according to the TA’s conception of relevance. 8 7 Discussion Our evidence supports the conclusion that responsiveness – at least as characterized by the production requests and assessment guidelines used at TREC 2009 – is fairly well defined, and that disagreements among assessors are largely attributable to human error. As a threshold matter, only 5% of the first-pass assessments were appealed. Since participating teams had the opportunity and incentive to appeal the assessments with which they disagreed, we may assume that, for the most part, they agreed with the first-pass assessments of the documents they chose not to appeal. That is, the first-pass assessments were on the order of 95% accurate. Second, we observe that 89% of the appeals were upheld, suggesting that they had, for the most part, a reasonable basis. Our study considers only those appealed documents for which the appeals were upheld – about 89% of the appealed documents, or 4.5% of all assessed documents. Are these documents arguably on the borderline of responsiveness, as one might suspect? At the TREC 2009 Workshop, many participants, including the authors, voiced opinions to this effect. An earlier study by the authors preliminarily examined this question and found that, for two topics,9 the majority of non-responsive assessments that were overturned were the result of human error, rather than questionable responsiveness [6]. The aim of the present study was to further test this hypothesis, by considering the other five topics, and also responsive assessments that were overturned (i.e., adjudicated to be non-responsive). To our surprise, we found that we judged nearly 90% of the overturned documents to be clearly responsive, or clearly non-responsive, in agreement with the Topic Authority. We found another 5% or so of the documents to be clearly responsive or clearly non-responsive, contradicting the Topic Authority. Only 5% did we find to be arguable, indicating a borderline or questionable decision. Accordingly, we conclude that the vast majority of disagreements arise due to simple human error; error that can be identified by careful reconsideration of the documents using the production requests and assessment guidelines. Our results also suggest that the TA assessments, while quite reliable, are not infallible. We confirmed this directly for Topic 204 by having the same TA reconsider ten documents that she had previously assessed as part of TREC 2009. For three of the ten documents, the TA contradicted her earlier assessment; for two of the ten, the TA coded the documents as arguable. For only half of the documents did the TA unequivocally reprise her previous assessment. While we did not have the TAs for the other topics reconsider their assessments, we are confident from our own analysis of the documents that some of their assessments were incorrect. All in all, the total proportion of documents that are borderline, or for which the adjudication process yielded the wrong result, appears to be quite low. Five percent of the assessed documents were appealed; 90% of those appeals were upheld; and of those, perhaps 10% were borderline – that is, only about 0.45% of the assessed documents were “arguable.” It stands to reason that there may be some borderline documents that our study did not consider. In particular, we did not consider documents that the first-pass assessor and the TREC 2009 participants agreed on, and which were therefore not appealed. We also did not consider documents that were appealed, but 9 Topics 204 and 207, which were chosen because they were the least esoteric of the seven topics. 9 for which the TA upheld the first-pass assessment. We have little reason to believe that the number of such borderline documents would be large in either case; however, a more extensive study would be necessary to quantify this number. In any event, we are concerned here specifically with the cause of assessor disagreement that was observed, and since there is no assessor disagreement on these particular documents, this quantity has no bearing on the hypotheses we were testing. We characterize our study as qualitative rather than quantitative for several reasons. The documents we examined were not randomly selected from the document collection; they were selected in several phases, each of which identified a disproportionate number of controversial documents: 1. The stratified sampling approach used by TREC 2009 to identify documents for the first-pass assessment emphasized documents for which the participating teams had submitted contradictory results; 2. The appeals process selected from these documents those for which the teams disagreed with the first-pass assessment; 3. For our post-hoc assessment, we considered only appealed documents for which the Topic Authority disagreed with the first-pass assessor; and 4. For our TA reconsideration, we considered only ten percent of the documents from our post-hoc assessment – those for which the post-hoc assessment disagreed with the decision rendered by the TA at TREC 2009. All of these phases tended to focus on controversial documents, consistent with our purpose of determining whether disagreement arises due to ambiguity concerning responsiveness, or human error. Therefore, it would be inappropriate to use these results to estimate the error rate of either the first-pass assessor or the Topic Authority on the collection as a whole. Finally, neither of the authors is at arm’s length from the TREC 2009 effort; our characterization of responsiveness reflects our informed analysis and as such, is amenable to debate. Accordingly, we invite others in the research community to examine the documents themselves and to let us know their results. Towards this end, we have made publicly available the text rendering of the documents we reviewed for this study.10 8 Conclusion It has been posited by some that it is impossible to derive accurate measures of recall and precision for the results of any document review process because large numbers of documents in the review set are “arguable,” meaning that two informed, reasonable reviewers could disagree on whether the documents are responsive or not. The results of our study support the hypothesis that the vast majority of cases of disagreement are a product of human error rather than documents that fall in some “gray area” of responsiveness. Our results also show that while Topic Authorities – like all human 10 See http://plg1.cs.uwaterloo.ca/~gvcormac/maura1/. 10 assessors – make coding errors, adjudication of cases of disagreement in coding using a senior attorney can nonetheless yield a reasonable gold standard that may be improved by systematic correction of the estimated TA error rate. References [1] BAILEY, P., C RASWELL , N., S OBOROFF , I., T HOMAS , P., DE V RIES , A., AND Y ILMAZ , E. Relevance assessment: are judges exchangeable and does it matter. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval (2008), ACM, pp. 667–674. [2] B ÜTTCHER , S., C LARKE , C., AND C ORMACK , G. Information retrieval: Implementing and evaluating search engines. MIT Press, 2010. [3] C HU , H. Factors affecting relevance judgment: a report from TREC Legal track. Journal of Documentation 67, 2 (2011), 264–278. [4] C ORMACK , G., AND M OJDEH , M. Machine learning for information retrieval: TREC 2009 web, relevance feedback and legal tracks. In The eighteenth Text REtrieval Conference proceedings (TREC 2009), Gaithersburg, MD (2009). [5] E FTHIMIADIS , E., AND H OTCHKISS , M. Legal discovery: Does domain expertise matter? Proceedings of the American Society for Information Science and Technology 45, 1 (2008), 1–2. [6] G ROSSMAN , M. R., AND C ORMACK , G. V. Technology-assisted review in ediscovery can be more effective and more efficient than exhaustive manual review. Richmond Journal of Law and Technology XVII, 3 (2011). [7] H EDIN , B., T OMLINSON , S., BARON , J. R., AND OARD , D. W. Overview of the TREC 2009 Legal Track. In The Eighteenth Text REtrieval Conference (TREC 2009) (2010). To appear. [8] ROITBLAT, H. L., K ERSHAW, A., AND O OT, P. Document categorization in legal electronic discovery: Computer classification vs. manual review. Journal of the American Society for Information Science and Technology 61 (2010), 70–80. [9] VOORHEES , E. M. Variations in relevance judgments and the measurement of retrieval effectiveness. Information Processing & Management 36, 5 (2000), 697– 716. [10] WANG , J., AND S OERGEL , D. A user study of relevance judgments for EDiscovery. Proceedings of the American Society for Information Science and Technology 47, 1 (2010), 1–10. 11 Retrospective and Prospective Statistical Sampling in Legal Discovery1 Richard T. Oehrle, Cataphora Legal, a division of Cataphora, Inc.(rto@cataphora.com) DESI IV Workshop Statistical sampling can play an essential double role in defining document sets in response to legal discovery requests for production. Retrospectively, looking backward at existing results, statistical sampling provides a way to measure quantitatively the quality of a proposed production set. Prospectively, looking forward, statistical sampling (properly interpreted) shows how the quality of a proposed production set can be improved. The proposed improvements depend on transparency of classification: in order to correct clashes between human judgments of sampled data and a proposed hypothesis, one must know the source of the misclassification. And in correcting such misclassifications, one must take care to avoid standard dangers like overfitting. Section 1 below sets the stage by presenting the most basic material on statistical sampling and introducing the concept of data profiles. Section 2 argues that statistical sampling in retrospective mode is the only practical way to assess production quality. Section 3 offers several reasons why statistical sampling assessment has failed to become the standard practice it deserves to be in the field of legal discovery. Section 4 focuses on the use of statistical sampling in prospective, forwardlooking mode, to drive iterative improvement. Section 5 describes at a high level some of our practical experience at Cataphora using iterative statistical sampling to force rapid convergence of an evolving responsiveness hypothesis with very high quality standards of review assessment. (In fact, the intrinsic role that statistical sampling plays in this process described in this section—a process that has consistently yielded measurably high quality—is one reason to consider the issues that arise in the preceding sections.) Along the way, we offer a variety of questions for discussion in a series of footnotes. 1 1.1 Background statistical sampling, confidence intervals, confidence levels Statistical sampling starts with a sample drawn from a data set. We cannot be certain that the sample is representative of the data as a whole. But we can estimate the likelihood that it is. This estimate takes the form of two hedges—a confidence interval and a confidence level. The confidence interval pads the particular results derived from the sample with room for error on both sides (say +/- 5%). The confidence level states how probable it is that any sample drawn from the data will fit within this interval. Intuitively, think of any distribution as being roughly like a bell curve, which prototypically has a central axis (the vertical line that goes through the top of the bell), with distribution falling off symmetrically on either side. Then think of a sample as a subset of the region between the horizontal x-axis and the bell curve. If the sample is randomly selected, because of the shape of the bell curve, most of the items in the sample fall within a relatively small interval flanking the central axis symmetrically on either side. This interval is represented by the confidence interval, and when the bell curve is relatively normal—not too flat—most of the points are not far from the central axis. Most is not the same as all, of course. And the confidence level is added to 1 I’d like to thank the anonymous DESI IV referees for their constructive comments. Of course, any errors in this paper are my responsibility, not theirs. 1 deal with the outliers on either side that don’t make it into the interval. (These outliers form the tails on either side of the bell that trail off on either side getting closer and closer to the x-axis the further away they are from the central axis.) If we claim a 95% confidence level, the claim is roughly that at least 95% of the points under the bell curve fall within the window and less then 5% of the points under the bell curve fall within the outlying tail on either side outside the window. This is why we need both a confidence interval and a confidence level. 1.2 data profiles Many discussions of information retrieval and such basic concepts as recall and precision assume a binary distinction between responsive (or relevant) and non-responsive (or non-relevant). As anyone with any practical experience in this area knows, this is quite an idealization. To get a grasp on the range of possibilities that emerges from combining a responsiveness criterion with a dataset, it is useful to introduce the concept of a data profile.2 Suppose we are given a dataset D and some omniscient being or oracle has the quick-wittedness and charity to rank every document on a scale from 0 (the least responsive a document could possibly be) to 10 (the most responsive a document could possibly be), and to provide us with a list of documents ranked so that no document is followed by a document with a higher rank. Here are some illustrative pictures, where documents are represented as points on the x-axis (with document d1 corresponding to a point to the left of the point corresponding to document d2 if document d1 precedes document d2 in the oracle’s list), and the degree of responsiveness of a document represented by points on the y axis from 0 (least responsive) to 10 (most responsive). Fig. 1: all-or-nothing Fig. 2: semi-categorical 2 The intuitions behind this concept have affinities with the AUC Game described by Gordon Cormack & Maura Grossman (2010). 2 Fig. 3:constant decay Fig. 4: fall-and-decline Technically, the fundamental property of these graphical displays is that the relations they depict are weakly decreasing: if a point (x1 , y1 ) is to the left of a point (x2 , y2 ) (so that x1 < x2 )), then y2 cannot be greater than y1 . The point of introducing them is simple: they make it possible to apprehend and explore a landscape of theoretical possibilities which illuminates the practical questions that practitioners face.3 2 Statistical Sampling is the only practical way to assess production quality Legal Discovery requires a specification (at some level of detail) of what is regarded as Responsive and what is regarded as Non-Responsive with respect to a particular document collection. A production set or potential production set drawn from this collection can be regarded as a hypothesis about which documents satisfy the Responsive specification. When both the underlying dataset and the proposed production set are relatively large,4 there is only one practical way to assess the quality of such an hypothesis: statistical sampling. Statistical sampling relies on a solid and well-understood mathematical foundation. It has been employed extensively across a broad range of subject matters. It is quantitative, amazingly efficient, replicable, and informative. 3 Question: Given a data profile associated with a responsive criterion R associated with a request for production and a dataset D, what portion of the dataset should be produced? 4 A referee has noted the potential and importance of hot-document searches, whose relative rarity may insulate them from the representative sweep of statistical sampling. We will come back to this point briefly in the final section. 3 2.1 human categorization does not in and of itself entail quality results It is sometimes assumed that a quantitative assessment of production quality is unnecessary, on the grounds that the method used to define the candidate production entails its high quality. But assessment is completely independent of this process of definition. If we define a candidate production set by flipping a fair coin, we should still be able to assess the quality of the result. Historically, human manual review of the entire document collection has served as a benchmark of sorts, based on the assumption that human manual review must be correct. But there have always been skeptics who have doubted the efficacy of human manual review. And empirical investigations, which are not easy to arrange in practice, are beginning to show this assumption is in fact incorrect: defining a potential production set by human manual review does not guarantee a high quality result. (See, for example, Roitblat, Kershaw, and Oot (2010).) Recently, there has been another version of this argument applied to automated methods of review. This version takes a form like the following: if a method can be shown to be consistently accurate across a diverse population of document collections, then we can assume that it will be consistently accurate when applied to a new collection that it has never been tested on. This formulation involves some delicate conditions concerning the properties of the document collections that form the provisional testing set. How could one be sure in practice that these same conditions actually hold when we move to a new document collection? The simplest way is to measure the quality of the results by statistical sampling. But in this case, it isn’t necessary to rely on the delicate conditions at all: the quantitative assessment will provide the information needed. 2.2 human re-review: expensive, inefficient One conceivable way to test quality is to re-review the entire document collection manually. This approach would be expensive and time-consuming. Furthermore, recent empirical research (such as Roitblat, Kershaw, and Oot (2010)) shows that multiple human reviews of the same document set yield astonishingly large disagreements in judgments.5 In other words, apart from its expense and inefficiency, this approach is unlikely to provide a true assessment of the quality of a proposed production set. Similarly, suppose the candidate production set was defined by a fully automated process. We can’t test the process by re-running the fully automated process. If the process is consistent, then a second run will replicate the results of the earlier run, without providing any information about quality. Again, the remedy is to submit the results to statistical sampling. 2.3 informal QC vs. statistical sampling Statistical sampling is sometimes replaced by informal browsing through a candidate production set. This differs from the statistical approach in a number of ways. For example, the sample set is not always selected appropriately. Moreover, quantitative results are not always tabulated. While the results of this seat-of-the-pants QC can be better than nothing, they do not provide the detailed insights available from statistical sampling. 5 Question: if we consider a data profile associated with the responsive criterion and data set of the Verizon study, what is the corresponding error profile: that is, are clashes in judgment randomly distributed across the x-axis values? are they concentrated at the extremes of responsiveness / nonresponsiveness (represented by the left end and right end, respectively? are they concentrated in the middle? . . . 4 2.4 if human review is fallible in general, why is it effective in sampling? There are two practical reasons to distinguish the general properties of human review in large linear reviews and the general properties of human review in sampling reviews. First, because sampling review is remarkably efficient, it makes sense to employ senior attorneys with knowledge of both the details of the case at hand and the underlying law, rather than junior associates or contract attorneys or paralegals. (Compare the role of the Topic Authority in recent TREC Legal rounds.) In other words, the population is different, in a way that should (in principle) tilt the balance toward improved results. Second, our knowledge of the inconsistencies of multiple human reviews is based on large datasets with thousands of judgments. Since sampling reviews involve a much smaller dataset, clashes between a given reasonable hypothesis concerning responsiveness and actual expert reviewer judgments tend in practice to be even smaller. In fact, they are small enough to be subjected to individual examination, which sometimes confirms the expert reviewer, but at other times confirms the given hypothesis. This kind of detailed examination provides a highly valuable constraint on the quality of information provided by human reviewers, a constraint absent in the large scale multiple reviews whose differences have been studied. Finally, sampling review occurs over a time-span that lessens the risks of fatigue and other vicissitudes. 2.5 summary In summary, if you want to know how good your proposed production set is, statistical sampling provides a quantitative, replicable, efficient, informative, practical, defensible answer. No other method known to us comes even close.6 3 Why isn’t statistical sampling the de facto standard in legal discovery? Properly conducted statistical sampling answers basic questions about the quality of legal discovery productions (up to approximations represented by confidence interval and confidence level). Why doesn’t statistical sampling play a more central role when issues concerning discovery arise? Why isn’t it regarded as reasonable and customary? 3.1 is no quantitative check needed? One possible answer to this question (already discussed above) is that no quantitative check on quality is needed and if it isn’t needed, it poses an additional and unnecessary burden. The primary justification for this answer is that the particular method chosen (manual, technology assisted, or fully automatic) serves as a guarantee of production quality. But empirical studies of manual review consistently show that manual review does not support this justification. And there is little reason to think that automated forms of review fare better. Moral: skepticism is called for. 6 Question: where in the E-Discovery process should statistical sampling be employed? Example: if one side proposes to use keyword culling to reduce the size of the data and the associated costs of discovery, should the other side be provided with quantitative measures of the impact of this procedure on the responsive and non-responsive populations before and after culling? 5 3.2 what is a practical standard? A related myth is that on the assumption that human review is perfect (100% recall and 100% precision), revealing actual sampling results will introduce quantitative figures that can never meet this perfect standard. It’s true that statistical results always introduce intervals and confidence levels. And while such results can approach 100%, sampling can never guarantee 100% effectiveness. The practical impact of these facts is that some may feel that introducing statistical sampling results can only serve to illuminate defects of production. But if the introduction of statistical sampling results were the accepted practice, whether for manual or automated forms of review and production, it would very quickly become clear what the acceptable numbers for review quality actually are, what numbers require additional work, and what numbers are of high enough quality that further improvements would require increasing amounts of work for decreasing rewards.7 3.3 ignorance may be preferable to the consequences of knowledge There is another possible factor which may have contributed to the failure of statistical sampling to be regarded as a reasonable and customary part of discovery. This factor has nothing to do with disclosing such results to the court or to other parties. Rather, it involves fear that the results will not satisfy one’s own standards. Weighed in the balance, fear and ignorance trump knowledge and its consequences. Suppose you conduct a traditional linear review on a large document set. At the end of the review, you sample appropriately across the dataset as a whole to estimate the recall and precision of your candidate production. What if you were aiming for 90% at a minimum (with a confidence interval of 5% and a confidence level of 95%), but your sampling review shows that the recall is 75%. What choices do you face? Do you certify in some way a review that is plainly deficient in its results (even though it may have been conducted flawlessly)? Do you launch the manual review again from scratch, with all the attendant costs in time, effort, and money—and no guarantee in advance that the results of the second round of review will outperform the unsatisfactory results of the first review? One way to avoid this dilemma is to refrain from a quantitative estimate of production quality. The resulting shroud of ignorance obscures the painful choice. This situation is not restricted to cases involving human manual review. For example, a similar dilemma would arise in circumstances in which the initial results depended on a black-box algorithm—that is, an automated approach that offers a hypothesis about how documents are to be sorted in a Responsive set and a Non-Responsive set, but does not reveal the details of how the hypothesis treats individual documents. For example, think of clustering algorithms that can be adjusted to bring back smaller or larger clusters (by strengthening or relaxing the similarity parameters). In the face of unsatisfactory recall results, one might be able to adjust the algorithm to drive recall numbers. Typically, however, this very adjustment adversely affects precision numbers, because additional documents that are in reality non-responsive may be classified as Responsive. 7 Question: who should have access to sampling numbers? Example: does the counsel for the sampling side want to know that the sampling numbers are not perfect—they never are, for reasons discussed above—even they may far exceed contemporary standards? Question: what role do sampling measures play in defending production quality before a judge? Example: can the opposing side reasonably demand statistical measures of recall and precision when the quality of a production to them is in question? 6 3.4 making reality your friend Not every method of defining a production set faces this dilemma. In the next section, we discuss the conditions needed to leverage statistical sampling results to improve review quality. And subsequently, because the necessary conditions are somewhat abstract, we discuss our experience at Cataphora using this iterative review model over the past seven years. 4 Leveraging statistical sampling results prospectively for hypothesis improvement If the results of statistical sampling can be used to improve a hypothesis about a potential production set—that is, improve recall and improve precision—then a system based on successive rounds of sampling and hypothesis refinement can return better and and better results.8 Before considering how quickly this convergence takes place in practice in the next section, we focus first on exactly how sampling can be leveraged for hypothesis improvement. Suppose you review 800 documents selected to test the recall of your current hypothesis. This is a test of completeness, whose goal is to ascertain whether the current hypothesis mischaracterizes Responsive documents as Non-Responsive. Suppose that the resulting review judgments are as follows: judged Resp judged NR hypothesized Responsive 80 40 hypothesized NonResponsive 40 640 This sampling review thus confirms that the current hypothesis is underperforming with respect to recall: 40 documents that were hypothesized to be NonResponsive were judged Responsive. We might think that 40 is a relatively small number: only 5% of the 800 document sample. But this represents a third of all the documents judged Responsive in the sample. Suppose that the document set as a whole contains 800,000 items. If the sample is representative, 120,000 of them are responsive and our current hypothesis only identifies 80,000 of them. This is clearly unacceptable. The hypothesis needs to be revised and substantially improved. Let’s assume that all the clashes between review judgments and the current categorization hypothesis are settled in favor of the review. (In practice, it happens that further scrutiny of clashes can lead to a resolution that favors the categorization hypothesis rather than the human review.) Having determined the identity of 40 false negatives, the least we can do is to ensure that these 40 are re-categorized in some way so that they are categorized as Responsive. But this is obviously insufficient: the 40 false negatives are representative of a much larger class. It’s this larger class that we must be concerned with and we want to use information extractable from the 40 false negatives to improve our hypothesis. What we seek is a way of categorizing these 40 false negatives that generalizes appropriately over the data as a whole. Two basic cases arise. In the first case, the 40 false negatives are categorized by the current hypothesis, but the combination of the categorization components involved are incorrectly associated with NonResponsiveness, rather than Responsiveness. By adjusting the way in which such combinations of categorization 8 See the work by Grossman & Cormack, cited earlier, and Büttcher, Clarke, and Cormack (2010), especially section §8.6 Relevance Feedback. What we describe here is a form of iterated supervised feedback involving both relevant and non-relevant information. 7 components determine Responsiveness or NonResponsiveness, the performance of the current categorization hypothesis can be improved in a suitably general way. (This is the very situation that Cataphora’s patented Query Fitting Tool is designed to address, particularly when the number of categorization components is so high that finding a near-optimal combination of them manually is challenging.) In the second case, the 40 false negatives are not categorized at all or are categorized in a way that is overly specific and not suitable for generalization. In this case, we seek to divide the 40 documents into a number of groups whose members are related by categorization properties (topics, subject line information, actors, time, document type, etc.). We next add categorization components sensitive to these properties (and independent of documents known to be NonResponsive) and assign documents satisfying them to the Responsive class. The result of these two cases is a revised categorization hypothesis. It can be tested and tuned informally during the course of development. But to determine how well performance has improved, it is useful to review an additional sample drawn in the same way. If the results confirm that the revised categorization hypothesis is acceptable, the phase of hypothesis improvement (for recall, at least) can be regarded as closed. (When all such phases are closed, a final validation round of sampling is useful to ensure overall quality.) On the other hand, if the results suggest that further improvements are indicated, we repeat the tuning of the categorization hypothesis as just outline and test a subsequent sample. In this way, we get closer and closer to an ideal categorization hypothesis. In practice, we get a lot closer with each round of improvement. (Details in the next section.) 4.1 transparency There is one critical point to note about this iterative process: it depends critically on the transparency of categorization. In order to improve a categorization hypothesis, we need to know how particular documents are categorized on the current hypothesis and we need to know how these documents will be categorized on a revised hypothesis. If we cannot trace the causal chain from categorization hypothesis to the categorization of particular documents, we cannot use information about the review of particular documents to institute revisions to the categorization hypothesis that will improve results not only for the particular documents in question but also for more general sets of documents containing them. 5 Cataphora’s practical experience: empirical observation on the success of iterative sampling We’ve shown above how statistical sampling results can be used to both measure performance and drive hypothesis improvement. If we follow such a strategy, results should improve with each iteration. But how much better? And how many iterations are required to reach high-quality results? In this section, we address these questions from a practical, empirically-oriented perspective. Cataphora Legal has been successfully using statistical sampling to measure performance and to improve it for almost a decade. In what follows, we draw on this experience, in a high-level way (since the quantitative details involve proprietary information). Our goal is not to advertise Cataphora’s methods or results, but to document the effectiveness of statistical sampling review in the development of high-quality hypotheses for responsiveness categorization. 8 Before discussing project details, it is worth pointing out that statistical sampling can be integrated with other forms of review in many ways. As an example, it may be desirable and prudent to review the intersection of a responsive set of documents and a set of potentially privileged documents manually, because of the legal importance of the surrounding issues. As an other example, it may be desirable to isolate a subpopulation of the dataset as a whole to concentrate for manual hot-document searches. In other words, different techniques are often appropriate to different subpopulations. Overall, such mixed methods are perfectly compatible. In a recent project, we developed a hypothesis concerning responsiveness for a document set of over 3 million items. This work took approximately 2 person-weeks, spread out over 3 months (not related to our internal time-table). Attorneys from the external counsel reviewed a randomly selected sample of the dataset. The recall of our hypothesis exceeded 95%. Precision exceeded 70%. Not long after this sampling review occurred, we received additional data, from different custodians, as well as some modifications in the responsiveness specification. After processing the new data, we arranged a sampling review, using the previously developed responsiveness hypothesis. In this second sample, involving significant changes to the population, the performance of the original responsiveness hypothesis declined considerably: recall dropped to about 50% from the previous high on the original data. We spent days, not weeks, revising and expanding the responsiveness hypothesis in ways that generalized the responsive review judgments in the sampling results. At the end of this process, the attorneys reviewed a fresh sample. Results: the recall of the revised hypothesis exceeded 93%; precision exceeded 79%. These numbers compare favorably with publicly available estimates of human manual review performance. The dataset involved was large. The overall process was efficient. In the present context, what is most notable is that the convergence on high quality results was extremely quick and the role played in this convergence by statistical sampling was significant. References Büttcher, Stefan, Charles L. A. Clarke, and Gordon V. Cormack, Information Retrieval: Implementing and Evaluating Search Engines, Cambridge: The MIT Press, 2010. Gordon V. Cormack and Maura R. Grossman, TREC Legal Track—Learning Task Draft Guidelines, http://plg.uwaterloo.ca/∼gvcormac/legal10/legal10.pdf, (2010). Roitblat, H., A. Kershaw, and P. Oot, ‘Document categorization in legal electronic discovery: computer classification vs. manual review’, Journal of the American Society for Information Science and Technology, 61.1, pp. 70-80, 2010. 9 An Intelligent Approach to E-discovery Steve Akers CTO/Founder Digital Reef Inc. Boxborough, Ma Jennifer Keadle Mason, Esq. Partner Mintzer, Sarowitz, Zeris, Ledva & Meyers, LLP Pittsburgh, Pa Peter L. Mansmann CEO Precise Litigation Inc. Pittsburgh, PA 1 Introduction Legal Discovery and assessment is an expensive proposition for corporations and organizations of all types. Last year (2010) it is estimated that $1 billion – $3 billion was spent on legal discovery processing alone1. This cost is large and growing; finding more intelligent methods to assess Electronically Stored Information (ESI) and understand what is contained within it is a goal of not just corporate personnel but also lawyers and legal service providers (companies providing legal discovery services). This paper outlines a proposed “standard” methodology and defines “ideal” tools and technology methods that combined are suggested as a “standard” for search. It focuses on 1. a standard methodology to identify potentially relevant data, and 2. tools and technology that can aid this process. It discusses important technological aspects of Ediscovery and how products either address or fall short of perfection in certain areas. Using the best process of identification in combination with the proper technologies for specific data types in order to have resulting cost-effective Ediscovery is the focus of this paper. One of the quandaries facing attorneys is how best to approach any particular data set to identify potentially relevant information either for their own use or use in responding to discovery. Increasing variety of data types and sources along with expanding volumes of unstructured data has made the decision of how to search the data more imperative than ever. Analytic search tools have blossomed in this environment and certainly provide some of the best options for searching. However analytic searching has many flavors in and of itself. Understanding the pros and cons to each approach is important in deciding which route to go. In addition attorneys cannot ignore “traditional” search methods as they can be an effective supplement to analytic searching or in some cases may be the best primary method for running a search. The decisions about which route to take is largely driven by the types of data being searched, the relative organization of the data being searched, the particularity of the case facts, and the attorneys familiarity with the case facts and client. The application of keywords has long been the standard for searching data sets. Keyword searching in its basic form is identifying any documents that contain particular terms. Ideally the parties discuss the keywords to be run, review a report of the initial search results to discuss any necessary adjustments, apply a privilege filter, review, and produce. These steps may be repeated numerous times to allow the parties to apply new search terms based upon the knowledge gained in reviewing the records. The problems with keyword searching are several and include: The parties must have sufficient knowledge of the case facts and industry/party parlance; straight keyword searching will not find misspellings; natural language usage has the problem of synonymy (multiple words with the same meaning – kitten, cat, feline) and polysemy (same word having different meanings – strike); finding variations of people’s names can be difficult (Dr. Jones, Indiana Jones, Indiana J. ). Because of these difficulties in running straight keyword searches, variants on the searching were developed to work around some of the deficiencies. Attorneys began running keyword searches in conjunction with metadata searches. Star searching allows the user to find root words to account for variations (interp* - would find interpret & interpretation). Fuzzy searching allowed users to find words within a certain percentage similarity of the word being searched. Proximity searching allowed 1 Source: <Marketing to supply reference to report by Gartner or Forrester> 2 users to search for words within a certain distance of other words of each other. These variants on the keyword search alleviated some of the issues discussed above, but still didn’t overcome the obstacles of synonymy and polysemy. This is where analytic searching has come to the forefront. Analytic searching in, its most rudimentary explanation, is a method of finding or grouping documents based upon the content of the documents themselves not solely on a keyword(s) being used. This is commonly employed by internet search engines that allow the user to type in a basic subject inquiry and retrieve search results that aren’t solely driven by the words entered into the search box. The basis for this search technology is the conversion of a document’s contents into numeric values that allows the computer to compare differing document’s values in order to determine similarity of content. By approaching document comparison in this way, specific terms (or even language) of a record becomes irrelevant to the determination of similarity. Alex Thomo an Associate Professor in the Department of Computer Science at the University of Victoria offers the following example to explain the basis for how analytic searching (and in particular a Latent Semantic Analysis) operates in determining documents responsive to a search request: Suppose there is a set of five documents containing the following language: Document 1: “Romeo and Juliet” Document 2: “Juliet: O happy dagger!” Document 3: ”Romeo died by dagger.” Document 4: “Live free or die - New Hampshire’s motto” Document 5: “Did you know, New Hampshire is in New England?” A search is conducted for: dies, dagger A classical IR system (for our purposes keyword searching) would rank d3 to be the top of the list since it contains both dies, dagger. Then, d2 and d4 would follow, each containing a word of the query. However, what about d1 and d5? Should they be returned as possibly interesting results to this query? A classical IR system will not return them at all. However (as humans) we know that d1 is quite related to the query. On the other hand, d5 is not so much related to the query. Thus, we would like d1 but not d5, or differently said, we want d1 to be ranked higher than d5. The question is: Can the machine deduce this? The answer is yes, LSA does exactly that. In this example, LSA will be able to see that term dagger is related to d1 because it occurs together with the d1’s terms Romeo and Juliet, in d2 and d3, respectively. Also, term dies is related to d1 and d5 because it occurs together with the d1’s term Romeo and 3 d5’s term New-Hampshire in d3 and d4, respectively. LSA will also weigh properly the discovered connections; d1 more is related to the query than d5 since d1 is “doubly” connected to dagger through Romeo and Juliet, and also connected to die through Romeo, whereas d5 has only a single connection to the query through New-Hampshire. Using the above as an example, its apparent that analytic search engines can have a significant role in searching by obviating some of the problems of straight keyword searching. However, this does not mean that analytic searching alone will always be the most defensible method of searching. In addition when running analytic searching, its important to understand the different analytic engines and the limitations of each. With all that said, in an attempt to identify a standard search process, this paper will first identify the problem, that is to determine what data you have and who has it. Next, the paper will elicit standard characteristics of a proposed standard search process. These will include identification methods and search and process methodologies to identify potentially relevant data in an effective, efficient and repeatable manner. Finally, the paper will discuss why those search and process methodologies are suggested for the search standardization model proffered. (The reasons for the identification methods proffered have been discussed in many articles, blogs and cases. Therefore, they are not discussed herein.) Deciding What You Have (where you have it and how much) The first problem with Ediscovery projects is assessing the magnitude and characteristics of the data in common knowledge repositories (email archives, SharePoint repositories, etc.). IT or Litigation Support professionals know they have a lot of data but are not sure where they have it and what these repositories contain. Not understanding the locations of data in an organization may seem like an odd statement but, for example, departments put SharePoint servers into production and users copy data to shared drives on networked file systems without knowing what they have copied. In other circumstances, administrators may not have knowledge of what systems are used for what data. These types of problems are ubiquitous and growing. The first step to effective assessment of a potential legal matter is to know what exists in various repositories within an organization. This is often the biggest problem in Ediscovery; identifying how much data exists and where it exists. Legal Problem: Identification of Potentially Relevant Data When litigation is filed and/or is reasonably anticipated, parties and counsel are required to identify and preserve data that is potentially relevant to the claims or defenses of the parties. In order to meet this obligation, parties have utilized numerous methodologies with varying levels of success. Success is often dependant upon the participants knowledge of the location of data as well as their understanding of the legal requirements and the technology involved. However, there has been no standard method to accomplish the task of searching for and reliably and efficiently locating that data. 4 Deciding who has it Another big problem is in knowing who owns (or is responsible for) the information that is stored in various repositories. When a custodian (or potential custodian) is identified, it is important to know where their data might reside. Many organizations don’t have any idea who owns what data and how often the data is accessed (if ever). Technological Problem: Lack of Ownership insight Historically data indexing and search solutions have not provided support for just a quick scan of file ownership in a short period of time to show what data requires further deeper analysis. Historically data has required full content indexing and analysis to provide insight into what it contains. Often the first level of analysis should be just a “who owns what” look at available data. In this case not all content needs full indexing. An intelligent approach is to perform a first-level analysis with just Meta data indexing and to then identify what content needs full content indexing. These are different “representations” of the data; one in Meta data form and one with all the content in the documents represented within the index. Systems with the ability to “represent” data in various ways let (users) reviewers decide what to deeply analyze. This saves time, storage space and lots of money. Deciding What to Look For A legal complaint will contain key facts about the case that will get the lawyers started on what they should ask the legal counsel representing an organization to identify and produce. A set of analytics that can assess the main complaint language or other “known key terms” and use these data to help build a set of “similar” documents would be very valuable to legal staff working on a case. Analytic processes that can expose “terms of interest” within a document to help lawyers involved with the case decide what to look for in other documents would be of great assistance to legal reviewers. Analytics to identify content that is “similar” to known example content is also very valuable. Legal Problem: Identification of Potentially Relevant Claims/Defenses/Data Upon identification of potential litigation and/or receipt of an action that has been filed, counsel must identify potentially relevant data and preserve it. How to most efficiently and effectively accomplish this goal is the problem faced by counsel and vendors alike. The proposed standard search methodology would begin with a litigation hold which involves identification of the “relevant topics” for the case. Relevant topics include but are not limited to the claims or defenses. Relevant topics might also include particular areas of interest for the litigation including, for example, profits/losses, prior claims, prior knowledge, investigations, testing, etc. in products liability cases. Once the relevant topics are known, the next area of inquiry is to identify the key players who might have possession of and/or who have created potentially relevant data. Key player questionnaires should be sent to these individuals. The questionnaire seeks information from the key player about “basic” data of which they are aware, why they were named, what position they hold, time frames of relevance, what documents they create in that position that might be relevant to the known relevant topics and where they store that data. It also should contain basic questions about the media on which they store information and where it is mapped and/or backed up. After this information is identified, a data map for the litigation should be drafted and key player interviews held. The interviews are usually more productive in person 5 where information sources located in the office, but often forgotten, can be identified. The interviews should be a much more detailed analysis of the way the corporation works, where data is stored, to whom it is copied, the purpose for which it is created, etc. The locations of the known potentially relevant data should also be discussed and, if possible, the path followed to locate specific server, drive, file names, etc. The data map for the litigation should be updated with this information and the client should verify the information contained therein by signature. Once the specific known locations are identified and all relevant topics have been discussed, known relevant documents can be pulled for use in creating better search parameters for further collection of data. In addition, once additional known relevant documents are located through the analytical search processes, the information from those documents can be utilized to search for other potentially relevant documents. Further, the terms/phrases from these new documents can be compared to the search results, i.e. clustering, to more efficiently identify potentially relevant data. In other words, the process should be iterative. In the meantime, an IT key player questionnaire should be sent to the person responsible for IT to determine the data architecture of the entity and backup/legacy information. The identification of mapping should also be sought along with information as to third party entities who maintain data and/or website information. Finally, IT and all key players should be asked to discontinue any document destruction, turn off auto delete and auto archive, and identify backup rotation. Potentially relevant data should be properly preserved depending upon data type and business capability until further decisions are made. Technological Problem: lack of an “analytics toolkit” and Lack of Flexibility and Scale Vendors have historically pushed one approach or solution on customers for Ediscovery. Every solution requires a search capability; when the solutions begin to contain analytics the vendor approach has been to offer a single type of analysis. One type of analysis does not always product the best results with all data sets. Sometimes email is the only source of data pertinent to a matter. One set of tools for email analysis may work fine for such a case. With other data pertinent to the same case, key evidence may exist in MS Word documents and email analysis techniques are not appropriate. This fact of life in Ediscovery has caused legal reviewers to turn to multiple solutions that are stand-alone applications. Moving data into and out of these applications introduces complexity and potential for error (human and otherwise). One platform providing a number of analytic tools that are appropriate at various times throughout the lifecycle of a case would be the most efficient approach to take for legal discovery. In addition, historically data indexing and search solutions lack the flexibility and scale to analyze the amount of data that may exist within a typical organization. A platform that could analyze large volumes of data efficiently would be helpful. Deciding who shared what (and with whom) Conversational analytics are very important to an Ediscovery solution. Knowing who spoke with whom about certain topics is often the cornerstone to legal analysis. Technological Problem: lack of capability or full-featured capability for conversations Some solutions use email header analysis, others use Meta data analysis and header analysis, others rely on message content. A solution that can identify content and header similarity is often the best solution. Providing this capability at scale is a challenge in many solutions. 6 Solution: A “Standard” Ediscovery Process The solution to many of these problems with Ediscovery would be contained within an “standard ediscovery system” that connects to many sources of local data (behind the corporate firewall), to help litigation support personnel generate reports about data that may prove relevant to a case matter. This software would also interface with collection tools for desktop or laptop collection and process data at great scale in large data center environments. The ideal discovery system would also perform a number of functions that would allow collection processing and analysis of data regardless of file format. The system would support a system of “describing” data without moving it into a separate repository; reducing the required storage space to use the system and making collection efforts more targeted and specific; let alone more cost effective (take just what you need for legal hold for example). This “system” would be coupled with additional standard processes and best practices to form the “standard” Ediscovery process. Such a system would also provide specific access to certain data items but not others based on user credentials and group membership of users (multi-tenancy; or the ability of multiple groups to use the system but only see specific documents depending on their role in the organization or on the review team). Please see Figure One and Figure Two (below) for an illustration of these concepts and how the system is deployed within an organization. At the present time, it is not believed that any one platform on the market has all of the capabilities mentioned herein and certainly does not account for capabilities not yet developed. Counsel should always keep abreast of technological advances and incorporate the same into any standard process. Depending upon the case and the data set, you may want to consider one or more platforms that best fit your needs. The choice of platform may be driven by which of the following options, beyond standard keyword-Boolean options, are available and/or needed for your data set: 1. Platform capability to allow unprecedented scale of indexing, search and analytics a. OCR conversion to text capabilities to ensure that content is captured even in image files b. Exception processing of certain file types that may need special processing like forensic video or audio analysis c. Processing with “flexible attribute selection” i. Indexing with Regular Expression matching turned “on” or “off” ii. Numerical content turned “on” or “off” 2. Multiple representations of data corpora a. File system- level Meta data only b. Application-level Meta data c. Full content d. Analytic attribute structures for semantic analysis e. Analytic Meta data structures for user-supplied attributes “tagging” f. Analytic Meta data structures for machine generated attributes i. Cluster associations for similar documents 7 ii. Near-duplicate associations for similar documents iii. Group views for search associations 3. Analysis Capabilities a. To help identify keyword criteria – figure out which words are contained within the data universe and subsequently determine which are most relevant b. To identify relationships in the content that is in need of scrutiny or discovery (clustering) c. To organize documents and relate keyword searches to content that is in an analytic folder d. To remove duplicate content from responsive document sets e. To identify versions of content within sets of documents (versions of contracts or emails) f. To identify language characteristics of documents (language identification) g. To identify email conversations and conversation “groups” h. Linguistic analysis (categorization in terms of meaning) i. Sampling to pull data from known locations to use for additional searching j. Supervised classification or categorization (using known relevant documents to form search queries to find other potentially relevant documents k. Lexical analysis (entity extraction or analysis) 4. Validation Capabilities (Whether in the platform or extraneous) a. To validate the search (pulling random sample of all documents to validate search methodology b. To validate the review for: i. Privilege ii. Confidential Information (i.e. other products, social security numbers) iii. Tagged/relevant topics (pulling random sample of reviewed data to validate the review process) Definitions of Key Terms Key terms relevant to understanding an ideal Ediscovery system are: Representation of Data In the ideal system, it is important to represent documents so that they can be identified, retrieved, analyzed and produced for attorney review. Documents can be represented within the system by some sort of index or by certain kinds of data structures (covered in detail in a later section of this document). Different types of analysis require different types of indices or data structures. It is ideal to build the appropriate data structures to support the kind of data analysis that is required at a certain stage of the ediscovery process. In an ideal system document representations can be constructed to include certain kinds of information but not other types. This is valuable as it keeps the space required for an index as small as possible and maximizes the speed of indexing or other data representation. 8 Meta data Categories and Use Cases There are three main types of Meta data that are important in electronic discovery. The first two are attributes of file systems and applications and help identify who created, copied or modified documents. This capability helps to identify custody or ownership criteria for documents important to a case. The third type of Meta data is supplied by either human reviewers or analytic software processes. File System or Repository Meta data For example, the file system where documents are found has Meta data about who copied a file to the file system or when a file was created on a specific file repository. This category would include SharePoint Meta data, NTFS (Windows) file system Meta data and any kind of Meta data that is relevant to the repository storing a data item (when it was placed into the repository, how large it is, what Access Control Lists (ACLs) apply to control the viewing of the item, etc.). If a litigation support person was looking for files that were created on a file system during a specific time period, they would be interested in file-level Meta data. An ideal discovery solution always indexes the import path of any document it represents along with as many file system attribute fields as possible. Application-level Meta data The application (MS Word for example) that creates a document stores certain Meta data fields inside any documents it creates. This presents an additional type of Meta data that can be indexed and analyzed to identify documents with certain characteristics. Application Meta data contains fields like who the author of a document may be, when they created the file with the application (MS Word in this instance) or when the file was modified (inside the application). The ideal discovery solution would capture as many of these document-specific Meta data fields as possible to determine everything from authorship of the document to when it was last printed (depending on what application created the document). User-supplied or “Analytic” Meta data The last type of Meta data that the system can store for a user is “Analytic” Meta data. This is user or machine supplied Meta data. Even though final document tagging is done by an attorney within the final review stage of a legal discovery operation, other support personnel will mark or tag documents to indicate their status. Legal support personnel may need to mark or “tag” documents with labels identifying certain documents as “important” for some specific reason (the documents may qualify for “expert” review by a professional in a certain field of expertise for example). They may want to tag them so that a supervisor can review their work and decide that they meet certain criteria that qualify them to “move along” in the discovery process. In addition to human review, a software analytic process can be run against a document collection and identify documents that are duplicate copies of one another in a large collection. An automatic process could generate tags (within the Analytic Meta data) indicating that certain documents are duplicates of a “master” document. If the master document was described as document “DOC000002345” then a tag such as “DUP_DOC1000002345” could describe all the documents that are duplicates of the master. These documents could then be identified quickly as redundant and they would not be passed along to attorneys for review. The system could retain the original copy of a duplicate document and mark or 9 remove the others so that attorneys would not have to read duplicates unnecessarily. The ideal discovery solution can run near-duplicate analysis and determine that certain documents meet a threshold of “similarity” to other documents, qualifying them as “versions” of an original document. Tags can then be automatically applied to the documents exhibiting these relationships so that they are identified for in-house counsel who may want to pass them along as data that outside counsel should review. Analytic Meta data is the repository where an ideal platform can conveniently place both human and machine-assisted codes or tags that will streamline or aid review of documents in a later part of the process. Given that human review is very expensive machine-assisted “culling” of information can reduce costs dramatically. Many experts in the industry term this process as part of “assisted coding” or “predictive coding” of documents. Analytic Processes For purposes of this paper, “analytic processes” will refer to the following main functions within the ideal discovery solution: 1. Unsupervised Classification – some refer to this as “clustering” where documents are organized together into lists or folders with members exhibiting some level of semantic similarity to one another. The term unsupervised refers to the technique’s ability to perform this semantic matching with no human supervision. 2. Supervised Classification – this refers to a capability where the product can take example content from a user and organize documents into lists using these examples as starting points or “seed” documents. The “best matches” are taken from among the candidate population of documents that are to be classified. The user can assign meaning to the seed clusters as they see fit; assign labels, etc. In the ideal solution a user can pick a number of documents as seeds, and specify an ordinal indicator of similarity that is a number between 0-1 that indicates a “threshold” of similarity that must be met for the candidate document to be placed on a seed list. Another form of the supervised classification is “search by document” where a user can select a single document as a “seed” and have it attract the most likely matches from the candidate list. 3. Near-duplicate analysis – this is very similar to supervised classification except that the system can take one “pivot” (example) document and compute all others within a relative “similarity distance” of it. Instead of organizing the document into a list of other semantically similar documents; candidates are marked as “near-duplicate” neighbors of a pivot should they fall within a range of similarity specified by a user. The documents are marked with “near-duplicate association” markers in the analytic Meta data repository as indicated above. 4. Email conversation analysis – this is where the ideal system identifies the email and instant messaging conversations that occur between parties. The parties and who sees a message is discernible through this type of analysis. 5. Different types of searching – simple keyword search, Boolean search, fuzzy search, proximate search are other types of search that are sometimes referred to as analytics within a product. An emerging technology that is more and more important to legal discovery is conceptual 10 searching, where concepts are computed among the members of documents and presented with the keyword results. Often conceptual searching is referred to in the context of conceptual mining which means a process that identifies concepts in documents that transcend keywords. Conceptual mining is often used to identify “latent” or immediately “unseen” words that are significant among a population of documents. These can often help a human reviewer identify what keywords should be included in a case and also to identify documents that the initial keyword searches did not include. Virtual Index For legal discovery purposes, a system needs to support building and searching the aforementioned three types of Meta data and must include support for analyzing and searching full document content as well. For analytics of certain kinds documents must be represented by special data structures that allow analysis (duplicate analysis, near-duplicate analysis, similarity comparisons to example content, etc.) to be undertaken. The system has to account for these at great scale. This entire set of capabilities should appear (to a user of the system) to be possible across one “index”. In the ideal system, these capabilities are encapsulated in one entity that will be referred to as: “the virtual index”. It is referred to in this way because it supports various operations on multiple data representations and encapsulates these operations transparently to the user. The user should not know or care about the different repository or representations of the documents within the ideal system. The user should simply issue searches or ask for “similar documents” and get the results. The virtual index will abstract all of these details for a user. Multi-site Support The ideal system should support use cases “behind the corporate firewall” for analyzing and collecting data within local enterprise or client environments, and also support large data center deployments. The indices built within the enterprise environment should be “portable” so that they can be built in the enterprise environment and then be transported to the larger data center environment where all aspects of the case can be evaluated in one “virtual place”. The idea of a virtual index supports this vision, as it allows local data sources to be analyzed at various remote locations and then any relevant files moved to a legal hold location at a central data center. The indices can be added to the central location along with any data that is copied for legal hold purposes. In all cases it is ideal to have a platform that “connects” to data sources, reads in a copy of the documents stored within them, but leaves the original in place at its source location. Instead of moving the original document into the ideal system and duplicating the document and the storage required to maintain or analyze it, the documents can be represented by an index or some data structure that is generally more compact. The original documents do not have to be resident within the ideal system to be analyzed and referenced. Please see Figure Two (below) for an illustration of the ideal system in relation to data sources it represents. It is important that documents do not have to be loaded and analyzed in “batches” and that the ideal system has the scale to represent vast numbers of documents within one single system. A system that 11 supports a set of analytic operations and scalable search is also an important feature of such a discovery platform. Having the ability to analyze new documents by comparing them analytically with examples already represented within the ideal discovery system is extremely important to solid ediscovery practices. Key Architectural Attributes of an All-Inclusive Platform An all-inclusive platform approach presents all of the capabilities shown above to the IT or legal review professional. The user can index data from locations within their data center or from sources as diverse as their SharePoint server “farm” their Exchange email server, NT file servers or large-scale NAS devices. The user can pick various levels of data representation based on the level of insight required for the task and the computational and storage burden acceptable to the reviewers. The user can then search for data that is relevant, select those results to “pass on” to other analytic processes (such as de-duplication and near-duplicate identification or email analysis) and then tag or otherwise mark the results. All of these capabilities should be available from a single console without the need for moving the data from one tool to another. Once the data is in the platform it can be identified, analyzed and marked according to the needs of the case. The important thing is that it can be managed with these processes at unprecedented scale. Please see Figure One (below) for an illustration of the ideal platform. The reader can quickly recognize that this is a product with a full suite of analytic and legal production capabilities. It is far beyond a single function product like a search engine. Please see Figure Two below for an illustration of how this platform could operate in the IT infrastructure among various repositories of data. Figure One: Ideal Discovery Platform Ideal Platform A SW product that provides secure multi-tenant (multi-organization) access to scalable electronic data processing, analysis and data management services Organization “A” Organization “B” Organization “C” 2 12 Ideal Platform (Web Services) Policy Enforcement (BPEL) Internet Data Conversion PDF/HTML Conversion Data Mgt. Migration/Legal Export EDRM XML Concord.dat, DB(3) Tagging/View Management User tags, admin. tags Supervised Classification Unsupervised Classification Profile (Model based) Searching/Analytics Search/Near-dup., etc. Processing/Indexing OCR, File ID, Index (3) Clustering The ideal discovery platform will perform all of the functions in the illustration above. The power of having all these capabilities in one platform is undeniable. Being able to process content (OCR, REGEX), index it, “cull” it down to a smaller size and then analyze it (remove duplicate material, perform NIST analysis, identify near-duplicate content, calculate email conversation “threads”) all in one platform without having to move the content from one system to another eliminates labor and potential human error. Promoting efficiency in electronic discovery is a key component to success in legal review matters. Figure Two: Intelligent File Analysis and Management Intelligent File Management Services Files Docs Images Business Process Execution Logic SharePoint Private or Public Cloud Manages Content and Retains History Analyze De-dup Encrypt Ideal Discovery Platform Intranet File/email Servers Storage Move Internet Virtual Index/ Repository VI maintains history Storage NAS Storage Scale of Indexing and Representation An ideal discovery solution must have unprecedented scale. Scale is provided through superior use of physical computing resources but also through the segmenting of the various data resources into the virtual index components described previously. Scale of Indexing, Search and Analytics [List of All Unique Terms in a Collection] With the correct architecture hundreds of millions to billions of documents can be indexed and managed in a fraction of the time required for other solutions, and with a fraction of the hardware they require. One vendor, utilizing a unique grid-based (multi-server) architecture has demonstrated the indexing and preparation of a given 17.3 TB data set in less than a twenty-four hour period. This is possible due to two factors: 1. The platform’s unique “Grid” architecture (see Figure Three) 2. The platform’s unique “Virtual Indexing” architecture and technology (see Figure Five) 13 This platform can be deployed as a single server solution or in the large data center configurations shown in figure three below. The ability to expand as the customer needs to index and analyze more data in a given amount of time is made possible by the architecture. Certain software components of the architecture schedule activities on the analytic engine components shown in the diagram. These analytic engines “perform intensive work” (indexing, searching) and the controlling software requests them to perform the work to produce results for users. The controlling software is the “intelligence or brains” of the system and the analytic engines are the “brawn” of the system. As the user needs more processing power, more analytic engines can be employed within the “grid” to provide more processing and analytic power (the user is again referred to Figure Three) Scale of Representation This architecture also supports the representation of content in multiple ways so that the search, classification and other analytic operations available from the analytic engines can “work on” the data that has been processed. This means that the index is really a set of “managed components” which include: 1. 2. 3. 4. Meta data indices Content indices Analytic data structures Analytic Meta data (tags, cluster groups, other associations) All of these things are what is meant by “scale of representation”; the platform can represent content in multiple ways so that the appropriate level of search or analytics can be undertaken on the documents that are within a collection or corpus. Speed and Scale of Indexing A second aspect of scale is the speed with which data can be processed and made available for assessment. With a superior architecture an index can be presented for searching within hours. Other solutions require days if not months to build the content for a case into a searchable representation. The ability to build an index and get results in one or two days and have it done reliably allows case matters to be investigated rapidly and with fewer errors. The sooner a reviewer can determine what is relevant within the scope of discovery the sooner lawyers can begin making intelligent decisions about the case. This leads to better outcomes because the reviewers are not as rushed and because they have better analysis options than they would have with traditional methods. With the data prepared faster, organizations have time to perform search operations and then perform more complex analysis of data that will aid the reviewer later in the case. 14 Figure Three: Scalable Grid Architecture Unique Three-Tiered Architecture User Access Tier Service Control Tier Policy Engine Services Manager Analytics Tier Analytics and File Storage Analytics Engine High Speed Index Storage Analytics Engine Access Manager Analytics Engine Access Manager Access Manager Services Manager Analytics Engine Access Tier Service Tier Analytics Tier (scales from 1 to n) (scales from 1 to n) (scales from 1 to n) User Access (AAA) •AD/LDAP Service Control •Grid Scheduling •Fail-over •High-availability •Load-scheduling •Job Mgt./Monitoring Analytic Ops High Capacity File Storage Virtual Warehouse Data Stores •File Identification •Archive Mgt. (PST) •Indexing (3 levels) •Analytic Env. •Search ops. •Near-duplicate analysis •Duplicate detection •Threading analysis As one can see from the illustration above, the data processing and search workload can be distributed over various machines in the “grid”. The customer simply has to install and provision more “engines” to exist in the grid and the intelligent management layer of software will use these resources for processing, indexing and search operations. This allows the product to scale to handle unprecedented levels of documents and to process them in unprecedented timeframes. In Figure Eight (below) the search operation is illustrated as being distributed over the available grid processing power. Virtual Indexing: a Key to Large Corpus Management In addition to a distributed “grid-like” architecture, another key to managing large data sets is using the proper constructs to represent the data. As mentioned, this platform builds different representations of the data based on the needs of the analysis tasks that will be required for specific discovery activities. It ties them together in a logical set of components that is referred to as a “Virtual Index”. This is necessary because the Meta data from files, the user-supplied Meta data from other reviewers, and analytically generated Meta data all must be searched as a single logical entity to make decisions about a given case. A virtual index stores the various pieces of the logical index separately so that the Meta data can be built and searched separately for efficiency reasons, but also for scale purposes. A virtual index can be grown to an unprecedented size because it is “built up” from smaller more efficient components. Further, it can be transported from one location to another, and then “added in” to a matter as the user sees fit. Earlier in this document the example of remote office local collection with the data being transported with the appropriate indices to a data center. This is possible because of the virtual index. Such an index can also grow arbitrarily large. The virtual index component of software can “open” the parts of a virtual 15 index that matter to a case at a certain point in time. This makes searching more efficient and also it allows the virtual index to grow or shrink as necessary. Also, the “pieces” of a virtual index can be repaired if they become corrupt for some reason. The ideal system retains “manifests” of documents that comprise a given portion of the virtual index and from these the component indices can be rebuilt if necessary. The user may want to just look at file system Meta data and characteristics of content stored within an enterprise. For that a straight forward file system Meta data index (basically POSIX-level Meta data) will satisfy the need. This type of index only requires about 4% of the original data size for storage. A full content index (on average) consumes between 25-30% of the original data size. The full-content index will require more storage than the Meta data variety of index, and it will take longer to build. If the user needs to understand the application (MS Word, PDF) Meta data or that and the full content of documents for keyword search, they will be willing to wait for the extra processing (full content indexing) to complete and are likely willing to consume extra storage. If the user is not sure if all the available content meets the criteria that their search may require, they may want to use the POSIX Meta data indexing technique initially to identify what content should be fully indexed (before committing to extra time and storage resources). One key aspect of the ideal system is that the Meta data index is separate and stands alone from the content index that supports it. The system presents one index for a given corpus of documents, but beneath this construct is at least a Meta data index. If a corpus is represented as a full content index, the corpus has a Meta data and a full content index component. The two indices are logically connected but physically separate. The virtual index software layer “binds” them together. Please see Figure Five for an illustration of a virtual index and its components. This virtual index approach makes the index more scalable; it allows it to be searched more rapidly and makes it resilient against potential corruption. In addition to the full content inverted index construct, the corpus of documents can be further represented by analytic feature descriptors (where each “word” or “token” is represented as a feature of the document). These feature descriptors for single documents can be combined as “models” where complex relationships between the words or tokens are stored. These analytic descriptors are separate data structures that support document similarity operations, clustering and near-duplicate analysis. They do not depend upon the inverted index that is used for keyword searching; they are separate data structures and are used independently of the index. 16 Figure Four: Analytic Meta Data Analytic Meta Data – Supporting user-specific TAGGING/Classification and Management Application Access User can kick-off a number of actions using combinations of : 1. Search ops 2. Classification 3. Tag labeling Tag Actions and Results from Analytic Operations (Classification) can be retained with the data descriptions set by a user Service Manager S e a r c h C l a s s if y Analytics Working Set T a g DocID DocID DocID DocID C M D Engine Analytic Meta Data Tag Tag Groups Groups Virtual Index Class Class Groups Groups Figure Five: Virtual Index Illustration Virtual Index Explained Job Requests View Handle Operand Analytic Operations Agent (AOA) Virtual Index SW Layer Index/Query Abstraction Layer DR Meta DR Content DRIndex Meta DR Content Data Index Meta Data Content Data Index Index Index Index MD and Content Inverted Indices 22 17 Analytic Representation Abstraction Layer Analytic Meta Data User tags, Class views (Clusters, etc.) Confidential Analytic Repository Analytic Model Doc. Structures Statistical Freq. Structures Models April 19, 2011 Figure Six: Virtual versus Monolithic indices Current Products – Monolithic Index • Monolithic – effective index of around 7 TB of content – After more documents than this: bad things happen • Ideal Product: build it as large as you would like – Build an incrementally sized Virtual Index – Pieces can be added as required to grow the view into a set of document collections Searchable View Capped in Size – cannot grow Grow as needed Virtual Index Layer Virtual Index Competitor 7 TB of content Virtual Index Virtual Index 16 Figure Seven: Virtual Indexing in Action Virtual Index - Operation Job Requests View Handle Operand Analytic Operations Agent (AOA) View Virtual Index SW Layer Index/Query Abstraction Layer Query Analytic Representation Abstraction Layer Results DR Meta DR Content DRIndex Meta DR Content Data Meta Data Index Content Data Index Index Index Index MD and Content Inverted Indices 23 18 Analytic Meta Data User tags, Class views (Clusters, etc.) Confidential Analytic Repository Analytic Model Doc. Structures Statistical Freq. Structures Models April 19, 2011 Figure Eight: Virtual Indexing plus Grid Architecture Distributed Archit. – Virtual Index Secure AD/LDAP Environment Data Sources NTFS CIFS DR “Intelligent Connectors” Analytics Engine Relevance Adjustment Layer MS Exchange or Lotus Notes Robust Connector Framework MS SharePoint Index built Service Tier/Policy Engine incrementally Q u e Results r Service Manager y Virtual Query Index AOA AOA Query Analytics Engine AOA Query Results Virtual Index Results Virtual Index Virtual Index (multiple Indices searched as one) Analytics Engine Summary of Architectural Concepts Now that we understand how the unique architecture of the ideal system solves several issues around Discovery, we can talk about some important types of processes which we will refer to as “analytics” and their importance to the overall process. The prior sections of this document explained how: 1. The grid architecture allows very large collections to be represented as indices in record time. Before this architecture became available, the Ediscovery process could not analyze the extremely large collections of documents that have become common in legal matters. These collections were either not analyzed, or they were analyzed in pieces, leading to human error and inconsistency of results. 2. The virtual index constructs let the user select various levels of index representation a. File Meta data only b. Application Meta data c. Full Content d. Analytic Descriptions (document feature attributes) e. Models and Profiles (example content in feature-attribute form) 3. The virtual index also lets the document collections that are represented grow to unprecedented size and still remain usable and efficient a. The virtual index lets the user add to the collection at any time as the virtual index is really a multi-index construct within the product 4. Monolithic Indices can be problematic a. Monolithic indices can grow in size and be very inefficient to search and manage 19 b. Monolithic indices can become corrupt at a certain size and become unusable c. Monolithic indices can take long periods of time to construct in the first place 5. Virtual Indices Supply Several Key Advantages a. In a virtual index Meta data only searches work on smaller absolute indices and complete more rapidly b. In a virtual index Meta data and full content searches actually execute in parallel increasing efficiency and scale while providing results more rapidly than from classical monolithic indices c. Virtual indices can support similarity operations like “search by document” that expose relevant documents that are meaningful to a human reviewer d. Virtual Indices can be repaired efficiently without requiring entire document collections to be re-processed and re-represented. 20 Analytics Analytic processes in EDiscovery present distinct advantages to human reviewers. In this section, analytic processes are described that can aid the legal review process. Discussion is presented about why they can be of help during that process. In addition, aspects of these approaches are presented and compared and advantages and disadvantages of each are explored. This is to give the reader a sense of how competing products in the discovery space compare. This is intended to help the reader value each and determine when one technique needs to be applied with others to be effective in a legal review process. Major Categories of Analytic Processing It is easy to become confused by all of the techniques that are available to aid human reviewers of electronic documents. These techniques fall into three main categories: 1. Unsupervised classification or categorization (often referred to as “clustering”) 2. Supervised classification or categorization 3. Specific types of analysis a. Near-duplicate analysis b. Duplicate identification or analysis c. Conversational analysis (who spoke to whom) Unsupervised Classification This is often referred to as “clustering” because one wants to form document groups that “belong together” because they “mean the same things”. The main idea behind this kind of analysis is that the human reviewer does not have to know anything about the data in advance. The reviewer can just “push the button” and find out what belongs where in a dataset and see folders or ordered lists of documents that are related somehow. See Figure Twenty Two (below) for a screenshot of a product that identifies documents according to their similarity to one another. This particular system uses a series of algorithms to perform its work; but the end result is folders of documents that are related to one another. Close inspection of the diagram will show that the foreign language documents end up in the same containers or folders. The predominantly foreign language documents get grouped together in a folder that is labeled with foreign language “concepts” to make the review process more efficient. Other advantages of this technique will be explained in later sections of this document. This is an example of a multi-level algorithm that accounts for language differences. Some unsupervised classification algorithms do not account for the language differences of documents and they can produce results that appear “confusing” in some circumstances. This phenomenon will be discussed later in the document. Most of these unsupervised techniques culminate in some kind of “conceptual” clustering or conceptual mining and analysis of the data they represent. As each specific technique is described in later sections 21 of this document the reader will be informed about how the technique relates to conceptual analysis of documents being analyzed. Supervised Classification Supervised classification means that a user “supervises” the process by providing at least some examples of documents that are similar to what they want the algorithm to find for them in a larger population of documents. These documents are usually put into what is called a “model” and they “attract” other documents that belong “closely” to them. Please see Figure Twenty Four for an illustration of supervised classification. Examples of supervised approaches: 1. Seed-model “nearest neighbor example” type clustering. 2. Support Vector Machines (see reference [6]) – the user must supply “known positive” and “known negative” examples of documents that the system can use to “compute the differences” between for purposes of classifying new documents. 3. Bayesian Classifiers (see section below and reference [3]) – the user must supply “good” and “bad” examples so that the algorithm can compute a “prior” distribution that allows it to mark documents one way or the other. 4. Statistical Concept Identifiers that arrange documents based on the characteristics of words and topics in a set of “training data” (documents that have been selected from a larger population of documents but that have not been reviewed by a user) 5. Linguistic Part of Speech (POS) models where certain patterns of a specific language are noted within a linear classification model and new documents are “matched” against it based on their linguistic characteristics. Specialized Analysis There are specialized analytic techniques such as: 1. Near-duplicate detection (finding things that should be consider versions of other documents) 2. Email conversational analysis (threads of conversations between specific parties) Mathematical Framework for Data Analysis In the preceding discussion of the ideal architecture the concept of representing data as an index for searching or as a mathematical model for analysis was presented. This section contains a description of how data is represented mathematically. There are many ways to represent data for mathematical analysis; the technique being described below is one way. This is not intended to be an exhaustive review of all available text representation techniques; it is offered to help the reader visualize methods that are not the same as an inverted index that can be used as a basis for document analysis. Basic Overview of Document Analysis Techniques The basics of how documents are compared to one another relies on representing them as “units of information” with associated “information unit counts”. This is intended to give the reader context to understand some of the terminology that follows. The goal of this is to support a mathematical process 22 that can analyze document contents: the “vector space model” *7+. The vector space model was developed by a team at Cornell University in the 1960’s *8+ and implemented as a system for information retrieval (an early search engine). The “pros” and “cons” of the vector space model are discussed in the references, but since it is a good way to understand how to think about documents in an abstract and mathematical way it is explained initially. When we refer to this in general it will refer to the document-term representation model where documents can be thought of as vectors. The term Vector Space Model would imply that in all cases we mean that the vectors are compared with cosine angular measurements after their “term frequency-inverse document frequency” attributes are computed. In the context below I discuss how that is possible but I don’t explain “tf-idf” in detail. The reader can consult [7] and [8] for information on computing similarity with tf-idf techniques. Furthermore, I am offering the model as an example of how documents can be represented for comparison. Other techniques than tf-idf used for cosine similarity comparisons use the vector concept so I want to make sure the reader understands the context in which this discussion is offered. The Vector Space Model (VSM) is often referred to not just as a data representation technique but as a method of analysis. Some of the techniques mentioned in the sections that follow utilize this vector type model in some way (for representation; but their mathematical approaches are different). Not all of the techniques discussed use the vector space model, but it is presented to give the reader a grasp on how a document can be analyzed mathematically. Some form of vector is used in many cases to describe the document content. Some of the techniques just need some data structure that represents the words in a document and how often they occur. This is often constructed as a vector even if the vector space calculations are not used to analyze the data the vector represents. 23 Representing Text Documents for Mathematical Analysis – Vector Space Model The “Vector Space Model” is a very well known structure within the field of information theory and analysis. It allows documents and their words or “tokens” to be represented along with their frequencies of occurrence. Documents are represented by a “document identifier” that the system uses to refer to it during analytic operations or so that it can be retrieved for a user. The overall combination of document identifier and the token frequency information is referred to as a “document descriptor” because it represents the information with the document and provides a “handle” to use to grab the document when necessary. Figure Nine: Vector Document term Frequency Structures Document Term Structures Terms -> Document Descriptor -> Doc ID t1 t2 tn 10117 47 77 22 Document Matrix Doc ID1 t1 Doc IDm t1 Frequencies -> t2 t2 t3 t3 t4 t4 t5 t5 tn tn tfn (term freq.) N=#total docs. dfn=#docs. Containing tn 24 Figure Ten: Documents in a Matrix Vector Space Documents Matrix Representation and Queries docs D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 Q t1 1 1 0 1 1 1 0 0 0 0 1 1 q1 t2 0 0 1 0 1 1 1 1 0 1 0 2 q2 t3 1 0 1 0 1 0 0 0 1 1 1 3 q3 RSV=Q.Di 4 1 5 1 6 3 2 2 3 5 3 t1 t3 D9 D2 D1 D4 D11 D5 D3 D10 D6 D7 D8 t2 Notice that when documents are represented as document-term structures the documents are like “rows” in a matrix. The “columns” of the matrix are the terms, and the columns can be the frequencies of the given terms that are found to occur in the documents. The term positions can be fixed (per term) and labeled with some integer with the actual string of the word/token being kept in a separate dictionary or some other means can be used to “keep track” of what the terms mean. The representation of a document will be the document being an entire row of terms with the columns representing the frequency of occurrence of any given term within a specific row or document. 25 Comparing Documents to One Another or Queries With the vector-space model of vector comparison, each document is treated as a “vector” in a dimensional space. The number of dimensions equals the number of terms in the largest document. If a document has a given term in a row of the matrix, the value in the matrix is equal to that document’s frequency for the given term. If the document does not have that given term, the column in the row of a given document is zero. To compare two documents, a “similarity calculation” is undertaken and a “score” is computed between the documents. The score represents the cosine of the angle between the two documents, or their “distance apart” in the “n-dimensional vector space”. This can be visualized in two-dimensions below. A query can be represented as a document so a query entered by a user can be compared to documents and the closest ones can be retrieved as results. Figure Eleven: Cosine Similarity Computing Similarity Scores D1  (0.8, 0.3) D2  (0.2, 0.7) 1.0 Q  (0.4, 0.8) cos 1  0.74 Q D2 0.8 0.6 0.4 0.2 cos  2  0.98 2 1 0.2 D1 0.4 0.6 0.8 1.0 The basics of the vector space model are that the cosine angle can be computed between any two vectors in the document-term matrix [7]. This number is guaranteed to be between zero and one and it shows that a document is identical to another document (score equals one) or the document has a zero score (nothing in common with the reference document) or something in between. The closer that the score is to the number one, the more similar two documents are in the “vector space” or “semantic space” of the documents. This model gives the reviewer some idea of how similar two documents are. It is useful in this respect; but it has some limitations. The reader can review the references for more detail 26 on the mathematics, but the basic idea is that documents are: 1) the same; 2) totally unrelated; 3) somewhere in between. Problems with Vector Space Model (VSM) The problems that arose using the vector space model included: • synonymy: many ways to refer to the same object, e.g. car and automobile • polysemy: most words have more than one distinct meaning, e.g. model, python, chip • Vectors are still sparse and there is a lot of extra computation involved with analyzing them Figure Twelve – Illustrative Behavior Vector Space Model Comparisons VSM • Example: Vector Space Model auto engine bonnet tyres lorry boot car emissions hood make model trunk make hidden Markov model emissions normalize Synonymy Polysemy Will have small cosine Will have large cosine but are related but not truly related As can be seen above, the VSM puts things together that have the same (literal) words. It is efficient to compute and gives an intuitive basis for understanding what is literally similar. What it does not help with is in finding documents that “mean” the same things. In the example above, one document with “car” and another with “auto” would not be grouped together using this technique. This is because the technique cannot account for synonyms or polyesters (these are explained in [5] within the references); polyesters are words that have more than one meaning (“Java meaning coffee” and “Java meaning the island of Java”, or “Java the programming language”). There are ways to improve the behavior of the vector-space model and it is still used and proves very useful for many operations in data analysis. For conceptual analysis of data however, there are other methods that can be used that don’t suffer from these drawbacks. 27 VSM: Historical Significance The VSM as one of the first techniques to model documents in a mathematical context and present them in a fashion that is conducive to analytic review. It is not perfect, but it represents a lot of value to folks trying to find similar documents and it paved the way for researchers to use a common model for thinking about document analysis. The techniques that were subsequently put forward used this representation method but applied very different mathematical techniques to the matrix model of viewing document collections. There are ways to improve the behavior of the vector-space model and it is still used and proves very useful for many operations in data analysis. For conceptual analysis of data however, there are other methods that can be used that don’t suffer from the drawbacks of VSM. These techniques perform dimensionality reduction on the data sets at hand, and expose “latent relationships” in data that are hard to find otherwise. Before moving on to techniques, some discussion of reducing the dimensionality of data sets is presented. Some More Basics Before we discuss the various techniques at our disposal for legal discovery analytics; we have to discuss a general concept in language theory: “dimensionality”. Dimensionality is the number of words or the number of things that we have to account for in a language model. Analyzing Text Documents – “The Curse of Dimensionality” The problem with text is that with most languages there are so many words to choose from. Documents vary in their vocabulary so much that it is hard to build one mathematical model that contains all the possibilities of what a document might “mean”. Language researchers and computer scientists refer to this problem as “the curse of dimensionality” and many information analysis approaches seek to reduce the number of dimensions (words) that their models have to contain. In the matrix above, if this represented a “real” collection of documents, the columns of the matrix would be much more numerous and for many of the documents the columns would not have an entry. This is what is meant by “sparse data” within a “document-term matrix”. Many approaches are aimed at identifying what words in a document or set of documents represent “enough” of the total so that others can be ignored. Information theorists refer to this as removing “noisy data” from the document model. This concept revolves around choosing enough of the attributes (words) within the documents to yield a meaningful representation of their content. Other techniques are used to actually remove words from the documents before they are analyzed. Stop Word Removal Certain words (such as “a”, “and”, “of”) that are not deemed “descriptive” in the English language can be removed from a document to eliminate the number of dimensions that an algorithm needs to consider. This may be helpful in some contexts and with some algorithms; it does reduce the numbers of dimensions that need to be considered by an analysis algorithm. When these are removed it can be hard to determine specific phrases that might carry meaning to a legal reviewer however. A search engine that can find phrases may not consider the difference between two documents with similar phrases: 28 Document #1:“We agree on the specific language outlined below…” And: Document #2: “We agree to pursue a process where we agree on a specific language to describe….” In these two documents many search engines would produce both documents; each with clearly different meanings in response to a phrase search of: “agree on the specific language”. This may not be a problem to a reviewer because both documents are likely to be returned; but the reviewer will have to read both documents and discard the one that is not specific enough for the case at hand. In this instance, stop word removal would yield results that are less specific than a reviewer might want. The searches with this type of index may produce more documents, but the cost will be that they may not be as specific to the topic at hand. Stemming of Language With most languages, there are ways to find the “stems” or “root meanings” of many words through an algorithm pioneered by Martin Porter [9] that has been named: “Porter Stemming”. This technique has been used widely and is often referred to simply as: “stemming”. Any serious language theorist recognizes the term “Porter Stemming”. The algorithms were initially released for English language analysis but have been extended for many other languages. See the reference (again [9]) for more discussion of the techniques and the languages supported. The idea with porter stemming is to reduce words with suffix morphologies to their “root” meaning. The root of “choosing” is “choose” and would show up in some stemmers as: “choos”. The roots of many common words can change after stemming to common “roots”: alter alteration altered become: alter alter alter This reduces the number of tokens that a document has to account for and the argument for using this technique is that the meaning of the words is “about the same” so the corresponding behavior this induces in the mathematics will not be deleterious to any given analysis technique. The theory behind using stemming for search is that more documents of “about the same meaning” will be produced for a given query. In a legal review context documents that are not specific to a query could be returned with stemmed collections. This is similar to the situation that could exist when stop-words are removed from a collection. For analytic approaches, the same thing can occur. The algorithms that group documents together could produce results that are less specific than might be desired by a human reviewer. 29 For analytic approaches, the designer of an algorithm must consider this trade-off between precision and recall. The benefits of having fewer things to keep track of in the model may outweigh any lack of clarity around usage that the token suffixes may have conveyed. In a legal discovery “clustering” context this may not be true (as we will discuss), but stemming is an important attribute of a collection of documents that should be considered when preparing documents for legal review purposes. It can help immensely and it can make other things less specific (which it was designed to do) than one might want for a legal discovery application. The designer of the analysis system should consider how the documents need to be prepared for the optimal performance inside the algorithms the system will implement. Higher-Order Mathematical Techniques To this point, we have seen how a legal discovery platform must include many different pieces of functionality and respect both Meta data and full content search at great scale. We have also seen how analytics can be important to legal discovery professionals. We have set the framework for how to represent documents in a way that allows mathematical operations to be defined on abstract representations of their content. We reviewed the vector space model and how document matrices have many “dimensions” that impact analytical performance for legal review purposes. We have discussed how to reduce dimensions by removing certain words from a collection or by reducing certain words to their “root forms” so that they can be considered more generally with fewer burdens being placed on the modeling technique. These techniques can reduce the specificity of results returned by the system. This may or may not be acceptable for a legal review application. For conceptual analysis of data, there are other methods that can be used that perform dimensionality reduction on the data sets at hand in a different manner. One of the first solutions to this problem that was proposed was Latent Semantic Indexing (or Analysis) by a team or researchers at Bell Laboratories in 1988. Before these are explored, some of the commonly used techniques are listed and discussed. This is not intended to be an exhaustive review of every technique available for language analysis. It is not a critique of any technique or vendor implementation. This is a discussion of some common techniques that have been used within legal discovery products and presents a “pro” and “con” set of considerations for the reader. Commonly Used NLP Techniques Within legal discovery, there are some NLP techniques that have become commonly known within the industry. These have been championed by vendors who have had success in providing them as pieces of various edicovery products. These are: 1. Latent Semantic Analysis/ Indexing (LSA/LSI) – this is an unsupervised classification technique used in a popular review platform and some other products 2. Probabilistic Latent Semantic Indexing or Analysis (PLSI; sometimes referred to as PLSA) – this is a supervised learning technique that has been implemented within search engine products 3. Bayesian Modeling (this is described below; the term “Bayesian” is commonly understood for SPAM filtering and other knowledge based products) 30 4. Discrete Finite Language Models (companies with these technologies have used linguists to build a “rules based” engine of some sort based on the “Parts of Speech” found in a text collection) these are included as “linguistic models and algorithms” that they use to help find keywords for search and to “understand“ collections. These probably are useful in some contexts; generally these are specific to a given language and will not provide much value to other languages without tuning by the authors of the model. Techniques Discussed/Analyzed Each of these techniques will be discussed briefly in the context of their use within legal review. All of these are of course useful in the appropriate context. Their usefulness in certain situations and what needs to be added to them to make them an integral part of the legal review process is noted below. Their behavior at a certain scale can become problematic for each technique; this will be discussed below. Latent Semantic Analysis Latent Semantic Analysis (sometimes referred to as Latent Semantic Indexing or “LSI”) was invented by a team of researchers at Bell Laboratories in the late 1980’s. It uses principles of linear algebra to find the “Singular Value Decomposition” (see reference *10+) of a matrix which represents the sets of independent vectors within the matrix that exhibit the best correlations between term members of the documents it represents. Notice that documents represented as “vectors” in a matrix make this technique available in the same way that vector space calculations are (as described earlier) available for VSM similarity operations. With LSI/LSA the terms that emerge from the document-term matrix are considered “topics” that relate to the documents within the matrix. These topics are referred to as the “k” most prevalent “topics” or words in the matrix of document terms. LSA – “The Math” From reference [11] (Wikipedia page on LSI): “A rank-reduced, Singular Value Decomposition is performed on the matrix to determine patterns in the relationships between the terms and concepts contained in the text. The SVD forms the foundation for LSI.[15] It computes the term and document vector spaces by transforming the single term-frequency matrix, A, into three other matrices— a term-concept vector matrix, T, a singular values matrix, S, and a concept-document vector matrix, D, which satisfy the following relations: A = TSDT In the formula, A, is the supplied m by n weighted matrix of term frequencies in a collection of text where m is the number of unique terms, and n is the number of documents. T is a computed m by r matrix of term vectors where r is the rank of A—a measure of its unique dimensions ≤ 31 min(m,n). S is a computed r by r diagonal matrix of decreasing singular values, and D is a computed n by r matrix of document vectors. The LSI modification to a standard SVD is to reduce the rank or truncate the singular value matrix S to size k « r, typically on the order of a k in the range of 100 to 300 dimensions, effectively reducing the term and document vector matrix sizes to m by k and n by k respectively. The SVD operation, along with this reduction, has the effect of preserving the most important semantic information in the text while reducing noise and other undesirable artifacts of the original space of A. This reduced set of matrices is often denoted with a modified formula such as: A ≈ Ak = Tk Sk DkT Efficient LSI algorithms only compute the first k singular values and term and document vectors as opposed to computing a full SVD and then truncating it.” This technique lets the algorithm designer select a default number of topics which will be “of interest” to them (the default value is usually between 150-300 topics). There is research to indicate that around 100-150 topics is the “best” or “optimum” value to configure when using LSA. This sparks debate among language theorists but has been discussed in other documents (see reference [4] and [11]). The topics generated via LSA SVD decomposition are referred to as the “k-dimensional topic space” within the new matrix. This is because there are k (100-150-300) topics or terms that now “matter” (instead of the thousands of individual terms in a set of documents before the dimensionality reduction has occurred). So the original matrix that could have contained thousands of unique terms is now represented by a much smaller matrix with terms that are highly correlated with one another. Figure Thirteen outlines some of the mathematical concepts that apply with Latent Semantic Analysis. 32 Figure Thirteen: LSA Matrix Illustrations LSA: One of the First Solution • Singular Value Decomposition {A}={U}{S}{V}T • V and U forms an orthonormal basis for input and output space: A*A, AA* In the diagram it can be seen that the large matrix has been “reduced” to the smaller dimensional area and “important” terms are represented in the matrix. What the “mathematics removed” were topics or terms that did not appear strongly in relation to the terms that “survived” the operations that reduced the larger matrix. So it seems like this is a great idea (it was; it just is not perfect). LSA Practical Benefits To help the reader see the benefits of LSI, and how it can find correlations in data, an actual example is provided from a blog maintained by Alex Thomo [mailto:thomo@cs.uvic.ca]. This example was used in an earlier section of the paper to illustrate how LSA as a technique is very valuable. Here we delve into it a bit more and explain its “pros” and “cons”: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > An Example > > Suppose we have the following set of five documents > > d1 : Romeo and Juliet. > d2 : Juliet: O happy dagger! > d3 : Romeo died by dagger. 33 > d4 : "Live free or die", that's the New-Hampshire's motto. > d5 : Did you know, New-Hampshire is in New-England. > > and search query: dies, dagger. > > A classical IR system would rank d3 to be the top of the list since it > contains both dies, dagger. Then, d2 and d4 would follow, each containing > a word of the query. > > However, what about d1 and d5? Should they be returned as possibly > interesting results to this query? A classical IR system will not return > them at all. However (as humans) we know that d1 is quite related to the > query. On the other hand, d5 is not so much related to the query. Thus, we > would like d1 but not d5, or differently said, we want d1 to be ranked > higher than d5. > > The question is: Can the machine deduce this? The answer is yes, LSA does > exactly that. In this example, LSA will be able to see that term dagger is > related to d1 because it occurs together with the d1's terms Romeo and > Juliet, in d2 and d3, respectively. > > Also, term dies is related to d1 and d5 because it occurs together with > the d1's term Romeo and d5's term New-Hampshire in d3 and d4, > respectively. > > LSA will also weigh properly the discovered connections; d1 more is > related to the query than d5 since d1 is "doubly" connected to dagger > through Romeo and Juliet, and also connected to die through Romeo, whereas > d5 has only a single connection to the query through New-Hampshire. <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< >End Example from Blog The example shows how the LSA technique can “link” together the concepts across the document collection. Even though the query has nothing to do with New Hampshire; the state motto: “live free or die” associates the query with the state. The power of the latent technique is obvious; it also can lead to obfuscation as we will also see in the next section. The Power LSA provides a set of concepts that were “latent” or unobserved in the data from the original document matrix. Due to the mathematical technique of computing linearly independent vectors that have pronounced term correlations; things that “belong together” show up because they relate to common linking words in certain ways. In a document collection, terms such as “astronaut” will be paired with “rocket” and “space” and “travel” or “expeditions”. Terms such as “cosmonaut” will be related to the same terms. The astronaut and cosmonaut term ascendancies are good examples of the benefits with 34 LSA. A human reviewer may not have thought about cosmonaut as a possibility for a keyword search along with astronaut but after LSA reveals it as a latent concept the reviewer can include it in the keyword list of a given matter. For legal reviewers it is valuable to find latent terms that are non-obvious that can be included in with obvious keyword selections. Other relationships in the data can be seen so that the legal reviewer can consider unseen aspects of document collections for building their legal strategies. If the document set is appropriately sized, the lawyer can receive “good ideas” from LSA operations that are run on data. Another benefit to the technique is that new documents that arrive after LSA has been performed can by “folded in” to an existing reduced matrix (with a vector multiplication operation). This is often necessary as documents for a given case are often found as part of an iterative process where prior review leads to a widening scope of collection. The technique can also have applicability across languages as it may identify correlations in documents that are similar regardless of language [11]. This is not universally true for all languages, but it can be seen in some instances. There can be drawbacks to the LSA approach however. The Problems: LSA Limitations For large document collections, the computational time to reduce the matrix to its k important dimensions is very great. On large document collections (100,000 – 200,000 documents) computing SVD’s can take a month or more; even with large capable servers equipped with large memory configurations. The technique does not distribute well over a large number of computers to allow the computational burden to be shared. Even if the computational burden is acceptable, the technique is difficult to use after a certain number of documents because the data seems to put too many things in too few buckets. The terms that it seems to correlate don’t always seem to make sense to human reviewers. This is due to a problem statisticians call “over-fitting”. Too many wide-ranging topics begin to show up in the reduced matrix. They are related somehow, but it is not clear why. Some good examples: good correlations occur where the terms “astronaut” and “cosmonaut” are paired with “rocket” and “space” and “travel”. This all makes sense, cosmonauts and astronauts engage in space travel. But also included in the matrix are documents containing travel documentary reviews of the Sahara desert, “camels”, “Bedouins” and “Lawrence” and “Arabia”. These don’t seem at all related to the documents about space travel. This occurs because the correlation of these topics with the ones about space travel relates to long journeys over harsh dry regions with little water, harsh temperatures and environments forbidding or deadly to humans. This over-fitting occurs as more documents with more topics exhibit these correlative effects. Soon there are too many associations to be crisp and logical to a human reviewer. Even in the example with just a few documents, it does not make sense that “New Hampshire” was introduced into the search results as “relevant” when the terms of the query were: “dies” and “dagger”. If this were a murder case it would not have made sense to drag in documents that are about the state of New Hampshire. 35 So the dimensionality of the matrix was reduced with LSA and there are fewer things for a human to consider, but its discerning power was reduced as well. The results that emerge from the technique are confusing and do not lead to crisp conclusions around what the document population “represents”. The technique is a purely mathematical one; there is no syntactic knowledge imparted to the model to check for consistencies with language usage. So it is clear that LSA is important, helpful under certain circumstances and that also it can be a bit confusing. Across a large population of documents it can take a long time to compute the relationships between documents and the terms they contain and the results of all that computation can end up being confusing. Probabilistic Latent Semantic Analysis Because of the discernment issue with LSA and as a result of other researchers looking at the problem of conceptual mining in a new way, Probabilistic Latent Semantic Analysis, or PLSA was invented. The reader is referred to [2] in the references section for a full discussion of the technique, but it is built on a foundation from statistics where co-occurrences of words and documents are modeled as a mixture of conditionally independent multinomial distributions. Instead of using linear algebra to reduce a matrix, the PLSA technique looks at how often a certain topic occurs along with a certain word. The following formula is from [2] and it basically says that for a given class of documents, the word “w” occurs at a certain frequency or with a certain probability along with the topic “z”. This is found by iterating over a training set of documents and finding the highest correlations in the documents that contain both w and z. This is what they mean by a “multinomial distribution” within the documents of the collection. This technique was invented by Thomas Hoffman at Brown University (and others) and is referenced in [13]. P(w,d) = ∑ P(c)P(d | c)P(w | c) = P(d) ∑ P(c | d)P(w | c) This may be easier to visualize with an illustration. It can be seen that the user picks a representative set of documents and then the PLSA software finds the highest valued topics for each word. This is accomplished with what is called an “Expectation-Maximization” algorithm that maximizes the “logarithmic likelihood” that topic z occurs with word w for a given word document combination. 36 Figure Fourteen: PLSA Illustrated The pLSI Model d zd1 zd2 zd3 zd4 wd1 wd2 wd3 wd4 Probabilistic Latent Semantic Indexing (pLSI) Model For each word of document d in the training set,  Choose a topic z according to a multinomial conditioned on the index d.  Generate the word by drawing from a multinomial conditioned on z. In pLSI, documents can have multiple topics. Benefits to PLSA The benefits that PLSA has are that correlations can be found with a statistical basis from a document population. One can say that there is a definite and certain “likelihood” that certain documents contain certain optics. Each document can contain multiple topics as well. Drawbacks to PLSA This technique represents the topics among a set of training documents, but unlike LSA it does not have a natural way of fitting new documents into an existing set of documents. It also is computationally intensive and must be run on a population of documents selected by the user. Unlike LSA, it is a supervised classification method; it relies on a set of documents identified by a user. If the population of documents selected by the user is not representative of the entire collection of documents, then comparisons to documents that have been analyzed previously are not necessarily valid. One cannot take into account any prior knowledge of the statistics underlying a new unclassified data set. PLSA (like LSA) also suffers from over-fitting. With PLSA several hand-selected parameters have to be configured to allow it to perform acceptably. If these “tempering factors” are not set correctly, the same issues with ambiguous topic identification can be seen with PLSA (as with LSA). Given that most users of data analysis products don’t understand the impacts of hand-tuning parameters, let alone the techniques being used (the mathematics involved) this concept of setting parameters in a product is impractical at best. Therefore, PLSA is a statistically based and mathematically defensible solution for concept discovery and search within a legal discovery product, but it can be quite complex to tune and 37 maintain. It is likely a very difficult technique to explain to a judge when a lawyer has to explain how PLSA might have been used to select terms for search purposes. Problems with Both LSA and PLSA With both PLSA and LSA, a product incorporating these must still provide an inverted index and basic keyword search capability. Therefore, if one just implemented PLSA or LSA, the problem of providing a scalable keyword indexing and search capability would still exist for legal discovery users. All of the problems that were presented in the first part of this document still exist with platforms supporting these two analytic techniques. Bayesian Classifiers Bayesian Classifiers are well known for their work in the area of SPAM detection and elimination. Basically they find “good” examples and “bad” examples of data and compare new messages to these to determine if a new message should be classified as one or the other. The mathematics behind this kind of technology is discussed in *3+ and is based on “Bayes Theorem” of conditional probability: “Bayes Theorem basically says that a document probability of belonging to a certain class (“C” in the equation below) is conditional on certain features within the document. Bayesian theory relies on the fact that these are all independent of one another. This “prior” probability is learned from training data. From [3]: “in plain English the above equation can be written as”: These can be used in a legal discovery context and some companies employ these types of technologies in their products. These kinds of classifiers are useful, they just have to be trained to accomplish their work, and this requires a human to perform this prior classification. Bayesian Benefits When properly trained, they work quite well. They are surprisingly efficient in certain cases. Like all tools, they are good at certain “jobs”. Bayesian Classifier Drawbacks/Limitations They sometimes have no idea what to do with unseen data; if there is no example to guide them, they can make “bad choices”. They can take skilled humans to collect the data for the “models” that they need to be effective. A lot of times this is not possible and can lead to unproductive behavior. 38 Natural Language Models: AKA Language Modeling There are products that claim to have “proprietary algorithms” where “linguists” construct classifiers based on part of speech tagging (POS tagging), or specific dictionary based approaches that they feel “model” language. These often require professional services from the same companies that sell the software implementing the models. This is because the linguist who constructed the model often has to explain it to users. In a legal setting these approaches often require the linguist to become an expert witness if the model results come under scrutiny. These are not stand-alone software tools that one can run at the outset of a legal matter to “get some ideas” about the electronic information available for a case. These models often require hand-tuning of the models given an initial keyword set produced by attorneys who have some initial knowledge about a case and are therefore not tools to expose meaning in language innately. They are more like “downstream” language classifiers that help identify documents in a large collection that meet some well understood semantic criteria established by the keyword analysis. There are other products that use a combination of dictionaries and language heuristics to suggest synonyms and polyesters [5] that an attorney could use for keyword searches given a well-understood topic or initial list of keywords. These also may require that an expert explain some of the results if there is a dispute over the keywords it may suggest. Drawbacks to Linguistic Language Models The drawbacks to these methods include: 1. They often require hand-tuning and are not general software packages that can organize and classify data 2. They often require professional services and expert witness defense 3. They are very language specific (English, French, etc.) and do not scale across multi-lingual data sets Other Specialized Techniques Near-duplicate Analysis Often versions of documents that are measured to be within “some similarity measure” of reference or example documents are very useful to identify. Knowing that a certain document has been edited in a certain place and in a certain way can be very useful to a legal reviewer. Knowing when this is done on a timeline basis is again a very crucial piece of many legal cases. Near-duplicate data identification products perform this kind of analysis. For two documents: • A near-duplicate identification product would build efficient data structures to compare two documents: – 39 “Mary had a little lamb” – document #1 – • “Mary had a little white lamb” – document #2 Yielding a “difference” of one word between the two documents; after “little” and before “lamb”. So the difference would be defined as an “insertion” between “little” and “lamb”. Mixing this kind of capability with the timeline that could be derived for when the edits occurred (by analyzing the Meta data that should be stored with the documents) both can be used to expose evidentiary facts about a case. Email Conversational Analysis Analyzing conversations within email and other message types is a very important part of legal discovery. A full email conversational analysis tool must include the ability to see what individuals sent messages on certain topics to others. In addition, it is important to have a tool that can display the full list of email “domains” that a person used in a given time period. This explains the main sites or companies contacted by a given individual over a specific period of time. Figure Fifteen: Email Analysis Email Conversational Graph Walking Email thread starting with Mary@sk.com Msg 1 Conversation: Mary Mary@sk.co m John@sk.com Joe@tl.com CEO@sk.com John@insidertrade.com Time 47 April 25, 2011 There are several approaches to email conversational analysis. The important aspect of this is allowing the correct attributes (both Meta data (header) information and content) to be included in the algorithm that constructs or follows the conversations. Legal Discovery Processing Versus “Pure” Natural Language Processing As we saw in the previous sections, there are a number of techniques that one can use to find conceptual relationships across document collections. In a desire to use the latest computer science 40 techniques to discover information users of legal review technology have turned to Natural Language Processing (NLP) and information analysis approaches that have been used in search engines and ecommerce applications. Unfortunately legal review professionals have often turned to vendors who confuse general NLP techniques with sound legal discovery practice. The notion that a single mathematical technique will identify everything within a data set that is relevant to a case and help produce all relevant documents as a result is incomplete thinking. NLP techniques when applied correctly can be very helpful and powerful; but like any tool they can only be used in the correct circumstances. Also, at certain magnitudes of scale, the techniques break down, or experience limitations in their feasibility to produce relevant results. Over-fitting can be a problem that obfuscates results and makes the burden of computation a luxury for what the techniques provide in benefits to the review process. This is why this paper started off with the full explanation of what a platform for legal discovery needs to contain. If the user understands that multiple operations need to be supported to find all aspects of the information relevant to a case (keyword data, Meta data, user-supplied Meta data, analytic associations) then NLP techniques can be one of those operations and the user will have great results. If the system a user selects relies on some specific NLP technique alone, the results it produces will not be complete enough for legal review purposes. Data Preparation is Important to Obtaining Appropriate Results We saw in previous sections that the way documents are prepared and represented within a legal discovery system is very important to obtaining good results. If data is not prepared correctly, certain techniques will break down (such as phrase searching performance). With analytics, stemming may reduce the dimensions an algorithm must analyze, but that may yield less specific results than one would envision. In legal discovery, it can be very important to find documents that say: “by taking the following action; the party is choosing to violate the contract”. If the documents in a collection are prepared for “NLP” approaches, more documents than one really wants will be returned when looking for the phrase shown above or documents may be missed in the review phase. The near-duplicate mechanisms shown can find too many items that are not true “near-duplicates” if stemming is utilized. So one-step approaches must be carefully scrutinized if the most relevant results are to be obtained. This can require extra human review and perhaps lead to human error during review. Many products prepare their collections of documents one way (to support both NLP and keyword search approaches). It is important to prepare documents specifically for the type of analysis (NLP or straight keywordphrase searching) they will undergo. For legal discovery, it is important to prepare collections that can return results specific enough to save time in the initial collection reduction and the eventual legal review portions of the process. The total platform approach (with the virtual index) lets one prepare data for the analytic operations that are important to each stage of a legal discovery process. This is possible because the virtual index 41 can represent the same data in multiple ways. Along with this, it is important to realize the benefits that analytics can provide. Aspects of Analysis Algorithms for Legal Discovery Another aspect of processing data for legal discovery and utilizing NLP techniques is that language characteristics of documents must be taken into account. Most NLP techniques use the statistical nature of data (via token frequency or occurrence) to derive some sort of model that describes data of certain types. If documents containing multi-lingual characteristics are combined with English-only documents, the predictive power of the model will decrease. If language is not properly accounted for, the predictive power can become even less precise than it would be otherwise. Data preparation is very important to analytic performance in these types of systems. Legal review requires more precision than other applications so it is especially important to be precise with the preparation of data sets. Benefits of Analytic Methods in Ediscovery In an Ediscovery context, analytics are very important. They help the reviewer in several ways: 1. They can expose information about a collection of documents that is non-obvious but that can help one understand the meaning of the information they contain. This can help a lawyer understand what keywords would be relevant to a matter, and to select the ones that ultimately get used to discover information about a legal matter. 2. They can identify relationships in data that reveal what information was available to the parties to a lawsuit at certain points in time. 3. They can identify versions of documents and relate these to a timeline to make a reviewer aware of how knowledge related to a lawsuit or regulatory matter has evolved over time. 4. They can be used to find the documents that relate to known example documents within a collection. This helps a reviewer find all documents that are relevant and can also help a reviewer find other relevant concepts that may not have been in an initial keyword search list. Problems with Analytic Procedures in Ediscovery As stated above in the introduction to this section of the document, a major problem with analytic procedures in Ediscovery is that one technique is not appropriate in all circumstances. As vendors have tended to champion one technology for their analytics, they tend to promote the over-use of one technique that is available through the use of their particular technology. In their desire to find “the holy grail” or identify the “magic bullet” for legal review users often grab on to technology pushed forward from a certain vendor and then find that it is not the panacea that it was supposed to be. Once they realize that this is an issue, some customers buy what they perceive as best of breed products. For legal discovery this has historically meant multiple ones; some for analytics and others for keyword search; perhaps a third or fourth for legal processing. Users typically try to use them separately. Outside of a single platform these technologies lose some of their value because loading data into and unloading data from various products introduces the chances of human and other error. 42 The introduction of one platform that can handle multiple analytic approaches is how one confronts the fact that there is no single analytic technique that masters all problems with electronic discovery. Related to this issue of multiple products is that the products on the market do not run at the scale necessary to add value in even a medium sized legal matter. Because of this, analytic procedures are (practically) run after a data set has been reduced in size. This can be appropriate, but it can also reduce the useful scope and overall usefulness of the analytic technique in question. If some analytic techniques are run on a very large data set they can take an inordinately long time to run, making their value questionable. In addition, some techniques “break-down” after a certain scale and their results become less useful than they are at lower document counts. An Ideal Platform Approach To combat these issues with analytics, the correct platform with a scalable architecture and the appropriate “mix” of analytics is proposed as the answer. In the following sections a set of techniques that have been developed to ameliorate most of the issues with well-known analytic approaches will be shown. The platform approach includes a “two-tier” ordering algorithm that first “sorts” data into related categories so that deeper analysis can be undertaken on groups of documents that belong together (at least in some sense). This helps the second-level algorithm run at appropriate scale and even avoid “bad choices” when sampling documents for running analysis that can identify conceptual information within documents. This is possible because of the grid architecture explained above and the correct mix of analytic techniques. Analytic Techniques in Context of a Legal Discovery “Ideal Platform” So given the assertion that no single analytic technique is adequate on its own to provide legal discovery analysis, this section discusses how a single platform using a combination of different analytic techniques could be valuable. In addition, it shows how a platform implementing several techniques allows the overall system to provide better results than if it had been implemented with one single analytic technique. The ideal discovery platform: 1. Uses a specific and powerful initial unsupervised classification (clustering) technique to organize data into meaningful groups, and identifies key terms within the data groups to aid the human reviewer. Other analytic processes can take advantage of this first order classification of documents as appropriate 2. Uses a powerful multi-step algorithm and the grid architecture to organize data which has semantic similarity; conceptual cluster groups are formed after accounting for language differences in documents 3. Allows the user to select other analytic operations to run on the classification groups (folders) built in the first unsupervised classification step. This allows other analytic algorithms to be run at appropriate scale and with appropriate precision within the previously classified data folders 43 4. 5. 6. 7. (LSA or PLSA for example) the benefit would be that the ideal platform could break the collection down and then allow PLSA or LSA to run at an appropriate scale if a judge ordered such an action Allows the user to select documents from within the folders that have been created a. Using keyword search b. Using visual inspection of automatically applied document tags c. Via the unsupervised conceptual clustering techniques Allows the user to select documents from folders and use them as example documents a. “Search by document” examples where the entire document is used as a “model” and compared to other documents b. Examples that can be used as “seed” examples for further supervised classification operations Allows the user to tag and otherwise classify documents identified from the stage one classification or from separate search operations Allows the user to identify predominant “language groups” within large collections of documents so that they can be addressed appropriately and cost effectively (translation, etc.) Conceptual Classification in the Ideal Platform As we learned in an earlier section of this document, this is an analytic technique that answers the question: “what is in my data”? It is designed to help a human reviewer see the key aspects of a large document collection without having to read all the documents individually and rank them. In some sense it also helps a reviewer deduce what the data “means”. In the context of this discussion, it should be noted that the user of this functionality does not have any idea about what the data set contains and does not have to supply any example documents or “training sets”. Conceptual classification supports a number of uses within the ideal discovery product. These include: 1. Organizing the data into “folders” of related material so that a user can see what documents are semantically related; also it builds a set of statistically relevant terms that describe the topics in the documents 2. Presenting these folders so that search results can be “tracked back” to them. This allows a user to use keyword search and then select a document in the user interface and subsequently see how that document relates to other documents the unsupervised classification algorithm placed with the one found from keyword search. This is possible because the virtual index contains the document identifiers and the classification tags that show what related information exists for a given document. 3. Allows other “learning algorithms” to use the classification folders to identify where to “sample” documents for conceptually relevant information (explained below). This means that a firstorder unsupervised classification algorithm orders the data so that other analytic processes can select documents for further levels of analysis from the most fruitful places in the document group. This allows higher-order language models (LSA, PLSA or n-gram analysis) to be run on them with a finer-grained knowledge of what the data set contains and to avoid sampling documents and adding their content to a model of the data that might make it less powerful or 44 predictive. This allows the system to identify the best examples of information where higherlevel analysis can reveal more meaningful relationships within document content. Building models of similar documents from a previously unseen set of data is a powerful function of a system that contains analysis tools. Unsupervised Conceptual Classification Explained This technique solves the problems that were seen above with the single-technique approach (LSA/PLSA) where over-fitting can become an issue and specificity of results is lost. This technique: 1. Orders the data initially into folders of related material using a linearly interpolated statistical co-occurrence calculation which considers: a. Semantic relationships of absolute co-occurrence b. Language set occurrence and frequency c. This stage of the algorithm does NOT attempt to consider polysemy or synonymy relationships in documents; this is considered in the second stage of the algorithm 2. Performs a second-level conceptual “clustering” on the data where concepts are identified within the scope of the first-level “Clusterings”. A latent generative technique is used to calculate the concepts that occur in the first-level cluster groups. This portion of the algorithm is where synonymy and polysemy are introduced to the analysis; “lists” of concepts are computed per each first-level or first-order cluster group; these may be left alone or “merged” depending on the results of the stage three of the algorithm 3. The “lists” of second-level conceptual cluster groups are compared; concepts from one folder are compared to those computed from another. If they are conceptually similar (in crossentropy terms) they are combined into a “super-cluster”. If they are not similar, the cluster groups are left separate and they represent different cluster groups within the product 4. The algorithm completes when all first-order clusters have been compared and all possible super-clusters have been formed Algorithm Justification This algorithm allows the data to be fairly well organized into cluster groups after the first-level organization. More importantly, it removes documents from clusters where they have nothing in common, such as documents primarily formed from foreign language (different character set) data. This is important because most latent semantic algorithms will consider information that can be irrelevant (on a language basis) thus obfuscating the results of the concept calculations. This also localizes the analysis of the conceptual computations. Secondly, the over-fitting problem is reduced because the latent concept calculations are undertaken on smaller groups of documents. Since conceptual relationships can exist across the first-level folder groups, the concept lists can be similar; denoting the information in two folders is conceptually related. The third step of the algorithm allows these similarities to be identified and the folders “merged” into a “super-folder” or “super-cluster” as appropriate. Therefore the result is a set of data that is conceptually organized without undue overfitting and dilution of conceptual meaning. 45 The trick to unsupervised learning or classification is in knowing where to start. The algorithm for unsupervised classification automatically finds the documents that belong together. The algorithm starts by electing a given document from the corpus as the “master” seed. All subsequent seeds are selected relative to this one. This saves vast amounts of processing time as the technique builds lists that belong together and “elects” the next list’s seed automatically as part of the data ordering process. Other algorithms randomly pick seeds and then try to fit data items to the best seed. Poor seed selection can lead to laborious optimization times and computational complexity. With the ideal platform’s linear ordering algorithm, the seeds are selected as a natural course of selecting similar members of the data set for group membership with the current seed. There will naturally be a next seed of another list which will form (until all documents have been ordered). First-Order Classification This seed-selection happens during the first-order organization of the data. The aim of this part of the algorithm is to build “lists” of documents that belong together. These lists are presented to the users as “folders” that contain “similar” items. The system sets a default “length” of each list to 50 documents per list. The lists of documents may grow or shrink or disappear altogether (the list can “lose” all of its members in an optimization pass); initially the list is started with 50 members however. Please see Figure Sixteen for an illustration of the clustering technique. Figure Sixteen: Unsupervised Classification Similarity Algorithm Illustration Bounded number of cluster locations (m) Candidate List of documents (CL) l1 l2 k Exception List of documents (N or smaller) lp 34 Proprietary and Confidential Do Not Duplicate or Disclose 21-Apr-11 The algorithm starts with a random document; this becomes the first example document or “seed” to which other documents are compared. The documents are picked from the candidate documents in the collection being classified (candidate list; or every document in the corpus initially). There are “N” of these documents in the collection. The lists are of length “m” as shown in the illustration (as stated the 46 default value of m is 50); the number of lists is initially estimated at k=N/m. The first document is a seed and all other documents are compared to this document; the similarity calculation determines what documents are ordered into the initial list (l1). One key aspect of this technique is that initially all documents are available for selection for the initial list. Each subsequent list only selects documents that remain on the candidate list however. This is a “linear reduction” algorithm where the list selections take decreasing amounts of time as each list is built. There is a second optimization step to allow seeds that did not have a chance to select items that were put on preceding lists to select items that “belong” (are more similar to) them and their members. Each document is compared to the initial seed. The similarity algorithm returns a value between 0 and 1; a document with exact similarity to a seed will have a value of 1, a document with no similarity (nothing in common) will have a value of zero. The candidate document with the highest similarity to the seed is chosen as the next list member in l1. When the list has grown to “m” members the next document found to be most similar to the seed S1 is chosen as the seed for the next list (l2). The list l2 is then built from the remaining documents in the candidate list. Note that there are N-m members of the candidate list after l1 has been constructed. This causes (under ideal conditions) a set of lists, with members that are related to the seeds that represent each list, and with seeds that have something in common with one another. Similarity Calculation This similarity calculation takes into account how often tokens in one document occur in another and account for language type (documents that contain English and Chinese are “scored” differently than documents that contain only English text). This is a linearly interpolated similarity model involving both the distance calculation between data items and the fixed factors that denote language type. A document that would have a similarity score of “0.8” relative to its seed (which has only English text), based on its English text content alone, but that has a combination of English, Chinese and Russian text will have a score that is “lower” than 0.8 because the semantic similarity score will be reduced by the added attributes of the document having all three languages. This way the system can discern documents that have very similar semantics but that have different languages represented within them. The three language types are viewed as three independent statistical “events” (in addition to the language co-occurrence events of the tokens in the two documents). All events that occur within the document influence its overall probability of similarity with the seed document. With some “language-blind” statistical scoring algorithms it is possible to have document scores which represent a lot of commonality in one language (English) and where the presence of Chinese text does not influence this much at all. If specific language types are not added into the calculation of similarity, documents with three different languages will appear to be as significantly similar to a seed as those which have only one language represented within their content. Please see Figure Seventeen for an illustration of the first stage of the algorithm. Please see Figures Seventeen through Figure Twenty Two for other aspects of the algorithm. 47 Figure Seventeen: “Normal Operation” (Step One) Algorithm “Step One” Candidate List of documents (N) Grab a document from candidate list of items for the initial seed l1 Si l2 Calculate m given k there will be a fixed number of clusters and this defines m; conversely; one can start with a bounded size (m) and compute k; in either event; m=(N/k) is the defining relation l4 Exception List of documents (N or smaller) initially empty l5 lp Proprietary and Confidential Do Not Duplicate or Disclose 35 21-Apr-11 Figure Eighteen: Normal Operation Normal “Operation” Full list of items (m items) no more than m l1 Si l2 Si+1 Candidate List (CL) is empty Si+1 Chain members Each (k) List of documents (cluster) has “m” documents; everything is wonderful l4 Si+4 Chain members l4 Si+5 38 48 Proprietary and Confidential Do Not Duplicate or Disclose Exception List of documents (N or smaller) (empty) 21-Apr-11 Figure Nineteen: List Formation “List-building Behavior” Item “least similar” to Seed Full list of items (m items) no more than m Tail Item l1 Candidate List of documents (CL; N in number) Replace l2 S1 l4 Select the most similar documents from the Candidate List of documents (CL); select up to “m” documents and replace any on the list with those from CL more similar to the ones found first in the list. This builds a list of documents that are “most similar” to the seed from the one on CL. Replace “end of list” item as necessary; pushing items off the list and back to CL l5 Exception List of documents (N or smaller) lp Proprietary and Confidential Do Not Duplicate or Disclose 36 21-Apr-11 Figure Twenty: Linear Set Reduction Linear Set Reduction Property Candidate List of documents is reduced by the list length (m) + 1 (or more) Full list of items (approx. m items) can be more than m l1 Si l2 Si+1 Si+1 Select the most similar documents from the Candidate List of documents (CL); select up to “m” documents and replace any on the list with those from CL more similar to the ones found first in the list. This builds a list of documents that are “most similar” to the seed from the one on CL. The first cluster list requires N-1 comparisons. The last document on the list after all N-1 items have been compared is the “candidate Seed” for the next cluster. Move the last item from l1 to “head” of list l2; this is the candidate seed for list l2 l4 l5 Exception List of documents (N or smaller) lp 37 49 Proprietary and Confidential Do Not Duplicate or Disclose 21-Apr-11 The lists which form under normal operation where there are documents with some similarity to a given seed for a given list are as shown in Figure Eighteen. The seed of each new list is related to the seed of a prior list and therefore has a transitive similarity relationship with prior seeds. In this respect each list that has a seed related to a prior seed forms a “chain” of similarity within the corpus. The interesting thing that occurs is when a seed cannot find any documents that are similar to it. This occurs when a chain of similarity “breaks” and in many cases, a “new chain” forms. This is when a seed cannot find a relationship in common with any remaining item on the candidate list. The document similarity of all remaining members of the candidate list, relative to the current seed is zero. This causes the chain to break and the seed selection process to begin again. Please see Figure Twenty One for an illustration of this behavior. When a seed in a prior list does not find any items on the candidate list which is “similar” to it the algorithm selects an item from the candidate list as the next seed and the process starts over again. If items that are similar to this newly selected seed document exist on the candidate list, they are selected for membership in a new list (headed up by the newly selected seed) and the new list forms the head of a new chain. Figure Twenty One: Broken Chains Broken“Chains” Full list of items (m items) no more than m l1 Si Si+1 Chain members l2 Documents similar to one another for some reason Si+1 l4 S1a l4 Si5a Chain members Documents that have nothing in common with the “first chain group” but that do have something in common with this chain group ALL ARABIC for example 39 Proprietary and Confidential Do Not Duplicate or Disclose 21-Apr-11 In Figure Twenty One it is shown that the documents in the first chain group have no commonality with documents in the second chain group. The documents in the second chain group do have some commonality among themselves however. The second chain group is formed by selecting a new seed at random from the candidate list when a seed from the first chain group finds no documents in the candidate list with which it has attributes in common. This phenomenon indicates a major change in the nature of the data set. This major change is often related to the document corpus having a set of 50 documents from a totally different language group than that represented in the first chain of lists and documents. This occurs when the language of the documents in a given chain group are English and the next chain group is Arabic for example. Figure Twenty Two is an actual screen shot from a product that shows a “cluster” of documents that are composed of Arabic text. This same behavior can occur within the same language group, but this is the most common reason that it occurs. Figure Twenty Two: Chain Behavior Displayed in Classification Groups 40 COMPANY CONFIDENTIAL April 4, 2011 In this example the folders labeled: “Case-Data 6”, “Case-Data 7” and “Case Data 8” contain Arabic text documents. This occurred because the textual similarity of the documents had little to do with English and were very much in common because of the Arabic text they contain. The similarity algorithm put the Arabic documents together because the product evaluates each “token” of text as it is interpreted in Arabic. The frequency of Arabic tokens in the documents compared with “English seeds” showed no similarity with a given English document seed. An Arabic “chain” formed and attracted Arabic documents to these particular folders. Similar assignments happened in these data for Spanish and French documents. Second-Order Classification It was explained above, but with the benefit of the illustration it is clear that the conceptual calculations are undertaken on the folders consecutively. They benefit from the fact that the overall calculation has been broken into groups that bear some relationship to a seed document that leads the cluster. Even if conceptual similarity spans two clusters, the third and final stage of the algorithm will “re-arrange” the cluster membership to order the documents conceptually. Computing concepts on each individual cluster from the first stage of the algorithm reduces the number of documents in the calculation and thus over-fitting. The technique used in the ideal platform at this stage is reviewing conditional probabilities with prior statistical distributions. It is drawing an initial “guess” of how the terms in the documents are distributed statistically by gathering information about the document classifications found in the first-order classification step. It computes the likelihood of certain terms being “topics” within certain documents 51 and within the overall collection of documents. It orders topics and documents so that they can be regarded as “topic labels” for the documents that are contained within the folders. Third-Order Classification This stage of the algorithm will re-order any documents into the final folder groups according to the conceptual similarity of the concept lists computed in stage two of the algorithm. Documents which “belong” to a “super cluster” are merged to be with the documents that are most similar to the concept list computed for some number of second stage clusters. Concept lists for each second stage cluster are compared and if their members are similar, the documents forming the lists are merged into a final super cluster; otherwise the documents are left in the cluster they inhabit. This allows conceptually similar folders to be merged together; the documents comprising folders with similar concept lists will be re-organized into a super-cluster. The documents the folders represent “belong together” so the folders are “merged”. First-Order Classification Importance As stated previously, when documents contain different languages, the algorithms that compute the labels for document groups can lose precision. These algorithms look at the probabilities of certain terms occurring with other terms and when multiple languages are involved their results can become skewed. The first-order classification algorithm puts the documents with similar first-order language characteristics together, which aids the performance of the second-order topic generation algorithm. The two algorithms together are more powerful than either one is together. This also helps the thirdorder classification algorithm as the folders that have similar characteristics tend to be “near” one another in the folder list. Even if they are not, the concept clustering algorithm will find the conceptually related information that “goes together” but merging is often possible early on in the “walking” of the folder concept lists because of the first-order classification operation. Second-Order and Final Classification Importance With this technique, the topics that are latent are generated by the algorithm running on the folders of pre-ordered documents themselves. This provides the benefits of LSA or PLSA on the pre-ordered sets of data but with much less computation (by taking advantage of the classification from the first pass of the algorithm). Without the first-order technique, more computation would be required to arrive at the optimal generation of the topics. This would place a computational burden on the system unnecessarily. The first-order classification of the documents assists the second algorithm and makes further analysis much more “clear” as well. When the final check is done on the semantic label lists of the folders in the collection, they allow for the documents that belong together conceptually to be re-clustered as necessary. Multi-lingual Documents It is important to note that the algorithm still handles multi-language documents, and the concept generation algorithm can find cross-correlations of terms in multiple languages. Terms that occur in a document containing English and Arabic text will have conceptual lists that contain both Arabic and English members. The first-order grouping of primarily Arabic or English documents together will still 52 allow for single language correlations to predominate but will not prohibit the generation of concepts from documents containing both Arabic and English text. Value of Classification The value of classification like this is that a human reviewer can quickly identify document groups that may be of interest. The “top reasons” or “top terms” that a folder contains (the predominant terms in the documents it contains) is shown at the top of the folder in the screen shot contained in Figure Twenty Two. The user of the product can determine if the documents are of any interest to him/her quickly by reading the labels on each folder. Further, the reviewer does not have to open and read documents that may be Arabic or Chinese (unless they want to read them). This folder based ordering of documents allows a reviewer to avoid obviously irrelevant information such as documents in a language that is of no interest to them. More importantly however, this pre-ordering first-stage classification technique makes higher-order analysis of the data in the folders accurate and predictable and more computationally efficient. The end-stage classification yields strong language semantics for members of final classification groups. Other Benefits of Document Classification Other benefits are that any document found with a keyword search can be related back to the classification groups. Since the documents in these classification folders are related to their seed, they are very likely to be related to one another. This allows a reviewer to see what other documents are similar to a given document found via keyword search where the reviewer knows that the reference document contains at least agreed upon key terms. The conceptual organization past this step allows for the final clusters to include semantically conceptual clustering relationships that can be related back to keyword searches. The ideal system also allows one to use the pre-ordering classification for other analysis techniques such as PLSA or LSA. Either would benefit from operating on pre-ordered data that is smaller than the entire collection. N-gram or Other Statistical Learning Models For building language specific tools on top of the base classification engine, the pre-ordering technique is especially useful. After the classification algorithm has ordered a collection, higher-order language models can be built from samples within the larger collection. This may be beneficial for building n-gram models for language specific functions like part of Speech (POS) tagging. Other tools that could benefit from this would be learning models that support functions such as sentence completion for search query operations. In these cases, knowing that a cluster group contains primarily Arabic textual information would allow the n-gram model to select samples from an appropriate set of documents. This can be important if one is building a model to handle specific functions such as these. The firstorder algorithm will “mark” the analytic Meta data for a certain cluster group to show that it represents a predominant language. For a POS tagger, this would be important as many of these are highly sensitive to the input model data and training them with appropriate samples is important. It would be counterproductive to train an English language POS tagger with German training data for example. The firstorder algorithm allows one to select documents from the appropriate places within the larger collection of documents for specific purposes. See Figure Twenty-Three for an illustration of this behavior. 53 Figure Twenty Three: Selecting from Pre-Classified Data for Higher-Order Models Second-level Analysis Performed on Sub-sets Class #1 Sampling Function selects training docs Sampling Function selects training docs Language Model Language Model LDA/LSI Concept Calculation 43 Class #2 N-gram smoothing Concept Calculation COMPANY CONFIDENTIAL N-gram smoothing April 21, 2011 How Pre-Ordering Can Make Other Techniques Better This first-order classification capability can reduce the amount of documents that any second order algorithm has to consider. This can help other less-efficient algorithms run more effectively. As with the concept calculation example (above) it may be desirable to run LSA or other tools on data that the platform has processed. By utilizing the folder-building classification algorithm within the platform, the large population of documents for a given case could be reduced to more manageable sized increments that an algorithm like LSA can handle. If opposing counsel were to insist on running LSA or PLSA or some other tool from a select vendor on a data set, the ideal platform could order and organize smaller folders of data that LSA or PLSA could handle. This technique of pre-ordering the data will generate smaller sized related folders that these other techniques could process at their more limited scale. The reduced size of the data set would help focus the results of LSA because it would have fewer documents and find fewer concepts to fit into the “buckets”. Therefore the platform could help reduce the LSA over-fitting issue. This would help PLSA in this regard as well. As mentioned in other sections of this document, LSA was envisioned and works best on collections of documents that are “smaller” than those that are routinely found during current legal discovery matters. In *4+ it is stated by the authors of the LSA algorithm that they envision “reasonable sized” data sets of “five thousand or so” documents. In today’s cases, it is routine to see 100,000 to 200,000 or more documents (millions are not uncommon). 54 With the combined platform approach a technique like LSA could be run on the documents from the most likely “clusters” from the ideal platform so that the computation would be tractable. If a judge felt comfortable with LSA as a technique due to prior experience with the algorithm he or she could see better results by reducing the amount of documents that the algorithm has to address at any one time. The use of the ideal platform would benefit a legal review team by making a previously used product implementing LSA more effective at what it does. Supervised Classification Supervised Classification requires the user to provide some example data to which the system can compare unclassified documents. If a user has some documents that they know belong to a certain classification group, a system can compare documents to the examples and build folders of documents that lie within a certain “distance” from the seeds (in terms of similarity). Figure Twenty Four: Supervised (Example Based) Classification Supervised Classification Pick your own seeds or select documents for a model; model is used to start seeds in profile case l1 Si l2 Si+1 Candidate List (CL) is empty when “done” Chain members Each (k) List of documents (cluster) has “m” documents l4 Si+4 Chain members l4 Si+5 45 Proprietary and Confidential Do Not Duplicate or Disclose Exception List of documents (N or smaller) (empty if everything fits) 21-Apr-11 If the system has pre-ordered a lot of the data (using unsupervised techniques as before), then finding examples is simplified for the user. They have some idea where to look for examples that they can use to classify documents that will enter the case as new documents. Secondly, search results can be related to document clusters that exist, then examples of the “strongest most similar” documents to the search results can be located within a cluster folder, and then the supervised classification technique can identify other documents that belong with pre-selected “seeds”. Again, documents added to a case can be classified with examples using the supervised technique shown above. This continuous classification 55 of documents can help reviewers find the most relevant documents rapidly with more or less automated means. Email Conversational Analysis Email conversational analysis is an important aspect of any legal review platform. Seeing what conversations transpired between parties is important. This was discussed previously. With the ideal platform approach of providing classification along with the email threading, these two techniques can be used simultaneously to identify documents that are in a conversation thread, and that have similar documents which may exist outside that thread. The existence of documents that are similar to those in a thread will lead to the identification of email addresses that perhaps were outside the custodian list but that should be included. Having the conversation analysis and the classification capability all within one platform makes this “analytic cross-check” capability possible. Near-Duplicate or Version Analysis The version analysis mentioned above, combined with supervised clustering and Meta data search can identify what documents were edited at certain times and by whom. Using near duplicate analysis can allow the system to “tag” all members of a “near-dupe group” within a collection of documents (autotagging of analytic Meta data). Using supervised clustering with a known seed (from the near dupe group) a user can identify other versions within the collection of documents comprising a case. Using the Meta data attributes to identify owners and import locations of documents that have been collected exposes information about who owned or copied version of files at any time during the life cycle of the case. This is a very powerful attribute of a platform that handles these combined sets of analytic processes at large scale. Summary This paper attempted to display the value behind a comprehensive platform that handles various levels of indexing for Meta data content and analytic structures. It exposed new concepts behind analysis and storage of these constructs that implement high-speed indexing and analysis of data items within the context of legal discovery. Further, it explored and discussed several aspects of large scale legal discovery processing and analysis and how the correct architecture combined with indexing and search capabilities can make legal discovery effective and productive relative to current single product approaches. The “ideal platform” approach (as it was called) presented both architecture and a set of capabilities that remove risk of error from legal discovery projects. Examples of how this combination would reduce cost and risk of error in legal discovery engagements were presented. Finally, analytic approaches that are available from electronic discovery products today, how they work, where they are effective and where they are not effective were presented. These were compared and contrasted with one another and were discussed in relation to the ideal platform approach. The ideal platform and its ability to pre-order and classify data at great scale, and then perform generative concept and label generation to identify the “meaning” of content and assign it to “folders” of 56 documents within large cases was discussed. It was shown how the platform approach of pre-classifying data and using a hierarchical model of classification algorithms could aid other products and techniques such as those that utilize LSA and PLSA. References [1] Wikipedia description of Latent Semantic Analysis: http://en.wikipedia.org/wiki/Latent_semantic_analysis [2] Wikipedia description of Probabilistic Latent Semantic Analysis: http://en.wikipedia.org/wiki/PLSA [3] Wikipedia description of Bayesian Classifiers: http://en.wikipedia.org/wiki/Bayesian_classification [4] Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, Richard Harshman (1990). "Indexing by Latent Semantic Analysis" [5] Wikipedia page on Polysemy: http://en.wikipedia.org/wiki/Polysemy [6] Wikipedia page on Support Vector Machines: http://en.wikipedia.org/wiki/Support_vector_machine [7] Wikipedia page on Vector Space Model: http://en.wikipedia.org/wiki/Vector_space_model [8] Wikipedia page on SMART system: http://en.wikipedia.org/wiki/SMART_Information_Retrieval_System [9] Web Page of Martin Porter: http://tartarus.org/~martin/PorterStemmer/ [10] Wiki entry on mathematical treatment of SVD: http://en.wikipedia.org/wiki/Singular_value_decomposition [11] Wiki entry on LSA/LSI: http://en.wikipedia.org/wiki/Latent_semantic_indexing [12] Adam Thomo Blog: mailto:thomo@cs.uvic.ca] Blog entry for LSI example [13] ^ Thomas Hofmann, Probabilistic Latent Semantic Indexing, Proceedings of the Twenty-Second Annual International SIGIR Conference on Research and Development in Information Retrieval (SIGIR99), 1999 57 58 bit-x-bit computer forensic investigation and e-discovery consulting services The Frick Building 437 Grant Street, Suite 1501 Pittsburgh, Pennsylvania 15219 Tel: 412-325-4033 www.bit-x-bit.com BIT-X-BIT, LLC DESI IV POSITION PAPER APRIL 18, 2011 This Position Paper describes the interests of bit-x-bit, LLC (“bit-x-bit”) in the Fourth DESI Workshop. bit-x-bit is computer forensics and e-discovery consulting firm with offices in the Frick Building in downtown Pittsburgh, Pennsylvania. We work with companies, law firms, and other clients in Pittsburgh and throughout the country, providing e-discovery services such as collection of electronically stored information (“ESI”) for litigation, ESI processing (deduplification, de-NISTing, indexing), key word searching, testing of key words, hosted review tools, ESI document preparation and production. We also perform computer forensic investigations for use in civil and criminal cases. We are exclusively endorsed by the Allegheny County Bar Association as the provider of e-discovery and computer forensic services to its more than 6,500 members. Recently, we planned and prepared the materials, assembled a panel, and presented the four-hour Orientation Program for the E-discovery Special Masters (“EDSM”) admitted to the EDSM Program in the U.S. District Court for the Western District of Pennsylvania. We are interested in participating in the Workshop because we work with a wide variety of lawyers, from very small firms to very large firms. Virtually all of our e-discovery clients use key word searching as a means of identifying potentially relevant and responsive documents for litigation. bit-x-bit assists its client-litigants in selecting and refining key words in order to maximize the effectiveness of search terms in locating and recalling potentially relevant documents. We are interested in state of the art techniques to make key words and related search technology as precise and efficient as possible, so as to minimize review time and money spent on the review of irrelevant documents. As “predictive coding” techniques and software are developed, we are interested in considering such tools for our use, and the use of our clients, in setting standards for the case and project management techniques necessary to use effectively such software, and in the research, testing and data which supports the use of such tools and the defensibility of such tools. 1 Respectfully submitted, Susan A. Ardisson, Esq., CEO W. Scott Ardisson, President Joseph Decker, Esq., Vice President and General Counsel 2 A Perfect Storm for Pessimism: Converging Technologies, Cost and Standardization Topics: Information Retrieval, Data Mining, Machine Learning (Supervised and Unsupervised), eDiscovery, Information Governance, Information Management, Risk Assessment, Computational Intelligence Author: Cody Bennett Company: TCDI - http://www.tcdi.com If some vendors’ proclamations are to be believed, realizations of next generation self-learning technologies geared towards text retrieval, language understanding and real time risk assessments are being fulfilled. But the knowledge experts in charge of assisting these platforms need to be aware of exiguous claims. If standardization is going to occur on a matrix of such complex systems, they need to occur on reality, not hype. The amount of information will grow vastly while storage costs become subdued increasing the need for computational technologies to offset the very large costs associated with knowledge workers. This paradigm shift signals a mandatory call for smarter information systems, both automated and semiautomated. But imperfections in technology systems (arguably lower than human mistakes) require critical focus on workflows, modularity and auditing. And although linear systems have improved efficiencies 1 through the use of virtualization, they still do not approach a lateral learning mechanism . While they have begun to break into multi-tenant super-computing capacity, software on these systems is still statistical and rules-based, hardly approaching anything “thinking” – encompassing decades-old algorithms stigmatized by “Artificial Intelligence”. Further, the cost prohibitive business model reliant upon a single technology becomes a landmine. Technology in the space of Information Retrieval and Machine Learning are moving targets, and formal standardizations may be quickly outmoded. While there are attempts to use previous and ongoing research alongside industrial search studies performed to classify and understand the limitations of each search 2 model , use of hybridization and the underlying platforms / architectures facilitating multiple types of search techniques should be a target for which Information Management systems strive. For eDiscovery, vendors should be prepared to harness multiple search capabilities as courtrooms over time mold what is accepted as “standard”. Focusing on a single methodology when coupled with automated systems hampers recall – IBM’s Watson and Intelligence organizations prove that hybridized multimodal search and brute force NLP based directed probabilistic query expansion are interesting because of combinations in Information Retrieval, Data Mining and Machine Learning. How do you standardize upon the algorithms entrenched in systems that are constantly in flux? Do only systems with little or no entropy deserve standardization? Use of multimodal search is becoming fashionably effective in tandem with automation. Machine Learning methods utilizing hybrid approaches to maximize historically divergent search paradigms are capable of producing multiple viewpoints based on different algorithms, maximizing return on implementations such as predictive coding, active “tripwire” systems and next-generation risk assessment. In eDiscovery, multiple modeling viewpoints can help augment linguistic spread of features necessary to defensibly identify varying degrees of responsiveness. An example would be the improvement for the eDiscovery process using active learning when conducting initial discovery and query expansion / extrapolation in the beginning phases of 3 Request for Production . With both Information Retrieval and Machine Learning, transparency in the methods and a heavy breakdown of the algorithms used will be required. This transparency assists Information Governance, defensible methods for legal, and quality assurance for knowledge workers. This prognostication may be similar to the inevitability of eDiscovery certification in bar exams. While it may not be necessary for legal to understand the full complexities of the underlying search technology or automated algorithm, it should be required to ascertain and request certifiable tests meeting standardized thresholds on retrieval software and learning systems especially in comparison with human counterparts. These standards not only directly affect industry and researches in academia, but legal teams who may view such technology as foreign. Legal in the realm of Information Governance will become the centrality for delivering the dos and don’ts of information in the corporation, in partnership with the CIO / IT, and possibly as oversight. More robust search algorithms and sophistication in automated apparatuses allow more document discovery to be performed. While it could be argued by legacy eDiscovery review shops that such systems displace 1 See Lateral learning and the differentiators of Artificial and Computational Intelligence 2 NIST TREC, Precision, Recall, F-measure, ROC 3 http://trec.nist.gov/pubs/trec19/papers/bennett.cody.LEGAL.rev.pdf workers, the resulting outcome will be more time for their expertise to focus on larger data sets and cases. The technology tools also allow new forms of discovery. During litigation, if both counsels are using automated methods, expect different forms of data mining and statistical modeling to look for fringe data; Information Governance becomes critically important because signposts to documents that were not produced may become evident. It also puts the onus on the automated systems. Though, even while precision, speed and capacity may massively increase, the chance of sanctions should increase less dynamically dependant upon the unknowns of the output. In review, knowing that automated coding will always make the same calls if the parameters and data remain the same may be comforting. But the hive instinct of a group of humans making judgments on the fly is tempered when replaced by the efficiency. Are vendors willing to champion their products against comparisons of human reviewers in standardized sessions? Are they willing to “open up the hood” for transparency? Along with the many previous buzzwords, possibly the biggest is “Cloud”. Information Management, Cloud and semi / automated eDiscovery provide historically high potential for low cost, immediate, real-time view into the information cycle. Which means, not only will businesses entertain cloud services, but because of lower cost, less worry about infrastructure, and touted uptime, they will be able to search and store more information as they adhere to rules for retention and preservation. Whether a public or private cloud or some hybrid, this growth of searchable data will necessitate further automation of processes in Information Governance and solidification of the underlying framework – policies, procedures and standards beyond search of information. The standardization for Clouds may be best lead by the Government and related agencies. Cost of Government is under heavy scrutiny and current endeavors are occurring to facilitate the movement of Government into the Cloud. Cloud infrastructure, believing the hype, will structurally allow the computing 4 capacity needed for today’s brute force systems and experimental Computational Intelligence et al . This intriguing ability to perform massive calculations per second with elasticity is a lowly feature compared to the perceived cost savings which currently drives the interest for mid to large sized entities; public clouds like Microsoft, Amazon and Salesforce.com currently among the most popular. Although, for eDiscovery, the cost of demanding and actually acquiring documents from geographically disparate locations may produce a haven for sanctions. More ominously, if mission critical systems become cloud based, could critical 5 infrastructure (industry, state, and government) become even more exposed ? This architecture triangulation (Cloud + [Enterprise] Information Retrieval + Machine Learning) is either a Nirvāṇa or the Perfect Storm. Whatever viewpoint, the criticality is security. Providing a one stop shop for data leaks and loss, hack attacks, whistle blowing and thievery across geographically massive data sets of multitudes of business verticals combined with hybridized, highly effective automated systems designed to quickly gather precise information with very little input at the lowest possible cost is one CIO’s wish and one 6 Information Manager’s nightmare . Next generation systems will need to work hand in hand with sophisticated intrusion detection, new demands for data security and regulators across state and international boundaries – and hope for costs’ sake, that’s enough. Standardized security for different types of clouds was bluntly an afterthought to cost savings. 7 Finally, technology growth and acceptance while cyclic is probably more spiral ; it takes multiple iterations to conquer very complicated issues and for such iterations to stabilize. Standardizing Artificial, Computational and Hybrid Intelligence Systems is no different. The processes underneath these umbrella terms will require multiple standardization iterations to flesh out the bleeding edge into leading edge. It is possible that the entropy of such systems is so high that standardization is just not feasible. Where standardization can occur in the triangular contexts described above, expect it to follow similar structure as RFCs from the Internet 8 Engineering Task Force . Though, this will likely require heavy concessions and the potential unwillingness from industry on interoperability and transparency. 4 Pattern analysis, Neural Nets 5 A next generation Stuxnet, for example… 6 Not to mention, lawyers holding their breath in the background… 7 This type of cyclical information gain when graphed appears similar to Fibonacci (2D) and Lorentz (3D) spirals. 8 This makes sense due to the fact that data access and search has been spinning wildly into the foray of Internet dependence. May 2011 EDIG: E-Discovery & Information Governance The Williams Mullen Edge Why Document Review is Broken BY BENNETT B. BORDEN, MONICA MCCARROLL, MARK CORDOVER & SAM STRICKLAND1 The review of documents for responsiveness and privilege is widely perceived as the most expensive aspect of conducting litigation in the information age. Over the last several years, we have focused on determining why that is and how to fix it. We have found there are several factors that drive the costs of document review, all of which can be addressed with significant results. In this article, we move beyond costs and get to the real heart of the matter: document review is a “necessary evil” in the service of litigation, but its true value is rarely understood or realized in modern litigation. It was not always so. When the Federal Rules of Civil Procedure were first promulgated in 1938, they established a framework from the common law with respect to which discovery took place. But there was no fundamental change in how one conducted discovery of the comparatively few paper documents that comprised the evidence in most civil cases. There was no Facebook or even email at the time. Only later, when the sheer number of paper documents grew to a point where litigators needed help to get through them, and only later still when the electronic creation of documents became possible and then ubiquitous, did the “problem” of information inflation convert document review into a separate aspect of litigation, and one that accounted for a significant portion of the cost of litigation. There are three primary factors that drive the cost of document review: the volume of documents to be reviewed, the quality of the documents, and the review process itself. The volume of documents to be reviewed will vary from case to case, but can be reduced significantly by experienced counsel who understands the sources of potentially relevant documents and how to target them narrowly. This requires the technological ability to navigate computer systems and data repositories as well as the legal ability to obtain agreement with opposing counsel, the court or the regulator to establish proportional, targeted, iterative discovery protocols that meet the needs of the case. Because of the important work of The Sedona Conference® and other similar organizations, these techniques are better understood, if not always widely practiced.2 At some point, however, a corpus of documents will be identified that requires careful analysis, and how that “review” is conducted is largely an issue of combining skillful technique with powerful technology. In order to take advantage of all of the benefits this technology can provide, the format of the documents, the data and metadata, must be of sufficient quality. When the format of production is “dirty” (i.e., inconsistent, incomplete, etc.), you face a situation of “garbage in/garbage out.” For several reasons, “garbage” in this sense no longer suffices. As we discuss more fully below, the most advanced technology we have found uses all of the aspects of data and metadata to improve the efficiency (and thus reduce the cost) of the review process – and more. This means that the ESI must be obtained, whether from the client for its own review or from opposing counsel for the review of the opposing party’s documents, with sufficient metadata in sufficiently structured form to capitalize on the power of the technology. This requires counsel with technological and legal knowhow to obtain ESI in the proper format. Many negotiated ESI protocols have become long and complex, but they rarely include sufficiently detailed requirements concerning the format of documents, including sufficiently clean data and metadata such that the most powerful technologies can be properly leveraged. Without this, a great deal of efficiency is sacrificed. Williams Mullen EDIG Team Bennett B. Borden Co-Chair 804.420.6563 bborden@williamsmullen.com Monica McCarroll Co-Chair 804.420.6444 mmccarroll@williamsmullen.com Stephen E. Anthony 757.629.0631 santhony@williamsmullen.com Jonathan R. Bumgarner 919.981.4070 jbumgarner@williamsmullen.com W. Michael Holm 703.760.5225 mholm@williamsmullen.com William R. Poynter 757.473.5334 wpoynter@williamsmullen.com Brian C. Vick 919.981.4023 bvick@williamsmullen.com Lauren M. Wheeling 804.420.6590 lwheeling@williamsmullen.com Ada K. Wilson 919.325.4870 awilson@williamsmullen.com May 2011 • E-Discovery Once a corpus of documents has been identified and obtained in the proper format, the document review commences. This is where we have found the greatest inefficiency, and this is the primary area in which the most significant gains are possible. Our analysis of the typical review process leads us to conclude that the process is broken. By this we mean that, typically, document review is terribly inefficient and has been divorced from its primary purpose, to marshal the facts specific to a matter to prove a party’s claims or defenses and to lead to the just, speedy and inexpensive resolution of the matter. This disheartening conclusion led us to question whether document review could be completed efficiently and effectively within days or even hours so that a party could almost immediately know its position with respect to any claim or defense. That kind of document review could become an integral part of the overall litigation as well as the primary driver of its resolution. But document review has become an end unto itself, largely divorced from the rest of litigation. The typical review is structured so that either contract attorneys or low-level associates conduct a first level review, coding documents as responsive, non-responsive or privileged. Sometimes the responsive documents are further divided and coded into a few categories. But this sub-dividing is usually very basic and provides only the most general outline as to the subjects of the documents. Typically, a second level review is conducted by more senior associates to derive and organize the most important facts. Thus, every responsive document is reviewed at least twice, and usually several more times as the second level reviewers distill facts from the documents to organize them into case themes or deposition outlines that are finally presented to the decision makers (usually partners). This typical tiered review process is inherently inefficient and requires a great deal of time and effort. The most pressing question that arises in the beginning of a matter, “what happened?”, prompts the answer, “We’ll tell you in two (or three or six) months.” This multiplicitous review process leads to lost information in transfer, lost time, and the attendant increase in cost. The three standard categories (responsive, non-responsive and privileged) result in oversimplification because not all responsive documents are equally responsive. Add to inefficiency, then, simple misinformation. Is this avoidable? Document review became separated from the litigation process because of the increase in the volume of potentially relevant documents. With thousands or even millions of documents to review, law firms or clients typically threw bodies at the problem, hiring armies of contract attorneys to slog through the documents one by one in an inefficient linear process. The goal was simply to get through the corpus to meet production deadlines. But, if the whole point of document review is to discover, understand, marshal and present facts about what happened and why, then it is the facts derived from document review that drive the resolution of the matter. Thus, the entire discovery process should be tailored to this fundamental purpose. Part of this, as we have noted, must be accomplished through experienced counsel who understands what the case is about, what facts are needed, and how to narrowly and 2 proportionally get at them. The other key is to derive facts from the reviewed documents as quickly and efficiently as possible, and transfer the knowledge distilled from those facts to the decision makers in the most effective and efficient way. In short, document review should be returned to its rightful place as fact development in the service of litigation. In October 2010, we released an article entitled: The Demise of Linear Review, wherein we discussed the use of advanced review tools to create a more efficient review process.3 There, we showed that by using one of these advanced tools, and employing a nonlinear, topic-driven review approach, we were able to get through several reviews between four and six times faster than would be the case with less advanced tools using a typical linear review approach. Since then, we have focused on perfecting both the review application and our review processes. Our results follow below. The application used in the reviews described in The Demise of Linear Review was created by IT.com and is called IT-Discovery (ITD). The non-linear, topic-driven reviews were conducted by a team of attorneys led by Sam Strickland, who has since created a company called Strickland e-Review (SER), and the reviews were overseen by the Williams Mullen Electronic Discovery and Information Governance Section. The ITD application uses advanced machine learning technology to assist our SER reviewers in finding responsive documents quickly and efficiently, as we showed in The Demise of Linear Review. But we wanted to show not only that our topic-driven review process was faster, but also that it was qualitatively better than a typical linear review. Here we move into the area of whether humans doing linear review are in fact better than humans using computer-assisted techniques - not only for cost reduction but for improving the quality of results. We tested this and concluded that humans doing linear review produce significantly inferior results compared to computer-assisted review. To prove this, we obtained a corpus of 20,933 documents from an actual litigation matter. This corpus had been identified from a larger corpus using search terms. The documents were divided into batches and farmed out to attorneys who reviewed them in a linear process. That review took about 180 hours at a rate of about 116 documents per hour. The typical rate of review is about 50 documents per hour, so even this review was more efficient than is typical. Our analysis showed that this was because the corpus was identified by fairly targeted search terms, so the documents were more likely to be responsive. Also, the document requests were very broad, and there were no sub-codes applied to the responsive documents. Both of these factors led to a more efficient linear review. We then loaded the same 20,933 documents into the ITD application and reviewed them using our topic-driven processes with SER. This review took 18.5 hours at a rate of 1,131 documents per hour, almost ten times faster than the linear review. Obviously it is impossible for a reviewer to have seen every document at that rate of review, so we must question whether this method is defensible. To answer that question, it is important to distinguish between reviewing a document and reading it. May 2011 • E-Discovery Reviewing a document for responsiveness means, at the most fundamental level, that the document is recognized, through whatever means, as responsive or not. But this does not mean that the document has to be read by a human reviewer, if its responsiveness can otherwise be determined with certainty. The ITD application uses advanced machine learning technology to group documents into topics based upon content, metadata and the “social aspects” of the documents (who authored them, among whom were they distributed and so forth), as well as the more traditional co-occurrence of various tokens and forms of matrix reductions that constitute modern machine learning techniques to data mine text. Because of the granularity and cohesiveness of the topics created by ITD, the reviewers were able to make coding decisions on groups of documents. But more interestingly, these unsupervised-learning-derived topics aid in intelligent groupings of all sorts, so that a reviewer can “recognize” with certainty a large number of documents as a cohesive group. One can then code them uniformly. Does this mean that some documents were not “reviewed” in the sense that a reviewer actually viewed them and made an individual decision regarding their responsiveness? No. To understand this by analogy, think of identifying in a corpus all Google news alerts by the sender, “Google alert,” from almost any advanced search page in almost any review or ECA platform, in a context where none of these documents could be responsive. Every document was looked at as a group, in this case a group determined by the sender, and was coded as “non-responsive.” This technique is perfectly defensible and is done in nearly every review today. What we can do is extend this technique much deeper to apply it to all sorts of such groups and voilá: a non-linear review on steroids. Isn’t there some risk, however, that if every document isn’t read, privileged information is missed inadvertently and thus produced? Not if the non-linear review is conducted properly. Privileged communications occur only between or among specific individuals. Work product only can be produced by certain individuals. Part of skillful fact development is knowing who these individuals are and how to identify their data. The same holds true for trade secrets, personal information, or other sensitive data. The key is to craft non-linear review strategies that not only identify responsive information, but also protect sensitive and privileged information. We showed in our non-linear review that the SER reviewers, using the advanced technology of the ITD application, coded 20,933 documents in 1/10th of the time that it took the linear reviewers to do so. The question then becomes, how accurate were those coding decisions? To answer this question, we solicited the assistance of Maura R. Grossman and Gordon V. Cormack, Coordinators of 2010 TREC Legal Track, an international, interdisciplinary research effort aimed at objectively modeling the e-discovery review process to evaluate the efficacy of various search methodologies, sponsored by the National Institute of Standards and Technology. With their input, we designed a comparative analysis of the results of both the linear and non-linear reviews. 3 First, we compared the coding of the two reviews and identified 2,235 instances where the coding of the documents conflicted between the two reviews. Those documents were then examined by a topic authority to determine what the correct coding should have been, without knowing how the documents were originally coded. Results: The topic authority agreed with the ITD/SER review coding 2,195 times out of 2,235, or 98.2% of the time. Not only was the ITD/SER review ten times faster, it resulted in the correct coding decision 99.8% of the time. In nearly every instance where there was a dispute between the “read every document” approach of the linear review and our computer-assisted non-linear review, the non-linear review won out. Could this just be coincidence? Could it be that the SER reviewers are just smarter than the “traditional” reviewers? Or perhaps, as we believe, is the fundamental approach of human linear review using the most common review applications of today simply worse? The latter position has been well documented by Maura R. Grossman and Gordon V. Cormack, among others.4 The implication of this specific review, as well as those discussed in The Demise of Linear Review, is that with our ITD/SER review process we can get through a corpus of documents faster, cheaper and more accurately than with traditional linear review models. But, as we have noted, document review is not an end unto itself. Its purpose is to help identify, marshal and present facts that lead to the resolution of a matter. The following is a real-world example of how our better review process resulted in the resolution of a matter. We should point out that case results depend upon a variety of factors unique to each case, and that case results do not guarantee or predict a similar result in any future case. We represented a client who was being sued by a former employee in a whistleblower qui tam action. The client was a government contractor who, because of the False Claims Act allegations in the complaint, faced a bet-the-company situation. As soon as the complaint was unsealed, we identified about 60 custodians and placed them on litigation hold, along with appropriate noncustodial data sources. We then identified about 10 key custodians and focused on their data first. Time was of the essence because this case was on the Eastern District of Virginia’s “Rocket Docket.” We loaded the data related to the key custodians into the ITD platform and SER began its review before discovery was issued. We gained efficiency through the advanced technology of the ITD platform. We also gained efficiency by eliminating the need to review documents more than once to distill facts from them. Our review process includes capturing enough information about a document when it is first reviewed so that its facts are evident through the organizing and outlining features in ITD. This eliminated the need for a typical second- and even third-level review. Within four days, the SER reviewers could answer “what happened?”. Soon thereafter, the nine reviewers completed the review of about 675,500 documents at a rate of 185 documents per hour. More importantly, within a very short time we knew precisely May 2011 • E-Discovery what the client’s position was with respect to the claims made and had marshaled the facts in such a way as to use them in our negotiations with the opposing party, all before formal document requests had been served. For more information about this topic, please contact the author or any member of the Williams Mullen EDiscovery Team. Knowing our position, we approached opposing counsel and began negotiating a settlement. We made a voluntary production of about 12,500 documents that laid out the parties’ positions, and walked opposing counsel through the documents, laying out all the facts. We were able to settle the case. All of this occurred after the production of only a small fraction of all the documents, without a single deposition taken, and at a small fraction of the cost that we had budgeted to take the case through trial. 1 This real-world example demonstrates the true power of “document review” when understood and executed properly. Fundamentally, nearly every litigation matter comes down to the questions of “what happened?” and “why?”. In this information age, the answers to those questions almost invariably reside in a company’s ESI, where its employees’ actions and decisions are evidenced and by which they are effectuated. The key to finding those answers is knowing how to narrowly target the necessary facts within the ESI. You then can use those facts to drive the resolution of the litigation. This requires the ability to reasonably and proportionally limit discovery to those sources of ESI most likely to contain key facts and the technological know-how to efficiently distill the key facts out of the vast volume of ESI. The typical linear document review process is broken. It no longer fulfills its key purpose: to identify, marshal and present the facts needed to resolve a matter. Its failure is legacy to the nature of how it came into being as the volume of documents became overwhelming. We believe we have found the right combination of technique and technology to return the process to its roots, resolving litigation. 4 Bennett B. Borden and Monica McCarroll are Chairs of Williams Mullen’s Electronic Discovery and Information Governance Section. Mark Cordover is CEO of IT.com. Sam Strickland is President of Strickland e-Review. 2 See, Bennett B. Borden, Monica McCarroll, Brian C. Vick & Lauren M. Wheeling, Four Years Later: How the 2006 Amendments to the Federal Rules Have Reshaped the E-Discovery Landscape and are Revitalizing the Civil Justice System, XVII RICH. J.L. & TECH. 10 (2011), http://jolt.richmond.edu/v17i3/article10.pdf. 3 See, The Demise of Linear Review, October 2010, http://www.williamsmullen.com/the-demise-oflinear-review-10-01-2010/ 4 Maura R. Grossman & Gordon V. Cormack, Technology-Assisted Review in E-Discovery Can Be More Effective and More Efficient Than Exhaustive Manual Review, XVII RICH. J.L. & TECH. 11 (2011), http://jolt.richmond.edu/v17i3/article11.pdf. Williams Mullen E-Discovery Alert. Copyright 2011. Williams Mullen. Editorial inquiries should be directed to Bennett B. Borden at 804.420.6563, bborden@williamsmullen.co m or Monica McCarroll at 804.420.6444, mmccarroll@williamsmullen. com. Williams Mullen E-Discovery Alert is provided as an educational service and is not meant to be and should not be construed as legal advice. Readers with particular needs on specific issues should retain the services of competent counsel. Planning for Variation and E-Discovery Costs By Macyl A. Burke President Eisenhower was fond of the quote, “In preparing for a battle I have found plans are useless, but planning is indispensable”. The same logic, creating a foundation and framework, can be applied effectually to the complex world of e‐discovery where every case brings its own uniqueness and quirks. That said, there are some check points that should be examined in triangulating the factors of cost, risk, and time that loom large in the world of complex litigation. Discovery is the most expensive piece in legal spend, with review as the most expensive element in discovery. Discovery and review are estimated at over 80% of the cost by some sources. It is therefore logical to be thoughtful about the discovery process in general and review in particular. Cost Calculations We suggest the following points be analyzed and considered in the decision making process of the legal discovery cost:  HOW DO THEY CHARGE? Examine the basis for your Economic Model: Look for cost models that are document or gigabyte based. Explore alternative cost models that are fixed and transparent. Integrated costing models that combine both the cost of the law firm and the vendor on a fixed price basis can be advantageous. The critical path is to achieve the lowest all‐in cost which includes the supervision of the law firm and their certification of the process. The line item costs taken in and of itself are not the critical path.  WHAT DO THEY MEASURE? Know your enemy: You should know your blended hourly rate. The blended rate is the cost of the contract attorneys combined with the cost of the law firms to supervise and certify the review. This is usually the single largest cost in the discovery process and largest available cost saving opportunity in most projects. If you are not paying for the review by the document or gigabyte in an integrated cost model be sure to understand the blended rate.  WHAT TOOLS DO THEY HAVE? Practice MTCO: “Measure Twice; Cut Once” is a best practice. There are numerous methods of measurement that can be applied to your project to cut cost and improve results. Just a few are blended rate, sampling, review rates, error rates, richness of the population, recall, precision, page counts, document counts, pages per 2 7 2 0 0 Tou r n e y Roa d, Su it e 4 5 0 , Va le n cia , CA 9 1 3 5 5 P 6 6 1 .2 8 4 .6 4 0 1 F 6 6 1 .2 8 4 .7 6 5 4 www.actlit.com Page 2 document, etc. that can be used to analyze your project. A vendor should have a tool box of metrics to explain and explore the best way forward. These metrics can be instrumental in reducing the size of the population to be reviewed in a reasonable good faith way and result in large direct savings in terms of all‐in cost. Good metrics help foster innovation and are a catalyst to change.  WHAT DO THEY REPORT? Use sampling over inspection: By all means necessary, avoid inspection as a quality assurance practice by the law firm in the review stage of discovery. Sampling is materially less expensive and produces a far superior result. Sampling will reduce the cost of your blended rate geometrically and provide more current and better results as to quality and how well the knowledge transfer has taken place. It allows for quick course correction and continuous improvement. Inspection is expensive rework that is not necessary.  CAN THEY DO IT FOR LESS? Quality processes should lower cost: Understand the quality control and quality assurance processes being employed by the vendor and law firm. If they do not lower cost they are not quality applications. They are cost increases. The lowered costs should be measurable and produce the desired results.  CAN THEY DO MORE? Select an integrated provider: The largest savings comes from reducing the discovery population to be reviewed as much as possible in a reasonable and good faith manner and reviewing it in a high quality low cost application. An integrated provider who offers processing, culling, hosting and first cut review is in the best position to achieve these efficiencies. Using an integrated provider allows for the growth and development of a strong partnership with the participating law firm.  HOW DO I LEARN MORE? Take advantage of third party information: Use proven quality processes from other sources. The Sedona Conference Paper Achieving Quality in the e‐ Discovery Process is an excellent source. It references numerous other sources which are rich in information. Ralph Losey’s blog is highly useful with a diverse set of contributors.  Keep up to date on current developments: E‐discovery is a fluid and dynamic field. The TREC 2009 Legal Track Interactive Task offers new insights into technology assisted review. Additionally, proportionality and cooperation are emerging as important factors in the discovery process.  Construct an Audit Trail and Flow Chart: Document carefully the processes and results across the whole of the EDRM. You want a record that your process was reasonable, in good faith, and proportional. Be sure to document ongoing changes. Even your best planning will need to be adapted to emerging realities that could not have been anticipated or known. 2 7 2 0 0 Tou r n e y Roa d, Su it e 4 5 0 , Va le n cia , CA 9 1 3 5 5 P 6 6 1 .2 8 4 .6 4 0 1 F 6 6 1 .2 8 4 .7 6 5 4 www.actlit.com Page 3 Risk Evaluation Variation in complexity, scale, cost, and risk are present in any system or process. Good quality is about reducing variation to acceptable and predictable levels which are confirmed by metrics. This is particularly true in the discovery process of litigation that represents the bulk of the legal spend. In general there has been some price compression on discovery activities but there is still a large amount of cost variation in the discovery process. The standard is that the discovery burden must be reasonable, in good faith, and proportional. By that standard, using measurement and understanding, the variation insures compliance with the requirement. A few specific examples can show the scope of the problem that variation presents in the discovery process. In an actual case, we audited one firm using contract reviewers at a cost of $53 an hour for 1st Tier review with law firm attorneys doing supervision and 2nd Tier review at $300 plus an hour which achieved an all‐in cost of $7 a document. In the same case, we found another firm using associates for 1st Tier review at a cost of $250 an hour and 2nd Tier review and supervision for $300 plus an hour for an all‐in cost of $20 a document. These were documents being reviewed in the same case with the same issues. The only difference was the cost and the process. In neither case were the results measured. Evaluation: There is significant variation between the two firms and the costs do not include processing, hosting, or production. We participate in an alliance with a law firm that would offer an all‐in price (collection, processing, hosting, 1st and 2nd Tier review, certification, and production) or total cost of approximately $1.63 a document. In another example, we audited a 1st Tier review at cost of $0.05 per page. There was not an hourly rate or per document rate involved. There were approximately 50,000 documents with a page count of 7,400,000 pages all of which were billed by the page. This works out to an average document page count of around 150 pages per document. Evaluation: On a per page basis at $0.05 per page that turns out to be around $370,000. A more common per document charge in the range of $0.70 to $1.10 a document works out between $35,000 and $55,000. Looking at various per gigabyte all‐in cost numbers, we find the variation is enormous. The average all‐in cost (collection, processing, hosting, 1st Tier and 2nd Tier review, certification, and production would be approximately $70,000 a gigabyte plus or minus $35,000. In our view, with a good integrated process the cost should be approximately $24,000 a gigabyte plus or minus $2,000 for all‐in cost. 2 7 2 0 0 Tou r n e y Roa d, Su it e 4 5 0 , Va le n cia , CA 9 1 3 5 5 P 6 6 1 .2 8 4 .6 4 0 1 F 6 6 1 .2 8 4 .7 6 5 4 www.actlit.com Page 4 All of the pricing examples above offer large variations in outcomes. In planning, it is a good practice to measure from two different approaches and compare how close the results are to one another. We would recommend you examine several economic models for each project. Ask, how are we measuring results? What are the metrics, what is the cost per document, per gigabyte, etc? The difference between pricing by the page or document can be extreme. Time Frame The examples show the order of magnitude, variation, and the potential savings available in the cost of discovery. Eisenhower was also fond of the statement, “What is urgent is seldom important, and what is important is seldom urgent”. In the hair on fire world of high stakes litigation this is not the conventional wisdom. However, the spiraling cost of discovery fueled by ever increasing volumes of ESI (electronically stored information) should give us cause to pause and take a hard look at process, variation, and measurement. Planning can be derailed or incomplete by the drama of law and press of events. The temptation to short cut the planning activity should be avoided even if the urgency is great. Planning is too important to be co‐opted by urgency. If a vendor offers a magic plan in the e‐discovery maze, be wary. Only planning can prepare for variations, cause awareness of alternatives, help discover pitfalls, and refine our goals. The given check points are general concepts that can be expanded to more granular applications. The approach should be emergent, based on circumstance and need. It is basically a read and react scenario using concepts that may or may not be appropriate in a given circumstance. The suggestions are not new or radical. We are offering them as touch points to reduce costs, lower risk, and improve time. By no means are they a panacea or magic bullet to spiraling legal costs. We do suggest they are reasonable and good faith questions that should be asked and answered. April 2011 2 7 2 0 0 Tou r n e y Roa d, Su it e 4 5 0 , Va le n cia , CA 9 1 3 5 5 P 6 6 1 .2 8 4 .6 4 0 1 F 6 6 1 .2 8 4 .7 6 5 4 www.actlit.com Semantic Search in E-Discovery Research on the application of text mining and information retrieval for fact finding in regulatory investigations David van Dijk, Hans Henseler Maarten de Rijke Amsterdam University of Applied Sciences CREATE-IT Applied Research Amsterdam, the Netherlands d.van.dijk@hva.nl, j.henseler@hva.nl University of Amsterdam Intelligent Systems Lab Amsterdam Amsterdam, the Netherlands derijke@uva.nl Abstract— For forensic accountants and lawyers, E-discovery is essential to support findings in order to prosecute organizations that are in violation with US, EU or national regulations. For instance, the EU aims to reduce all forms of corruption at every level, in all EU countries and institutions and even outside the EU. It also aims to prevent fraud by setting up EU anti fraud offices and actively investigates and prosecutes violations of competition regulations. This position paper proposes to address the application of intelligent language processing to the field of ediscovery to improve the quality of review and discovery. The focus will be on semantic search, combining data-driven search technology with explicit structured knowledge through the extraction of aspects, topics, entities, events and relationships from unstructured information based on email messages and postings on discussion forum. Keywords: E-Discovery, Semantic Search, Retrieval, Entity Extraction, Fact Extraction, EDRM I. Information Introduction Since the ICT revolution took off around 50 years ago the storage of digital data has grown exponentially and is expected to double every 18 months [16]. Digital data became of crucial importance for the management of organizations. This data also turned out to be of significant value within the justice system. Nowadays digital forensic evidence is increasingly being used in court. The Socha-Gelbmann Report from 2006 shows a usage of this kind of evidence in 60% of the court cases [31]. The process of retrieving and securing digital forensic evidence is called electronic data discovery (E-Discovery). The E-Discovery Reference Model [8] gives an overview of the steps in the e-discovery process. The retrieval of information from large amount of digital data is an important part of this process. Currently this step still involves a large amount of manual work done by experts, e.g. a number of lawyers searching for evidence in all e-mails of a company which may include millions of documents [30]. This makes the retrieval of digital forensic evidence a very expensive and inefficient endeavor [24]. Digital data in E-Discovery processes can be either structured or unstructured. Structured data is typically stored in a relational database and unstructured data in text documents, emails or multimedia files. Corporate Counsel [6] indicates that at least 50% of the material of contemporary electronic discovery environment is in the form of e-mail or forum and collaboration platforms. Finding evidence in unstructured information is difficult, particularly when one does not exactly know what exactly to look for. The need for better search tools and methods within the area is reflected in the rapid growth of the E-Discovery market [32,10], as well as in the growing research interest [34,15,29]. This paper positions the research that is carried out through joined work from CREATE-IT Applied Research at the Amsterdam University of Applied Sciences [7] and the Intelligent Systems Lab Amsterdam at the University of Amsterdam [17]. It focuses on the application of text mining and information retrieval to E-Discovery problems. II. Text Mining and Information Retrieval Information retrieval (IR) can be defined as the application of computer technology to acquire, organize, store, retrieve and distribute information [19]. Manning defines IR as finding material (usually documents) of unstructured nature (usually text) from large collections (usually stored on computers) that provides in an information need [23]. Text mining (TM), also called text analytics, is used to extract information from data through identification and exploration of interesting patterns [9]. In TM, the emphasis lies on recognizing patterns. TM and IR have a considerable overlap, and both make use of knowledge from fields such as Machine Learning, Natural Language Processing and Computational Linguistics. Both TM and IR provide techniques useful in finding digital forensic evidence in large amounts of unstructured data in an automated way. The techniques can be used for example to extract entities, uncover aspects of and relationships between entities, and discover events related to these entities. The extracted information can be used as metadata to provide additional guidance in the processing and review steps in E- Discovery. Without such guidance, plain full-text search in large volumes of data becomes useless without proper relevance ranking. Metadata can be used to support interactive drill down search that is more suited for discovering new facts. Furthermore, information about entities and aspects makes it possible to retrieve facts about a person as to what kind of position he currently holds, what positions he has previously had and what is important about him. Information about relationships can be used to identify persons closely connected with each other, but also to identify what persons are strongly connected to specific locations or (trans)actions. And events related to the entity can help one to extract temporal patterns. III. Applications The above techniques can be useful in many areas, both within and outside the domain of E-Discovery. Opportunities can be found in the areas of fraud, crime detection, sentiment mining (e.g., marketing), business intelligence, compliance, bankruptcies and, as one of the largest areas, e-discovery [27,12]. Large regulatory compliance investigations in the areas of anti-corruption and anti-trust offer excellent opportunities for text mining and information retrieval. Known techniques can be optimized and further developed to extract facts related to corruption and competition and to identify privileged and private information that should be excluded from the investigation. For the detection of competition law infringements one can look at how prices develop [4]. For finding corruption one could search for suspicious patterns in transactions between entities, e.g., clients and business partners. In determining confidential data one can think of social security numbers, correspondence between client and attorney, medical records, confidential business information, etc. But often it is not clear beforehand what is sought, and therefore techniques are of interest that make the information accessible and provide insights so that a user can easily interact with it. The entities and relations retrieved by the aforementioned techniques can be made accessible to the user in various ways. Either as additional metadata to documents to be combined with full-text search or as relational data in a separate system which can process questions in natural language (Question Answering System). The former gives a list of documents in response, the second can answer in natural language. IV. Objective Our research will focus on review and in particular on the search process. Generic search technology is not the answer. It has its focus on high precision results, where the top-ranked elements are of prime importance, whereas in forensic analysis and reconstruction all relevant traces should be found. In ediscovery, both recall and precision must be simultaneously optimized [26]. As a consequence, in e-discovery, the search process is typically iterative: queries are refined through multiple interactions with a search engine after inspection of intermediate results [5]. Analysts often formulate fairly specific theories about the documents that would be relevant and they express those criteria in terms of more-or-less specific hypotheses about who communicated what to whom, where, and, to the extent possible, why [2]. Representing, either implicitly or explicitly, knowledge associated with analysts’ relevance hypotheses so that an automated system can use it, is of primary importance in addressing the key issues in e-discovery of how to identify relevant material [14]. Our research is aimed at providing analysts with more expressive tools for formulating exactly what they are looking for. In particular, our research questions are as follows: RQ1: At least 50% of the material in today’s e-discovery environment is in the form of e-mail or forum and collaboration platforms [6]. How can the context (such as thread structure or the participant’s history) of email messages and forum postings be captured and effectively used for culling entire sets of messages and postings (as they do not answer the question posed)? RQ2: How can the diversity of issues that relate to the question posed be captured in a data-driven manner and presented to analysts so as to enable them to focus on specific aspects of the question? RQ3: Social networks, graphs representing probable interactions and relations among a group of people, can enable analysts to infer which individuals most likely communicated information or had knowledge relevant to a query [28,13]. How can we effectively extract entities from e-mail messages and forum postings to automatically generate networks that help analysts identify key individuals? RQ4: How can we semi-automatically identify the principal issues around the question posed? Creating an “information map” in the form of a domain-specific and context-specific lexicon will help improve the effectiveness of the iterative nature of the e-discovery process [36]. Based on typical user needs encountered in E-Discovery best practices, these research questions are situated at the interface of information retrieval and language technology. Answering them requires a combination of theoretical work (mainly algorithm development), experimental work (aimed at assessing the effectiveness of the algorithms developed) and applications (implementations of the algorithms will be released as open source). V. Innovation In recent years the field of information retrieval has diversified, bringing new challenges beyond the traditional text-based search problem. Among these new paradigms is the field of semantic search, in which structured knowledge is used as a complement to text retrieval [25]. We intend to start a research project which pursues semantic search along two subprojects: Subproject 1: integrating structured knowledge (discussion structure, topical structure as well as entities and relations) into information retrieval models; Subproject 2: extracting structured knowledge from user generated content: entities, relations and lexical information. We have requested funding for two PhD students, one for each of the two subprojects. Subproject 1 will primarily address RQ1 and RQ2. Subproject 2 will focus on RQ3 and RQ4. Work on RQ1 will start from earlier work at ISLA [35] and extend the models there with ranking principles based on thread structure and (language) models of the experience of participants in email exchanges and collaborative discussions. Work on RQ2 will take the query-specific diversity ranking method of [11], adapt them to (noisy) social media and complement them with labels to make the aspects identified interpretable for human consumption and usable for iterative query formulation. Work on RQ3 will focus on normalization, anchoring entities and relations to real-world counterparts as captured in structured information sources. This has proven to be a surprisingly hard problem [20]. So far, mostly rule-based approaches have been used in this setting; the project will break down the problem in a cascade of more fine-grained steps, some of which will be dealt with in a data- driven manner, and some in a rule-based step, following the methodology laid down in [1]. Finally, in work on RQ4, principal issues in result sets of documents will be identified through semi-automatic lexicon creation based on bootstrapping, using the initial queries as seeds [21]. For all the questions described above we plan to conduct experiments in which we will implement our newly designed techniques and evaluate them by measuring commonly used metrics like precision and recall. By experimenting with different designs and evaluating them we expect to reach the desired level of quality expected from these techniques. Evaluation will take place by participating in benchmarking events like TREC [33], CLEF [3] and INEX [18] and by cooperating with business organizations within the stated areas. As the aim of the TREC Legal Track [34] is to evaluate search tools and methods as they are used in the context of e- discovery, participating in this track seems to be an attractive way to start of our project. We will join the 2011 track with our first implementation for which we will use the Lemur Language Modeling Toolkit [22], complemented with implementations of the lessons learned at ISLA in the work referenced above. The track will provide us with workable data, focus, a deadline and it will provide us with a first evaluation of our work. VI. Relevance for quality in E-Discovery This research derives its relevance for quality in E-Discovery from three factors: First, the research connects with the present (and growing) need of trained E-Discovery practitioners. Both national and international regulators and prosecutors are facing a large increase in the amount of digital information that needs to be processed as part of their investigations. Second, the research is relevant for legal processes, as it directly addresses evidential search. The proceedings of their investigations impact in-house and outside legal counsel who are acting on behalf of companies that are under investigation. Intelligent language processing techniques can be a solution to effectively discover relevant information and to filter legal privileged information at the same time. This is not only a Dutch problem but also extends to international cases with US en EU regulators. Third, the research will result in (open source) web services that can be exploited in E-Discovery settings. For testing and development purposes, open sources and/or existing data sets are available. These factors and the active involvement of E-Discovery practitioners will be realized through their involvement in use case development, data selection and evaluation. We expect that this combination will increase the effectiveness and the quality of E-Discovery while information volumes will continue to explode. REFERENCES [1] [2] [3] [4] [5] Ahn, D., van Rantwijk, J., de Rijke, M. (2007) A Cascaded Machine Learning Approach to Interpreting Temporal Expressions. In: Proceedings NAACL-HLT 2007. Ashley K.D., Bridewell, W. (2010) Emerging AI & Law approaches to automating analysis and retrieval of electronically stored information in discovery proceedings. Artificial Intelligence and Law, 18(4):311-320. CLEF: The Cross-Language Evaluation Forum, http://www.clefcampaign.org/ Connor, John M., (2004). How high do cartels raise prices? Implications for reform ofthe antitrust sentencing guidelines, American Antitrust Institute, Working Paper. Conrad J., (2010). E-Discovery revisited: the need for artificial intelligence beyond information retrieval. Artificial Intelligence and Law, 18(4): 321-345. [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] Corporate Counsel, (2006). The American Bar Association (ABA), section of litigation, committee on Corporate Counsel. http://www.abanet.org/litigation/committees/corporate/ CREATE-IT applied research - Onderzoek/lectoren, http://www.createit.hva.nl/content/create-it-applied-research/onderzoeksprogrammas/ EDRM: Electronic Discovery Reference model, http://www.edrm.net/ Feldman, R., and Sanger, J. (2006). The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press. Gartner, (2009).MarketScope for E-Discovery Software Product Vendors report 2009, http://www.gartner.com/DisplayDocument?id=1262421 He, J., Meij, E., de Rijke, M. (2011) Result Diversification Based on Query-Specific Cluster Ranking. Journal of the American Society for Information Science and Technology, to appear. Henseler, J.,(2010A). Openbare les E-Discovery: Op zoek naar de digitale waarheid. Amsterdam University of Applied Sciences. Henseler, J.,(2010B). Network-based filtering for large email collections in E-Discovery. Journal Artificial Intelligence and Law,Volume 18, Number 4, p.413-430 Hogan C, Bauer R, Brassil D (2010) Automation of legal sensemaking in e-discovery. In: Artificial Intelligence and Law, 18(4):321-345 ICAIL 2011: The Thirteenth International Conference on Artificial Intelligence and Law, http://www.law.pitt.edu/events/2011/06/icail2011-the-thirteenth-international-conference-on-artificial-intelligenceand-law/ IDC 2007: Research report on the Information Explosion. ISLA: Intelligent Systems Lab Amsterdam, http://isla.science.uva.nl/ INEX: Initiative for the Evaluation of XML Retrieval, http://www.inex.otago.ac.nz/ Jackson P., Moulinier I., (2002). Natural Language Processing for Online Applications: Text Retrieval, Extraction, and Categorization. Amsterdam: John Benjamins Publishing Company. Jijkoun, V., Khalid, M., Marx, M. de Rijke, M. (2008) Named Entity Normalization in User Generated Content. In: Proceedings of the second workshop on Analytics for noisy unstructured text data (AND 2008), pages 23-30, ACM. [21] Jijkoun, V., de Rijke, M., Weerkamp, W. (2010) Generating Focused Topic-specific Sentiment Lexicons. In: 48th Annual Meeting of the Association for Computational Linguistics (ACL 2010). [22] LEMUR: Language Modeling Toolkit http://www.lemurproject.org/lemur/ [23] Manning, C. D., Raghavan, Prabhakar, and Schütze, Hinrich (2008). Introduction to Information Retrieval. Cambridge University Press. [24] Oard, D.W., Baron, J.R., Hedin, B., Lewis, D.D., Tomlinson, S.,(2010). Evaluation of information retrieval for E-discovery. Journal Artificial Intelligence and Law,Volume 18, Number 4, p.347-386 [25] Pound, J., Mika, P., Zaragoza, H. (2010) Ad-hoc object retrieval in the web of data. In WWW 2010, pp 771-780. [26] Rosenfeld L., Morville, P. (2002) Information architecture for the World Wide Web, 2nd edn. O’Reilly Media, Sebastopol. [27] Scholtes, J. C., (2009). Text mining: de volgende stap in zoektechnologie. Inauguratie. Maastricht University [28] Schwartz, M.F., Wood, D.C.M. (1993) Discovering shared interests using graph analysis. Communications of the ACM 36:78–89 [29] Sedona: The Sedona conference, http://www.thesedonaconference.org/ [30] The Sedona Conference® Best Practices Commentary on Search & Retrieval Methods (2007). http://www.thesedonaconference.org/publications_html [31] The 2006 Socha-Gelbmann Electronic Discovery Survey Report, (2006). http://www.sochaconsulting.com/2006survey.htm/ [32] The 2008 Socha-Gelbmann Electronic Survey Report, (2008). http://www.sochaconsulting.com/2008survey.php/ [33] TREC: Text REtrieval Conference, http://trec.nist.gov/ [34] TREC Legal, http://trec-legal.umiacs.umd.edu/ [35] Weerkamp, W., Balog K., de Rijke, M. (2009) Using Contextual Information to Improve Search in Email Archives. In: 31st European Conference on Information Retrieval Conference (ECIR 2009), LNCS 5478, pages 400-411 [36] Zhao F.C., Oard, D.W., Baron, J.R. (2009) Improving search effectiveness in the legal e-discovery process using relevance feedback. In: Proceedings of the global E-Discovery/E-Disclosure workshop on electronically stored information in discovery at the 12th international conference on artificial intelligence and law (ICAIL09 DESI Workshop). DESI Press, Barcelona Best Practices in Managed Document Review February 2011 The key to minimizing the risks and maximizing the efficiency and effectiveness of the document review process is to construct, document and follow a defensible process based on best practices that reflect sound project management disciplines, good legal judgment, and counsel’s specifications. A Note on Legal Jurisdiction This broad guide is intended to have specific application in discovery exercises in US jurisdictions but also to inform the practice surrounding disclosure exercises in UK and other former Commonwealth jurisdictions. Specific reference is made to recent revisions to the CPR regarding e-disclosure practice in the UK, in particular the new Practice Direction 31B (effective date October 1, 2010), which expands best-practice guidance for counsel engaged in litigation requiring electronic disclosure. Pending developments in electronic discovery/disclosure rules and procedures in Australia should be expected to align all three jurisdictions in respect to several elements, key being the requirement of maintaining defensibility. In several respects, the UK Practice Direction incorporates principles of litigation readiness and e-disclosure best practice that have for far longer been the rule in the US under e-discovery amendments to the Federal Rules of Civil Procedure (FRCP), adopted in December 2006, and state law analogues. Without going into detail, under both the Federal Rules and PD 31B, parties and their counsel are obligated to confer regarding disclosure of electronic documents, and to agree on the scope of discovery, tools and techniques to be employed, and specifications for production (exchange) of documents, all with an eye to ensure cost-efficient and competent disclosure of relevant electronically stored information. And every process employed must be fully transparent and documented in order to contribute to a fully defensible discovery exercise in total. Introduction If there is a lot riding on the outcome of litigation, there is a lot riding on the manner in which discovery, and by extension, document review, is conducted. Often we, both clients and counsel, think about conducting discovery and managing document review as necessary yet secondary concerns, a couple steps in priority and glory beneath the higher calling of designing and implementing case strategy. Moreover, in the past several years, there has been an increasing focus on cost-containment in this phase of discovery, leading to growing interest in simple and expedient solutions. But we should not lose sight of the stakes involved. Defensibility must remain the governing principle; we should want an efficient and effective process that meets a reasonable standard and that can be defended. In the wake of Zubulake and its progeny, specifically, we recognize that a defensible process, well-conceived and executed, is imperative and minimizes risk. Accordingly, counsel should guard against undervaluing discovery as a process. Best practice principles must be extended to the context of discovery and document review. This paper outlines recommended best practices for managing document review – a basic best practice guide – having as its goals the design of an efficient, cost-effective and defensible workflow yielding consistent and correct work product. As in the US, the adoption in the UK of a new definition of competent practice in the domain of e-disclosure can be daunting at first blush to many litigators, as it suggests a need for the lawyer to master a technological discipline somewhat alien to the traditional practice of law. Counsel is well-advised to seek competent e-discovery providers/partners to help navigate the e-disclosure landscape and to recommend processes and tools that have been proven in e-discovery practice. Best Practices in Managed Document Review | 2 Managed Document Review There are, in basic construct, two standing industry models for outsourcing e-discovery document review projects – the managed review model and the staffing model. Engaging a managed review provider is readily distinguishable from having a staffing agency supply temporary labor, such as attorneys or paralegals, to perform review. In the latter case, a law firm or client specifies reviewer qualifications and the staffing agency locates and vets the reviewers and assembles the team. But the staffing agency typically does little more than provide the raw workforce which must be trained, monitored, and supervised by counsel. In instances, the law firm or company also provides necessary infrastructure – physical space, technology systems, security controls, etc. – to support the review. Because staffing agencies don’t assume responsibility for managing or governing the process, the law firm or client is solely responsible for planning, review design, assignment workflow, training, process documentation, reporting, and validation of results. A managed review provider typically provides a review team, facilities, technical support, and project management, and shares with counsel responsibility for managing an efficient and defensible process. In the best examples, the review provider, whether a full-service e-discovery vendor or a stand-alone review operation, collaborates with counsel in recommending an optimal project workflow. In addition, the review provider offers proven operational features including complete metrics reporting to assist counsel in overseeing and ensuring an efficient and effective discovery exercise, from kick-off through postproduction. Whatever the choice counsel makes in selecting a review solution – whether review conducted by associates, by a temporary staff of contract agency attorneys, or outsourced to a managed review provider – the solution should reflect an approach steeped in an understanding of applied best practices. The following sets forth a minimal, standardized, framework which can and should be adapted to meet the needs of specific cases. Planning and Project management ƒ Ensure a project plan is tailored to the specifications of counsel and consistent with best practices ƒ Deliver a key set of documents that govern the execution and project management of the review process team selection and training ƒ Develop specific job descriptions and define a detailed protocol for recruiting, testing, and selection ƒ Conduct reference and background checks, and a conflicts check, where necessary ƒ Employ team members previously used on similar projects ƒ Ensure the review team receives comprehensive substantive and platform training WorkfloW ƒ Design processes, assignments and quality assurance steps specifically geared to the project’s requirements ƒ Demonstrate compliance with key security and quality standards while maintaining acceptable pace Quality control ƒ Develop quality control processes to achieve key project goals ƒ Implement controls to manage privilege designation and preparation/ validation of results for production ƒ Test first review work product using sampling, targeted re-review, and validations searches ƒ Employ formal statistics to ensure the highest quality end result ƒ Maintain performance tracking for all reviewers communication ƒ Develop a formal schedule of communications with counsel ƒ Calibrate initial review results, seeking counsel’s guidance to confirm or correct results and to conform review protocol and training materials to insights gained rePorting ƒ Deliver regular, comprehensive reports to monitor progress and quality and to assist counsel in managing the review process Productions and Privilege logs ƒ Isolate and validate producible documents for counsel’s imprimatur ƒ Prepare privilege logs in accordance with specifications set by counsel Post-case ƒ Determine need for documents to be placed in a repository for future or related litigation ƒ Document the process from collection through production and assemble a comprehensive defensibility record Best Practices in Managed Document Review | 3 Best Practices An effective document review team serves as a “force multiplier” that attempts, as closely as possible, to approximate the decisions that senior lawyers intimately familiar with the underlying case would themselves make if they had the time and opportunity to review each of the documents. A best-practices review establishes a construct in which counsel’s guidance can be – and should be – assimilated into many discrete decisions across a team of reviewing attorneys and those many discrete decisions can be calibrated to deliver consistent results. Planning and Project management There are two key objectives to the discovery process. The first is to identify documents – the production set – relevant to the matter at hand and responsive to the discovery request(s), with privileged documents held apart. The second is to recognize and bring to the attention of counsel the subset of documents that warrant particular attention, either because they support the case or are likely to be used by opposing counsel and therefore merit a prepared response. Achieving these goals requires sound planning and project management tailored to the directives from counsel. The outcome of the planning process should be a set of documents that govern the execution and project management of the review process. These documents ensure that all the key elements of the project have been discussed and specify all decisions, tasks and approaches. The planning stage documents should include: ƒ Protocol plan ƒ Comprehensive project management manual ƒ Privilege review guidance notes ƒ Sample reports The protocol plan documents the background and procedures for reviewing documents in connection with the specified litigation – it is a roadmap for the review team. A protocol plan typically includes a backgrounder to provide context for the review exercise (with information regarding the underlying litigation and a high level statement of the objectives of the review). Additionally, it includes detailed document review guidance, including a description of and examples of what constitutes relevance or responsiveness; how broadly or narrowly privilege is to be defined for purposes of review; what information or content is to be designated “confidential;” a primer on substantive issues that are required to be identified; and how other materials are to be treated, including documents that cannot be reviewed (“technical defects”) and foreign language documents. The protocol also lays out the schematic or “decision tree” and procedures for how issues or questions are to be raised among the team members and with counsel. The project management manual includes the review protocol and also lays out operational elements for the project, including: review scope; timeline; deliverables; staffing including team structure and job responsibilities, training, and work schedules; productivity plan; workflow; identification and features of the review application/platform; a quality control plan; feedback loops; query resolution process; communication plans, including reporting, validation methodology and final privilege review; and key project contacts, project closing procedures, and security protocols. Privilege review guidance notes summarize guidance for the privilege review process and should cover the following areas: overview of reviewer roles, guidance on categories and the scope of privilege, guidance on accurate coding, and privilege logging of documents. Sample reports provide counsel with examples of the reports that will be routinely delivered throughout the project. This is important to ensure up-front agreement on all reporting requirements. team selection and training Assembling a review team entails formulating specific job descriptions, identifying the associated skill sets based on the parameters of the engagement, and defining a protocol for recruiting, testing, and selection. The process should reflect relevant regulatory requirements and guidelines such as those set forth in ABA Formal Opinion 08-451, which states: “At a minimum, a lawyer outsourcing services … should consider conducting reference checks and investigating the background of the lawyer or nonlawyer providing the services … The lawyer also might consider interviewing the principal lawyers, if any, involved in the project, among other things assessing their educational background.” The level of training and experience of the review team is contingent upon the described task set. For example, a review for the purpose of redacting personal or confidential information may require limited legal training, and may be delegated to teams of paralegals under a lawyer’s supervision. Other reviews require an exercise of judgment or discretion wisely entrusted to teams of qualified junior lawyers, or even elements of substantive legal knowledge within the purview of the most highly trained and experienced attorneys. It is expected that all team members will receive thorough substantive training from counsel and an orientation or re-orientation to the selected review platform (application) prior to the commencement of each review. Early review results should be reported in detail to counsel and detailed feedback sought. In pro-actively soliciting counsel’s guidance on any reviewed documents on which a question was raised by reviewers, and to confirm or correct coding decisions made early in the review, the Best Practices in Managed Document Review | 4 team is progressively more closely aligned to counsel’s instructions. Review protocols should be fine-tuned or expanded, as necessary, as additional guidance is received from counsel. designed suite of quality control measures matched to rigorous training, performance measurement, and reporting. A very capable quality control regime includes: ƒ Intelligent validation of results to ensure the set of reviewable data has been reviewed in its entirety by the appropriate reviewers ƒ Targeted review to detect potential errors and to identify materials requiring further review ƒ Targeted review to isolate from the production set all privileged documents WorkfloW Workflow design is a synthesis of science and art. How a reviewable population of documents is approached in review will determine both the efficiency and pace of review. Workflow on linear review platforms – using conceptual searching tools and clustering or similar technology – can be optimized by applying screens to a given document population, sorting into queues those documents having similar content or format from specified custodians, or isolating discussion threads. This can aid reviewers in making consistent calls more quickly. Additional techniques can be integrated into the process to speed review, including highlighting specific search terms within documents and segregating potentially privileged documents for focused review. Other techniques can be applied to ensure accuracy of review, such as employing a mix of sampling and validation searches and targeted re-review of reviewed documents, sampling of results by counsel, and employing a formalized query resolution process that requires counsel to formulate specific answers to questions in writing. Workflow design includes the review tagging structure, incorporating desired behaviors, and constraints for individual tags. Consideration must also be given to the preferred treatment of document families, confidential or personal information, and whether redactions need to be applied. Related issues include attention to data integrity and security protocols to be followed during review and on the review platform. Quality control Any endeavor involving human effort, employing tools designed by humans, is inherently prone to error. Therefore, the standard for discovery, or indeed execution of any legal service, is not perfection. Rather, work product is expected to be correct within tolerances defined at times as consistent with diligent effort and the exercise of reasonable care. The dividing line between inadvertent error and culpable error or wanton carelessness lies in whether reasonable care was exercised in avoiding and detecting such errors. For a specific example, Federal Rule of Evidence 502(b) provides that inadvertent disclosure of privileged material will not result in waiver where the holder of the privilege (through counsel) “took reasonable steps to prevent disclosure” and “promptly took reasonable steps to rectify the error.” So one question is, how do we define reasonable steps [to prevent disclosure of privileged material] and, more broadly, reasonable care, in the context of document review? Reasonable care, in this context, equates to what we call “defensible” – and requires, at a minimum, an intelligently Quality controls should be implemented in at least two key areas: privilege designation and validation of presumptive production sets. Review results should be “tested” and determined to be: ƒ Consistent across the entire data set and team, across multiple phases of a project, and with protocol treatment for families, duplicates, etc. ƒ Correct in that it meets parameters for relevance, privilege, confidentiality, and issue coding, and that all potential privilege has been identified There are significant challenges in designing and executing a rigorous and effective quality control regime. Where sampling is relied upon, there may be reason to employ statistical methods in order to identify statistically sound and representative random samples of a document population for re-review. The most effective and, arguably, more defensible approaches combine sampling with intelligently targeted quality control elements to identify documents meriting a second level of review, and also solicit continuous input from counsel to calibrate the review team. All quality control elements should be designed with counsel’s input and documented. communication Best practices mandate developing a formal and regular schedule of reporting and communications among the review team, its managers, and supervising counsel throughout the process. During ramp-up, communications should be geared to ensure that supervising counsel is available to help confirm review guidelines and answer reviewer questions. A schedule of regular calls should be established to review progress and any issues. A bestpractices communication plan will also document points of contact, escalation processes, and appropriate means of communication. rePorting Reporting is a key element of the review process and is the primary means by which counsel is presented with information necessary to assess, in real time, whether a review is on track and on pace, how accurate the results are, the breakout of designations made for documents Best Practices in Managed Document Review | 5 reviewed thus far, and the number of interesting (“hot”) or problematic documents. Review reports, issued at agreed-upon intervals, deliver invaluable information on productivity, accuracy, operational issues, technical issues, team structure, folders released, and other requested metrics. Good systems can now generate reports containing these and other data points automatically. Best practice requires, of course, that the review vendor and counsel actually read the reports and act on information gained. Production and Privilege logs Where production is to be made to an adversary or requesting agency, best practices necessitate counsel and vendor to agree well ahead of time on production specifications (principally, format and included fields) and procedures. The provider handling processing and hosting of reviewable documents should provide to counsel a comprehensive production log, cross-referencing production ID numbers (Bates numbers) to document ID numbers on the review platform and correlated to the original data collection. Privileged and redacted or withheld documents ordinarily would be logged by the review team or its managers, with the format and content of each log also having been agreed upon ahead of time. Final logs (and final privilege determinations) should be reviewed by counsel prior to production. Post-case/Documenting the Process Counsel should determine early in the process whether some or all documents should be maintained in a repository for future or related litigation, and necessary arrangements should be made with the responsible vendor. An advantage that can be gained through using a repository is that, once made, final privilege designations can be preserved if the same dataset is subject to future or related litigation discovery. As a final element of best practices, counsel and vendors involved in all aspects of a discovery exercise, specifically including review, assemble a complete documentary record of the discovery process, including specifications of the collection, processing, review, and production(s). Such a record, which we refer to as a “defensibility binder,” is a valuable tool for counsel as a historical record to answer questions raised at a later date and as a means of demonstrating that discovery was undertaken with diligence and reasonable care. Conclusion Document review is a critical, resource-intensive component of the e-discovery process that, in order to be successful, requires active and competent project management, following a suite of well-designed processes that reflect relevant and agreed upon best practices. The result is the timely and cost-effective delivery of defensible work-product that facilitates the overall litigation process and enhances the favorability of its outcome. ABOUT INTEGREON Integreon is the largest and most trusted provider of integrated e-discovery, legal, research and business solutions to law firms and corporations. We offer a best-in-class managed review solution designed to deliver defensible work product at reasonable cost by designing cost-efficient and effective methods and applying intelligent processes that define best practice. Our review capability is global and our domain experience is substantial. Learn more at www.integreon.com Best Practices in Managed Document Review | 6 For more information contact: Foster Gibbons (foster.gibbons@integreon.com) Eric Feistel (eric.feistel@integreon.com) www.integreon.com Copyright © 2011 by Integreon No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means — electronic, mechanical, photocopying, recording, or otherwise — without the permission of Integreon. Searches Without Borders By Curtis Heckman, eDiscovery Associate, Orrick, Herrington & Sutcliffe LLP Even when working with native language documents, collection and production searches can be time-consuming and challenging. A multi-lingual environment introduces additional complexity to searching and therefore exposes the litigant to a greater risk of human and technology-based errors. This paper discusses the challenges of multi-lingual searches and provides simple recommendations for crafting defensible searches. Computers were originally developed on an English-based system. Thus, the original assignment of alphabetic characters for computers was made in favor of Latin letters. Soon after, engineers developed computers using non-Latin alphabets. For a computer’s purposes, a number is assigned to each alphabetic character. Strung together, these numbers are referred to as code pages. Communication between computers “speaking” languages based in different alphabets can become muddled beneath the surface as a result of their different code pages. Even different versions of the same software can vary in the interpretation and translation of non-Latin characters. Keyword searches that work in the language environment in which they were created may, despite looking identical on screen, miss documents created using a different alphabet. Accordingly, a comprehensive and accurate search may require that a litigant consider the original language environment of potentially-relevant documents. Each custodian’s software and hardware, and even the alphabet used to type a keyword, significantly impacts the accuracy of the search. If a litigant knows or anticipates that its universe of potentially-relevant documents contains information created in more than one alphabetic system, that party should consider using keyword variations in its search protocol. The best way to ensure a defensible search is to have a quality assurance system in place to test the results of the search and collection methods at the outset. 1  The Sedona Conference recommends parties evaluate the outcome of each search, using “key metrics, such as the number of included and excluded documents by keyword or filtering criteria, can be used to evaluate the outcome.”1  Ask questions during the initial custodian interviews to identify the language(s) used for both formal and informal communications. Also identify alternative alphabetic characters on the custodian’s keyboard, and what, if any, type of software the custodian used to type in the computers non-primary language.  Question the Information Technology department and any discovery vendor regarding the ability of proposed search engines to account for different character sets and code pages. Jason R. Baron et al., The Sedona Conference: Commentary on Achieving Quality in the E-Discovery Process (May 2009) at 15.  If the production or review is to involve translations, determine whether the translation is to be a machine-based translation and whether the translator has the capability to differentiate mingling characters.  Document the entire process underlying the determination and deployment of key words. A receiving party should also be fully prepared to discuss the producing party’s search obligations if multiple alphabetic systems are anticipated:  Consider meeting with a consultant or expert who understands multi-language searches and productions prior to the initial meet and confer.  Negotiate the parameters of the search at the meet and confer. Clearly identify and convey your production expectations.  If no agreement is reached and the litigation moves to motions practice, have an expert provide technical affidavits regarding searching and production in multi-system environments. Litigation imposes significant difficulties for international corporations, not the least of which is multi-lingual searches. Knowing the challenges before tackling cross-border searching and documenting each step establishes reasonableness and defensibility. Submission by: Curtis Heckman of Orrick, Herrington & Sutcliffe, LLP. Mr. Heckman is a member of Orrick’s eDiscovery working group, which is Orrick’s practice group devoted to eDiscovery. He is resident at Orrick’s Global Operations Center (“GOC”) in Wheeling, WV and can be reached at (304) 2312645 or checkman@orrick.com. The GOC is Orrick’s on-shore outsourcing center and home of Orrick’s Document Review Services and Data Management Group. Full biographies and description of the practice group are available at http://www.orrick.com/practices/ediscovery. Orrick, Herrington & Sutcliffe LLP 2 Position: The discovery process should account for iterative search strategy. By Logan Herlinger and Jennifer Fiorentino, eDiscovery Associates, Orrick, Herrington & Sutcliffe LLP Developing an informed search strategy in litigation, whether it involves key terms, technology, data sources, document types or date ranges, is an iterative process. Unfortunately, many judges and attorneys still approach discovery with outdated perceptions of search methodology. Litigants should not be permitted to submit a keyword list of fifty terms for the opposing party to use in their document review and production, and expect that it will define the scope of discovery. Nor should they be allowed to dictate which data sources the other side should search. Members of the legal community need more education on recent advancements in search strategy to appropriately define the scope of relevant information. An informed search strategy leads to more efficient and accurate responses in discovery, but the parties must meet and confer often and in a timely fashion to take advantage of these advancements. For years, members of the legal profession have called attention to the problems inherent in the use of keyword search terms. Attorneys may draft search terms too narrowly or too broadly. Some terms may yield a large amount of hits unexpectedly, such as a term that is automatically generated in every company email signature. These problems exist because there is significant variance in the way individuals use language, and it is impossible to fully predict that use when first drafting search terms. Similarly, attorneys must focus their keyword searches on the appropriate data sources. Otherwise, they waste time and money on irrelevant information. To address these issues, parties need to test the search terms’ performance repeatedly in an iterative process to determine how they interact with the data set at issue. As data sets increase in size, search term problems are exacerbated, and more time must be spent testing the terms to determine their effectiveness. Cooperation between the parties is essential to make this process as efficient and effective as possible. Attorneys must take the time to become familiar with their client’s data, conducting custodian interviews to learn about unique use of language and to determine where relevant data is most likely to reside. Each side should bring this knowledge to the meet and confer. The parties should discuss and agree upon an initial search strategy. Thereafter, the parties should test the chosen methodology and meet often and in a timely fashion to discuss the results. Because of the disparity in understanding of search methodology amongst judges and attorneys, it is often appropriate to include a third party data analytics provider to assist in the process and, if needed, defend its use before the court. In order to effectively implement an iterative search strategy, the courts and parties must account for the time involved in testing, meeting, and applying the search terms in the document review process. It is widely recognized that standard linear document review is the number one driver of discovery cost and time. While an iterative search strategy takes time upfront, its use should yield a focused and accurate data set for attorney review, which significantly limits the review costs and increases efficiency in responding to discovery requests. There is prior literature that begins to address this position. In 2008, The Sedona Conference published the “Cooperation Proclamation,” which addresses the increasing burden that pre-trial discovery causes to the American judicial system and champions a paradigm shift in the way the legal community approaches discovery. See The Sedona Conference, The Sedona Conference Cooperation Proclamation (2008), http://www.thesedonaconference.org. Although still zealously advocating for their clients, attorneys should cooperate as much as possible with the opposing party on discovery issues, particularly those that relate to ESI. This reduces costs for the clients and allows the attorneys to focus on the substantive legal matter(s) at issue, which in turn advances the best interests of the client. Along the same lines, there has recently been a push for proportionality and phased approaches to eDiscovery. The Sedona Conference Commentary on Proportionality in Electronic Discovery says that it may be “appropriate to conduct discovery in phases, starting with discovery of clearly relevant information located in the most accessible and least expensive sources.” See The Sedona Conference, The Sedona Conference Commentary on Proportionality in Electronic Discovery at 297 (2010), http://www.thesedonaconference.org. This work supports the value of an iterative search strategy and for parties to meet and confer often and timely on the issue. Attorneys must be prepared to educate the judiciary on the importance of a fully developed search strategy. They have to explain that while the iterative search process and subsequent meet and confers take time, the cost savings and increased efficiency in pre-trial discovery are significant and well worth the effort. Submission by: Jennifer Fiorentino and Logan Herlinger of Orrick, Herrington & Sutcliffe LLP. Ms. Fiorentino and Mr. Herlinger are members of Orrick’s eDiscovery Working Group. Ms. Fiorentino is resident in Orrick’s Washington, D.C. office, and can be reached at (202) 339-8608 or jfiorentino@orrick.com; Mr. Herlinger is resident at Orrick’s Global Operations Center (“GOC”) in Wheeling, W.V. and can be reached at (304) 234-3447 or lherlinger@orrick.com. The GOC is Orrick’s on-shore outsourcing center and home of Orrick’s Document Review Services and Data Management Group. Full biographies and a description of the practice group are available at http://www.orrick.com/practices/ediscovery. Orrick, Herrington & Sutcliffe LLP 2 Xerox Litigation Services Adaptable Search Standards for Optimal Search Solutions Amanda Jones Xerox Litigation Services Amanda.Jones@xls.xerox.com ICAIL 2011 Workshop on Setting Standards for Searching Electronically Stored Information in Discovery Proceedings – DESI IV Workshop – June 6, 2011, University of Pittsburgh, PA, USA The goal of this year’s DESI IV workshop is to explore setting standards for search in e-discovery. Xerox Litigation Services (XLS) strongly supports the effort to establish a clear consensus regarding essential attributes for any “quality process” in search or automated document classification. We believe in the principles of iterative development, statistical sampling and performance measurement, and the utilization of interdisciplinary teams to craft sound information retrieval strategies. These will strengthen virtually any search process. Still, XLS also recognizes that there is no single approach to search in e-discovery that will optimally address the needs and challenges of every case. Consequently, there cannot be a single set of quantitative performance measurements or prescribed search protocols that can reasonably be applied in every case. Instead, we agree with the authors of “Evaluation of Information Retrieval” (Oard et al. 2011) that the discussion of standards for search should concentrate on articulating adaptable principles, clear and concrete enough to guide e-discovery practitioners in designing search solutions that are well-motivated, thoroughly documented and appropriately qualitycontrolled, with the flexibility to allow creative workflows tailored to the goals and circumstances of each matter. Because of the unique and complex challenges ever-present in search in e-discovery, XLS would contend that the key to designing successful search strategies is the ability to explore multiple perspectives and experiment with a variety of tactics. Countless factors influence the quality of automated search outcomes. Therefore, it will be vital to the advancement of search techniques to adopt standards that encourage research on the sources of variability in search performance and create the latitude that is needed for ongoing hypothesis-testing and midstream course correction. One source of variability in text-based search performance that XLS has already identified and addressed is data type. Relevance is manifested in markedly different linguistic patterns across various types of documents. So, XLS has elected to utilize distinct classification models for spreadsheet data, email data, and other text-based data for most projects. Developing and implementing distinct models for these three classes of data requires an additional investment of time and resources, but has consistently translated into significant performance gains for the population as a whole. So, it is the approach that we currently use to mitigate this source of performance variation and ensure the highest possible quality in our automated search results. Our research into this is continuing, though, and we are open to adopting a new equally effective and less labor-intensive tactic for managing linguistic variation acrossdata types. Xerox Litigation Services Both within and outside Xerox, research in machine learning, information retrieval, and statistical datamining is progressing rapidly. Thus, it is important to not only to devise creative solutions to known sources of variation in search performance, but also to have the freedom to explore the full potential of emerging automated search technologies. XLS is currently experimenting with ways to optimize search results by utilizing multiple techniques and technologies simultaneously, incorporating input from all sources that enhance the final results. In our observations, combining search tactics often leads to significantly higher performance metrics than can be achieved by any of the individual tactics alone. In one preliminary investigation across several matters, for example, we found that combining scores from one statistical algorithm applied to the metadata of a population with scores from a completely different statistical algorithm applied to the full text of the population consistently increased both precision and recall. Similarly, we have also found it constructive to treat certain responsive topics or data types within a project with one search technique while using alternative approaches for other topics or data sources. For example, by analyzing patterns of error generated by our statistical algorithms, it has been possible for us to identify opportunities to use highly targeted linguistic models to correct those errors in the final result set. In general, our experimentation with hybridized search strategies has proven extremely fruitful and there are many avenues of investigation left to pursue in this area. This is a major motivating factor behind XLS’s support of standards that would promote the novel application of any combination of available search resources, provided the efficacy of these applications were adequately demonstrated. Obtaining a better understanding of the limitations of various search techniques is just as important as exploring the potential of new search technologies, because the limitations will also engender adaptive search strategies. Any text-based automated classification system will be subject to certain dependencies and limitations. For example, achieving comprehensive coverage with a high degree of accuracy is often challenging for search systems that rely on linguistic patterns to identify responsive material when responsive documents are “rare events” in the data population – primarily because there are simply fewer examples of the language of interest available to generalize. So, each and every responsive document is more noticeably impactful in the final results and performance metrics. In a case like this, more data is generally needed to achieve high precision and recall. It is sometimes possible, though, to mitigate the need for additional data utilizing linguistic and/or statistical approaches to increase the density of responsive material in a subset of the data population thereby increasing access to responsive linguistic material for generalization. Even then, though, it may require significant extra effort and ingenuity to ensure accurate and comprehensive coverage of the topic. Further, the rate of responsiveness in a population interacts in a complex way with the definition of the responsive topic itself to influence the level of difficulty that can be anticipated in the development of a successful search strategy and the extent to which special tactics will need to be pursued. While it is not often discussed in great detail, it is extremely important to consider the subject matter target for a case when assessing options for search strategy. The way in which responsiveness is articulated in requests for production can have a profound impact on search efficacy. For example, all of the following subject Xerox Litigation Services matter attributes will play a role in shaping the inherent level of difficulty in using automated search techniques to evaluate a population for a given topic: • • • • • Degree of subjectivity – e.g., a request for production may specify that all “high level marketing strategy” documents should be produced, but an automated search approach will likely struggle to differentiate between documents that constitute “high level” discussions and those that represent “routine” marketing conversations Conditions on modality – e.g., a request for production may specify that all “non-public discussions of pricing” should be produced, but linguistic distinctions between private and public conversations often prove unreliable causing automated approaches to confuse pricing discussions between corporate employees with similar discussions appearing in the media, etc. Linguistic variability – e.g., a request for production may specify that all “consumer product feedback” should be produced, but consumer feedback may touch upon any number of product features, may be positive or negative, may appear in formal reports or informal emails, and may be expressed in any number of unpredictable ways that could prove challenging for automated search systems to capture comprehensively Linguistic generalizability – e.g., a request for production may specify that all “negotiations with retailers” should be produced, but if the corporate entity routinely deals with thousands of retailers, it would be difficult, if not impossible, for an automated search system to successfully recognize the complete set of potentially relevant retailers and differentiate them from entities such as wholesalers or suppliers, etc. Conceptual coherence – e.g., a request for production may specify that all “discussions of product testing” should be produced, but if this is intended to include R&D testing, Quality Control testing and Market Research testing, then there will actually be three distinct concepts to capture, each with its own community of expert speakers with unique jargon and communication patterns such that capturing all of these sub-topics equally successfully may challenge automated search systems These factors interact not only with rate of responsiveness but also with one another to shape the target of the search effort. Analyzing the subject matter of a case to identify attributes that may introduce difficulties for automated search will make it possible to devise methods for overcoming the challenges. There are, in fact, numerous options for coping with the various situations highlighted above. Sometimes the solution will be as simple as choosing one search technique over another. At other times, it may be most effective to collaborate with the attorney team to operationalize the definition of responsiveness to minimize the need for subjective interpretation or fine-grained subject matter distinctions. At other times, the best choice may be to create distinct models for the most critical subtopics in an especially wide-ranging request for production to ensure that they will receive ample effort and attention, reducing the risk of having their performance obscured by the search results for other more prevalent topics. Undertaking a preliminary subject matter analysis and consultation with the case team, along with early sampling and testing in the corpus, will typically enable the formulation of a Xerox Litigation Services project proposal that will provide value for the client while accommodating the realities of the search situation. Finally, while much of the above discussion has centered on the use of in-depth analysis and a multitude of search tactics to achieve the highest possible quality results, XLS acknowledges this level of analysis and investment of expert resources is not always feasible. In fact, it may simply be unreasonable given the practical constraints of the case or its proportional value to the primary stakeholders. Open and frequent communication with the attorney team and client for the matter will not only enhance the quality of the subject matter input for the project, but also afford them opportunities to contribute their invaluable expert opinions regarding the reasonableness of the search for the matter at hand. In sum, XLS adopts the position that search results in e-discovery should be judged relative to the goals that were established for the project and that the search process, rather than the technology alone, should be scrutinized. We recognize it would be advantageous to have a single concretely defined protocol and technology applicable to every matter to achieve high-quality results quickly, cheaply, and defensibly. However, it would be naïve to suggest the unique topics, timelines, resources, parties, data sources and budgetary constraints associated with each matter could all be treated successfully using the same search strategy or the same quantitative measures, especially when current technologies are in a state of growth and evolution. It does a disservice to both the complexity of the problem and to the value of human insight and innovation in tailoring custom solutions to adapt to specific needs. ISO 9001: A Foundation for E-Discovery As the e-discovery industry strives for common standards and practices, an ideal solution exists: ISO 9001. by Chris Knox, Chief Information Officer, IE Discovery and Scott Dawson, President of Core Business Solutions Executive Summary A lack of standards and common practices hampers the processing of Electronically Stored Information (ESI) for litigation. Industry thought leaders, as evidenced by recent work by the Sedona Conference, Text Retrieval Conference (TREC) Legal Track, as well as this DESI meeting, are actively seeking standards to define and manage the process of discovery. This is necessary for both competitive differentiation in the marketplace, as well as to satisfy a growing demand for transparency and documentation of the Discovery process from the judiciary. The legal profession has focused much of its energies seeking benchmarks and standards in the search process, but there is a need to be able to certify repeatable, defensible, and consistent business processes through the entire e-discovery process. In many of the ongoing industry discussions, the ISO 9000 standards family arises as one of the ideal models for how to provide certification and standardization. In fact, we believe that the ISO 9000 family of standards is not just a model, but is ready today to provide a common standard of quality for e-discovery. In addition, the model provides a framework for an industry-specific solution that can emerge to solve the growing complexity and difficulties found in e-discovery. Introduction The Discovery Management industry does not have a defined baseline for quality. Complicating matters, the discovery of evidence for litigation was, until relatively recently, a paper-based process. As such, any existing industry standards and quality expectations are still primarily paper-based or focused on standards of quality control for scanning paper documents to digital formats. As the industry has adapted to process the exploding universe of digital media, and new products and processes are introduced, no new set of quality standards has emerged specifically to govern the discovery of Electronically Stored Information (ESI). In other industries, standards help inform buying decisions, provide a common language and point of reference to communicate quality. When purchasing in a manufacturing vertical (such as automotive and pharmaceuticals), buyers can expect a baseline of quality based on certifications. The e-discovery services industry is largely cost driven, with buyers purchasing e-discovery services as if it were a commodity but without the means to ascertain the level of quality they can expect. However, the purchasers of e-discovery legal services cannot expect quality service at every price point because of the lack of accepted industry practices. Industry standards are not simply a marketing tool to sell services; standardization of processes is an explicit requirement from the judiciary.1 Primarily in the area of search technology, courts have confirmed that standards are necessary for establishing defensible e-discovery practices. In addition, the Federal Rule of Civil Procedure 26(g)(1) requires attorneys to certify “to the best of the person’s knowledge, information, and belief formed after a reasonable inquiry” that disclosures are “complete and correct.” 1 William A. Gross Construction Associates, Inc. v. American Manufacturers Mutual Insurance Co., 256 F.R.D. 134, 134 (S.D.N.Y. 2009) (“This Opinion should serve as a wake-up call to the Bar in this District about the need for careful thought, quality control, testing, and cooperation with opposing counsel in designing search terms or “keywords” to be used to produce emails or other electronically stored information”) www.iediscovery.com 1.800.656.8444 ISO 9001: A Foundation for E-Discovery We believe these requirements can be satisfied with the adoption of quality management and documentation processes in the e-discovery industry. The discussion at this conference2 and others like it underscores the growing consensus that quality standards are necessary in creating the common baseline for defensible and standardized e-discovery practices. However, the industry is just beginning to grapple with issues such as the criteria used to search electronic records for responsive documents. For example, the Text Retrieval Conference (TREC) Legal Track is evaluating the effectiveness of various methods of information retrieval technology to find a baseline quality expectation3 Similarly, the Sedona Commentary on Achieving Quality in E-Discovery calls for development of standards and best practices in processing electronic evidence.4 These efforts are feeding a larger effort to create defensible standard practices for the industry. Most professionals in the e-discovery industry understand that there will never be a comprehensive e-discovery process; the demands of searching, reviewing, and producing evidence from the complex, diverse, and ever-expanding universe of discoverable data ensures that standardization will likely be impossible. An even bigger obstacle is the rapid changes in technology. For example, the use of advanced information retrieval technology to augment the human review process is constantly evolving.5 Also, protocols are case-based; what may be a perfect solution in one situation may not be appropriate for the next. ISO 9001 is an ideal solution for this state of affairs because it is designed to deliver the best solution for different situations. Because ISO 9001 is a baseline standard, it is flexible enough to address this complex challenge as few other approaches can. ISO 9001 has been held up as a standard that is a useful example of the type of standard the e-discovery industry can hope to develop. We believe that ISO 9001 is in fact not just an example, but a workable, real-world solution that provides a solid foundation for the e-discovery industry today. What is ISO 9001? The ISO 9000 family of standards is an internationally accepted consensus on good quality management practices. ISO 9001 is an international quality certification that defines minimum requirements for a company’s Quality Management System (QMS). A company’s QMS includes the organization’s policies, procedures and other internal requirements that ensure customer requests are met with consistency and result in customer satisfaction. Some of the areas of an organization within the scope of ISO 9001 include: • • • • • Customer contracts Hiring and employee training Design and development of products and services Production and delivery of products and services Selection and managing of suppliers 2 In Search of Quality: Is It Time for E-Discovery Search Process Quality Standards? Baron, Jason E-Discovery Team blog. (http://e-discoveryteam.com/2011/03/13/ in-search-of-quality-is-it-time-for-e-discovery-search-process-quality-standards/) 3 J. Krause, Human-Computer Assisted Search in EDD, Law Technology News, December 20 (2010). and Oard, et. al. Evaluation of information retrieval for E-discovery, Artificial Intelligence and Law, December 22 (2010). 4 The Sedona Commentary on Achieving Quality in E-Discovery, May 2009. Principle 3. Implementing a well thought out e-discovery “process” should seek to enhance the overall quality of the production in the form of: (a) reducing the time from request to response; (b) reducing cost; and (c) improving the accuracy and completeness of responses to requests. The type of quality process that this Commentary endorses is one aimed at adding value while lowering cost and effort. 5 Maura R. Grossman & Gordon V. Cormack, Technology-Assisted Review in E-Discovery Can Be More Effective and More Efficient Than Exhaustive Manual Review, XVII RICH. J.L. & TECH. 11 (2011). www.iediscovery.com 1.800.656.8444 ISO 9001: A Foundation for E-Discovery To maintain the certification, an organization must implement: • • • • • Management responsibilities Internal quality audits Monitoring and measuring Continual improvement Corrective and preventive actions To receive an ISO 9001 certification a company must put the required QMS processes and controls in place, monitor performance of its processes and demonstrate continual improvement. Many companies hire an experienced consulting firm to assist with these preparations. Once the QMS is in place, a registrar (or certification body) is hired to audit the company’s compliance with ISO 9001 requirements. If discrepancies are found during the audit, they must be corrected before the ISO 9001 certificate is issued. One of the most demanding aspects of the ISO 9001 certification is that it must be maintained through regular audits (bi-annual or annual) conducted by the selected registrar. To maintain certification, organizations must provide measurable targets for improvement and data to show current and past performance. This information is kept in a quality manual, a general description of how a company operates and how meets ISO 9001 requirements. An organization provides specific procedures or work instructions determined by the management as needed to ensure processes meet the stated quality objectives. In addition, an organization must maintain historical records that demonstrate compliance with company procedures and the ISO 9001 standard and train employees and management in the required responsibilities, ISO awareness and understanding of the quality policy, administrative procedures, and the audit process. Customer feedback is another essential component, which demands tracking customer complaints, compliments, and overall satisfaction. A management representative is assigned to coordinate the ISO program and a regular management review meeting should assess the progress and initiate improvements as needed. In addition, a team of employees trained to conduct an audit similar to the registrar’s audit must conduct a formal internal audit, on top to the annual outside auditor review. Through these requirements, organizations will likely find that they have to ensure that rigorous documentation of processes is in place. And because the program demands continual review and process improvement, the certification makes certain that an organization’s services, documentation, and processes are consistently updated and streamlined. The Benefits of ISO From an organizational standpoint, adopting and adhering to an ISO 9001 compliant QMS creates a more organized operating environment, attracts new customers, and generally leads to a higher level of satisfaction among those customers. For e-discovery practices, the certification process would demand documentation and policies be put in place that are available as a reference or even supporting materials that attest to an e-discovery vendors good faith efforts to provide the highest standard of care in litigation. From a practical standpoint, the certification forces an organization to continually upgrade and reconsider its processes. In outlining any current operations, organizations must add the requirements of the ISO 9001 standard and optimize processes, meaning internal operations can be quickly enhanced and streamlined. And, as noted, after achieving certification, the process mandates continual process improvements. A recent survey of 100 registered firms reported the average improvement in operating margin at 5 percent of sales. These firms also reported faster turnaround times, and a reduction in scrap and overtime. In addition, the ISO process facilitates increased quality awareness. During implementation, quality awareness will increase, since all staff must be trained on ISO 9001. The QMS will also demand built-in systems to report on key quality indicators, which will significantly reduce the reoccurrence of problems. This helps develop a strong quality culture, where the staff recognizes problems such as systems or process issues and work on fixing them, rather than placing blame with an individual. And with ISO 9001 www.iediscovery.com 1.800.656.8444 ISO 9001: A Foundation for E-Discovery certification, employees learn processes more quickly and reduce misunderstandings with customers. If a problem does occur, it is traced to its root cause and fixed. While there is no accepted ISO certification requirement in the e-discovery industry, ISO 9001 certification is becoming a requirement to do business in many markets. We believe that as sophisticated business enterprises bring the e-discovery process inhouse and away from law firms, the expectation of ISO certifications will increase. A recent survey of ISO 9001 certified companies shows that 41 percent were asked to achieve certification by a client. Considering that it can take 6 months or longer for some organizations to achieve certification, already having a compliant QMS in place can be a distinct advantage. E-discovery vendors that do adopt the standard now, ahead of any possible requirement to do so, have a distinct marketing advantage, as they are able to declare their processes conform to an internationally recognized standard that few competitors can claim. The ISO Organization Perhaps the most important benefit of ISO certification is the broad, international acceptance the standard has achieved. The International Standards Organization is a combination of the national standards institutes in roughly 157 countries. The ISO 9000 family of international quality management system standards is perhaps the best known example of the organization’s output, but it is only one of the many standards produced. The ISO 9000 standards provide a basis for certifying compliance by individual organizations with standards in the family. An e-discovery company may qualify for basic ISO 9001 certification, or, more optimally, an industry-specific standard could be created to provide applicable certification to their operations in this field. And when a company or organization is independently audited and certified to be in conformance with any ISO 9001 standard in the family,that organization may also claim to be “ISO 9001 certified.” What ISO Does and Does Not Do The ISO 9001 certification is distinct because it demands patience and an ongoing process of improvements. Other standards offer a regimen of self-help and implement more advanced management techniques in an organization. But as management and staff turnover naturally occurs, organizations lose interest and forget what they are working on without the ongoing commitment to the ISO 9001 audit process. ISO mandates that an organization’s management has a defined quality standard and meets these goals. Compared to the ISO model, other certification processes are often static. Once certified an organization can claim to have achieved the standard, but there is no required maintenance. For example, the Capability Maturity Model Integration (CMMI) in software engineering makes similar demands, but without demanding process improvement or providing a point of reference for appraising current processes. Of course, no certification guarantees quality service; rather, they can only certify to potential customers that formal processes for measuring and controlling quality are being applied. www.iediscovery.com 1.800.656.8444 ISO 9001: A Foundation for E-Discovery The ISO 9001 Family Figure 1: ISO 9001 and its industry variants ISO 9001:2008 Industry-specific standards: Related management-system standards: Aerospace Industry Standard ISO/IEC 27001 Information security management ISO/TS 16949 Automotive Industry Standard ISO/IEC 20000 IT service management TL 9000 Telecom Industry Standard ISO 14001 Environmental management standards ISO 13485 Medical Industry Standard ISO 26000 Social responsibility ISO/TS 29001 Petroleum, petrochemical and natural gas industries Standard. OHSAS 18001 Occupational Health and Safety AS9001 ISO 17025 Calibration and Test Laboratories ISO 22000 Food Safety As the chart above indicates, a number of industries have created ISO 9000 variants with specific requirements. Most are in manufacturing fields, although the model can certainly be adapted to create a standard specific to the processing of ESI. In addition, management system standards have created systems for implementing international standards for social responsibility, as in the ISO 260000 model or the environmental safety model defined by ISO 14001. Of particular interest to e-discovery service providers, the ISO 27000 standard is designed to identify and manage risks posed to business information by data theft or accidental loss. It provides guidelines for putting a secure infrastructure in place and implementing a risk management process and corporate policy to minimize data loss. This is the one existing ISO standard e-discovery vendors can and should actively consider adopting in addition to the ISO 9001. E-discovery service providers can have their processing centers certified under ISO/IEC 27001 certification as an assurance to customers that any ESI handling and processing is done with a commitment to security and data integrity. Litigation and support suppliers will certainly benefit from the adoption of the general ISO 9001 standard. However, many industries have benefited from the adoption of an industry-defined subset of 9001. These subsets were all proposed and developed by professional organization and industry experts with the intent of addressing perceived weakness in ISO 9001 relative to that specific industry. Because a number of initiatives and projects are underway that attempt to define and create a framework for acceptable e-discovery practices, these efforts could certainly be used to jump-start an effort to define an e-discovery ISO 9001 model. ISO 9001 and the E-discovery Industry Industry organizations have begun to make some initial attempts at creating standards and best practices. The Sedona Conference has a number of guides and best practices recommendations available for e-discovery topics, including search protocol and choosing an e-discovery vendor.6 The Electronic Discovery Reference Model (EDRM) has led an effort to create a standard, generally accepted XML model to allow vendors and systems to more easily share electronically stored information (ESI). 6 The Sedona Conference Publications (http://www.thesedonaconference.org/publications_html) www.iediscovery.com 1.800.656.8444 ISO 9001: A Foundation for E-Discovery However, industry best practices are currently only recommendations, and technical standards such as the proposed XML schema are most useful in creating consistent standards and attributes for products. ISO focuses on how processes work and how work product is produced and is not a technical or product related standard. Technical or product related standard certifications are generally hard to come by and are most useful only for technology vendors and not service providers. As noted, the ISO 9001 standard is a general, baseline and provides only high-level guidance. A number of industry sectors have created standardized interpretations of the ISO guidelines for processes in industries as diverse as aerospace manufacturing and medical devices. Industry-specific versions of ISO 9000 allow for industry-specific requirements, and allow for the training and development of a base of industry auditors that are properly qualified to assess these industries. Standardizing processes could standardize pricing as well – or at least create a common language for pricing e-discovery services. Courts have repeatedly found that a failure to adequately document the steps taken to sample, test, inspect, reconcile, or verify e-discovery processes is unacceptable and can result in court-imposed sanctions.7 The profession may resist applying metrics to litigation as are applied in manufacturing and other industries, but the discovery phase of litigation is a business process, and a quantifiable one. There will always be questions of law in the discovery process that require a lawyer’s judgment and discretion, but within the process, service providers can and should apply some of the same rigor and standardization of service as seen in other industries. For example, some of the possible quality metrics that can be measured are: • • • • • • • Defects per reviewed document delivered Search expectations- how many images, graphics, or embedded documents were successfully indexed The error rate for files loaded to a repository of the total number of files received Deadlines met or missed A measure of data collected which was ultimately deemed non-relevant Search accuracy and recall The number of corrupted files loaded to a repository prior to review In order to implement such standards, definitions must be agreed upon. For example, such foundational issues such as what is a document and what is a container file. The ongoing research by TREC and other technical studies can continue to develop baseline measures for successful e-discovery search and document review. These measures and metrics should then be considered within the ISO 9001 framework to provide a baseline for quality of services. Moving Forward Organizations such as the Sedona Conference and the EDRM are two obvious candidates for promoting further efforts in this area. The ISO 9001 standard would in fact be an ideal vehicle for implementing the work these and other organizations done into search methodology and information handling across the industry. And together with the more detailed efforts to define and create best practices for the industry, perhaps an ISO 9001 standard for the management and handling of ESI can be formulated. The primary driver for an e-discovery-specific ISO standard will be to ensure that when a customer purchases services from a certified source, they can have a level of assurance that the vendor has basic quality control practices in place. Most importantly, the ISO 9001 certification standard provides a third-party independent auditor who reviews the company’s standard against the certification. Buyers do not want to trust a vendor with their data sets only find out a vendor does not have basic quality control measures in place. ISO 9001 is a standard that may become necessary just to compete. The e-discovery industry can only stay fragmented for so long. In order to mature, the e-discovery industry needs a common language to both satisfy the demands of its customers as well as the growing chorus of judges and legal scholars looking for measurable quality standards. 7 The Pension Committee of the University of Montreal Pension Plan, et al. v. Banc of America Securities LLC, et al., No. 05 Civ. 9016 (S.D.N.Y. Jan. 15, 2010) www.iediscovery.com 1.800.656.8444 ISO 9001: A Foundation for E-Discovery About the Authors Chris Knox, Chief Information Officer of IE Discovery Chris Knox has more than 16 years of project and resource management experience. He is responsible for the implementation of strategic initiatives and IT expenditures, as well as the development of company-wide operating procedures. Chris graduated from the University of Texas at Austin with a degree in Engineering and received his MBA from Syracuse University. Prior to IE Discovery, Chris designed and implemented data collection networks for software developers and Fortune 500 corporations. Chris also previously created Geographic Information Systems for large municipalities, specializing in the development of algorithms for hydraulic models. Scott Dawson, President of Core Business Solutions Scott has more than 20 years’ experience in manufacturing with the past 10 years consulting with organisations seeking IS0 9001 certification. Scott is also an active voting member of the US Technical Advisory Group (TAG) to ISO Technical Committee 176 (TC 176), which is responsible for drafting ISO 9001 and ISO 9004 on quality management systems. www.iediscovery.com 1.800.656.8444 A Call for Processing and Search Standards in E-Discovery Sean M. McNee, Steve Antoch Eddie O`Brien FTI Consulting, Inc. 925 Fourth Ave, Suite 1700 Seattle, WA 98104 USA FTI Consulting, Inc. 50 Bridge Street, Level 34 Sydney, Australia 3000 {sean.mcnee, steve.antoch}@fticonsulting.com eddie.obrien@ftiringtail.com ABSTRACT We discuss the need for standardization regarding document processing and keyword searching for e-discovery. We propose three areas to consider for standards: search query syntax, document encoding, and finally document metadata and context extraction. We would look to encourage search engine vendors to adopt these standards as an optional setup for the application of ediscovery keyword searches. We would encourage search engine users to apply these standards for e-discovery keyword searching. Examples of some difficulties worth noting: - Wildcard operators. Should such operators match on 0 characters or not? For example, would (Super*FunBall) hit on both the SuperFunBall and SuperHappyFunBall, or only the latter? - Stemming and Fuzzy Searching. Different IR systems provide support for different algorithms for term stemming and fuzzy searching (e.g. Porter stemming or Levenshtein distance). Attempting to standardize them might be too difficult in a standard. This would be an example of a value-add that a particular vendor could offer, but only of the lawyer understand and approve it. - Morphology and Word-breaking. Concepts and word breaks are hard to determine in some languages. For example, Arabic has many ways to express a single term; Chinese and Japanese have ambiguous word boundaries. Keywords E-Discovery, search, engine, keyword, standards. 1. INTRODUCTION E-Discovery document analysis and review continues to consume the bulk of the cost and time during litigation. As the e-discovery market matures, clients will have increased expectations about the quality and consistency of how their documents are collected, processed, and analyzed. It is also our assumption that e-discovery vendors will compete based on the quality and breadth of their review and analytic services offerings. Seeing this as the changing landscape of e-discovery, we propose in this paper that the vendors of e-discovery software and services are encouraged to create and apply a set of shared e-discovery standards for document processing and keyword search. We hope that these standards would be organized and maintained by a standards committee such as the Sedona Conference [1] or follow the example of the EDRM XML standard. 2. AREAS FOR STANDARDS We think there are several areas where consistency, speed, and quality could be improved by having an open and agreed to set of standards. 2.1 Search Query Syntax Different information retrieval/search engine systems use different and often incompatible syntax to express complex searches. This can cause confusion for attorneys, for example, when they are negotiating search terms during Meet and Confer, or when they are trying to express a complex query to an e-discovery vendor. A position paper at ICAIL 2011 Workshop on Setting Standards for Searching Electronically Stored Information on Discovery (DESI IV Workshop) Copyright © 2011 FTI Consulting, Inc. These are only a few examples of the potential problems encountered when standardizing query syntax. Our goal here is not to suggest that any given syntax is better than another. Nor is it to “dumb down” syntax by removing extremely complex operators. Rather, we see it as a chance to set a high bar as to what lawyers can expect from search engine systems in an ediscovery context. It is quite possible that some systems simply will not have enough functionality to support a standardized syntax. In this case, the lawyers are better off knowing of this limitation before e-discovery begins! While the syntax varies by vendor, many complex expressions have direct correlations—there should be a mapping between them. Ideally mappings would make it possible to start with a standard syntax and have each vendor map the query to their equivalent native syntax. The standard syntax should be vendorneutral; perhaps XML or some other formal expression language should be used to define it. 2.2 Encodings and Special Characters Textual characters are encoded in documents through the use of various character sets. The first and most well-known character set is the ASCII character set describing 127 characters (letters, numbers, and punctuation) used in English. Lawsuits, however, are language agnostic. Unicode [2] is the preferred standard from the ISO to represent a universal character set. To state that Unicode should be used as the standard encoding for all documents in e-discovery seems obvious—so, we should do it. What is not as obvious is the need for standardized set of test documents to validate the conversion to Unicode from a variety of data formats common to e-discovery. Finally, the standardized search query syntax discussed above needs to be able to express searches for all Unicode characters, including symbols such as the Unicode symbol for skull-andcrossbones (0x2620): ☠. 2.3 Metadata and Content Extraction A very small minority of documents in litigation are raw text documents. Most are semi-structured documents, such as emails, Microsoft Office documents, Adobe PDF documents, etc. These documents contain raw textual data, metadata, and embedded objects, including charts, images, audio/video, and potentially other semi-structured documents (e.g. a Microsoft Excel spreadsheet embedded in a Microsoft Word document). We have an opportunity now to extend what has already been done in the EDRM XML standard to define what metadata should be considered standard extractable metadata for various file types. If we know in advance what is required, then we can ensure higher quality. For example, it will be easier to detect corrupt files. By standardizing, we also make meet-and-confer meetings smoother, as metadata no longer becomes a point of contention— both sides assume the standard is available. 2.3.1 Known Document Types For known document types, such as Microsoft Office documents, there are several generally accepted ways of extracting content and metadata. These generally rely on proprietary technology, some of which are free (Microsoft’s iFilters [3]) and some are not (Oracle’s Outside In Technology [5]). Several open source alternatives also exist, such as Apache POI for Microsoft Office documents. Relying on any one technology, whether free, paid, or open source, is dangerous. Yet, because of the complexity of these file formats, it remains a necessary requirement. By enforcing standards of what metadata and content is to be expected from this extraction technology, we can provide for a more consistent ediscovery experience. 2.3.2 The Need for Open File Formats An important distinction for these document types is whether the file format is an open standard (email), proprietary yet fully documented (Microsoft Office [4]), or not public information. By specifying the differences between formats, a standard could enforce all data be represented in an open or documented formats. This way, open source solutions, such as Apache Tika [7], can fully participate in e-discovery without fear of reprisal. As a side effect, this could influence holders of closed proprietary formats to open them to the community at large. One important point, however, deals with the conversion from closed to open formats. As long the standard specifies what content and metadata needs are, the conversion needs to guarantee all data comes across faithfully. 2.3.3 Information in the Cloud For information residing in the cloud, such as documents in Google Docs, Facebook posts, Twitter updates, etc., determining what is a document can be difficult. Google Docs, for example, saves updates of documents every few seconds. Legally, how can you determine what is a user’s intended save point containing a ‘coherent’ document? Standardization is even more important here than for known document types—we need to define what a document even means before we can extract metadata and content. Further, all of the metadata we need might not be attached to the content but rather will need to be accessed programmatically. 3. ACKNOWLEDGEMENTS Thanks to everyone who provided comments and insights on previous drafts of this paper. 4. CONCLUSIONS In this paper we discussed the need for standards in e-discovery surrounding search query syntax, document encoding, and content extraction. We hope this starts a conversation among e-discovery practioners, search engine vendors, and corporations facing lawsuits with the goal of increasing search quality and consistency during E-Discovery. 5. REFERENCES [1] Sedona Conference. “The Sedona Conference Homepage” http://www.thesedonaconference.org/, Last accessed 21 April 2011. [2] Unicode Consortium. “The Unicode Standard”, http://www.unicode.org/standard/standard.html. Last accessed 21 April 2011. [3] Microsoft. “Microsoft Office 2010 Filter Packs”. https://www.microsoft.com/downloads/en/details.aspx?Famil yID=5cd4dcd7-d3e6-4970-875e-aba93459fbee, Last accessed 21 April 2011. [4] Microsoft. “Microsoft Office File Formats”, http://msdn.microsoft.com/enus/library/cc313118%28v=office.12%29.aspx. Last accessed: 21 April 2011. [5] Oracle. “Oracle Outside In Technology”, http://www.oracle.com/us/technologies/embedded/025613.ht m. Last accessed: 21 April 2011. [6] The Apache Foundation. “The Apache POI Project”, http://poi.apache.org/. Last accessed: 21 April 2011. [7] The Apache Foundation. “The Apache Tika Project”, http://tika.apache.org/. Last accessed: 21 April 2011. DESI IV POSITION PAPER The False Dichotomy of Relevance: The Difficulty of Evaluating the Accuracy of Discovery Review Methods Using Binary Notions of Relevance BACKGROUND Manual review of documents by attorneys has been the de facto standard for discovery review in modern litigation. There are many reasons for this, including the inherent authoritativeness of lawyer judgment and presumptions of reliability, consistency, and discerning judgment. In the past couple of decades, growth in the volume of electronic business records has strained the capacity of the legal industry to adapt, while also creating huge burdens in cost and logistics. (Paul and Baron, 2007). The straightforward business of legal document review has become so expensive and complex that an entire industry has arisen to meet its very particular needs. Continually rising costs and complexity have, in turn, sparked an interest in pursuing alternative means of solving the problem of legal discovery. Classic studies in the field of Information Retrieval which outline the perils and inherent accuracy of manual review processes have found new audiences. (see, e.g. Blair & Maron, 1985) Many newer studies have emerged to support the same proposition, such as the work of the E-Discovery Institute and many who work in the vendor space touting technology-based solutions. Even more recently, cross-pollination from information analytics fields such as Business Intelligence / Business Analytics, Social Networking, and Records Management have begun generating significant “buzz” about how math and technology can solve the problem of human review. Clients and counsel alike are looking toward different solutions for a very big problem – how to deal with massive amounts of data to find what is important and discharge discovery obligations better and more cost-effectively. The tools available to streamline this job are growing in number and type. Ever more sophisticated search term usage, concept grouping and coding techniques, next generation data visualization techniques, and machine learning approaches are all making inroads into the discovery space. There is ample evidence that the allure of “black box” methods is having an impact on how we believe the problem of large-scale discovery can be resolved. POSITION Because math is hard, lawyers have become enamored with notional “process” with its implicit suggestion that there is some metaphysically ideal assembly line approach that can be invoked for each case. All you have to do is make certain tweaks based on case type, complexity, etc. and you will generate a reproducible, defensible product. The approach is analogous to the “lodestar” computation used in assessing the reasonableness of contingency fees in complex cases. This process-focused approach rests on the faulty premise that relevance is an objective, consistently measurable quality, and by extension, that it is susceptible to some objectively measurable endpoint in document review. Deterministic formulas, no matter how sophisticated, can only accounts for discrete variables in the review, such as size, scope, complexity, and the like. The foundational variable, relevance, is anything but discrete, and without a reproducible, consistent definition of relevance, the input into any formula for review accuracy or success will be unreliable. The False Dichotomy of Relevance How do we determine if a document is relevant or not? Disagreement among similarly situated assessors in Information Retrieval studies is a known issue. (Voorhees, 2000). The issue of translating the imperfect, analog world of information to a binary standard of true/false is a difficult one to study. When you compound the confusion by blurring the distinction between relevance, which is something you want, and responsiveness, which is something that may lead to something you want, the difficulty only increases. In practice, this author has participated in side by side testing of learning tools and seen very capable expert trainers develop quite different interpretations of both responsiveness and relevance. Anyone who has been involved in document review understands that where responsiveness or relevance are concerned, reasonable minds can, and often do, disagree. Who is right and who is wrong? Is anyone really right or wrong? Take note of an actual request by the Federal Trade Commission in antitrust review. It calls for “all documents relating to the company’s or any other person’s plans relating to any relevant product, including, but not limited to…” The governing guidance for civil discovery can be found in Federal Rules of Civil Procedure 26(b)(1): “Parties may obtain discovery regarding any nonprivileged matter that is relevant to any party’s claim or defense – including the existence, description, nature, custody, condition, and location of any documents or other tangible things and the identity and location of persons who know of any discoverable matter… Relevant information need not be admissible at the trial if the discovery appears reasonably calculated to lead to the discovery of admissible evidence.” Very broad requests blended with highly inclusive interpretive guidance give rise to great variability in interpreting both relevance and responsiveness. What is relevant gets confused with what is responsive, and in both events, a wide range of possible thresholds can be established, depending on who is making the decisions. As an illustration using the above request, if a reviewer is presented with a document that is a calendar reminder to self concerning a product development meeting that mentions the product by name and a date, but no other information, would it be relevant or responsive? If it mentions other people expected to be in attendance, would that change things? If it also stated the meeting’s agenda, what would happen? Depending on the relevant issues of the particular matter, the answers might vary, and this author would disagree that there is any bright line response that covers every use case. In the bulk of litigation, a large proportion of documents fall into the kind of gray area like the calendar entry example above. There is rarely a hard and fast rule for what is relevant or responsive when context, vernacular, and intent are unknown. Forcing such determinations is a 2 necessary evil, but distorts conceptions of relevance and responsiveness, particularly when rules and guidance are inferred from prior judgments. The effort of doing so is akin to pushing a round peg through a square hole, and the results are analogous to trying to define obscenity instead of saying “you know it when you see it.” When forcing documents to live in a yes/no world, a marginal yes will be considered the same as an obvious, smoking gun yes for all follow-on evaluations. This creates a problem similar to significant figures calculations in scientific measurement – the incorporation and propagation of uncertainty into further calculations simply yields greater uncertainties. Attempting to adopt objective standards (e.g. F1 measure thresholds) based on a flawed presumption of binary relevance/responsiveness will by extension also be suspect. Comparing different information retrieval and review systems is difficult and often misleading enough without internalizing the uncertainty generated by enforced binary classification of relevance. Worse yet, the seduction of clean numerical endpoints belies the complexities in deriving them. We would love to say that System A is 90% accurate and System B is 80% accurate, so System A is superior. The truth, however, is that data are different, reviewers are different, assessors are different, and methods of comparing results are different. In the most straightforward matters, there are few documents that are 100% relevant or irrelevant to a given request. Moreover, actual relevance often changes over time and as case issues are defined more narrowly through discovery. After all, if both sides knew everything they needed to know about the case issues at the outset, why bother with discovery? As recently as the last Sedona annual meeting, there was talk of developing a benchmark F1 measure that could be used as an objectively reasonable baseline for accuracy in a review. This is troubling because even in the most knowledgeable community addressing electronic discovery issues, the notion of an objectively definable standard of relevance/responsiveness is entertained. The legal industry must not succumb to the temptation of easy numbers.1 Proposed Solution Before traveling too far down the road of setting accuracy standards or comparing different review systems, we should question our current conception of notional relevance in legal discovery review and advocate a meaningful, practical approach to benchmarking the accuracy of legal review in the future. We cannot faithfully ascribe a priori standards of relevance without the benefit of full knowledge that a real world case will not permit, and we cannot even do a legitimate analysis ex post facto unless all stakeholders can agree about what passes muster. 1 This kind of approach also ignores the fact that statistical measures will not work equally well across different likelihood of responsiveness (e.g. a recall projection for a corpus of 1 million in which 50 docs are truly responsive and 30 are returned would undoubtedly look very different from a projection based on 300,000 found out of 500,000 true responsive). Furthermore, such standard setting does not the fact that different cases call for different standards – a second request “substantial compliance” standard is, in practice, very different from a “leave no stone unturned” standard that one might employ in a criminal matter. 3 The best we can aim for is to make sure that everyone agrees that what is produced is “good enough.” “Good enough” is a fuzzy equation that balances the integrity of the results with the cost of obtaining them, and is evaluated by all concerned parties using their own criteria. Integrity is a utilitarian measure. As a consumer of discovery, a practitioner would want to know that everything that they would be interested in is contained therein. The guidance of the Federal Rules notwithstanding, this does not mean that a recipient of discovery wants to know that everything that is arguably responsive is contained in the production corpus, but rather everything that they would deem necessary to flesh out their story and understand / respond to the other side’s story is produced. In other words, and at the risk of over-simplification, the consumer of discovery wants some degree of certainty they have received all clearly relevant material. While discovery rules and requests are fashioned to yield the production of documents “tending to lead to the discovery of admissible evidence,” this is largely a safety net to ensure no under-production. Analyzing the accuracy of discovery as a function of whether all documents “tending to lead to the discovery of admissible evidence” is a slippery slope. The inquiry quickly turns to determining whether all documents that tend to lead to the discovery of documents that tend to lead to the discovery of potentially admissible evidence, which militates strongly in favor of severe over-production, at considerable cost to both producing and receiving party and also the very system of achieving justice, since it is so fraught with high and avoidable costs. Relevance within a case is highly volatile, subjective, and particular to that matter. Furthermore, the only parties that care are the ones involved (excluding for the sake of argument those who are interested in broader legal issues at bar). Accordingly, the best way to approach relevance is to adopt some relevance standard that relies on consensus, whether actual, modeled, or imputed. Actual consensus would involve use of representatives of both parties to agree that particular documents are relevant. Modeled consensus would involve using learning systems or predictive algorithms to rank documents according to a descending likelihood of relevance. Imputed consensus would involve the use of a disinterested third party, such as an agreed-upon arbiter or a special master. The question to be answered by any consensus-based standard should be slightly different than the rather unhelpful “whether this document tends to lead to the discovery of admissible evidence.” It should instead focus on actual utility. In terms of defining relevance, perhaps we could articulate the standard as a function of likelihood of being interesting, perhaps “would a recipient of discovery reasonably find this document potentially interesting?” Expressed in the inverse, a non-produced document would be classified as accurately reviewed UNLESS it was clearly interesting. No one really cares about marginally responsive documents, whether they are or are not produced. By extension, we should disregard marginal documents when determining the accuracy of a given review. As far as applying the standard, there are no objective criteria, so some subjective standard must be applied. This removes the business of assessing review accuracy from the myriad of manufacturing QA/QC processes available, since using objective metrics like load tolerances to measure subjective accuracy is like using word counts to rank the quality of Shakespearean plays. In practice, only the receiving party generally has standing to determine whether or not they are harmed by over or under production, so the most rational approach to determining review quality should begin and end with the use of the receiving party or a reasonable proxy for 4 them. One possible way of doing this is to assign an internal resource to stand in the shoes of the receiving party and make an independent assessment of samples of production (whether by sampling at different levels of ranked responsiveness, stratified sampling using other dimensions, such as custodian, date, or perhaps search term), and then analyze the results for “clear misses.” These clear misses could be converted to a rate of review required to include these (or other metric that demonstrates the diminishing returns associated with pushing the production threshold back), which can then be converted to man-hours and cost to produce such additional documents. If predictive categorization is being employed, it is also possible to use multiple trainers and then overlay their results. Overlapping results in relevance are a de facto consensus determination, and can be used to ascribe overall responsiveness to a given document. The benefit of this approach is that it also serves a useful QC function. There are, of course, a number of other possible approaches, but the overriding theme should be that evaluations of effectiveness and accuracy should redraw the lines used to evaluate accuracy, steering away from hard and fast standards and moving toward more consensus-based, matterspecific metrics. CONCLUSION Attorneys and the electronic discovery industry should eschew the easy path of arbitrarily derived objective standards to measure quality and accuracy, but at the same time, they cannot expect to develop rigorous, objective rigorous criteria for comparing or evaluating search and review methods. Any evaluation of systems that purport to identify legally relevant or discoverable information rests on a definition of relevance, and relevance is a matter-specific, highly subjective, consensual determination. As a community, we should work toward developing assessment standards that mirror this reality. Eli Nelson Cleary, Gottlieb, Steen & Hamilton 2000 Pennsylvania Ave., Washington, DC 20006 (202)974-1874 enelson@cgsh.com 5 Sampling – The Key to Process Validation Christopher H. Paskach, Partner, KPMG LLP Michael J. Carter, Manager, Six Sigma Black Belt, KPMG LLP May 13, 2011 Abstract The increasing volume and complexity of electronically stored information and the cost of its review continues to drive the need for development of sophisticated, high-speed processing, indexing and categorization -- or “predictive coding” -- software in response to litigation and regulatory proceedings. Since the majority of these tools rely on sophisticated, proprietary algorithms that are frequently referred to as “black box” technologies, there has been a reluctance to exploit their expected productivity gains for fear that the results they produce may be challenged and rejected as not meeting the required standard of “reasonableness.” Effective use of sampling can overcome this concern by demonstrating with a stated level of confidence that the system has produced results at least as consistent and reliable as those obtained by having attorneys review the documents without sophisticated technology support. Through testing, based on statistical sampling, the quality improvements and cost savings promised by “predictive coding” technology can be realized. Current State Determining the reasonableness of a document search and review process has been based on whether there was sufficient “attorney review” of the documents to ensure that the results will be reliable. While “attorney review” has been accepted as the “gold standard” for the adequacy of a document review process, the consistency and reliability of the results produced by the attorneys has rarely been questioned or tested. The presumed effectiveness of “attorney review” is generally accepted to meet the reasonableness standard so that sophisticated sampling and testing of the attorney review is rarely performed. The sheer volumes and unforgiving production deadlines of today’s e-discovery efforts demand ever increasing review capacity and throughputs. Simply scaling up the review process with more people to handle these demands is clearly at odds with cost control initiatives that are of utmost importance to corporate law departments. New Technologies Recent technology development efforts have focused primarily on helping review teams manage the efficiency and cost of large scale reviews. Of particular interest are the tools that help categorize and cluster documents based on document content, or identifying near-duplicate documents and grouping them for review. These “predictive coding” tools can help reviewers speed through non-responsive or similar sets of documents by bulk tagging and more quickly isolating relevant material that has to be more carefully reviewed for privilege before being produced. The very latest technologies aim to automate the review process by minimizing the need for human reviewers in a first-pass review for relevance. But regardless of where an organization falls on the automation continuum in its adoption of technology -- from traditional linear review to concept-based clustering, leveraging technology or human review -- the goal of a faster, more consistent, more predictable and less costly review requires 1 more than basic efficiency gains. A cost-effective document review project requires more sophisticated technology and proven Quality Control (QC) processes to demonstrate its effectiveness. Technology vs. Human Review While an argument for cost effectiveness of technology-based processes has been largely established, the consistency of the results “versus” human review remains a topic of ongoing discussion. For many, the validation of review quality is often subordinate to the review itself and consists of informal or casual observations that lack the scientific rigor and quantifiable measures necessary to defend the quality process. More sophisticated quality methods that rely on sampling can provide the much needed assurance that the results are at least as good as human review and when used appropriately can result in significantly improved consistency and productivity. Over the past three years, KPMG has conducted four test projects that compared the results of an “attorney review” process with results obtained by reprocessing the same document collection with a predictive-coding software tool. The tool used in these tests uses a series of randomly selected sample batches of 40 documents that are reviewed by a subject matter expert (SME) to train the software. Based on the SME’s decisions on the training batches, the software calculates the relevance of the remaining documents in the collection. In all four test cases, the software was more consistent in categorizing documents than were the human review teams. Although the software produced more consistent results than the review attorneys, the proprietary algorithm used to produce the relevance ranking is not publicly available. However, the results it produces can be effectively tested with sampling to determine the efficacy of the automated relevance ranking process. Assuring Process Quality Assuring process capability, explicitly or implicitly, is a requirement for defensibility. Having a defendable, and therefore accepted, process is a matter of sound design, transparency and predictable results. Process sampling delivers all three requirements. Sampling is a well proven, scientifically rigorous method that can give the Project Manager much needed flexibility to demonstrate effectively the quality of the review process. Carefully selecting a sample, and from it inferring the condition of the larger population with high confidence in the reliability of the inference, is a powerful tool with tremendous eDiscovery utility. The process of establishing review QC using statistical sampling enables the review team to determine appropriate sample size, quantify the process risks, and determine process acceptance and rejection criteria. Then, should questions arise concerning the quality of the results, a meaningful discussion of the QC methodology can take place without the need to explain, justify or alter unproven judgmental QC practices. Objections to Statistical Sampling If statistical sampling can provide all of these benefits to QC in discovery review, why isn’t it more widely used? There are several possible reasons, including a lack of familiarity with the method or its perceived complexity and the anticipated time investment required to understand and achieve proficiency in it. Another concern may be that a small error found in sampling could render the entire review results unacceptable. Likewise, in the discovery review process there is no clear legal precedent that confirms 2 the acceptability of statistical sampling methods for eDiscovery. Whatever the reasons, although sampling is widely accepted as a basic QC methodology in numerous other product and service industries to manage and quantify quality risk, it has not been widely adopted in eDiscovery review projects. Overcoming the Objections How can the issues that prevent wider use of statistical sampling be addressed? Overcoming the lack of familiarity with sampling can be addressed through training and the use of experts. Involving those who understand QC sampling in the process of eDiscovery can be a very effective approach to achieving the benefits and overcoming project managers’ unfamiliarity. These sampling experts can assist with data stratification, determining sample sizes and calculating confidence levels for statistical inferences. One objection to this approach would be the added cost of these sampling experts. This can be addressed with a straight-forward cost-benefit calculation comparing the cost of the experts to the avoided costs of more extensive testing with non-statistical approaches. Another objection would be the risk of the supervising attorneys not being sufficiently knowledgeable to assess the quality of the sampling experts’ work. This can be addressed through careful questioning and review of the experts’ approach and results. Another option to support using statistical sampling would be to programmatically integrate generally accepted QC sampling methods into widely-used eDiscovery applications. Carefully designed user interfaces for selecting samples, testing them and reporting the results could guide users through the sampling process, thereby minimizing, if not eliminating, most common sampling mistakes. Increased consistency, repeatability and reproducibility of the QC process would result. Additionally, the sampling methodology could include periodic batch sampling throughout the review process with a mechanism for dealing with review error as soon as it is detected to reduce the need to re-perform a significant portion of the review process. Likewise, sampling error could be addressed with a set of tools that would enable sample results to be adjusted and reinterpreted in light of sampling error to reduce the risk of having to significantly expand the sample or restart the sampling process. The final objection regarding a lack of a clear legal precedent is likely to be addressed soon by the courts, which are becoming increasingly aware of the benefits of statistical sampling in dealing with the challenges posed by very large populations of documents. Without clear legal precedent there is some additional risk to applying new technologies and relying on statistical sampling to demonstrate their efficacy. However, the benefits in terms of quality and cost of QC sampling the results from these new technologies can more than offset these risks until the legal precedents supporting their use are clearly established. Note: The preceding commentary relates solely to process control sampling as applied in the performance of document review in connection with electronic discovery and is NOT a commentary on the maturity of sampling techniques relative to financial statement auditing. 3 Process Evaluation in eDiscovery as Awareness of Alternatives Jeremy Pickens, John Tredennick, Bruce Kiefer Catalyst Repository Systems th 1860 Blake Street, 7 Floor Denver, Colorado 303.824.0900 {jpickens, jtredennick, bkiefer}@catalystsecure.com ABSTRACT With a growing willingness in the legal community to accept various forms of algorithmic augmentation of the eDiscovery process, better understanding of the quality of these machineenhanced approaches is needed. Our view in this position paper is that one of the more important ways to understand quality is not in terms of absolute metrics on the algorithm, but in terms of an understanding of the effectiveness of the alternative choices a user could have made while interacting with the system. The user of an eDiscovery platform needs to know not only how well an information seeking process is running, but how well the alternatives to that process could have run. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval – Search process General Terms 2. THE “WHAT IF” OF EDISCOVERY 2.1 Choices Interactive information seeking in general and eDiscovery in particular are characterized by choices. Even with machine augmentation of the search process there is still a human in the loop, considering alternatives and making decisions. Examples of choices, not all of which are independent of each other, include: 1. Does one continue traversing the results list for an existing query, or does one issue a new query instead 2. If complete queries are offered as suggestions, which of the alternatives does one pick? 3. If individual terms are offered as query expansion options, which of the alternatives does one pick, and when does one stop adding additional terms? 4. If the collection is clustered in some manner, which cluster does one choose to examine, and when does one stop examining that cluster? 5. If multiple sources (e.g. custodians) or document types (e.g. PDF, PPT, Word, email) are available, how does one choose which sources or types to pay the most attention to? 6. When the document volumes go beyond what is feasible to review, how do you determine when to stop reviewing? 7. At what point do you produce documents which haven’t been personally reviewed. Measurement, Experimentation, Standardization. Keywords Iterative Information Seeking, Interactive Information Seeking, eDiscovery, Process Evaluation 1. INTRODUCTION Unlike traditional ad hoc search (such as web search) in which the information seeking process is typically single-shot, eDiscovery has both the potential and the necessity to be iterative. Information needs in eDiscovery-oriented information seeking are changing and ongoing, and often cannot be met in a single round of interaction or by a single query. Evaluation of eDiscovery platform quality must take this into account. There are many metrics for single-shot, non-interactive retrieval, such as precision, recall, mean average precision, and PRES [1]. Our goal is not to propose a new single-shot metric. Instead, we declare that what is needed is an approach in which any or all of these metrics are used in an interactive context. Furthermore, we take a user-centric view in that we are not concerned with comparison between eDiscovery platforms but instead are concerned with helping the user understand where he or she is within a larger information seeking task on a single platform. A quality process should be one in which the user is able to both (1) affect system behavior by making conscious choices, and (2) explicitly obtain an understanding of the consequences of those choices, so as to adapt and make better choices in the future. 2.2 Consequences In the previous section we outlined a few examples of the types of choices that an information seeker has to make. Each of those choices has consequences. The choice to dedicate time and resources investigating information coming from one custodian means that less time and fewer resources will be dedicated to a different custodian. More time spent traversing the result set of one query means less time spent on the results of a different query, or perhaps fewer queries executed overall. Adding some terms to an existing query (during query expansion) means not adding others. Deciding that a particular point would be a good one at which to stop reviewing, and then continuing to review anyway might yield diverging expectations as new pockets or rich veins of information are discovered. In order to understand the quality of a search process, knowing the effectiveness of such choices are not enough. A user has to be able to come to know and understand the opportunity costs of the choices not taken. Does an eDiscovery platform make it possible for a user to understand the consequences of his or her choices? Does the system give a user a working awareness of the alternatives? Is it possible for the user to return to a previous choice at a later point in time and obtain feedback on the question of “what if” that path had been chosen? A quality search process should be able to answer, or at least give insight into, these questions. 3. PRINCIPLES AND EXAMPLES Giving an information seeker an awareness of alternatives is not an approach tied to any one particular algorithmically-enhanced methodology. The manner in which a machine (algorithm) learns from the human and applies that learning to the improvement of future choices is a separate issue from whether or not the user is able to garner insight into the efficacy of alternative choices. Granted, some algorithmic approaches might be more penetrable, more conducive to proffering the needed awareness. But the feedback on choices taken versus not taken are going to depend heavily on the nature of the choices themselves. That said, we offer a few principles which might aide in the design of consequence-aware systems: 1. 2. If there is overlap between the multiple choices (i.e. if the consequences of certain choices are not mutually exclusive) then information garnered while following one choice could be used to make inferences about another choice. If there is overlap between the consequences (results) of a single choice, then the efficacy of that choice can be more quickly assessed by examining fewer, perhaps more “canonical” results. For example, a clustering algorithm might not partition a set of documents, but instead place a few of the same documents in multiple clusters. Or the same (duplicate or near-duplicate) documents might be found in the collections from more than one custodians. Or two different query expansion term choices (e.g. “bees” and “apiary”) might retrieve many of the same documents. In such cases, judgments (coding) on these shared documents can be used to assess multiple choices. Naturally the assessment is done within the context of whatever metric is most important to the user, whether that metric is precision, recall, or something else entirely. But the principle of using overlap to estimate and make inferences on that metric remains. The way in which this could be made to work would be to implement a process-monitoring subsystem that keeps track of choices both taken and not taken, and then uses information such as the ongoing manual coding of responsiveness and privilege to assess the validity of those choices. The differential between expectation at one point in time and reality at a future point in time should yield more insight into the information seeking eDiscovery process than just knowing the precision or recall effectiveness at any given point in time. 4. ISSUES The largest issue that needs to be resolved for alternative-aware approaches is that of ever-expanding choice. At every round of interaction, at every point in the human-machine information seeking loop at which the human has the ability to make a choice, a number of options become available. Every choice then gives rise to another set of choices, in an exponentially-branching set of alternatives. Naturally this exponential set needs to be pruned into a manageable set of the most realistic, or possibly the most diverse, set of alternatives. The consequences of every possible choice or path not taken probably do not to be tracked and monitored; a subset should be fine. However, there needs to be enough awareness of alternatives that the user can get an overall sense of how well he or she is doing, and how much progress is or is not being made with respect to the other choices that were available at various stages. The user needs to be able to get a sense of how well a choice at one point in time matches reality as the consequences of that and other, hypothetically-followed choices become clearer at later points in time. 5. SUMMARY Information retrieval has a long history of using user interaction (e.g. in the form of relevance feedback and query expansion, for example) to improve the information seeking process in an iterative manner. User behavior alters the algorithm. However, it is also true that the algorithm alters the user. The more choices a user makes, the more potential exists that some of these choices are sub-optimal. Therefore, awareness of alternative choices are needed to help the user orient himself inside of complex information seeking tasks such as in eDiscovery. This paper proposes an approach to the evaluation of quality not in terms of system comparison, but in terms of alternative, path-not-taken choice comparison and awareness. 6. REFERENCES [1] Magdy, Walid and Jones, Gareth. In the Proceedings of the 33rd Annual SIGIR Conference. PRES: A Score Metric for Evaluating Recall-Oriented Information Retrieval Applications. Geneva, Switzerland. August 2010. Discovery of Related Terms in a corpus using Reflective Random Indexing Venkat Rangan Clearwell Systems, Inc. venkat.rangan@clearwellsystems.com ABSTRACT A significant challenge in electronic discovery is the ability to retrieve relevant documents from a corpus of unstructured text containing emails and other written forms of human-to-human communications. For such tasks, recall suffers greatly since it is difficult to anticipate all variations of a traditional keyword search that an individual may employ to describe an event, entity or item of interest. In these situations, being able to automatically identify conceptually related terms, with the goal of augmenting an initial search, has significant value. We describe a methodology that identifies related terms using a novel approach that utilizes Reflective Random Indexing and present parameters that impact its effectiveness in addressing information retrieval needs for the TREC 2010 Enron corpus. 1. Introduction This paper examines reflective random indexing as a way to automatically identify terms that co-occur in a corpus, with a view to offering the co-occurring terms as potential candidates for query expansion. Expanding a user’s query with related terms either by interactive query expansion [1, 5] or by automatic query expansion [2] is an effective way to improve search recall. While several automatic query expansion techniques exist, they rely on usage of a linguistic aid such as thesaurus [3] or concept-based interactive query expansion [4]. Also, methods such as ad-hoc or blind relevance feedback techniques rely on an initial keyword search producing a top-n results which can then be used for query expansion. In contrast, we explored building a semantic space using Reflective Random Indexing [6, 7] and using the semantic space as a way to identify related terms. This would then form the basis for either an interactive query expansion or an automatic query expansion phase. Semantic space model utilizing reflective random indexing has several advantages compared to other models of building such spaces. In particular, for the specific workflows typically seen in electronic discovery context, this method offers a very practical solution. 2. Problem Description Electronic discovery almost always involves searching for relevant and/or responsive documents. Given the importance of ediscovery search, it is imperative that the best technologies are applied for the task. Keyword based search has been the bread and butter method of searching, but its limitations have been well understood and documented in a seminal study by Blair & Moran [8]. At its most basic level, concept search technologies are designed to overcome some limitations of keyword search. When applied to document discovery, traditional Boolean keyword search often results in sets of documents that include non-relevant items (false positives) or that exclude relevant terms (false negatives). This is primarily due to the effects of synonymy (different words with similar meanings) or polysemy (same word with multiple meanings). For polysemes, an important characteristic requirement is that they share the same etymology but their usage has evolved it into different meanings. In addition, there are also situations where words that do not share the same etymology have different meanings (e.g., river bank vs. financial bank), in which case they are classified as homonyms. In addition to the above word forms, unstructured text content, and especially written text in emails and instant messages contain user-created code words, proper name equivalents, contextually defined substitutes, and prepositional references etc., that mask the document from being indentified using Boolean keyword search. Even simple misspellings, typos and OCR scanning errors can make it difficult to locate relevant documents. Also common is an inherent desire of speakers to use a language that is most suited from the perspective of the speaker. The Blair Moran study illustrates this using an event which the victim’s side called the event in question an “accident” or a “disaster” while the plaintiff’s side called it an “event”, “situation”, “incident”, “problem”, “difficulty”, etc. The combination of human emotion, language variation, and assumed context makes the challenge of retrieving these documents purely on the basis of Boolean keyword searches an inadequate approach. Concept based searching is a very different type of search when compared to Boolean keyword search. The input to concept searching is one or more words that allow the investigator or user to express a concept. The search system is then responsible for identifying other documents that belong to the same concept. All concept searching technologies attempt to retrieve documents that belong to a concept (reduce false negatives and improve recall) while at the same time not retrieve irrelevant documents (reduce false positives and increase precision). 3. Concept Search approaches Concept search, as applied to electronic discovery, is a search using meaning or semantics. While it is very intuitive in evoking a human reaction, expressing meaning as input to a system and applying that as a search that retrieves relevant documents is something that requires a formal model. Technologies that attempt to do this formalize both the input request and the model of storing and retrieving potentially relevant documents in a mathematical form. There are several technologies available for such treatment, with two broad overall approaches: unsupervised learning and supervised learning. We examine these briefly in the following sections. 3.1 Unsupervised learning These systems convert input text into a semantic model, typically by employing a mathematical analysis technique over a representation called vector space model. This model captures a statistical signature of a document through its terms and their occurrences. A matrix derived from the corpus is then analyzed using a Matrix decomposition technique. The system is unsupervised in the sense that it does not require a training set where data is pre-classified into concepts or topics. Also, such systems do not use ontology or any classification hierarchy and rely purely on the statistical patterns of terms in documents. These systems derive their semantics through a representation of co-occurrence of terms. A primary consideration is maintaining this co-occurrence in a form that reduces impact of noise terms while capturing the essential elements of a document. For example, a document about an automobile launch may contain terms about automobiles, their marketing activity, public relations etc., but may have a few terms related to the month, location and attendees, along with frequently occurring terms such as pronouns and prepositions. Such terms do not define the concept automobile, so their impact in the definition must be reduced. To achieve such end result, unsupervised learning systems represent the matrix of document-terms and perform a mathematical transformation called dimensionality reduction. We examine these techniques in greater detail in subsequent sections. 3.2 Supervised learning In the supervised learning model, an entirely different approach is taken. A main requirement in this model is supplying a previously established collection of documents that constitutes a training set. The training set contains several examples of documents belonging to specific concepts. The learning algorithm analyzes these documents and builds a model, which can then be applied to other documents to see if they belong to one of the several concepts that is present in the original training set. Thus, concept searching task becomes a concept learning task. It is a machine learning task with one of the following techniques. a) b) c) Decision Trees Naïve Bayesian Classifier Support Vector Machines While supervised learning is an effective approach during document review, its usage in the context of searching has significant limitations. In many situations, a training set that covers all possible outcomes is unavailable and it is difficult to locate exemplar documents. Also, when the number of outcomes is very large and unknown, such methods are known to produce inferior results. For further discussion, we focus on the unsupervised models, as they are more relevant for the particular use cases of concept search. 3.3 Unsupervised Classification Explored As noted earlier, concept searching techniques are most applicable when they can reveal semantic meanings of a corpus without a supervised learning phase. To further characterize this technology, we examine various mathematical methods that are available. 3.4 Latent Semantic Indexing Latent Semantic Indexing is one of the most well-known approaches to semantic evaluation of documents. This was first advanced in Bell Labs (1985), and later advanced by Susan Dumais and Landauer and further developed by many information retrieval researchers. The essence of the approach is to build a complete term-document matrix, which captures all the documents and the words present in each document. Typical representation is to build an N x M matrix where the N rows are the documents, and M columns are the terms in the corpus. Each cell in this matrix represents the frequency of occurrence of the term at the “column” in the document “row”. Such a matrix is often very large – document collections in the millions and terms reaching tens of millions are not uncommon. Once such a matrix is built, mathematical technique known as Singular Value Decomposition (SVD) reduces the dimensionality of the matrix into a smaller size. This process reduces the size of the matrix and captures the essence of each document by the most important terms that co-occur in a document. In the process, the dimensionally reduced space represents the “concepts” that reflect the conceptual contexts in which the terms appear. 3.5 Principal Component Analysis This method is very similar to latent semantic analysis in that a set of highly correlated artifacts of words and documents in which they appear, is translated into a combination of the smallest set of uncorrelated factors. These factors are the principal items of interest in defining the documents, and are determined using a singular value decomposition (SVD) technique. The mathematical treatment, application and results are similar to Latent Semantic Indexing. A variation on this, called independent component analysis is a technique that works well with data of limited variability. However, in the context of electronic discovery documents where data varies widely, this results in poor performance. 3.6 Non-negative matrix factorization Non-negative matrix factorization (NMF) is another technique most useful for classification and text clustering where a large collection of documents are forced into a small set of clusters. NMF constructs a document-term matrix similar to LSA and includes the word frequency of each term. This is factored into a term-feature and feature-document matrix, with the features automatically derived from the document collection. The process also constructs data clusters of related documents as part of the mathematical reduction. An example of this research is available at [2] which takes the Enron email corpus and classifies the data using NMF into 50 clusters. 3.7 Latent Dirichlet Allocation Latent Dirichlet Allocation is a technique that combines elements of Bayesian learning and probabilistic latent semantic indexing. In this sense, it relies on a subset of documents pre-classified into a training set, and unclassified documents are classified into concepts based on a combination of models from the training set [15]. 3.8 Comparison of the above technologies Although theoretically attractive and experimentally successful, word space models are plagued with efficiency and scalability problems. This is especially true when the models are faced with real-world applications and large scale data sets. The source of these problems is the high dimensionality of the context vectors, which is a direct function of the size of the data. If we use document-based co-occurrences, the dimensionality equals the number of documents in the collection, and if we use word-based co-occurrences, the dimensionality equals the vocabulary, which tends to be even bigger than the number of documents. This means that the co-occurrence matrix will soon become computationally intractable when the vocabulary and the document collections grow. Nearly all the technologies build a word space by building a word-document matrix with each row representing a document and column representing a word. Each cell in such a matrix represents the frequency of occurrence of the word in that document. All these technologies suffer from a memory space challenge, as these matrices grow to very large sizes. Although many cells are sparse, the initial matrix is so large that it is not possible to accommodate the computational needs of large electronic discovery collections. Any attempt to reduce this size to a manageable size is likely to inadvertently drop potentially responsive documents. Another problem with all of these methods is that they require the entire semantic space to be constructed ahead of time, and are unable to accommodate new data that would be brought in for analysis. In most electronic discovery situations, it is routine that some part of the data is brought in as a first loading batch, and once review is started, additional batches are processed. 4. Reflective Random Indexing Reflective random indexing (RRI) [6, 7, 11] is a new breed of algorithms that has the potential to overcome the scalability and workflow limitations of other methods. RRI builds a semantic space that incorporates a concise description of term-document co-occurrences. The basic idea of the RRI and the semantic vector space model is to achieve the same dimensionality reduction espoused by latent semantic indexing, without requiring the mathematically complex and intensive singular value decomposition and related matrix methods. RRI builds a set of semantic vectors, in one of several variations – term-term, termdocument and term-locality. For this study, we built an RRI space using term-document projections, with a set of term vectors and a set of document vectors. These vectors are built using a scan of the document and term space with several data normalization steps. The algorithm offers many parameters for controlling the generation of semantic space to suit the needs of specific accuracy and performance targets. In the following sections, we examine the elements of this algorithm, its characteristics and various parameters that govern the outcome of the algorithm. 4.1 Semantic Space Construction As noted earlier, the core technology is the construction of semantic space. A primary characteristic of the semantic space is a term-document matrix. Each row in this matrix represents all documents a term appears in. Each column in that matrix represents all terms a document contains. Such a representation is an initial formulation of the problem for vector-space models. Semantic relatedness is expressed in the connectedness of each matrix cell. Two documents that share the same set of terms are connected through a direct connection. It is also possible for two documents to be connected using an indirect reference. In most cases, term-document matrix is a very sparse matrix and can grow to very large sizes for most document analysis cases. Dimensionality reduction reduces the sparse matrix into a manageable size. This achieves two purposes. First, it enables large cases to be processed in currently available computing platforms. Second, and more importantly, it captures the semantic relatedness through a mathematical model. The RRI algorithm begins by assigning a vector of a certain dimension to each document in the corpus. These assignments are chosen essentially at random. For example, the diagram below has assigned a five-dimensional vector to each document, with specific randomly chosen numbers at each position. These numbers are not important – just selecting a unique pattern for each document is sufficient. Figure 1: Document Vectors From document vectors, we construct term vectors by iterating through all terms in the corpus, and for each term, we identify the documents that term appears in. In cases where the term appears multiple times in the same document, that term is given a higher weight by using its term frequency. Each term k’s frequency in the document nk weighs in for each document vector’s position. Thus, this operation projects all the documents that a term appears in, and condenses it into the dimensions allocated for that term. As is evident, this operation is a fast scan of all terms and their document positions. Using Lucene API TermEnum and TermDocs, a collection of term vectors can be derived very easily. Once the term vectors are computed, these term vectors are projected back on to document vectors. We start afresh with a new set of document vectors, where each vector is a sum of the term vectors for all the terms that appear in that document. Once again, this operation is merely an addition of floating point numbers of each term vector, adjusting for its term frequency in that document. A single sweep of document vectors to term vector projection followed by term vectors to document vector constitutes a training cycle. Depending on needs of accuracy in the construction of semantic vectors, one may choose to run the training cycle multiple times. Upon completion of the configured number of training cycles, document and term vector spaces are persisted in a form that enables fast searching of documents during early data exploration, search, and document review. It is evident that by constructing the semantic vector space, the output space captures the essential co-occurrence patterns embodied in the corpus. Each term vector represents a condensed version all the documents the term appears in, and each document vector captures a summary of the significant terms present in the document. Together, the collection of vectors represents the semantic nature of related terms and documents. Once a semantic space is constructed, a search for related terms of a given query term is merely a task of locating the nearest neighbors of the term. Identifying such terms involves using the query vector to retrieve other terms in the term vector stores which are closest to it by cosine measurement. Retrieving matching documents for a query term is by identifying the closest documents to the query term’s vector in document vector space, again by way of cosine similarity. An important consideration for searching vector spaces is the performance of locating documents that are cosine-similar, without requiring a complete scan of the vector space. To facilitate this, the semantic vector space is organized in the form of clusters, with sets of the closest vectors characterized by both its centroid and the Euclidean distance of the farthest data point in the cluster. These are then used to perform a directed search eliminating the examination of a large number of clusters. 4.2 Benefits of Semantic Vector Space From the study the semantic vector space algorithm, one can immediately notice the simplicity in realizing the semantic space. A linear scan of terms, followed by a scan of documents is sufficient to build a vector space. This simplicity in construction offers the following benefits. a) b) In contrast to LSA and other dimensionality reduction techniques the semantic space construction requires much less memory and CPU resources. This is primarily because matrix operations such as singular value decomposition (SVD) are computationally intensive, and requires both the initial term-document matrix and intermediate matrices to be manipulated in memory. In contrast, semantic vectors can be built for a portion of the term space, with a portion of the index. It is also possible to scale the solution simply by employing persistence to disk at appropriate batching levels, thus scaling to unlimited term and document collections. The semantic vector space building problem is more easily parallelizable and distributable across multiple systems. This allows parallel computation of the space, allowing for a distributed algorithm to work on multiple term-document spaces simultaneously. This can dramatically increase the availability of concept search capabilities to very large matters, and within time constraints that are typically associated with large electronic discovery projects.. c) d) Semantic space can be built incrementally, as new batches of data are received, without having to build the entire space from scratch. This is a very common scenario in electronic discovery, as an initial batch of document review needs to proceed before all batches are collected. It is also fairly common for the scope of electronic discovery to increase after early case assessment. Semantic space can be tuned using parameter selection such as dimension selection, similarity function selection and selection of term-term vs. term-document projections. These capabilities allow electronic discovery project teams to weigh the costs of computational resources against the scope of documents to be retrieved by the search. If a matter requires a very narrow interpretation of relevance, the concept search algorithm can be tuned and iterated rapidly. Like other statistical methods, semantic spaces retain their ability to work with a corpus containing documents from multiple languages, multiple data types and encoding types etc., which is a key requirement for e-discovery. This is because the system does not rely on linguistic priming or linguistic rules for its operation. 5. Performance Analysis Resource requirements for building a semantic vector space is an important consideration. We evaluated the time and space complexity of semantic space algorithms as a function of corpus size, both from the initial construction phase and for follow-on search and retrievals. Performance measurements for both aspects are characterized for four different corpora, as indicated below. Corpus TREC Reuters EDRM Tobacco Collection Enron Corpus PST Files - 171 No. of Emails - 428072 - No. of Attachments 21578 305508 6,270,345 No. of Term Vectors (email) No. of Document Vectors (email) No. of Term Vectors (attachments) No. of Doc Vectors (attachments) - 251110 - - 402607 - 63210 189911 3,276,880 21578 305508 6,134,210 - 3996 - 2856 210,789 No. of Clusters (email) No. of Clusters (attachments) 134 - Table 1: Data Corpus and Semantic Vectors As can be observed, term vectors and document vectors vary based on the characteristics of the data. While the number of document vectors closely tracks the number of documents, the number of term vectors grows more slowly. This is the case even for OCR-error prone ESI collections, where the term vector growth moderated as new documents were added to the corpus. 5.1 Performance of semantic space building phase Space complexity of the semantic space model is linear with respect to the input size. Also, our implementation partitions the problem across certain term boundaries and persists the term and document vectors for increased scalability. The algorithm requires memory space for tracking one million term and document vectors, which is about 2GB, for a semantic vector dimension of 200. Time for semantic space construction is linear on the number of terms and documents. For very large corpus, the space construction requires periodic persistence of partially constructed term and document vectors and their clusters. A typical configuration persist term vectors for each million terms, and documents at each million documents. As an example, the TREC tobacco corpus would require 4 term sub-space constructions, with six document partitions, yielding 24 data persistence invocations. If we consider the number of training cycles, each training cycle repeats the same processes. As an example, the TREC tobacco corpus with two training cycles involves 48 persistence invocations. For a corpus of this size, persistence adds about 30 seconds for each invocation. Performance Item Vector Construction (minutes) Cluster Construction (minutes) Reuters-21578 dataset 1 1 EDRM Enron dataset 40 15 TREC Tobacco Corpus 490 380 Table 2: Time for space construction, two training cycles (default) These measurements were taken on commodity Dell PowerEdge R710 system, with two Quad Xeon 5500 processors at 2.1GHz CPU and 32GB amount of memory. 5.2 Performance of exploration and search Retrieval time for a concept search and time for building semantic space exploration are also characterized for various corpus sizes and complexity of queries. To facilitate a fast access to term and document vectors, our implementation has employed a purposebuilt object store. The object store offers the following. a) b) Predictable and consistent access to a term or document semantic vector. Given a term or document, the object store provides random access and retrieval to its semantic vector within 10 to 30 milliseconds. Predictable and consistent access to all nearest neighbors (using cosine similarity and Euclidean distance measures) of a term and document vector. The object store has built-in hierarchical k-means based clustering. The search algorithm implements a cluster exploration technique that algorithmically chooses the smallest number of clusters to examine for distance comparisons. A cluster of 1000 entries is typically examined in 100 milliseconds or less. Given the above object store and retrieval paths, retrieval times for searches range from 2 seconds to 10 seconds, depending on large part, on the number of nearest neighbors of a term, the number of document vectors to retrieve and on the size of the corpus. The following table illustrates observed performance for the Enron corpus, using the cluster-directed search described above. Term vector search Average Stdev Clusters Examined 417.84 274.72 Clusters Skipped 1001.25 478.98 Terms Compared 24830.38 16079.72 Terms Matched 21510.29 15930.2 Total Cluster Read Time (ms) 129.39 88.23 Total Cluster Read Count 417.84 274.72 Average Cluster Read Time (ms) 0.29 0.18 Total Search Time (ms) 274.56 187.27 Table 3: Search Performance Measurements As is apparent from the above time measurements as well as number of clusters examined and skipped, identifying related terms can be offered to users with predictability and consistency, thereby making it possible for its usage as an interactive, exploratory tool during early data analysis, culling, analysis and review phases of electronic discovery. 6. Search Effectiveness An important analysis is to evaluate the effectiveness of retrieval of related terms from the perspective of the search meeting the information retrieval needs of the e-discovery investigator. We begin by analyzing qualitative feel for search results by examining the related terms and by identifying the relevance of these terms. We then analyze search effectiveness using the standard measures, Precision and Recall. We also examine search effectiveness using Discounted Cumulative Gain (DCG). 6.1 Qualitative Assessment To obtain a qualitative assessment, we consider the related terms retrieved and examine its nearness measurement, and validate the closest top terms. The nearness measure we use for this analysis is a cosine measure of the initial query vector when compared with the reported result. It is a well-understood measure of judgment of quality in that a cosine measure reflects the alignment of the two vectors, and closeness to the highest value of cosine, which is 1.0, means perfect alignment. Table 4 shows alignment measures for two concept query terms for the EDRM Enron Dataset [12]. It is quite clear that several of the related terms are in fact logically related. In cases where the relationship is suspect, it is indeed the case that co-occurrence is properly represented. E.g., the term offshore and mainly appear in enough documents together to make it to the top 20 related terms. Similarly, we have offshore and foreign co-occur to define the concept of offshore on the basis of the identified related terms. 6.2 Precision and Recall Measures Query: drilling Query: offshore Related Term Similarity Related Term Similarity Precision and recall are two widely used metrics for evaluating the correctness of a search algorithm [8]. Precision refers to the ratio of relevant results compared to the full retrieved set, and represents the number of false positives in the result. Recall on the other hand, measures the ratio of relevant results compared to the number of relevant results actually present in the collection, i.e. the number of false negatives. Usually, recall is a harder measure to determine since it would require reviewing the entire collection for identifying all the relevant items, and sample-based estimation is a substitute. refuge 0.15213 interests 0.13212 For our purposes, two critical information retrieval needs should be evaluated. Arctic 0.12295 foreign 0.13207 wildlife 0.12229 securing 0.12597 exploration 0.11902 viable 0.12422 Rigs 0.11172 involves 0.12345 Rig 0.11079 associated 0.12320 supplies 0.11032 mainly 0.12266 Oil 0.11017 principle 0.12248 refineries 0.10943 based 0.12241 Environmen talists 0.10933 achieved 0.12220 Table 4: Illustration of two query terms and their term neighbors We can further establish the validity of our qualitative assessment using individual document pairs and their document cooccurrence patterns. As an example, Table 5 shows cosine similarity, the number of documents the two terms appear in and the common set of documents both terms appear in, again in the EDRM Enron Dataset. Term1 Term2 Cosine Docs1 Docs2 CDocs offshore drilling 0.2825 1685 1348 572 governor Davis 0.3669 2023 2877 943 brownout power 0.0431 13 30686 13 brownout ziemianek 0.5971 13 2 1 Table 5: Cosine similarity comparison for select terms from EDRM Enron corpus An observation from the above data is that when the two terms compared appear in large number of documents with large overlap, the similarity is greater. In contrast, if one term is dominant in its presence in a large number of documents, and the other term is not, the presence of the two terms in all the common documents (brownout and power), the similarity is lower. Also noteworthy is if two terms are common in every document and the documents each appears in are small number (brownout and ziemianek) the similarity measure is significantly higher. a) b) The ability of the system to satisfy information retrieval needs for the related concept terms. The ability of the system to provide the same for documents in a concept. We evaluated both for several specific searches using the EDRM Enron dataset, and we present our results below. 6.3 Precision and Recall for Related Concept Terms Note that Precision and Recall are defined for related concept terms using a combination of automated procedures and manual assessment. As an example, we supplied a list of queries and their related concept terms and asked human reviewers to rate each related term result as either strongly correlated or related to the initial query, or if it is not related. This gives us an indication of precision for our results, for a given cutoff point. Evaluating recall is harder, but we utilized a combination of sampling methodology and a deeper probe into related term result. As an example of this, we evaluated precision for a cutoff at 20 terms and recall by examining 200 terms and constructing relevance graphs. 6.4 Impact of dimensions Given that the semantic vector space performs a dimensionality reduction, we were interested in understanding the impact of dimension choice for our semantic vectors. For the default implementation, we have a vector dimension of 200, which means that each term and document has a vector of 200 floating point numbers. To study this, we performed a study of precision and recall for the EDRM Enron dataset and tracked the precision-recall graph for four choices of dimensions. The results are indicated in Figure 2 below. As can be observed, we did not gain significant improvement on precision and recall characteristics with a higher choice of dimension. However, for a large corpus, we expect that precisionrecall graph would indicate a significantly steeper fall-off. We also evaluated search performance relative to dimensions. As expected, there is a direct correlation between the two, which can be explained by the additional disk seeks to retrieve both cluster objects as well as semantic vectors for comparison to the query vector. This is illustrated in Figure 3 below. 6.6 Impact of Training Cycles We studied the impact of training cycles on our results. A training cycle captures the co-occurrence vectors computed in one cycle to feed into the next cycle as input vectors. As noted earlier, the document vectors for each training cycle start with randomly assigned signatures, and each successive training cycle utilizes the learned term semantic vectors and feeds it into the final document vectors for that phase. This new set of document vectors forms the input (instead of the random signatures) for the next iteration of the training cycle. Figure 2: Precision and Recall graphs for the EDRM Enron Dataset Figure 4: Normalized DCG vs. dimensions of semantic vector space In our model, we note that term has a direct reference to another discovered term when they both appear in the same document. If they do not appear in the same document but are connected by one or more other common terms between the two documents, we categorize that as an indirect reference. Figure 3: Characterizing Search time and dimensions for 20 random searches A significant observation is that overall resource consumption increases substantially with increase in dimensions. Additionally, vector-based retrieval also times increase significantly. We need to consider these resource needs in the context of improvements in search recall and precision quality measures. 6.5 Discounted Cumulative Gain In addition to Precision and Recall, we evaluated the Discounted Cumulative Gain (DCG), which is a measure of how effective the concept search related terms are [14]. It measures the relative usefulness of a concept search related term, based on its position in the result list. Given that Concept Search query produces a set of related terms and that a typical user would focus more on the higher-ranked entries, the relative position of related terms is a very significant metric. Figure 4 illustrates the DCG measured for the EDRM Enron Dataset for a set of 20 representative searches, for four dimension choices indicated. We evaluated the retrieval quality improvements in the context of increases in resource needs and conclude that acceptable quality is achievable even with a dimension of 200. Adding training cycles has the effect of discovering new indirect references from one term to another term, while also boosting the impact of common co-occurrence. As an example, Table 6 illustrates training cycle 1 and training cycle 4 results for the term drilling. Notice that new terms appear whose co-occurrence is reinforced by several indirect references. Another view into the relative changes to term-term similarity across training cycles is shown below. Table 7 illustrates the progression of term similarity as we increase the number of training cycles. Based on our observations, the term-term similarity settles into a reasonable range in just two cycles, and additional cycles do not offer any significant benefit. Also noteworthy is that although the initial assignments are random, the discovered terms settle into a predictable collection of co-occurrence relationship, reinforcing the notion that initial random assignment of document vectors get subsumed by real corpus-based co-occurrence effects. [2] Myaeng, S. H., & Li, M. (1992). Building Term Clusters by Acquiring Lexical Semantics from a Corpus. In Y. Yesha (Ed.), CIKM-92, (pp. 130-137). Baltimore, MD: ISMM. Query: drilling Training Cycle 1 Training Cycle 4 Related Term Similarity Related Term Similarity Wells 0.164588 rigs 0.25300 Rigs 0.151399 wells 0.23867 viking 0.133421 offshore 0.22940 Rig 0.130347 rig 0.21610 buckeye 0.128801 exploration 0.21397 Drill 0.124669 geo 0.20181 exploration 0.123967 mcn 0.19312 richner 0.122284 producing 0.18966 producing 0.121846 ctg 0.18904 alpine 0.116825 gulf 0.17324 [3] Susan Gauch and Meng Kam Chong, Automatic Word Similarity Detection for TREC 4 Query Expansion, Electrical Engineering and Computer Science, University of Kansas [4] Yonggang Qiu, H.P.Frei, Concept-Based Query Expansion, Swiss Federal Institute of Technology, Zurich, Switzerland [5] Ian Ruthven, Re-examining the Potential Effectiveness of Interactive Query Expansion, Department of Computer and Information Sciences, University of Strathclyde, Glasgow [6] An Introduction to Random Indexing, Magnus Sahlgren, SICS, Swedish Institute of Computer Science. [7] Trevor Cohen (Center for Cognitive Informatics and Decision Making, School of Health Information Sciences, University of Texas), Roger Schvaneveldt (Applied Psychology Unit, Arizona State University), Dominic Widdows (Google Inc., USA) [8] Blair, D.C. & Moran M.E. (1985). An evaluation of retrieval effectiveness for a full-text document-retrieval system. Communications of the ACM, 28, 298-299 Table 6: Training Cycle Comparisons Term1 Term2 TC-1 TC-2 TC-3 TC-4 offshore drilling 0.2825 0.9453 0.9931 0.9981 governor davis 0.3669 0.9395 0.9758 0.9905 brownout power 0.0431 0.7255 0.9123 0.9648 brownout ziemianek 0.5971 0.9715 0.9985 0.9995 Table 7: Term Similarity of training cycles (TC) for four cycles 7. CONCLUSIONS Our empirical study of Reflective Random Indexing indicates that it is suitable for constructing a semantic space for analyzing large text corpora. Such a semantic space has the potential to augment traditional keyword-based searching with related terms as part of query expansion. Co-occurrence patterns of terms within documents are captured in a way that facilitates very easy query construction and usage. We also observed the presence of several direct and indirect co-occurrence associations, which is useful in a concept based retrieval of text documents in the context of electronic discovery. We studied the impact of dimensions and training cycles, and our validations indicate that a choice of 200 dimensions and two training cycles produced acceptable results. 8. REFERENCES [1] Donna Harman, Towards Interactive Query Expansion, Lister Hill National Center for Biomedical Communications, National Library of Medicine, Bethesda, Maryland [9] Berry, Michael W.; Browne (October 2005). "Email Surveillance Using Non-negative Matrix Factorization". Computational & Mathematical Organization Theory 11 (3): 249–264. doi:10.1007/s10588-005-5380-5. [10] Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, Richard Harshman (1990). "Indexing by Latent Semantic Analysis" (PDF). Journal of the American Society for Information Science 41 (6): 391–407. doi:10.1002/(SICI)1097-4571(199009)41:6<391::AIDASI1>3.0.CO;2-9. http://lsi.research.telcordia.com/lsi/papers/JASIS90.pdf. Original article where the model was first exposed. [11] Widdows D, Ferraro K. Semantic vectors: a scalable open source package and online technology management application. In: 6th International conference on language resources and evaluation (LREC); 2008. [12] EDRM Enron Dataset, http://edrm.net/resources/datasets/enron-data-set-files [13] Precision and Recall explained, http://en.wikipedia.org/wiki/Precision_and_recall [14] Discounted Cumulative Gain, http://en.wikipedia.org/wiki/Discounted_cumulative_gain [15] Latent Dirichlet Allocation, http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation Introduction Linear document review – where individual reviewers manually review and “code” documents ordered by date, keyword, custodian or other simple fashion – has been the accepted standard within the legal industry for decades. However, time has proven this method to be notoriously inaccurate and very costly. And in a business environment where the sea of information – and therefore potentially relevant electronically stored information (ESI) – is ever-expanding, technology-enhanced methods for increasing the efficiency and accuracy of review are becoming an ever-more-important piece of the eDiscovery puzzle. Courts have begun to push litigants to expedite the long-overdue paradigm shift from linear manual review to computer-expedited approaches, including Predictive Coding™. Judge Grimm framed this shift to computer-expedited review perfectly in a recent webinar1: “I don’t know how it can legitimately be said that manual review of certain data sets…can be accomplished in the world in which we live. There are certain data sets which I would say cannot be done in the time that we have as simply as a matter of arithmetic. So, the question then becomes what is the best methodology to do this. And this methodology is so much more preferable than keyword searching. I don’t know what kind of an argument could be made by the person who would say keyword searching would suffice as opposed to this sophisticated analysis. That’s just comparing two things that can’t legitimately be compared. Because one is a bold guess as to what the significance of a particular word, while the other is a scientific analysis that is accompanied by a methodology...” The volume of ESI continues to grow at alarming rates and despite improved culling and early case assessment strategies2, linear review remains too expensive, too time consuming and is, as articulated best by Judge Grimm, simply not feasible in many cases3. An AmLaw 50 law firm recently estimated that document review costs account for roughly one-half of a typical proceeding’s budget4. However, new computer-expedited review techniques like Predictive Coding can slash that number5 and provide a methodology that not only keeps budgets in check but speeds the review process in a reasonable and defensible manner. Predictive Coding addresses the core shortcomings of linear document review by automating the majority of the review process. Starting with a small number of documents identified by a knowledgeable person (typically a lawyer, but occasionally a paralegal) as a representative “seed set”, Predictive Coding uses machine learning technology to identify and prioritize similar documents across an entire corpus – in the process literally “reviewing” all documents in a corpus, whether 10 megabytes or 10 terabytes. The result? A more thorough, more accurate, more defensible and far more cost-effective document review regardless of corpus size. Unlike other computer-expedited offerings, however, Predictive Coding is not a “black box” technology where case teams are confronted with trying to explain the algorithms of an 1 advanced search or application to a judge. Instead, Predictive Coding utilizes a workflow which includes built-in statistical sampling methodology that provides complete transparency and verifiability of review results that not only satisfies the Federal Rules’ requirements for “reasonableness” of review process6, but greatly exceeds linear review with respect to overall quality control and consistency of coding decisions. The Process The Predictive Coding starts with a person knowledgeable about the matter, typically a lawyer, developing an understanding of the corpus while identifying a small number of documents that are representative of the category(ies) to be reviewed and coded (i.e. relevance, responsiveness, privilege, issue-relation). This case manager uses sophisticated search and analytical tools, including keyword, Boolean and concept search, concept grouping and more than 40 other automatically populated filters collectively referred to as Predictive Analytics™, to identify probative documents for each category to be reviewed and coded. The case manager then drops each small seed set of documents into its relevant category and starts the “training” process, whereby the system uses each seed set to identify and prioritize all substantively similar documents over the complete corpus.7 The case manager and review team (if any) then review and code all “computer suggested” documents to ensure their proper categorization and further calibrate the system. This iterative step is repeated until no further computer suggested documents are returned, meaning no additional substantively similar documents remain in the “unreviewed” portion of the corpus. The final step in the process employs Predictive Sampling™ methodology to ensure the accuracy and completeness of the Predictive Coding process (i.e. precision and recall) within an acceptable error rate, typically 95% or 99%. The result in most cases is a highly accurate and completely verifiable review process with as little as 10% of a corpus being reviewed and coded by human reviewers, generating dramatic cost and time savings. Predictive Coding is based on the three (3) core workflow steps as follows: 1. Predictive Analytics: Predictive Analytics includes the use of keyword, Boolean and concept search, and data mining techniques – including over 40 automatically 2 populated filters – to help a case management team develop understanding of a matter and quickly identify sets (batches) of key documents for review. These sets are reviewed by the case team and establish seed documents to be trained upon during Predictive Coding’s Adaptive ID Cycles (iterations). 2. Adaptive ID Cycles: Adaptive ID Cycles, also called iterations, are multiple occurrences of category training that identify additional documents that are “more like” seed documents. In this process, documents identified as being probative of a category during human review and Predictive Analytics are trained upon, with the application retrieving and prioritizing additional documents that it considers to be relevant to such category (i.e. substantively similar to the seed set). The cycle is as follows: a. Relevant seed documents are ‘trained’ upon b. The system suggests documents that are substantively similar to the seed set for such category c. Case team reviews/codes the suggested documents, providing further calibration for the system d. All relevant seed documents are ‘trained’ upon, and the iterations continue 3. Predictive Sampling: Predictive Sampling is the use of statistical sampling as a quality control process to test the results of a Predictive Coding review. It provides quantifiable validation that the process used was reasonable and, as a result, defensible. Predictive Sampling is used after Adaptive ID Cycles yield no or a very small amount of responsive documents, meaning no substantively similar documents remain unreviewed and uncoded. The process entails pulling a random sample of documents that have not been reviewed and placing them under human evaluation for responsiveness. The review can be deemed complete after quality control sampling is verified to provide a statistical certainty in the completeness of the review. Predictive Sampling Examined Quality control in the document review process has long been identified as something which is at best unevenly applied and at worst nonexistent.8 Of particular concern – and criticism by no less than the Sedona Conference9 – has been the reliance on such inaccurate tools as keyword search. As such, Landmark eDiscovery cases including the Victor Stanley10 and Mt. Hawley Insurance Co.11 decisions have pushed parties to not just embrace more advanced technology, but have gone so far as to identify sampling as the only prudent way to test the reliability of search, document review and productions irrespective of technology or approach utilized. 3 In keeping with this emerging judicial mandate, the Predictive Coding workflow automates the sampling process in the form of Predictive Sampling, which provides statistically sound certainty rates for responsiveness, issue relation, etc. The soundness of this approach has been corroborated by eDiscovery industry commentators, including Brian Babineau, Vice President of Research and Analyst Services with Enterprise Strategy Group, “Predictive Sampling assesses the thoroughness and quality of automated document review, helping to fortify the defensibility of Predictive Coding. Leading jurists have already written that the superiority of human, eyes-on review is a myth, so law firms continue to work with technology vendors to fill in much of this gap. Predictive Coding with Predictive Sampling enables users to comfortably leverage technology to attain a level of speed and accuracy that is not achievable with traditional linear review processes.” The Predictive Sampling process is relatively straightforward. A statistically significant number of documents (typically 2,000 – 10,000 for statistical significance) are randomly set aside by the system before the review or analysis process begins; this set of documents is the “control set” against which the review – both by the review team and the Predictive Coding system – will be measured to validate the accuracy and error rate of all coding decisions. This control set is reviewed by the case team for all relevant categories, i.e. relevance, responsiveness, privilege and/or issue relation, with the positive/negative rates for all such categories automatically tracked by the system. Once the Adaptive ID Cycle step is completed, a small selection of the remaining, unreviewed corpus is randomly selected by the system for review by the review team (again, typically 2,000 – 10,000 documents for statistical significance). This latter set is then reviewed and coded to see if any probative-yet-unidentified documents (aka false negatives) can be found. The results of this review are then compared against the results from the review of the initial control set, from which a statistically significant and verifiable measurement of the Predictive Coding process’s accuracy and completeness (i.e. precision and recall) are verified. Incidentally, while beyond the scope of this paper it has been shown that the above process has a rather significant benefit beyond the validation of the Predictive Coding process: the ability to use quality control in the review process as an offensive weapon. Unparalleled Review Speed, Accuracy, Cost Savings and Defensibility The most immediate benefits of Predictive Coding are the dramatic reduction in review time required, thereby decreasing review costs significantly while simultaneously improving review quality. Predictive Coding has been shown to speed up the review process by a factor of 2-5x, yielding 50-90% savings in the cost of review. Time and cost improvements include: 4 Predictive Analytics provide early insight into the substance of a corpus and key documents before review has begun. This allows a targeted approach to creating seed documents to be used for category training. More relevant documents are in front of reviewers, more often and more quickly, leading to reviewers seeing less non-relevant documents thereby further expediting the review process. The process provides a pre-populated (predictive) coding form to the reviewer. The human review is mostly a confirmation of computer-suggested coding, which thus saves review time and improves coding consistency. The process provides highlighting hints within the document to guide the reviewer in his/her decisions, and thus to quickly focus his/her attention on the most important parts of the document – which is particularly helpful with longer documents. Category training provides a self-assessment of quality in terms of a confidence score. This allows the reviewer to focus on the most critical parts of the review. Additional improvements in review quality with Predictive Coding enhance and improve coding decisions made by case teams: The predictive suggestion in the coding form leads to a significantly more consistent review across different reviewers. The human reviewer is typically very precise whenever making a positive decision. However, the completeness of the reviewer’s coding is typically lacking. For example, reviewers may miss certain issue codes, not becoming aware of sections in a document that lead to privilege classification, etc. Predictive Coding will not only provide a predictive check for reviewers to investigate but also provides highlights to critical concepts identified on the document. Thus alerting reviewers to critical aspects of documents. Typically, category training is run in a mode that is overly complete, i.e. errors on the side of recall. As a result, the overall review quality typically improves significantly, while maintaining a 2-5x speed improvement. Predictive Sampling used as a quality control process can provide case teams with a 95-99% certainty that relevant documents have been identified, confidence that is unmatched by any linear review or keyword search method. Conclusion In an era where escalating costs and increasing volume dictate a better way to manage the document review process, more and more legal teams are turning toward new methodologies to address client needs and concerns. The question is no longer if legal teams must reduce the time and cost of review but what method will they implement that is effective but also defensible. In response to this acute need, Predictive Coding with Predictive Sampling has achieved the “holy grail” of document review: the judgment and intelligence of human decision-making, the speed and cost effectiveness of computer– 5 assisted review, and the reasonableness and defensibility of statistical sampling. This patented methodology facilitates a fully defensible review while dramatically reducing review costs and timelines, as well as improving the accuracy and consistency of document review. 1 Webinar found at http://www.esibytes.com/?p=1572. Cited reference at 40:10. Last accessed on April 15, 2011. 2 Jason Robman: The power of automated early case assessment in response to litigation and regulatory inquiries. The Metropolitan Corporate Counsel, p33, March 2009. 3 Craig Carpenter: Document review 2.0: Leverage technology for faster and more accurate review. The Metropolitan Corporate Counsel, February 2008. 4 Anonymous AmLaw 100 Recommind customer, January, 2011. 5 Robert W. Trenchard and Steven Berrent: Hope for Reversing Commoditization of Document Review? New York Law Journal, http://www.nylj.com, p3, April 18, 2011. 6 Robert W. Trenchard and Steven Berrent: The Defensibility of Non-Human Document Review. Digital Discovery & e-Evidence, 11 DDEE 03, 02/03/2011. 7 Craig Carpenter: E-Discovery: Use Predictive Tagging to Reduce Cost and Error. The Metropolitan Corporate Counsel, April 2009 8 Craig Carpenter: Predictive Coding Explained. INFOcus blog post, March 10, 2010 9 See Practice Point 1 from The Sedona Conference Best Practices Commentary on the use of Search and Information Retrieval Methods in E-Discovery. 10 Victor Stanley Inc. v. Creative Pipe Inc., --F Supp 2d--, 2008 WL 221841, *3 (D. Md. May 29, 2008). 11 Mt. Hawley Ins. Co. v. Felman Prod. Inc., 2010 WL 1990555 (S.D.W.Va. May 18, 2010). Copyright © Recommind, Inc. 2000-2011. Recommind, Inc.’s name logo are registered trademarks of Recommind, Inc. MindServer, Decisiv Email, Auto-File, SmartFiltering, Axcelerate, Insite Legal Hold, QwikFind, One-Click Coding and Predictive Coding are trademarks or registered trademarks of Recommind, Inc. or its subsidiaries in the United States and other countries. Other brands and names are the property of their respective owners. 6 Manuscript submitted to the Organizing Committee of the Fourth DESI Workshop on Setting Standards for Electronically Stored Information in Discovery Proceedings on April 20, 2011. Updated May 18, 2011. Application of Simple Random Sampling1 (SRS) in eDiscovery Doug Stewart Daegis Abstract eDiscovery thought leadership organizations advocate for the use of sampling throughout much of the EDRM process. Additionally, judging from the numerous and frequent references to “sampling” found in the eDiscovery literature and online content, there appears to be wide acceptance of the use of these techniques to validate eDiscovery efforts. At the same time, there are lingering questions and concerns about the appropriateness of applying random sampling techniques to eDiscovery data sets. This paper offers evidence that random sampling of eDiscovery data sets yields results consistent with well established statistical principles. It shows that Simple Random Sampling (SRS) can be used to accurately make predictions about the composition of eDiscovery data sets and thus validate eDiscovery processes. Introduction Sampling is often mentioned as the principal method of validating many eDiscovery activities and decisions. Thought leadership organizations such as The Sedona Conference, EDRM and TREC Legal Track have published guides, protocols and reports that explicitly call for the use of sampling techniques in various eDiscovery processes2. Also “sampling” is frequently mentioned in the literature, at conferences and in various forms of online content3 as a key tool for validating results of collection, search, document review and other technology assisted eDiscovery activities. Further, the courts have called for the use of sampling in the eDiscovery process4. Despite these strong endorsements, there appears to be some reluctance or inertia toward the adoption and integration of sampling methods into the eDiscovery workflow. To some extent this reluctance may be based on a lack of understanding as most lawyers do not receive training in statistical principles. Lack of understanding may also contribute to the lingering doubts about the suitability of using Simple Random Sampling (SRS) techniques in the eDiscovery process. Additional education and training focused on applying sampling techniques in the eDiscovery process should drive adoption and acceptance of these methods. The Sedona Conference, EDRM and others5 recognize this need and have provided leadership and advocacy in this area. Additionally, simple demonstrations that these techniques work may prove to be one of the best ways to dispel some of the concerns. 1 A sampling technique where every document in the population has an equal chance of being selected. See http://www.thesedonaconference.org/content/miscFiles/Achieving_Quality.pdf; http://edrm.net/resources/guides/edrm-search-guide; and http://trec-legal.umiacs.umd.edu/LegalOverview09.pdf 3 For example, “Using Predictive Coding – What’s in the Black Box?” K. Schieneman et al. http://www.esibytes.com/?p=1649 4 Victor Stanley, Inc. v. Creative Pipe, Inc., 2008 WL 2221841 (D. Md. May 29, 2008). 5 “Sampling for Dummies: Applying Measurement Techniques in eDiscovery” Webinar by M. Grossman and G. Cormack 01/27/2011 2 Page 1 of 6 This study sets out to test the efficacy and applicability of SRS techniques to the eDiscovery process. In doing so, it guides the reader through the process of applying sampling methods on eDiscovery data sets. Several sampling methods are described and tested. Additionally, the key parameters including sample size, confidence level and confidence interval are discussed and measured. Methods and Material The metadata of six inactive eDiscovery databases was searched and sampled for the purposes of this study. The databases ranged in size from a few thousand to more than a million records. Various fields including author, custodian, date, file type, and responsive were searched and sampled using the following four sampling techniques: 1. Simple Random Sampling: Random sample sets created by randomly selecting records from the specified population using the Microsoft .NET 3.5 Random Class to generate random record sets. Required sample size was one of the input parameters. 2. Systematic Sampling: Random sample sets created by selecting every nth record from the specified population using a t-SQL script. A calculation was performed to determine the required value of n to produce the appropriate sample size. 3. MD5 Hash Value Sampling: Random sample sets created by running a MS SQL Server query to select all records with MD5 hash values beginning with two designated characters (e.g., AF or 4A). This method was used to produce a random sampling of 1/256th of the population. 4. Non-Random Sampling: Non-random sample sets created by running a search for documents that fell within a certain date range. Not to be confused with a weighted sample. The key parameters used to create the random samples for this study included: 1. Confidence Interval: Also called the “margin of error”, the Confidence Interval indicates the precision of the sample’s estimate by providing upper and lower limits on the estimate (e.g., plus or minus 2%). 2. Confidence Level: An indication of how certain one can be about the results. A 95% confidence level means that 95 times out of 100 the estimate will reflect the population’s composition within the margin of error provided by the Confidence Interval. 3. Sample Size: Determined by using a sample size calculator. Required inputs include the desired Confidence Level and the desired Confidence Interval. The Sample Size is related to the Population Size but does not scale linearly. For example, the required Sample Size needed to achieve a 95% confidence level with a +/-2 % confidence interval is shown below for a variety of Population Sizes: Population 1,000 10,000 100,000 1,000,000 10,000,000 Sample Size 706 1,936 2,345 2,395 2,400 4. Population or Population Size: The total number of documents in the source data set. Page 2 of 6 5. Percentage or Prevalence: The percentage of documents in the population that have the property being measured (e.g., percentage of the documents that are responsive). If the value is known it can be used to fine tune the Confidence Interval. If not known then 50% must be used to provide the most accurate estimates. Sample sizes, confidence levels and confidence intervals were calculated using the sample size calculator found at: http://www.surveysystem.com/sscalc.htm All analysis work was done using Microsoft Excel 2007. Results Graph #1: This graph shows the relative precision of each sampling method based on a single iteration of each. It shows how well the sampling techniques performed relative to each other. The precision is represented by the ratio of the absolute value of the sample’s variance from the overall population for the property under investigation divided by the sample’s confidence interval (or margin of error) as determined by using the sample size calculator. For instance, if the property under investigation were “ABC = Yes” the precision ratio would be calculated as follows: Precision = abs((% of ABC = Yes in sample) – (% of ABC = Yes in population))/Sample’s confidence interval A result of 1 or less indicates the results fell within the confidence interval and thus indicates a sample that conforms to the principles of SRS and accurately characterizes the entire population. A result greater than 1 indicates a sample that does not accurately estimate the population. For example, precision score of 0.50 indicates the sample estimate varied from the actual population by half of the margin of error or confidence interval. A score of 5.0 indicates the sample estimate exceeded the margin of error by a factor of 5. Graph #1: Variance from Population / Confidence Interval Non‐Random MD5 Variance from Population / Confidence Interval Systematic SRS ‐ 1.0 2.0 3.0 4.0 5.0 6.0 7.0 Graph #2: This graph shows the variance of the SRS derived sample from the population for six different eDiscovery databases. The sample size calculator was used to determine sample sizes based on a 95% confidence level and +/-2% confidence interval. The property analyzed was responsive (yes/no) that had been assigned in the review phase of each project’s lifecycle. The variance was calculated as follows: Variance = abs((% of Responsive = Yes in sample) – (% of Responsive = Yes in population)) Page 3 of 6 The data sets (S1 to S6) ranged in size from approximately 4,000 to 1,400,000 records. The property under investigation ranged from an approximate 2% prevalence in the population to over 85% prevalence. The experimental data easily fit within the allowable margin of error. Graph #2: Variance from Population (95% +/‐2) 2.000% 1.500% 1.000% Variance (95% +/‐2) 0.500% 0.000% S1 S2 S3 S4 S5 S6 Graph #3: This graph shows the results of running 10,000 iterations of SRS on a single database two times and counting the number of samples that exceeded the confidence interval. The sample size calculator was used to calculate the confidence interval based on a specified sample size, confidence level and the known prevalence (percentage) of the record property under investigation. A confidence level of 95% predicts that 9,500 samples out of the 10,000 analyzed would produce an estimated prevalence that matched that of the population within the confidence interval range —500 (5%) samples would estimate a prevalence that fell outside the calculated confidence interval. A confidence level of 99% predicts that 9,900 samples out of the 10,000 analyzed would produce an estimated prevalence that matched that of the population within the confidence interval range—100 (1%) samples would estimate a prevalence that fell outside the calculated confidence interval. The experimental data match the SRS predictions with extraordinary accuracy. 600 510 Graph #3: 10,000 SRS Iterations at Noted Confidence Levels 500 500 400 Actual ‐ Exceed Confidence Interval 300 200 103 100 Expected ‐ Exceed Confidence Interval 100 0 S4 ‐ 95% S4 ‐ 99% Graph #4: This graph shows the results of running 10,000 iterations of SRS on a single database and then plotting the frequency distribution of each sample’s percentage variance from the population. The sample size calculator was used to calculate the sample size based on the desired confidence level and confidence interval. Page 4 of 6 The data reveal that the distribution of the all the sample estimates centers on the actual prevalence percentage found in the population and then trails off as one moves out from the center as is predicted by SRS. As a result, this graph conforms to a normal distribution. Graph #4: Frequency Distribution of Variance from Confidence Interval 300 250 200 150 100 Frequency 50 ‐2.552% Count ‐2.217% Count ‐2.049% Count ‐1.881% Count ‐1.713% Count ‐1.545% Count ‐1.378% Count ‐1.210% Count ‐1.042% Count ‐0.874% Count ‐0.707% Count ‐0.539% Count ‐0.371% Count ‐0.203% Count ‐0.035% Count 0.132% Count 0.300% Count 0.468% Count 0.636% Count 0.804% Count 0.971% Count 1.139% Count 1.307% Count 1.475% Count 1.642% Count 1.810% Count 1.978% Count 2.146% Count 2.314% Count 2.691% Count 0 Discussion The data represented in Graph #1 agree with established statistical principles and support the common assumption that random sampling techniques create samples that make more precise estimates or predictions about populations as a whole than non-random sampling techniques. In this study the nonrandom sample varied from the population by nearly six times the expected confidence interval or margin of error. The randomly generated samples all fell within the expected confidence interval. Graph #2 demonstrates that SRS methods can be used across a variety of eDiscovery data sets to make predictions about the full population that fall within the calculated confidence intervals. The results shown indicate that regardless of the population size the SRS techniques were able to accurately estimate the population to within roughly 0.5 percent. The consistency in the accuracy of the estimates is even more astonishing when one considers that the prevalence of the property in question ranged from just over 2% to over 85% prevalence in the six data sets and the data sets themselves ranged in size from approximately 4,000 to 1,400,000 documents. Graph #3 indicates that SRS of eDiscovery databases will produce results that fall within the calculated confidence levels and confidence intervals. The confidence levels are supported by the iteration data with remarkable accuracy—out of 10,000 iterations the results varied by only 10 samples and three samples from what was predicted by SRS. Page 5 of 6 The normal distribution seen in Graph #4 strongly suggests that SRS of eDiscovery data sets produces results that adhere to the well established statistical principles and body of knowledge. Specifically, the variance from the population for the 10,000 samples follows the distribution predicted by the Central Limit Theorem6. Conclusions The prevailing assumption that SRS, when applied to eDiscovery data sets, produces results in line with accepted statistical principles is supported. This study provides compelling empirical evidence that supports the widely held belief that SRS is one of the best means of validating search and other eDiscovery activities. The fact that a sample of fewer than 2,400 records from a population of one million can be used to accurately estimate the population as a whole may defy intuition. The best way to get comfortable with SRS is to employ the techniques and test them. Firsthand experience seems to be the best teacher. Future work should include the creation of protocols and standards for further incorporating SRS methods into the eDiscovery workflow. This effort should also include standardized protocols for reporting on the sampling methods employed and the results obtained to ensure transparency in the process. Standardized protocols for the use of sampling techniques may also serve to educate and familiarize those that may have gaps in their understanding of these established techniques. Sampling will play an increasingly important role in the eDiscovery process as the industry continues to mature, as data volumes continue to rise and as technology continues to advance. As such, the eDiscovery industry and thought leadership should continue their educational and training efforts to ensure that the relevant segment of the legal community is comfortable with the application of these techniques. Transparency in process, standardization, further training and practical demonstrations of how well sampling techniques work will go a long way toward achieving this goal. Doug Stewart has over 25 years of IT, security and management expertise in the field of electronic discovery and litigation support. As Daegis’ Director of Technology, Doug has been instrumental in the development and deployment of Daegis’ eDiscovery Platform, which includes functionality for hosted review, on-site deployment, iterative search and much more. In 2009, Doug oversaw Daegis’ ISO 27001 Certification for information security management, which includes a rigorous annual audit process. In addition, Doug manages several departments at Daegis including IT, data collection, and information security. 6 The Central Limit Theorem states that as the sample size increases, the sample means tend to follow a normal distribution. Page 6 of 6