
|
Understanding
and Evaluating Automated Document Encoding Solutions for Litigation
Support
IntroductionThis Valora white paper discusses the common use of
bibliographic coding to provide a logical structure and retrieval mechanism for
the volume of paper (and electronic) documents involved in litigation
matters. After exploring what
document coding involves, its benefits and its drawbacks, this paper outlines
promising new approaches to the same problem using automated encoding
technology. Coded DatabasesMany litigators have come to realize the value of a fully “encoded” database for managing their discovery documents. For those who are unfamiliar with this practice, it involves the reading and preparation of each document so that a list of database fields can be filled out that pertain to the document. The fields describe information about the document, such as its title, author, date and intended recipient. With a fully fielded database of this “meta” information, an attorney can quickly locate the kind of documents she might need for a given purpose, such as all memos written by Party X to Party Y concerning Topic Z within a certain date range. Without the coded database, this is a slow, tedious process that involves the physical culling of many, many documents. The coded database allows for an infinite variety and complexity of searching and retrieval tasks. Creating Coded DatabasesUntil recently, these databases were created by hand, by trained paralegals and associates who would read each document and make the appropriate data entries into the database. This is largely how this process is still accomplished today. Such trained personnel, however, come at great cost. With billing rates upwards of $85-100 per hour, per document (per database record) charges tend to be in the $1.50-$2.25 arena. It is no wonder, then, that enterprising individuals with ties to English-speaking, developing nations have established “overseas coding houses” to exploit the relatively cheap cost of labor to perform the same data entry tasks. With the rise of the Internet, email communication and document imaging (scanned files of the physical paper documents), coding can happen anywhere there is electricity. Because of their much lower labor rates, overseas operations typically charge between $0.90 - $1.30 per document to perform a similar coding service to the paralegals and associates within the firm. Automated CodingWhile various countries battle it out for the cheapest cost of labor, another revolution in database creation has been quietly occurring here in the United States. Technology has begun to replace the manual labor in many data entry situations. Calling into an automated touch-tone menu (“Press 1 for Customer Support..”) is an example of technology replacing human data entry. Other examples include: credit card swipe systems, fill-in the circles forms and automated highway toll devices. While some industries have been using automated data entry techniques for many years (insurance, banking, etc.), other industries are first seeing such techniques now become mainstream (health care, law, pharmaceuticals). For litigation support, the rise of automated coding means the use of computer software to automatically create the document databases described above. The advantages to computerized coding are many: 1. The cost is significantly lower than with even the cheapest human labor. 2. The time to code is orders of magnitude lower. 3. The consistency is greatly improved over human output To give you an idea of the scope of these improvements, consider the following, generally- accepted figures. Most trained paralegals can code (enter field data into a database) at a rate of about 30 documents per hour, assuming they have the proper tools and training. A single PC running auto-coding software can typically code 5,000 documents per hour. Furthermore, the PC does not go home at night or on weekends, does not get paid overtime (or at all!), has no benefits, needs no training and always performs its job as expected. In addition, automation can accomplish things that human beings cannot. An automated system will always perform the exact same job the exact same way, even if it is performed over and over and over. Human beings, by contrast, get bored, distracted and make honest mistakes, even under the best of conditions. The more repetitive the task, the more the human being is likely to make a mistake. This unfailing precision to task is precisely what suits the automated system to perform the painstaking and monotonous chore of coding so well. Resolving the Drawbacks to Auto-codingThere is one thing, however, that the automated system lacks, and that is judgement – the human beings’ specialty. Humans can look at a document and interpret text that is backwards, upside down or even “inside out,” as can sometimes happen with electronic document processing. Machines have a very hard time with misaligned text. This is why auto-coding has come to involve a combination of both machine-oriented tasks as well as human-oriented tasks. In auto-coding systems, such as Linkify, from Valora Technologies, there is a mixture of human and machine involvement in the coding process. The machine does the “heavy lifting” portions of sorting through vast quantities of information and producing the fielded data, while the person provides the uniquely human judgement, when the machine requests it. A human task in auto-coding, for example, is to disambiguate between two names that might be similar (Jane Smith and Janet Smith) and might be the same person, but might equally not be. The person provides her judgement and instructs the machine how to proceed. This symbiotic relationship between man and machine results in a highly effective process that is both fast and accurate. What to Expect With Auto-codingMany people are unsure what to expect with auto-coding and are highly skeptical of the output. This is only natural, as human beings have accomplished the task for many, many years. Our experience has shown that the machine output and the human output are often indistinguishable from one another. The auto-coded output looks very much like human output. Each field is filled in in its proper place in the database. The output looks just like hand-coded output; only the man-machine combo described above performed it. You can expect your auto-coded database to be completed in roughly ½ to 1/3 the time it typically takes to manually create the database. Costs typically run in the $0.50-0.75 per document range. Accuracy of any output is always a tricky thing to measure, and there are varying claims in the marketplace. Typically, output that is solely generated by a machine and not subsequently QC’ed (manually checked) is in the 50-75% accuracy range. That means more than 1 error per document (or database record). Valora QC’s all of its output and guarantees at least 97% accuracy of its output. When to Use Auto-codingAuto-coding is a viable alternative (or supplement) to hand-coding in many cases. It performs best on large document sets that contain a variety of document formats. Valora’s auto-coding performs best on correspondence type documents (letters, email, faxes, memos, etc.) and on “formulaic” documents (pleadings, agendas, reports, etc.). Auto-coding makes use of optical character recognition (OCR) and is somewhat at the mercy of the OCR’s output quality. In general, the better the OCR, the better the auto-coded output. The more sophisticated auto-coding solutions on the market work within the limitations of OCR and have techniques for “compensating” for some of its shortcomings (misread characters, lost characters, etc.). Valora’s Linkify system employs several sophisticated, patent-pending approaches for supplementing, correcting and interpolating typical OCR output, such that it is able to make quite accurate “guesses” (probabilistic assumptions) as to the correct field information to report in most circumstances. When Not to Use Auto-codingNot all documents are auto-codable, unfortunately. Documents that are handwritten (notebook pages, order forms, fax cover sheets), in columnar format (newspaper articles, spreadsheets, accounting ledgers) or diagrammatic in nature (photos, graphics, blueprints) do not usually OCR into meaningful text and typically cannot be interpreted by an auto-coding engine. Document sets that contain an unusually high proportion of such document types are a poor fit for today’s auto-coding capabilities, although great strides are being made all the time. Because most document collections have at least some small percentage of non-automatable documents, there is a certain degree of traditional, hand-encoding that takes place on most auto-coding productions. While the per document cost is almost always higher on the hand-coded documents, the overall cost of the production is usually significantly less than if the entire set had been hand-coded. Where to Get More InformationValora Technologies, Inc. is one of the world’s leading authorities
on automated encoding of documents.
Our skilled team of engineers has been working on these technologies for
many years. Our technology and most
of our technical staff hail from the Massachusetts Institute of Technology
(MIT). Our research and development
efforts cover topics in computational linguistics, probabilistic systems and
enterprise software. Valora offers
its auto-encoding service to law firms, litigation support vendors and corporate
law departments. Please contact us
at 781-642-8806, www.valoratechnologies.com
Reprinted with permission of the author, Valora
Technologies, Inc. © Valora Technologies, Inc.,
2002.
|
| HOME | ABOUT GLS | CURRENT POSITIONS | SUBMIT RESUME | LINKS | INFO | CONTACTS |