 |
Understanding
and Evaluating Automated Document Encoding Solutions for Litigation
Support
Introduction
This Valora white paper discusses the common use of
bibliographic coding to provide a logical structure and retrieval mechanism for
the volume of paper (and electronic) documents involved in litigation
matters. After exploring what
document coding involves, its benefits and its drawbacks, this paper outlines
promising new approaches to the same problem using automated encoding
technology.
Coded Databases
Many litigators have come to
realize the value of a fully “encoded” database for managing their discovery
documents. For those who are
unfamiliar with this practice, it involves the reading and preparation of each
document so that a list of database fields can be filled out that pertain to the
document. The fields describe
information about the document, such as its title, author, date and intended
recipient. With a fully fielded
database of this “meta” information, an attorney can quickly locate the kind of
documents she might need for a given purpose, such as all memos written by Party
X to Party Y concerning Topic Z within a certain date range. Without the coded database, this is a
slow, tedious process that involves the physical culling of many, many
documents. The coded database
allows for an infinite variety and complexity of searching and retrieval
tasks.
Creating Coded Databases
Until recently, these databases were created by hand, by
trained paralegals and associates who would read each document and make the
appropriate data entries into the database. This is largely how this process is
still accomplished today. Such
trained personnel, however, come at great cost. With billing rates upwards of $85-100
per hour, per document (per database record) charges tend to be in the
$1.50-$2.25 arena. It is no wonder,
then, that enterprising individuals with ties to English-speaking, developing
nations have established “overseas coding houses” to exploit the relatively
cheap cost of labor to perform the same data entry tasks. With the rise of the Internet, email
communication and document imaging (scanned files of the physical paper
documents), coding can happen anywhere there is electricity. Because of their much lower labor rates,
overseas operations typically charge between $0.90 - $1.30 per document to
perform a similar coding service to the paralegals and associates within the
firm.
Automated Coding
While various countries battle it out for the cheapest cost
of labor, another revolution in database creation has been quietly occurring
here in the United States.
Technology has begun to replace the manual labor in many data entry
situations. Calling into an
automated touch-tone menu (“Press 1 for Customer Support..”) is an example of
technology replacing human data entry.
Other examples include:
credit card swipe systems, fill-in the circles forms and automated
highway toll devices. While some
industries have been using automated data entry techniques for many years
(insurance, banking, etc.), other industries are first seeing such techniques
now become mainstream (health care, law, pharmaceuticals).
For litigation support, the rise of automated coding means
the use of computer software to automatically create the document databases
described above. The advantages to
computerized coding are many:
1.
The cost is significantly lower than with even the cheapest human
labor.
2.
The time to code is orders of magnitude lower.
3.
The consistency is greatly improved over human output
To give you an idea of the scope of these improvements,
consider the following, generally- accepted figures. Most trained paralegals can code (enter
field data into a database) at a rate of about 30 documents per hour, assuming
they have the proper tools and training.
A single PC running auto-coding software can typically code 5,000
documents per hour. Furthermore,
the PC does not go home at night or on weekends, does not get paid overtime (or
at all!), has no benefits, needs no training and always performs its job as
expected.
In addition, automation can accomplish things that human
beings cannot. An automated system
will always perform the exact same job the exact same way, even if it is
performed over and over and over.
Human beings, by contrast, get bored, distracted and make honest
mistakes, even under the best of conditions. The more repetitive the task, the more
the human being is likely to make a mistake. This unfailing precision to task is
precisely what suits the automated system to perform the painstaking and
monotonous chore of coding so well.
Resolving the Drawbacks to Auto-coding
There is one thing, however, that the automated system lacks,
and that is judgement – the human beings’ specialty. Humans can look at a document and
interpret text that is backwards, upside down or even “inside out,” as can
sometimes happen with electronic document processing. Machines have a very hard time with
misaligned text. This is why
auto-coding has come to involve a combination of both machine-oriented tasks as
well as human-oriented tasks.
In auto-coding systems, such as Linkify, from Valora
Technologies, there is a mixture of human and machine involvement in the coding
process. The machine does the
“heavy lifting” portions of sorting through vast quantities of information and
producing the fielded data, while the person provides the uniquely human
judgement, when the machine requests it.
A human task in auto-coding, for example, is to disambiguate
between two names that might be similar (Jane Smith and Janet Smith) and might
be the same person, but might equally not be. The person provides her judgement and
instructs the machine how to proceed.
This symbiotic relationship between man and machine results in a highly
effective process that is both fast and accurate.
What to Expect With Auto-coding
Many people are unsure what to expect with auto-coding and
are highly skeptical of the output.
This is only natural, as human beings have accomplished the task for
many, many years. Our experience
has shown that the machine output and the human output are often
indistinguishable from one another.
The auto-coded output looks very much like human output. Each field is filled in in its proper
place in the database. The output
looks just like hand-coded output; only the man-machine combo described above
performed it.
You can expect your auto-coded database to be completed in
roughly ½ to 1/3 the time it typically takes to manually create the
database. Costs typically run in
the $0.50-0.75 per document range.
Accuracy of any output is always a tricky thing to measure,
and there are varying claims in the marketplace. Typically, output that is solely
generated by a machine and not subsequently QC’ed (manually checked) is in the
50-75% accuracy range. That means
more than 1 error per document (or database record). Valora QC’s all of its output and
guarantees at least 97% accuracy of its output.
When to Use Auto-coding
Auto-coding is a viable alternative (or supplement) to
hand-coding in many cases. It
performs best on large document sets that contain a variety of document
formats. Valora’s auto-coding
performs best on correspondence type documents (letters, email, faxes, memos,
etc.) and on “formulaic” documents (pleadings, agendas, reports, etc.).
Auto-coding makes use of optical character recognition (OCR)
and is somewhat at the mercy of the OCR’s output quality. In general, the better the OCR, the
better the auto-coded output. The
more sophisticated auto-coding solutions on the market work within the
limitations of OCR and have techniques for “compensating” for some of its
shortcomings (misread characters, lost characters, etc.). Valora’s Linkify system employs several
sophisticated, patent-pending approaches for supplementing, correcting and
interpolating typical OCR output, such that it is able to make quite accurate
“guesses” (probabilistic assumptions) as to the correct field information to
report in most circumstances.
When Not to Use Auto-coding
Not all documents are auto-codable, unfortunately. Documents that are handwritten (notebook
pages, order forms, fax cover sheets), in columnar format (newspaper articles,
spreadsheets, accounting ledgers) or diagrammatic in nature (photos, graphics,
blueprints) do not usually OCR into meaningful text and typically cannot be
interpreted by an auto-coding engine.
Document sets that contain an unusually high proportion of such document
types are a poor fit for today’s auto-coding capabilities, although great
strides are being made all the time.
Because most document collections have at least some small
percentage of non-automatable documents, there is a certain degree of
traditional, hand-encoding that takes place on most auto-coding
productions. While the per document
cost is almost always higher on the hand-coded documents, the overall cost of
the production is usually significantly less than if the entire set had been
hand-coded.
Where to Get More Information
Valora Technologies, Inc. is one of the world’s leading authorities
on automated encoding of documents.
Our skilled team of engineers has been working on these technologies for
many years. Our technology and most
of our technical staff hail from the Massachusetts Institute of Technology
(MIT). Our research and development
efforts cover topics in computational linguistics, probabilistic systems and
enterprise software. Valora offers
its auto-encoding service to law firms, litigation support vendors and corporate
law departments. Please contact us
at 781-642-8806, www.valoratechnologies.com
Reprinted with permission of the author, Valora
Technologies, Inc. © Valora Technologies, Inc.,
2002.
|
 |