Getting Started Guide Classification and Separation
Getting Started Guide Classification and Separation
5.2
Copyright
© 1997 - 2008 Kofax Image Products, Inc., 16245 Laguna Canyon Road, Irvine,
California 92618, U.S.A. All rights reserved. Use is subject to license terms.
Kofax Image Products, Kofax and the Kofax logo are trademarks or registered
trademarks of Kofax Image Products, Inc. in the U.S. and other countries. All other
trademarks are the trademarks or registered trademarks of their respective owners.
U.S. Government Rights Commercial software. Government users are subject to the
standard license agreement for this product and all applicable provisions of the FAR
and its supplements.
You agree that you do not intend to and will not, directly or indirectly, export or
transmit the Software or related documentation and technical data to any country to
which such export or transmission is restricted by any applicable U.S. regulation or
statute, without the prior written consent, if required, of the Bureau of Export
Administration of the U.S. Department of Commerce, or such other governmental
entity as may have jurisdiction over such export or transmission. You represent and
warrant that you are not located in, under the control of, or a national or resident of
any such country.
Overview ...................................................................................................................................5
Introduction ............................................................................................................................ 5
What is ADR and What does it Add to Capture? .............................................................. 5
Features of ADR .............................................................................................................. 5
What is Classification?............................................................................................. 5
What is Separation? ................................................................................................. 6
Classification and Separation of Documents in Production...................................... 6
Recognition ............................................................................................................... 6
Document Review.................................................................................................... 7
Assembly ................................................................................................................... 7
Typical System Architecture for an ADR Solution.............................................. 7
Configuring a Classification and Separation Solution ............................................... 8
The Tutorial............................................................................................................................. 9
The Example Documents ........................................................................................ 9
Processing.............................................................................................................................. 13
Introduction ...........................................................................................................................13
The Mortgage Applications Example.................................................................................13
Introduction....................................................................................................................13
Installing the Mortgage Applications Example .........................................................13
Document Classification ......................................................................................................14
Running the Mortgage Applications Example ..........................................................14
Tutorial: Running the Capture Path.....................................................................14
Tutorial: Running in Dedicated Mode.................................................................23
Page Classification and Separation.....................................................................................28
Running the Mortgage Applications Example ..........................................................28
Tutorial: Running the Capture Path.....................................................................28
Configuration ......................................................................................................................... 33
Overview................................................................................................................................33
Introduction....................................................................................................................33
Sample Documents ........................................................................................................33
Accuracy ..................................................................................................................33
Representative Documents....................................................................................34
Document Set Management Steps........................................................................34
Create Configuration.....................................................................................................36
Recognition..............................................................................................................36
Document Review ..................................................................................................38
Integrate the Configuration with Capture..................................................................39
Integration Steps .....................................................................................................39
Document Classification Tutorial .......................................................................................40
Document Set Management .........................................................................................40
Step 1: Create Project..............................................................................................40
Step 2: Import Documents .....................................................................................41
Step 3: Assign Document Types ...........................................................................47
Introduction
This guide introduces IBM FileNet Capture Advanced Document Recognition (ADR)
and describes how it used to automatically separate pages into documents and to
classify documents. It starts with brief installation instructions which are followed by
a tutorial. The tutorial will guide you through processing batches with the pre-
installed Mortgage Applications example. The guide then describes how each of the
modules are configured. The tutorial then steps through creating a new set of
configuration files to use to classify the mortgage application documents (and how to
assign this configuration to Capture). The final tutorial steps through creating a new
configuration, but this time to include automatic separation of pages into documents.
This guide assumes that you have a thorough understanding of Windows standards,
applications, and interfaces.
This guide is for people who need an introduction to ADR, specifically automatically
classifying and separating documents. It will be beneficial to people who will be:
Configuring ADR for a specific settings collection
Administering or supporting an ADR solution
If you need more detailed information on configuring a module, open the ADR Help
and read the relevant “How to configure…” book. Additional details of all the
documentation provided with ADR are included in the section Related
Documentation.
Each PDF guide can be opened by clicking Start on the taskbar to display the menu
and selecting All Programs | IBM FileNet Capture Professional | ADR
Documentation.
The ADR Help can be opened from the same menu, but can also be opened from the
Help menu within the tools. Pressing F1 within Definer and Script Editor will open
the topic for the feature being used.
The Getting Started Guide (Free-Form) (.pdf) focuses on configuring a solution to extract
data from free-form (semi-structured or unstructured) documents.
ADR Help
The ADR Help is written for those configuring a solution and for system
administrators, and assumes those reading it have read the Getting Started Guides or
attended an ADR training course. This assumption is made so that the ADR Help can
provide the most accurate and detailed information across every aspect of the
product.
Overview
Introduction
This chapter introduces some of the concepts of data capture and key points of ADR.
Capture scans paper-based documents or imports images from file, creating a series
of scanned image files. Capture then routes the files through ADR, a set of modules
that (along with Assembly) separate pages into documents, classify documents and
extract information, creating understandable electronic data. Capture then
automatically transfers the data to index fields and commits the data and images to
the repository.
Features of ADR
This guide covers two key features of ADR: classification and separation.
What is Classification?
Classification Methods
What is Separation?
Separation Methods
Document separation is determined from the page classification results using either
of the following methods:
Rules-based document separation One or more rules specify when new
documents are created; for example, if a page of type A is seen, create a
document of type X.
Advanced document separation A probabilistic method that ascertains the
most likely document structure from the page classifications and their
confidence scores. This method is robust to variation in documents and mis-
classifications due to its probabilistic nature. For example, a 6 page document
of type X has been specified to expect pages of type A, B, C, D, E and F. Five
of the pages have been classified but one page is classified as type Y. From
the fact that five out of six pages in a row in the batch have been classified as
pages in document type X it is highly probable that this type of document has
been found.
Recognition
Classification and separation are done in the same processing step, in an instance of
the Recognition module. A single solution would do one of the following:
Document Classification
Page Classification and Separation (resulting in document classification)
If extraction is also being done as part of the Capture Path, an additional instance of
the Recognition module named ADR Recognition (Classification and Separation) is
used for classification and separation.
Note Data extraction is generally done in the standard instance of Recognition, once
all document types have been determined (and manually reviewed if needed).
Document Review
Assembly
Assembly is typically run after the ADR modules to restructure the batch for the
repository, and if separation is done in ADR, to set the appropriate document class
for each document.
Note It is only after Assembly has run that Capture Professional will display the
document structure determined in ADR.
The system architecture is determined by the capacity of the system: high volume or
low volume.
In lower volume environments it is typical to run batches through all the modules on
a single station. ADR uses the standard Capture behavior: the user will normally
open Capture Professional and start the capture path.
The primary ADR tool used for configuration is Transformation Studio. The tutorial
in this guide will step you through the configuration process.
The Tutorial
This guide includes a tutorial on processing documents using the classification and
separation functionality in ADR.
The tutorial uses a set of example mortgage application documents with the
following document types:
Appraisal Report
Header
Funding Transmittal
Redemption
Initial Escrow
Request for Tax Form
Tax Escrow
Truth In Lending
Loan Application
Installation
Introduction
This chapter provides instructions for installing ADR using the installation wizard.
Note ADR is installed to a subfolder of the Capture installation path. By default this
location, referred to as <Installation Path>, is:
C:\Program Files\FileNet\CaptureADR
Standard Installation
Standard installation is done from the Capture installation wizard. These instructions
do not describe every installation screen, but focus on the key points.
X To install ADR
1 Run the Capture installer and select the “Advanced Document Recognition
(ADR)” package
2 When prompted by the Capture installer, place the ADR installation CD into
the CD-ROM drive.
The ADR installer will start automatically.
3 Follow the on-screen instructions
Silent Installation
Silent installation of ADR is possible during a silent install of Capture.
Licensing
The following licenses are available for ADR:
Fixed-form Recognition
Free-form Recognition
Fixed-form and Free-form Recognition
Fixed-form and Free-form Recognition and Classification and Separation
During installation you will be prompted to enter a software license key which
specifies which of the above features are licensed.
In order to run the tutorial in this guide, you will need the last license: Fixed-form
and Free-form Recognition and Classification and Separation.
If you try to use the features without a suitable license, an error message will display.
Processing
Introduction
This chapter will introduce you to the ADR modules as they are used in production.
You will process a pre-configured example solution to experience how the modules
run.
Introduction
The ADR installation includes an example configuration that demonstrates some of
the processing features of classification and separation in ADR. The example uses a
pre-defined capture path, settings collection and template, as well as configuration
files for the ADR modules. The example has been configured twice to demonstrate
two different methods for using ADR. Once to show document classification and
once to show page classification and separation (resulting in document
classification).
Document Classification
In the document classification example, ADR Recognition (Classification and
Separation) and ADR Document Review are used to classify mortgage applications.
Document boundaries are established prior to ADR Recognition (Classification and
Separation).
The complete set of Capture components and ADR modules used is:
File Import
Zonal OCR
Event Activator
ADR Recognition (Classification and Separation)
ADR Document Review
ADR Recognition
ADR Completion
ADR Scripted Export
Create a Batch
Create a batch based on the “Mortgage Applications Template” template.
Note It is not possible to see the document structure in Capture Professional until
Assembly has run.
X To import images and establish document boundaries, make sure the name
of the new batch is highlighted and select Tools | Start | Capture Path.
File Import will import the example images from the following location:
Zonal OCR will read a single index field on each image. If the word
DOCUMENT is found in the field, Event Activator will set an attribute marking
this image as a separator sheet.
The capture path will automatically continue onto the next stage, classifying
documents.
A message will display stating that there are no more problem documents to
display in Document Classification view so the batch will now be shown in
Review.
In Review, you can see all the documents in the batch. The document with
the two problems looks to be a very poorly scanned Truth In Lending
document. The next document in the batch is also a Truth In Lending.
4 Click the + buttons to the left of the two documents to expand them and
display the thumbnails
Note If this document could not be deleted or the problem fixed, the batch
can be suspended. It is then available to be re-opened in Capture
Professional ad-hoc mode by an administrator who could investigate the
problem.
Recognition will automatically extract the data from the eleven documents.
This is a “Header” document. No data has been extracted, but the document
type is displayed as a read only field.
2 Press F12 to move to the next document.
Again, no data has been extracted for the “Tax Escrow” document type.
3 Press F12 to move to the next document
4 Use Tab/Shift+Tab to navigate around the fields on Document 3.
Figure 3-7. Completion Window displaying Data Extracted for Loan Application
Documents
These fields have been specifically extracted for the “Loan Application”
document type. Such conditional extraction would not be possible without
the document classification that ran before.
5 Press F12 after viewing this document and after viewing each of the
remaining documents.
Notice the data that has been conditionally extracted (or deliberately not
extracted) for each of the document types in the document classification
solution.
When all the fields have been completed an End of Batch window is
displayed.
6 Click Exit Completion
collection is linked to a document class in the repository, which includes the index
fields to be populated.
X To export data, Scripted Export will automatically process the batch and close
once complete.
In a production system, Assembly, Index and Commit would be run after Scripted
Export. Using the attributes set by Scripted Export, the Assembly component would
split the original batch into a batch per document and assign the specified settings
collection to the single-document batch.
The Index component (configured for each document type's settings collection)
would then copy the field attributes to the index fields on the document class.
Commit would then commit the data to the repository and delete the batch.
Create a Batch
Create a batch based on the “Mortgage Applications Template (Dedicated)”
template.
Classify Documents
Use the Classification and Separation instance of the Recognition module to
automatically classify the documents. Any documents that can not be recognized
confidently will be displayed in the next stage, in Document Review.
2 Select Session | Select Batch… and select the batch created in Capture
Professional from the list
3 Click Open.
Recognition will automatically classify the documents, no user interaction is
required.
4 When processing is complete the status bar will display “Status: Idle”
5 Select Session | Exit to close Recognition
Note Rather than selecting a single batch in Recognition, the module would
normally be started in Wait for any Batch mode to automatically process batches as
they become available. Alternatively, Recognition would be installed as a Windows
service and would process batches automatically.
Based on the document type, Scripted Export sets attributes to specify how the batch
is split into documents and the settings collection for each document. This settings
collection is linked to a document class in the repository, which includes the index
fields to be populated.
X To export data
1 Open Scripted Export by clicking Start on the taskbar to display the menu,
and selecting:
All Programs | IBM FileNet Capture Professional | ADR Examples | ADR
Scripted Export – ADR Examples
This will only process batches from the ADR Examples local repository.
2 Select Session | Select Batch… and select the batch created in Capture
Professional from the list
3 Click Open.
Scripted Export will automatically classify the documents, no user
interaction is required.
When processing is complete the status bar will display “Status: Idle”.
4 Select Session | Exit to close Scripted Export.
Note Rather than selecting a single batch in Scripted Export, the module
would normally be started in Wait for any Batch mode to automatically
process batches as they become available. Alternatively, Scripted Export
would be installed as a Windows service and would process batches
automatically.
In a production system, Assembly, Index and Commit would be run after Scripted
Export. Using the attributes set by Scripted Export, the Assembly component would
split the original batch into a batch per document and assign the specified settings
collection to the single-document batch.
The Index component (configured for each document type’s settings collection)
would then copy the field attributes to the index fields on the document class.
Commit would then commit the data to the repository and delete the batch.
The complete set of Capture components and ADR modules used is:
File Import
ADR Recognition (Classification and Separation)
ADR Document Review
ADR Recognition
ADR Completion
ADR Scripted Export
Create a Batch
Create a batch based on the template called “Mortgage Applications with Separation
Template”.
Import Images
Import the example images.
X To import images, make sure the name of the new batch is highlighted and
select Tools | Start | Capture Path.
File Import will import the example images from the following location:
The Capture Path will automatically continue onto the next stage, classifying and
separating documents.
Note It is not possible to see the document structure in Capture Professional until
Assembly has run.
You can see that the document is, however, correctly classified. A simple
confirmation is required.
2 Press Enter to confirm the document type
3 Click OK on the message to open the Review view
4 Expand the documents to check the automatic separation has been successful
5 Select Session | Close Batch and click Yes to close the batch
Recognition will automatically extract the data from the fourteen documents.
X To export data, Scripted Export will automatically process the batch and close
once complete.
In a production system, Assembly, Index and Commit would be run after Scripted
Export. Using the attributes set by Scripted Export, the Assembly component would
split the original batch into a batch per document and assign the specified settings
collection to the single-document batch.
The Index component (configured for each document type's settings collection)
would then copy the field attributes to the index fields on the document class.
Commit would then commit the data to the repository and delete the batch.
Configuration
Overview
Introduction
To create an ADR solution, you first need to configure the ADR modules using the
ADR configuration tools and a set of sample documents. Once you have created and
tested this configuration, you need to assign it to a settings collection using Capture
Professional and create a batch template and capture path.
In these tutorials you will replicate the classification and separation elements of the
Mortgage Applications example configuration.
Sample Documents
In order to build your configuration, you need a set of sample documents that
accurately represent the documents you will process in the final solution. Typically
these documents will be exported from a current archive or repository system or
collected from current incoming documents.
Accuracy
The first step when configuring a solution is to use the Document Set Management
features in Transformation Studio to ensure the accuracy of the sample documents.
This is particularly important when building a configuration automatically – if the
input to the training process for classifiers and separators isn't accurate, the output
won't be accurate. The Document Set Management steps are also useful to ensure a
good understanding of the structure of the document set (and that no document
types are missing).
Representative Documents
It is important that the sample documents are scanned using the production scanner
and represent the variations that are seen in production, for example faxes and
photocopies. If extraction (indexing) is being implemented as well as classification
and separation, it is recommended that the documents are scanned at 300 dpi.
Step 1: Create Project Open Transformation Studio and create a new project.
Step 3: Assign Document Types If document types have not been imported, assign a
few document types manually and run automatic classification.
Step 6: Read Page Content Read (OCR) all the pages in the documents selected for
configuration. Transformation Studio will use the reads in the next step.
Step 7: Cleanup Documents Within this step you will analyze your document set,
cleanup the documents and add more samples until the set is ready to be used for
configuration.
Step 8: Select Documents for Testing From the clean document set, select a set of
documents to use for testing. These documents must not be used for configuration.
Create Configuration
Recognition
The Recognition module uses classifiers and separators to determine how a batch of
pages is split into documents, and to determine the type of each document.
Classifiers may be based on image or text content and are created from a set of
sample documents. These learn-by-example classifiers can be supplemented with
manually configured templated (including barcode) or rules-based classification
methods.
The advanced document separator is created automatically from the document types
assigned to a set of sample documents and, when run in production, takes into
account the confidence of the page classification results. The rules-based separator is
manually defined using a set of separation rules.
Transformation Studio is used to create all these classifiers and separators. For those
created automatically, it is particularly important that the sample documents are
accurately defined using the Document Set Management Steps.
Step 2: Configure Text Classification Create a document text classifier, integrate into
the configuration and test.
Step 4: Test Test the full configuration, analyzing the classification results and
looking for areas to improve.
Document Review
The Document Review project file is configured using the Document Review Project
Editor.
Step 1: Configure a Document Review Project File Create and configure a project file
to include validation rules and interface options.
The page classification and separation tutorial will use the Document Review
configuration you created for the document classification tutorial.
Integration Steps
The following steps are used to integrate the configuration with Capture and run a
batch through the solution.
X To create a project
1 Open Transformation Studio by clicking Start on the taskbar to display the
menu, and selecting All Programs | IBM FileNet Capture Professional |
ADR Tools | Transformation Studio
2 Click New… to open the New Project window
3 In the window enter the name “Tutorial” for your new project.
4 Click Create to create the project and open the Import Documents tab
Transformation Studio includes a wizard for importing documents into the current
project. Documents can be imported by selecting files and/or folders containing files.
Files on disk can be mapped into various document structures using the file/folder
structure of the image. In addition, data stored in the text of the filename (for
example document types) can be used for a first attempt at classification. However, if
document types are not known they can be assigned after import. Modifications to
document structure can also be done after import, but this is a manual process.
Import of documents can be done multiple times, though within a single import each
file can only be imported once.
Note To launch the Import Documents Wizard manually, select File | Import
Documents, press CTRL+SHIFT+I or click .
The Example
The example mortgage applications have been exported from an archive system, and
have the following folder architecture:
Each document type is in a folder, named with the document type
Each document is in a folder
Each page is a single file, with the last part of the filename indicating the page
number
During import, you will use values from the filename and path to specify how the
files should be imported into documents and to set document properties:
Each folder will indicate a new document
The last part of the filename will indicate page order
The folder name will be imported as the document type
Note During import, Transformation Studio displays status messages at the bottom
of the window.
c Select all the folders (there is one for each document type)
d Click Open
Note If you have not installed to the default location, the number may be
different. Ensure you select the last item in the list.
Note If you have not installed to the default location, the number may be
different. Ensure you select the item with the value “Appraisal Report” in
the list.
a Select “Copy document files into project folder”, rather than referencing
the images in their current location.
This will move them into the project folder, making it easier to move your
project at a later time and ensuring no dependency on the images
remaining in their current location.
Note Using this option will slow the import process and require more
disk space.
Note It is possible to move the project when the images have not been
copied into it, but the images must be accessible from wherever the
project is moved to.
If you did import documents without document types, you would need to do this
step. Refer to the ADR Help for more information.
Once you have imported your documents, the Overview tab displays the
composition of your document set.
You can see how many document types you have and the distribution of documents
across those types. From the Overview you may realize that some types occur rarely
and don’t need configuration or that you have more or less document types or
documents than you expected.
Note The Tax Escrow has more documents than any other type and the
Header has very few documents.
2 Review the Header documents to see whether you should get more
examples.
a Double-click the Header bar in the chart to open Browse Documents with
a filter to show just the Header documents.
b Scroll through the documents, looking at the amount of variation between
each example.
In fact, all of these documents contain barcodes that will be used to
classify the documents. Configuration of templated (barcode)
classification is simple and requires only a few examples, so no additional
documents are required.
3 Select the Overview tab to return to the chart view
Note To display more information on a specific document type, hold the mouse over
the bar in the chart and review the information in the tool tip and in the summary
statistics below the chart.
Note To change the chart display, use the toolbar buttons above the chart.
In this step you will select a subset of the documents in your project to add into the
Sample Documents set, on which your configuration will be based. Any documents
not selected are put into the Unused Documents set. From this set it is easy to add
more samples later, without accidentally selecting duplicates.
Document Sets
When working in a project it is recommended that you always use standard
document sets to ensure maximum accuracy and efficiency when setting up a
solution. If you have a large number of documents it may be time-consuming and
unnecessary to work on all the documents in your project. It is also important to
separate out a set of documents to use for testing and to ensure these are not used
during configuration. Standard document sets and the tools to move documents into
them support both these options. In addition, you may wish to create subsets of
documents in your project according to your own criteria, in which case you can use
the custom document sets.
Each of the three standard document sets has a specific role.
Sample Documents Documents to use when developing the configuration
(for text and image classification these will be used to train the classifiers)
Test Documents Documents to use when testing the configuration (not used
for training)
Unused Documents Documents that are not currently being used. These may
be additional documents that are not required for configuration or documents
that have not yet been classified.
Table 4-1 gives guidelines for the number of documents required for the different
classification methods (per document type). The figures take into account that some
documents may be misclassified or of poor quality and therefore may be discarded
before starting the configuration process. Although you can use more than the
suggested number of sample documents, this will slow down the configuration
process and may not improve accuracy. However, if your initial document set is poor
you should start with a higher number.
Table 4-1. Guideline Number of Documents per Document Type
Documents in standard or custom document sets are shared with those in the overall
project, that is, a document in a standard or custom document set is the same as that
document in All Documents.
Note The actual image files contained in a document are not duplicated into each
set, only the names of the files are duplicated.
When documents are “added” to a set, they are members of the original set and the
set they have been added to.
When documents are “moved” to a set, they are members of the new set, but not the
original set.
Changes to properties or the structure of a document in any document set will affect
that document in all document sets to which it belongs. Similarly, if pages and
documents are deleted or reordered in one document set they will be deleted or
reordered in all document sets to which they belong.
Note All documents are always present in the All Documents set. Documents can be
“added” to another set from All Documents, but cannot be “moved” to another set
from All Documents.
Note By default, 150 documents of each type will be added to the Sample
Documents set. If fewer than the specified number of documents exist in a
document type, a warning will display and all the documents in that type
will be added to Sample Documents set.
2 Click OK.
Once the documents have been successfully added, a message will display.
At this point you need to read (OCR) each page of the documents in your sample set.
Using these reads, Transformation Studio can help you analyze the documents with
the aim of finding any that are misclassified or poor quality. These reads will also be
used when you build text classifiers and configure additional classification methods.
Although all the documents in a project could be read, this is time consuming and
often unnecessary. Reading just the documents in the Sample Documents set is
sufficient (as these documents will be used for configuration and testing).
During the read, the status bar will display the number of the page being read and
the estimated time remaining. The read can be stopped at any point; no data will be
lost but you will need to read the remaining pages in order to continue to the next
step.
The read parameters used in the production configuration should match the
parameters used when reading the page content in Transformation Studio. When
you create a new configuration the default parameters will automatically match (the
parameters in the configuration resources folder are the same as those used by
default on the Read Page Content tab). However, if you are updating a production
configuration in which you have customized the full page read, you should use these
customized full page read parameters when reading page content in Transformation
Studio.
You only need to read a small section of each page (which will speed up
processing time)
You want to use the read for extraction as well as classification and need a
higher read accuracy
Note For information on setting custom read parameters refer to the ADR Help.
2 Click Read.
Once the read has finished, the Stop button will be renamed to Finish.
3 Click Finish
Important Reading all the pages in a document set may take a long time.
Within this step you will analyze your document set, cleanup the documents and
add more samples until the set is ready to be used for configuration. These three
steps may have knock-on effects to each other, requiring one or more steps to be
done multiple times.
The following sections describe each of the individual steps. The tutorial then ties
these together, showing how each step may need to be done more than once.
Green Clean The document type does not have very much variation and
needs little or no work in Cleanup Documents.
Orange Poor The document type has some variation and will need some
work in Cleanup Documents.
Red Very Poor The document type has a lot of variation in the text content. It
may need a lot of attention within Cleanup Documents or may
not be suitable for text classification.
Gray Unknown No information is available as the document type has not been
read or is “(Unknown)”, that is no type is assigned to the
documents.
Note This data is also visible by displaying the tool tip for a document type in the
chart (hover the mouse over the column).
Note The analysis of the documents is based on the page content (text) reads. This
means that occasionally a document type will appear to be poor, when it is actually
clean but only suitable for a classification method other than text (for example, image
or templated classification).
Transformation Studio analyzes the document set and identifies pages it suspects are
extra. These may be blank pages, fax cover sheets, pages with text that isn't found on
other documents in the type or pages that cannot be read properly.
Cleaning up Extra Pages Within this step you will confirm whether or not
each of the marked pages (those suspected as being extra) are extra pages.
Only confirm that a page is an Extra Page if it is not representative of the type
and will not occur in production. Extra Pages may mislead the process when
building classifiers, reducing the accuracy of the overall solution.
Cleaning up Document Types When Transformation Studio analyzes the
document set it assigns a confidence state to each document: confident,
unconfident or misclassified. It also identifies documents that will help define
each document type. In this step you will confirm or remove the document
type for each of these identified documents until all the documents in the set
are confident. As you work, Transformation Studio will continually re-
analyze the set and adjust the confidence states for other documents.
Note As you confirm pages and documents, you may find that other documents are
affected. Therefore Transformation Studio may cycle you through the Cleanup
Documents process until cleanup is complete, that is there are no extra pages,
unconfident documents or misclassified documents.
Note Whenever you add or move documents to the Sample Documents set, it is
recommended you repeat cleanup. For an indication of the additional work required,
review the Overview chart.
Note To see more information on a specific document type, hover the mouse
over the bar in the chart.
Only documents that will affect the confidence of the documents in the set
will be displayed. These documents are continually reassessed as you
confirm or remove document types.
The documents are color-coded as described in Table 4-3.
Table 4-3. Color Coding of Documents
Note The message above the document (and the color coding in the title
bar) indicates how confident Transformation Studio is about the
document.
Tax Escrow Approximately half or the documents in the Tax Escrow type are
actually Initial Escrow. Make sure you do not confirm these Initial Escrow
documents: when you see an Initial Escrow, click No to the question “Is the
displayed document a ‘Tax Escrow’?”.
Request for Tax Form These documents are all 2 pages long, if you see a 4
page document, right click on page 3 and select “Split Document” from the
context menu. You may see a document with a very skewed second page,
select the second page and click the Display Text button at the top of the
Image Viewer. You will see that the page read is very poor. This document
should not be used for configuration and should be deleted from the
document set. Right click on the document and select “Delete document
from project” from the context menu.
Loan Application Some of the loan applications have lots of pages. To wrap
the pages so they all display in the thumbnail viewer without scrolling, click
the Wrap Pages button above the thumbnail viewer.
Note As you confirm each document the status bar will update the
proportion of documents of each state within the type. When this bar is
completely blue and green (that is, all documents are confirmed or confident)
the type is clean and a message will display.
Note When working on the Cleanup Documents tab, you may wish to close
the Project Explorer and Document Types panels. If you have multiple
monitors, you may find it beneficial to drag the Image Viewer to a separate
monitor.
Extra Page Cleanup If the document shows a page highlighted in pink and
the question at the bottom of the tab is “Is the selected page an Extra Page?”,
Transformation Studio is displaying a document which it believes contains
one or more extra pages.
a Look at the currently marked page using the Thumbnail Viewer and the
Image Viewer and decide whether or not it is an extra page
b Confirm or clear the extra page mark:
Click Yes (or press ENTER or Y) to confirm the page is extra to the
document and will not occur in production
Click No (or press N) to clear the suspected extra page
Tip The only extra page in the mortgage applications set is a separator sheet
containing the text “NEW DOCUMENT”. All other suspected extra pages
may occur in production.
11 Read the new pages using the Read Page Content tool.
a Select Tools | Read Page Content.
On the Read Page Content tab, Sample Documents will be selected in the
Document Set list, the “Read only pages that are missing content” option
will be selected in Page Options and the “Use default read parameters”
option will be selected in Read Parameters.
b Click Read
c When the Stop button is renamed to Finish, click Finish
12 The Overview will now display Tax Escrow as orange. This is because the
Tax Escrow includes some (misclassified) Initial Escrow documents. Use
Auto Classify to try to reclassify these documents.
a On the Document Types panel, right click on Tax Escrow
b Select Auto Classify documents… from the context menu
c Click Close
13 Having added more documents to the set, run cleanup again.
a Select the Cleanup Documents tab
b Follow the instructions, confirming documents until a message states that
there is no more work to do
c Select the Overview tab
14 Review the chart.
There should now be at least 100 documents for each document type (except
for Header) and each bar should be green.
Note It is possible to review the documents that have been automatically classified,
using Browse Documents. For more information refer to the ADR Help.
The Test Documents set is used to store a subset of the clean documents for use in
testing. These are not used during the configuration process and therefore form an
unseen set of documents to use for testing. As the test documents have been cleaned
up, a comparison between the data in the project and the results of running the
configuration on the documents will provide an accurate indication of performance.
The Test Documents set is populated by moving documents from the Sample
Documents set.
Rules-based classification
2 Read the warning message; optionally click Show Warnings to see more
details
3 As the percentage of documents to move is already at 30%, click OK to move
the documents.
Once the documents have been successfully moved, a message will display.
4 Click No to remain in the Sample Documents set rather than opening the
Test Documents set.
The number of documents in each set will be updated in Project Explorer.
Note The resources created will vary depending on the type of template selected.
Integrate Classifier
In production, Recognition runs a Recognition script, which in turn calls the
classifier. The Recognition script (called Document Classification.ifv) is created
automatically when the configuration is created. Two changes may be needed in this
script:
The name of the classifier
The pages to be used by the classifier (and therefore that need to be read)
The script will, by default, call a classifier called “Document text classifier.ibc”. This
is the default name of the classifier created using the Build Document Text Classifier
tab. If the name is left unchanged, no modification is needed to the script. For
information on changing the classifier name in the script, refer to the ADR Help.
The script will, by default, run the classification on all pages. For information on
changing the pages to be used, refer to the ADR Help.
X To integrate the classifier, no modifications to the script are required for this
tutorial.
Test Classification
You will test the configuration on the Test Documents set, that is, the documents that
were not used to build the classifier. These documents require exporting from
Transformation Studio so they can be loaded into the Recognition Test Tool. You will
need to export these documents in the correct file structure for testing document
classification (a multi-page image file for each document).
You will then assign the configuration to a project in Recognition Test Tool, where it
is run on the test documents.
Note Although all testing could be done once the configuration is finished, it is
recommended that testing is done as each classification method is implemented,
ensuring any issues are quickly found and fixed.
My Documents\Transformation Studio
Projects\Tutorial\Configurations\Document
Classification\Resources\Document Classification.ifv
d Select the Test Properties tab
e Select the “Display document tree after test” option
f Click OK
g Select Documents | Select Batch File… to open the Select Batch File
window
h Select the batch file you exported from Transformation Studio:
My Documents\Transformation Studio Projects\Tutorial\Exported
Document Sets\Test Documents (Document Classification)\All
Document Types.ibf
i Click Open
j Press F8 or click the Run Test button to test the configuration.
The batch file will not be altered during this process.
k Select the Summary tab to view the Test Documents set with document
types assigned.
The documents have been sorted by document type, and each set of
documents can be viewed by selecting the tab with a name corresponding
to their document type.
l Select File | Save Project and save the project as:
<Installation Path>\Test Projects\Document Classification.rtp
m Select File | Exit to close Recognition Test Tool
For more information on these methods, refer to Classification Methods or the ADR
Help.
Note Image classification is not available when processing documents (it can only be
used to classify individual pages).
Note By adding the documents rather than moving them, the documents
still exist in the Sample Documents set.
Note The Exported File Structure options are equivalent for single-page
documents, so either could be selected.
12 Click Export.
The Export button will be renamed to Abort.
Once the documents have been exported (along with a batch file containing
the document set structure), the Abort button will be renamed to Finish.
13 Click Finish to close the Export Documents tab
2 Select Image | Open Image… to open the Open Sample Image window
3 Select the first Header image in the location you exported the sample Header
documents to:
My Documents\Transformation Studio Projects\Tutorial\Exported
Document Sets\Sample Header Documents\Header\
4 Click Open
7 On the Properties panel on the right, select the Name property and replace
the default value by entering “Barcode” for the field name
8 Select File | Save Definition to open the Save As window
9 Navigate to the location of your Recognition configuration:
My Documents\Transformation Studio
Projects\Tutorial\Configurations\Document Classification\Resources
10 Enter “Header” as the file name.
Note The name of the definition file will be the document type assigned if
classification is successful.
11 Click Save
12 Select Tools | Test Definition to open Test Mode. If prompted, click Yes to
save the configuration file each time you run a test.
14 Select the other Header images in the location you exported the sample
Header documents to:
My Documents\Transformation Studio Projects\Tutorial\Exported
Document Sets\Sample Header Documents\Header\
15 Click Open
16 Click Select All to select all the documents in the list
17 Click Process Document so you can check the result of each document
during the test
18 Click Auto Exit at End of Test so it is not selected
19 Click Run to start the test
20 Check the field shows the message “Barcode found at …” and that the data
matches the value above the barcode
21 Click Run Step to test the next document
22 Repeat the last two steps until the Run Step button is disabled
23 Click Close to exit the Test Mode window.
The barcode should have been found on every document. If needed, resize
the field and retest until all the barcodes are found.
24 In the main Definer view, select the Definition File tab below the image
25 Press Enter to make space for a new line of code after the line CORRECT NEVER
and before the line END
26 To register the document on finding a barcode, enter the following lines:
REGREGEXP .+
REGFORMID 1 -1
BEGIN FIELD
COORDS 427 489 1429 652
FORMID 1
NAME Barcode
TYPE CODE39
CORRECT NEVER
REGREGEXP .+
REGFORMID 1 -1
END
Note For information on the two parameters used, press F1 to open the ADR
Help.
To:
Const CLASSIFY_BY_DEFINITION_FILE = True
4 Set the name of the definition file by changing the following line at the top of
the script:
Const DEFINITION_FILE_FILENAME = "Template.idf"
To:
Const DEFINITION_FILE_FILENAME = "Header.idf"
Test Classification
Note You can open the project from the recent projects on the File menu. By
default it will be:
You have already tested the configuration when you added each classification
method. Those tests primarily checked that each classification method was called and
no errors occurred when running a test. In this step, you will analyze the
performance of the classification in detail, checking that the classification methods
implemented are picking up the documents as you expect and looking to see where
configuration could be improved (for example, by adding in a new classification
method).
This displays the number of documents classified as each type rather than
the percentage
4 Check that the number of documents for each type matches the number of
documents in each type in the Test Documents set in Transformation Studio.
a Open the Tutorial project in Transformation Studio
b Double-click Test Documents to open the set
c Check the number of examples of each document type on the Document
Types panel and compare with the values in the Results Analysis table in
Recognition Test Tool
Note The numbers may not be exactly the same, but the closer they are, the
more effective the configuration.
In this step you will create and configure a Document Review project file using
Document Review Project Editor. In a document classification solution, Document
Review is used to ensure document types are correctly assigned.
8 Click Save
9 Select the “Use the Document Classification view” option on the General
Options tab
10 Select the Types tab
11 Click in top left cell of the Document Types table
12 Enter the text “Appraisal Report” and press Enter to go to the next row
13 Enter the following types in the table:
Funding Transmittal
Header
Initial Escrow
Loan Application
Redemption
Request for Tax Form
Tax Escrow
Truth In Lending
Note It is important that the spelling and case of the document types is
exactly as written here, so that the types match those assigned in
Transformation Studio.
You will now specify a validation rule that states that all documents in the
batch must have a type specified in the list you just created. If this rule is
broken, a problem will display in the Document Review module.
14 Select the Validation tab
15 Click Add… to display the Select Validation Rule window
16 Select the rule “Every document must have a type specified in the list”
17 Click OK
18 Click Add… again
19 Select the rule “Every document must have a confident type“
20 Click OK
21 Select the Review tab
22 In the Review Options panel, select the property “Automatically go to next
problem”
23 In the right column, click the arrow on the right to display the list and select
True
24 Select File | Save to save the project file
25 Select File | Exit to close Document Review Project Editor
This section summarizes the steps to configuring a capture path, settings collection
and template for an ADR solution running on a single computer. For full details refer
to the Capture documentation.
Note The sample image would normally be moved from an existing batch.
For more information refer to the Capture documentation.
b In the “Recognition Script File” section, click Clear to remove the existing
script
c Click Select Script File…
d Browse to your configuration’s resources folder:
My Documents\Transformation Studio
Projects\Tutorial\Configurations\Document Classification\Resources\
e Select the script file Document Classification.ifv
f Click Open
g Click OK
h Select Tools | Configure | ADR Document Review… to open the
Document Review setup dialog
i Click Select…
j Browse to your Document Review configuration folder:
My Documents\Transformation Studio
Projects\Tutorial\Configurations\Document Review
k Select the project file Review.drp
l Click Open
m Click OK
n Select Tools | Configure | ADR Recognition | ADR Recognition… to
open the setup dialog for this instance of Recognition.
The configuration for the standard instance of Recognition is installed
with the example, but does not require any modification. For more
information on configuring conditional extraction, refer to the ADR Help.
o Click OK
p Select Tools | Configure | ADR Completion… to open the Completion
setup dialog.
The configuration for Completion is installed with the example, but does
not require any modification. For more information on creating templates
for multiple document types, refer to the ADR Help.
q Click OK
r Select Tools | Configure | ADR Scripted Export… to open the Scripted
Export setup dialog.
The configuration for Scripted Export is installed with the example, but
does not require any modification. For more information on writing ADR
data to index fields, refer to the ADR Help.
s Click OK
X To create a template
1 Select File | New | Template…
2 Following the wizard, create a template in the “ADR Examples” repository
with the name “My Mortgage Applications Template”
3 Select the settings collection “My Mortgage Applications Settings”
4 Select the capture path “My Mortgage Applications”
5 Click Next and Finish to close the wizard
Summary
In this section, you will modify the current solution to use automatic document
separation. Automatic document separation can save significant time and cost, for
example by removing the need for separator sheets.
Changes needed to make the current solution into a classification and separation
solution are:
Change the Recognition configuration to run classification at page level (as it
will run before document boundaries are established) and then to call
separation.
Create a page classification and separation configuration for Recognition
Configure page classification methods
Add in advanced document separation (which will use the page
classification results to determine the document boundaries and document
types)
Change the Capture Path to remove Zonal OCR and Event Activator. These
are used in the Document Classification solution to establish document
boundaries prior to ADR.
5 Click Add.
The configuration will be added to the Configurations list on the Project
Explorer panel.
Note In order for a page type to be trained successfully, at least 50 examples of that
page type are required. A warning will display in the table if there are less than 50
examples of a page type within the document set.
Note Some warning triangles may display. Hover the mouse over a specific
triangle to display the warning. The warnings in the tutorial are due to not
having enough examples of some page types. If these warnings were seen on
a project, you would need to ask the customer for more examples of these
page types.
2 Within the table, select “None” in the “Train Using” column for the Header
document type as this will be classified by templated (barcode) classification
3 Click Build.
Once the classifier has been built the Build button will be renamed to Finish
and the page text classifier will display on the Project Explorer panel.
4 Click Finish
Integrate Classifier
As for the document classification solution, Recognition calls a Recognition script
which in turn calls the classifier. The Recognition script (called Page Classification
and Separation.ifv) is created automatically when the configuration is created. One
change may be needed in this script:
The name of the classifier
The script will, by default, call a classifier named “Page text classifier.mod”. This is
the default name of the classifier created using the Build Page Text Classifier tab. If
the name is left unchanged, no modification is needed to the script. For information
on changing the classifier name in the script, refer to the ADR Help.
X To integrate the classifier, no modifications to the script are required for this
tutorial.
Test Classification
You will test the configuration on the Test Documents set, that is, the documents that
were not used to build the classifier. These documents require exporting from
Transformation Studio so they can be loaded into the Recognition Test Tool. You will
need to export these documents in the correct file structure for testing page
classification and separation (an image file for each page).
You will then assign the configuration to a project in Recognition Test Tool, where it
is run on the test documents.
Note Although all testing could be done once the configuration is finished, it is
recommended that testing is done as each classification method is implemented,
ensuring any issues are quickly found and fixed.
l Select File | Save Project and save the project in the following folder:
<Installation Path>\Test Projects\Page Classification and Separation.rtp
m Select File | Exit to close Recognition Test Tool
4 Click Build.
When the separator has been built, the Build button will be renamed to
Finish and the separator will display on the Project Explorer panel.
5 Click Finish
Editing thresholds is not covered in this guide. For further information refer to the
ADR Help.
1 Open Recognition Test Tool by clicking Start on the taskbar to display the
menu, and selecting All Programs | IBM FileNet Capture Professional |
ADR Tools | Recognition Test Tool
2 Open the project used to test page text classification
Note You can open the project from the recent projects on the File menu. By
default it will be:
4 At the bottom of the Configuration tab, click the button next to the
Document Review Project File box
5 Select the separation project in your configuration’s resources folder:
My Documents\Transformation Studio
Projects\Tutorial\Configurations\Page Classification and
Separation\Resources\Separation.drp
6 Click Open
7 Click OK
8 Click Run Test
9 Once the test has finished, select the Summary tab
10 If the tree view shows documents of various lengths, the separation has run.
Note More detailed analysis of performance will be done in Step 5: Test and
Evaluate Performance.
Text classification
Image classification
Templated classification (including barcodes)
Rules-based classification
For more information on these methods, refer to Classification Methods or the ADR
Help.
The definition file that was used for classifying the Header document type in the
document classification solution can be used again. However, one modification is
needed: the definition file needs to set the page type rather than the document type
(the separator will determine the document type when it does the separation, and
will not instantly classify it from the barcode registration).
My Documents\Transformation Studio
Projects\Tutorial\Configurations\Page Classification and
Separation\Resources
2 Double-click the file “Page Classification and Separation.ifv” to open it in
Script Editor
3 Turn on templated classification by changing the following line at the top of
the script:
Const CLASSIFY_BY_DEFINITION_FILE = False
To:
Const CLASSIFY_BY_DEFINITION_FILE = True
4 Specify the name of the definition file by changing the following line at the
top of the script:
Const DEFINITION_FILE_FILENAME = "Template.idf"
To:
Const DEFINITION_FILE_FILENAME = "Header_start.idf"
Note You can open the project from the recent projects on the File menu. By
default it will be:
Note More detailed analysis of performance will be done in Step 5: Test and
Evaluate Performance.
Note Some warning triangles may display. Hover the mouse over a specific
triangle to display the warning. Two types of warning display in the tutorial:
one for too few examples of a page type and the other for too many. In a
project, if you have too few examples you need to go back to the customer to
ask for more. If you have too many examples, you need to look at the
variation within the page type. If there is a lot of variation, you would keep
all the pages or consider using a classification method other than image. If
there is only a little variation, you would remove some of the examples.
3 Click Build.
Once the classifier has been built, the Build button will be renamed to Finish
and the classifier will display on the Project Explorer panel.
4 Click Finish
Integrate Classifier
The Recognition script now needs updating to call page image classification. The
template script will not call a page image classifier by default, so you need to
integrate the file by:
Specifying that you are using image classification
Specifying the name of your definition file (not required for this tutorial as the
classifier you built has the default name)
To:
Const CLASSIFY_BY_IMAGE = True
Test Classification
The configuration is again tested on the Test Documents set in Recognition Test Tool.
Note You can open the project from the recent projects on the File menu. By
default it will be:
Note More detailed analysis of performance will be done in Step 5: Test and
Evaluate Performance.
This step includes analyzing test results in Recognition Test Tool and more detailed
performance evaluation using the BatchCompare utility.
Testing has been done throughout the tutorial using the Recognition Test Tool. In
this step, the results of the test are analyzed in a little more detail, with the aim of
spotting any significant problems before going into the more detailed analysis using
BatchCompare.
Note The numbers may not be exactly the same, but the closer they are, the
more effective the configuration.
During the evaluation process the classification and separation results are analyzed
to determine if there are incorrect classifications or splits/joins in documents, where
these are most commonly occurring and therefore which areas in the configuration
need to be improved.
The first, a “reference batch”, is created during export of the Test Documents
from Transformation Studio. It contains accurate document types and
structure for all the Test Documents.
The second, a “comparison batch”, contains the same documents. However,
the document types and structure are exported from Recognition Test Tool,
after a test has been run using the new configuration.
Once the two batch files have been generated, they are compared using the
BatchCompare utility and the results of the comparison are output to a MS Excel
workbook.
Within Excel, additional statistics can be generated from the raw data using built in
macros.
Statistics
It is very important to consider the complete set of statistics when analyzing the
performance, as no single value indicates good or bad performance. For example, if
the separation statistics show several missed or additional splits, the classification
statistics will not accurately represent the performance.
For information on all the statistics, refer to the BatchCompare Reference in the ADR
Help.
Once the separation statistics have been taken into account, the key statistics are the
accuracy and classification rate for each document type.
Accuracy should be as high as possible for each document type
Classification rate should be as high as possible for each document type
Note In order to use the BatchCompare utility you need to have Microsoft Office
Excel installed.
In the new window, navigate to the location of the offline batch you
imported into Recognition Test Tool (the reference batch):
My Documents\Transformation Studio Projects\Tutorial\Exported
Document Sets\Test Documents (Page Classification and Separation)
b For the File name, enter “All Document Types Automatic Results”
c Click Save
d Select File | Exit to close Recognition Test Tool
2 Use the BatchCompare utility.
a Open the command prompt by clicking Start on the taskbar to display the
menu, and selecting All Programs | Accessories | Command Prompt
b Enter the command:
cd <full path to folder containing offline batches>
To enter the folder path quickly, first navigate to:
My Documents\Transformation Studio Projects\Tutorial\Exported
Document Sets\
Then, drag and drop the Test Documents (Page Classification and
Separation) folder into the command prompt window.
Note The Excel workbook will be saved in the location from which
BatchCompare is run (the path specified at the command prompt).
Note You will need to close and re-open the workbook for new security
settings to take effect.
4 Evaluate Performance.
a On the Summary Statistics worksheet, review the Separation Statistics
table.
Correct splits should be as high as possible with missed and additional
splits as low as possible.
b Review the Accuracy and Classification Rate statistics for the Per-
Document Results in the Classification Statistics table.
It is likely that Accuracy is close to 100% but Classification Rate is much
lower.
c Review the Confusion Matrix.
Ideally only the blue shaded cells will contain numbers. If there are any
numbers in the other (unshaded) cells, examples of the document type on
the left have been classified as the type in the column.
5 Improve accuracy.
If your accuracy is less than 90% or your confusion matrix has a lot of data in
the unshaded cells, refer to the following book of the ADR Help for
information on how to troubleshoot and improve performance:
How to configure… | Recognition | Step 4: Test | Page Classification and
Separation | Step 5: Evaluate and Improve Performance
6 If your accuracy Summary is at least 90%, but your classification rate
Summary is lower than 80%, determine your confidence thresholds by
modifying the Confidence Threshold values and recalculating the statistics.
a For the Per-Document Results, change the confidence threshold for each
document type in the table on the far right.
For any document type with a classification rate of less than 95%,
change its confidence threshold to 85%
For any document type with a classification rate of less than 60%,
change its confidence threshold to 80%
d Check that the Accuracy has not dropped. If it has, increase the
Confidence Threshold in 1% increments until the accuracy improves or
you are happy with the compromise.
Once you have determined the optimum thresholds, you need to set them
on the separator.
e Leaving Excel open (so you can see the thresholds), navigate to your
configuration’s resources folder:
My Documents\Transformation Studio
Projects\Tutorial\Configurations\Page Classification and
Separation\Resources
The configuration for Completion is installed with the example, but does
not require any modification. For more information on creating templates
for multiple document types, refer to the ADR Help.
v Click OK
w Select Tools | Configure | ADR Scripted Export… to open the Scripted
Export setup dialog.
The configuration for Scripted Export is installed with the example, but
does not require any modification. For more information on writing ADR
data to index fields, refer to the ADR Help.
x Click OK
X To create a template
1 Select File | New | Template…
2 Following the wizard, create a template in the “ADR Examples” repository
with the name “My Mortgage Applications with Separation Template”
3 Select the settings collection “My Mortgage Applications with Separation
Settings”
4 Select the capture path “My Mortgage Applications with Separation”
5 Click Next and Finish to close the wizard