0% found this document useful (0 votes)
86 views

Getting Started Guide Classification and Separation

Uploaded by

manoj shinde
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
86 views

Getting Started Guide Classification and Separation

Uploaded by

manoj shinde
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 114

Getting Started Guide

(Classification and Separation)

5.2
Copyright
© 1997 - 2008 Kofax Image Products, Inc., 16245 Laguna Canyon Road, Irvine,
California 92618, U.S.A. All rights reserved. Use is subject to license terms.

Portions, copyright 2006 - 2008 FileNet. Portions, copyright 1997-2008 Neurascript


Ltd. All Rights Reserved.

Third-party software is copyrighted and licensed from Kofax’s suppliers. For


information on third-party software included in this product, see the application
About boxes.

THIS SOFTWARE CONTAINS CONFIDENTIAL INFORMATION AND TRADE


SECRETS OF KOFAX IMAGE PRODUCTS, INC. USE, DISCLOSURE OR
REPRODUCTION IS PROHIBITED WITHOUT THE PRIOR EXPRESS WRITTEN
PERMISSION OF KOFAX IMAGE PRODUCTS, INC.

Kofax Image Products, Kofax and the Kofax logo are trademarks or registered
trademarks of Kofax Image Products, Inc. in the U.S. and other countries. All other
trademarks are the trademarks or registered trademarks of their respective owners.

U.S. Government Rights Commercial software. Government users are subject to the
standard license agreement for this product and all applicable provisions of the FAR
and its supplements.

You agree that you do not intend to and will not, directly or indirectly, export or
transmit the Software or related documentation and technical data to any country to
which such export or transmission is restricted by any applicable U.S. regulation or
statute, without the prior written consent, if required, of the Bureau of Export
Administration of the U.S. Department of Commerce, or such other governmental
entity as may have jurisdiction over such export or transmission. You represent and
warrant that you are not located in, under the control of, or a national or resident of
any such country.

DOCUMENTATION IS PROVIDED “AS IS” AND ALL EXPRESS OR IMPLIED


CONDITIONS, REPRESENTATIONS AND WARRANTIES, INCLUDING ANY
IMPLIED WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR
PURPOSE OR NON-INFRINGEMENT, ARE DISCLAIMED, EXCEPT TO THE
EXTENT THAT SUCH DISCLAIMERS ARE HELD TO BE LEGALLY INVALID.
Contents

How to Use This Guide ............................................................................................................1


Introduction ............................................................................................................................ 1
How do I Use This Guide? .................................................................................................... 1
Related Documentation ......................................................................................................... 2
Installation Guide (.pdf) ................................................................................................. 2
User's Guide (.pdf) .......................................................................................................... 3
Getting Started Guides ................................................................................................... 3
Getting Started (Fixed-Form) (.pdf)....................................................................... 3
Getting Started Guide (Free-Form) (.pdf) ............................................................. 4
ADR Help ......................................................................................................................... 4
Visual Basic Scripting Help (.chm) ............................................................................... 4

Overview ...................................................................................................................................5
Introduction ............................................................................................................................ 5
What is ADR and What does it Add to Capture? .............................................................. 5
Features of ADR .............................................................................................................. 5
What is Classification?............................................................................................. 5
What is Separation? ................................................................................................. 6
Classification and Separation of Documents in Production...................................... 6
Recognition ............................................................................................................... 6
Document Review.................................................................................................... 7
Assembly ................................................................................................................... 7
Typical System Architecture for an ADR Solution.............................................. 7
Configuring a Classification and Separation Solution ............................................... 8
The Tutorial............................................................................................................................. 9
The Example Documents ........................................................................................ 9

Getting Started Guide (Classification and Separation) iii


Installation .............................................................................................................................. 11
Introduction ...........................................................................................................................11
Installing ADR for the First Time .......................................................................................12
Standard Installation .....................................................................................................12
Silent Installation ...........................................................................................................12
Licensing ................................................................................................................................12

Processing.............................................................................................................................. 13
Introduction ...........................................................................................................................13
The Mortgage Applications Example.................................................................................13
Introduction....................................................................................................................13
Installing the Mortgage Applications Example .........................................................13
Document Classification ......................................................................................................14
Running the Mortgage Applications Example ..........................................................14
Tutorial: Running the Capture Path.....................................................................14
Tutorial: Running in Dedicated Mode.................................................................23
Page Classification and Separation.....................................................................................28
Running the Mortgage Applications Example ..........................................................28
Tutorial: Running the Capture Path.....................................................................28

Configuration ......................................................................................................................... 33
Overview................................................................................................................................33
Introduction....................................................................................................................33
Sample Documents ........................................................................................................33
Accuracy ..................................................................................................................33
Representative Documents....................................................................................34
Document Set Management Steps........................................................................34
Create Configuration.....................................................................................................36
Recognition..............................................................................................................36
Document Review ..................................................................................................38
Integrate the Configuration with Capture..................................................................39
Integration Steps .....................................................................................................39
Document Classification Tutorial .......................................................................................40
Document Set Management .........................................................................................40
Step 1: Create Project..............................................................................................40
Step 2: Import Documents .....................................................................................41
Step 3: Assign Document Types ...........................................................................47

iv Getting Started Guide (Classification and Separation)


Step 4: Initial Analysis ........................................................................................... 47
Step 5: Select Sample Documents for Configuration......................................... 49
Step 6: Read Page Content .................................................................................... 52
Step 7: Cleanup Documents.................................................................................. 53
Step 8: Select Documents for Testing................................................................... 63
Create Recognition Configuration .............................................................................. 65
Step 1: Create Configuration ................................................................................ 65
Step 2: Configure Text Classification................................................................... 67
Step 3: Add in Additional Classification Methods ............................................ 70
Step 4: Test Performance ....................................................................................... 76
Create Document Review Configuration................................................................... 77
Step 1: Configure a Document Review Project File ........................................... 77
Integrate the Configuration with Capture ................................................................. 79
Step 1: Create Capture Path .................................................................................. 79
Step 2: Create Settings Collection ........................................................................ 80
Step 3: Assign Configuration to Settings Collection.......................................... 80
Step 4: Create Batch Template .............................................................................. 83
Step 5: Process a Batch........................................................................................... 83
Page Classification and Separation Tutorial..................................................................... 84
Summary ........................................................................................................................ 84
Create Recognition Configuration .............................................................................. 84
Step 1: Create Configuration ................................................................................ 84
Step 2: Configure Text Classification................................................................... 85
Step 3: Configure Document Separation............................................................. 89
Step 4: Add in Additional Classification Methods ............................................ 92
Step 5: Test and Evaluate Performance ............................................................... 98
Integrate the Configuration with Capture ............................................................... 105
Step 1: Create Capture Path ................................................................................ 105
Step 2: Create Settings Collection ...................................................................... 105
Step 3: Assign Configuration to Settings Collection........................................ 106
Step 4: Create Batch Template ............................................................................ 108
Step 5: Process a Batch......................................................................................... 108

Getting Started Guide (Classification and Separation) v


vi Getting Started Guide (Classification and Separation)
How to Use This Guide

Introduction
This guide introduces IBM FileNet Capture Advanced Document Recognition (ADR)
and describes how it used to automatically separate pages into documents and to
classify documents. It starts with brief installation instructions which are followed by
a tutorial. The tutorial will guide you through processing batches with the pre-
installed Mortgage Applications example. The guide then describes how each of the
modules are configured. The tutorial then steps through creating a new set of
configuration files to use to classify the mortgage application documents (and how to
assign this configuration to Capture). The final tutorial steps through creating a new
configuration, but this time to include automatic separation of pages into documents.

This guide assumes that you have a thorough understanding of Windows standards,
applications, and interfaces.

This guide is for people who need an introduction to ADR, specifically automatically
classifying and separating documents. It will be beneficial to people who will be:
ƒ Configuring ADR for a specific settings collection
ƒ Administering or supporting an ADR solution

How do I Use This Guide?


Read the entire guide sequentially. It includes several tutorials which need to be
completed in order. The tutorials require ADR to be installed including the ADR
examples.

If you need more detailed information on configuring a module, open the ADR Help
and read the relevant “How to configure…” book. Additional details of all the
documentation provided with ADR are included in the section Related
Documentation.

Getting Started Guide (Classification and Separation) 1


Related Documentation
The following documentation is included with ADR.

Each PDF guide can be opened by clicking Start on the taskbar to display the menu
and selecting All Programs | IBM FileNet Capture Professional | ADR
Documentation.

The ADR Help can be opened from the same menu, but can also be opened from the
Help menu within the tools. Pressing F1 within Definer and Script Editor will open
the topic for the feature being used.

Installation Guide (.pdf)


This guide is written for those installing ADR, either on a development computer
(where a solution is configured or tested) or on a production computer.

The guide explains:


ƒ Licensing requirements
ƒ The procedure for installing ADR
ƒ How to customize modules running as dedicated applications
ƒ How to install the unattended modules as Windows services

System requirements are stated in the FileNet Capture-Print-RCS Products


Dependency Matrix.

The FileNet Capture-Print-RCS Products Dependency Matrix is available on the


IBM support web site.
1 Browse to www.ibm.com/support/documentation
2 On the Support & downloads page, under Choose support type, select
Information Management
3 Under Choose a product, select FileNet Capture
4 On the FileNet Capture Product support page, click Product documentation

2 Getting Started Guide (Classification and Separation)


User's Guide (.pdf)
The User's Guide (.pdf) is written for keyboard operators (keyers) who will be using
the attended modules on a production computer, and for those using all of the
modules on a development computer.

The guide explains:


ƒ What each ADR module is used for
ƒ How to operate each module

Getting Started Guides


These guides are written for people who need an introduction to ADR. The guides
are useful as a starting point for those who will be configuring or administering
ADR, or those using the keying modules. The guides are self contained, however
each focuses on configuring a different document processing solution.

Getting Started (Fixed-Form) (.pdf)

The Getting Started Guide (Fixed-Form) (.pdf) focuses on configuring a solution to


extract data from fixed-form (structured) documents.

The guide explains:


ƒ How to extract data from single page documents of a known document type,
using the installed Order Forms example
ƒ The tools, concepts and configuration files and how they relate to the setup in
Capture
ƒ How to replicate the Order Forms configuration by following detailed
procedures

Getting Started Guide (Classification and Separation) 3


Getting Started Guide (Free-Form) (.pdf)

The Getting Started Guide (Free-Form) (.pdf) focuses on configuring a solution to extract
data from free-form (semi-structured or unstructured) documents.

The guide explains:


ƒ How to extract data from single page documents of a known document type,
using the installed Solicitors Letters example
ƒ The tools, concepts and configuration files and how they relate to the setup in
Capture
ƒ How to replicate the Solicitors Letters configuration by following detailed
procedures

ADR Help
The ADR Help is written for those configuring a solution and for system
administrators, and assumes those reading it have read the Getting Started Guides or
attended an ADR training course. This assumption is made so that the ADR Help can
provide the most accurate and detailed information across every aspect of the
product.

The ADR Help explains:


ƒ How to configure the ADR modules to process a specific document set
ƒ How to use the module setup dialogs to assign a configuration to a settings
collection
ƒ The integration of ADR within the FileNet Capture platform
ƒ How to set up and monitor an efficient production environment
(Administration Help)

The ADR Help also contains a reference section which includes:


ƒ Definition file parameters used by the Recognition and Correction modules
ƒ Script objects, hooks, methods and properties used by all of the modules

Visual Basic Scripting Help (.chm)


The Visual Basic Scripting Help (.chm) is provided for further information on VB
scripting.

4 Getting Started Guide (Classification and Separation)


Chapter 1

Overview

Introduction
This chapter introduces some of the concepts of data capture and key points of ADR.

What is ADR and What does it Add to Capture?


ADR is a set of modules that provide additional automatic recognition (classification,
separation and extraction) and advanced keying (indexing and validation)
functionality to Capture.

Capture scans paper-based documents or imports images from file, creating a series
of scanned image files. Capture then routes the files through ADR, a set of modules
that (along with Assembly) separate pages into documents, classify documents and
extract information, creating understandable electronic data. Capture then
automatically transfers the data to index fields and commits the data and images to
the repository.

Features of ADR
This guide covers two key features of ADR: classification and separation.

What is Classification?

Classification is the process of assigning a type to each document, either to export to


the final repository or to use during extraction. ADR can be configured to classify
documents directly or as a result of page classification and document separation.

Classification Methods

Classification can be done using one or more of the following methods:

Getting Started Guide (Classification and Separation) 5


Chapter 1

ƒ Image Classification Classification based on the overall layout and structure


of a page, including lines, boxes, logos and placement of text.
ƒ Text Classification Classification based on detailed analysis of the text
content of a page or document.
ƒ Rules-Based Classification Classification performed by searching for specific
data or keywords, independent of layout.
ƒ Templated Classification Classification determined by the presence of one or
more marks, barcodes or items of text in pre-defined locations.

What is Separation?

Document separation methods provide an automated approach to identifying the


boundaries between multiple documents in a single batch.

Separation Methods

Document separation is determined from the page classification results using either
of the following methods:
ƒ Rules-based document separation One or more rules specify when new
documents are created; for example, if a page of type A is seen, create a
document of type X.
ƒ Advanced document separation A probabilistic method that ascertains the
most likely document structure from the page classifications and their
confidence scores. This method is robust to variation in documents and mis-
classifications due to its probabilistic nature. For example, a 6 page document
of type X has been specified to expect pages of type A, B, C, D, E and F. Five
of the pages have been classified but one page is classified as type Y. From
the fact that five out of six pages in a row in the batch have been classified as
pages in document type X it is highly probable that this type of document has
been found.

Classification and Separation of Documents in Production


The Recognition and Document Review modules, along with the Assembly
component, are used to classify and separate documents.

Recognition

Classification and separation are done in the same processing step, in an instance of
the Recognition module. A single solution would do one of the following:

6 Getting Started Guide (Classification and Separation)


Overview

ƒ Document Classification
ƒ Page Classification and Separation (resulting in document classification)

If extraction is also being done as part of the Capture Path, an additional instance of
the Recognition module named ADR Recognition (Classification and Separation) is
used for classification and separation.

This leaves the standard instance of Recognition available for extraction.

Note Data extraction is generally done in the standard instance of Recognition, once
all document types have been determined (and manually reviewed if needed).

Document Review

Document Review is usually used after Recognition (Classification and Separation)


to review the automatic classification results. Within Document Review, a user can
confirm any types that Recognition is uncertain about, fix any validation failures (by
changing document type) or review the batch.

Assembly

Assembly is typically run after the ADR modules to restructure the batch for the
repository, and if separation is done in ADR, to set the appropriate document class
for each document.

If documents are determined prior to ADR (for example using fixed-length


separation or separator sheets), Assembly may be run prior to the ADR modules.
However, in this case you will only be able to use a single document class for all the
documents in your solution.

Note It is only after Assembly has run that Capture Professional will display the
document structure determined in ADR.

Typical System Architecture for an ADR Solution

The system architecture is determined by the capacity of the system: high volume or
low volume.

Getting Started Guide (Classification and Separation) 7


Chapter 1

High Volume, Distributed Environment

In high volume environments it is typical to have multiple stations processing


batches, with each station running a specific module. As with the standard Capture
components, ADR modules may be used in this distributed environment. However,
there are two options for how they may be used:
ƒ Standard The user opens Capture Professional and starts polling. The
administrator must have previously configured a polling filter.
ƒ Dedicated The modules are started outside of Capture Professional, as
standalone applications or Windows services. Automatic polling occurs for
batches that are ready for the current module. To start a module in this
dedicated mode, the user would click Start on the taskbar to display the
menu, and select “All Programs | IBM FileNet Capture Professional |
<Module Name>”. Alternatively an administrator could install the Recognition
and Scripted Export modules as Windows services (which would start
automatically).

Low Volume, Single Station Environment

In lower volume environments it is typical to run batches through all the modules on
a single station. ADR uses the standard Capture behavior: the user will normally
open Capture Professional and start the capture path.

Note Ad hoc processing is supported by the ADR modules. However, it is not


recommended for general use. The ADR modules depend on batches being run
through the modules in a particular order. If this order isn’t followed, the solution
may not be applied in the most efficient way. For details of ad hoc processing refer to
the ADR Help.

Configuring a Classification and Separation Solution


Configuration (that is, setting up the ADR modules to process particular documents)
is a two step process:
ƒ Configure the ADR modules using the ADR configuration tools and a set of
sample documents
ƒ Assign the configuration to a settings collection using Capture Professional
and create a batch template and capture path

The primary ADR tool used for configuration is Transformation Studio. The tutorial
in this guide will step you through the configuration process.

8 Getting Started Guide (Classification and Separation)


Overview

The Tutorial
This guide includes a tutorial on processing documents using the classification and
separation functionality in ADR.

The tutorial works through processing and configuring two solutions:


ƒ Document Classification
ƒ Page Classification and Separation

The Example Documents

The tutorial uses a set of example mortgage application documents with the
following document types:
ƒ Appraisal Report
ƒ Header
ƒ Funding Transmittal
ƒ Redemption
ƒ Initial Escrow
ƒ Request for Tax Form
ƒ Tax Escrow
ƒ Truth In Lending
ƒ Loan Application

Getting Started Guide (Classification and Separation) 9


Chapter 1

10 Getting Started Guide (Classification and Separation)


Chapter 2

Installation

Introduction
This chapter provides instructions for installing ADR using the installation wizard.

To install ADR the following items are required:


1 A computer satisfying the system requirements as stated in the FileNet
Capture-Print-RCS Products Dependency Matrix.
The FileNet Capture-Print-RCS Products Dependency Matrix is available
on the IBM support web site.
a Browse to www.ibm.com/support/documentation
b On the Support & downloads page, under Choose support type, select
Information Management
c Under Choose a product, select FileNet Capture
d On the FileNet Capture Product support page, click Product
documentation
2 A Capture installation CD
3 An ADR installation CD
4 An ADR license key

Note ADR is installed to a subfolder of the Capture installation path. By default this
location, referred to as <Installation Path>, is:

C:\Program Files\FileNet\CaptureADR

Getting Started Guide (Classification and Separation) 11


Chapter 2

Installing ADR for the First Time

Standard Installation
Standard installation is done from the Capture installation wizard. These instructions
do not describe every installation screen, but focus on the key points.

X To install ADR
1 Run the Capture installer and select the “Advanced Document Recognition
(ADR)” package
2 When prompted by the Capture installer, place the ADR installation CD into
the CD-ROM drive.
The ADR installer will start automatically.
3 Follow the on-screen instructions

Silent Installation
Silent installation of ADR is possible during a silent install of Capture.

For more information, refer to the Capture documentation.

Licensing
The following licenses are available for ADR:
ƒ Fixed-form Recognition
ƒ Free-form Recognition
ƒ Fixed-form and Free-form Recognition
ƒ Fixed-form and Free-form Recognition and Classification and Separation

During installation you will be prompted to enter a software license key which
specifies which of the above features are licensed.

In order to run the tutorial in this guide, you will need the last license: Fixed-form
and Free-form Recognition and Classification and Separation.

If you try to use the features without a suitable license, an error message will display.

For more information on licensing, refer to the Installation Guide (.pdf).

12 Getting Started Guide (Classification and Separation)


Chapter 3

Processing

Introduction
This chapter will introduce you to the ADR modules as they are used in production.
You will process a pre-configured example solution to experience how the modules
run.

The Mortgage Applications Example

Introduction
The ADR installation includes an example configuration that demonstrates some of
the processing features of classification and separation in ADR. The example uses a
pre-defined capture path, settings collection and template, as well as configuration
files for the ADR modules. The example has been configured twice to demonstrate
two different methods for using ADR. Once to show document classification and
once to show page classification and separation (resulting in document
classification).

The example is installed in a local repository named “ADR Examples” and is


configured to capture data from a set of (installed) example images.

Installing the Mortgage Applications Example


The example is installed by default during the ADR installation.

Getting Started Guide (Classification and Separation) 13


Chapter 3

Document Classification
In the document classification example, ADR Recognition (Classification and
Separation) and ADR Document Review are used to classify mortgage applications.
Document boundaries are established prior to ADR Recognition (Classification and
Separation).

The complete set of Capture components and ADR modules used is:
ƒ File Import
ƒ Zonal OCR
ƒ Event Activator
ƒ ADR Recognition (Classification and Separation)
ƒ ADR Document Review
ƒ ADR Recognition
ƒ ADR Completion
ƒ ADR Scripted Export

Running the Mortgage Applications Example


This section will step you through classifying documents automatically using the
example. It will take you through the processing twice, initially demonstrating
typical processing in a low volume scenario, using Capture Professional and then
demonstrating processing in a high volume environment, using dedicated mode.

Tutorial: Running the Capture Path

Create a Batch
Create a batch based on the “Mortgage Applications Template” template.

X To create an example batch in Capture Professional


1 Open Capture Professional by clicking Start on the taskbar to display the
menu, and selecting All Programs | IBM FileNet Capture Professional |
Capture Professional 5.2
2 Select the “ADR Examples” repository
3 Expand the repository tree structure to display the folder names
4 Select the “Batches” folder.
Folders are optional, but are useful for organizing your work.

14 Getting Started Guide (Classification and Separation)


Processing

5 Select Edit | New Batch…


The New Batch window is displayed.
6 Enter a name for the new batch in the “Batch name” field, for example
“Mortgage Applications 1”
7 Select the “Mortgage Applications Template” template to use as the basis for
the batch.
The “Mortgage Applications Settings” settings collection will automatically
be selected.
8 Click OK

Import Images and Establish Document Boundaries


Import the example images using the File Import component. Then use Zonal OCR to
find the separator sheets (single pages inserted before scanning that indicate the start
of a new document) and Event Activator to establish document boundaries.

Note It is not possible to see the document structure in Capture Professional until
Assembly has run.

X To import images and establish document boundaries, make sure the name
of the new batch is highlighted and select Tools | Start | Capture Path.

File Import will import the example images from the following location:

<Installation Path>\Examples\Mortgage Applications\Images\Document


Images

Zonal OCR will read a single index field on each image. If the word
DOCUMENT is found in the field, Event Activator will set an attribute marking
this image as a separator sheet.

The capture path will automatically continue onto the next stage, classifying
documents.

Classify the Documents


Use the Classification and Separation instance of the Recognition module to
automatically classify the documents. Any documents that can not be recognized
confidently will be displayed in the next stage, in Document Review.

Getting Started Guide (Classification and Separation) 15


Chapter 3

X To classify the documents, Recognition will automatically classify the


documents: no user interaction is required.

Review the Classification Results


Use ADR Document Review to confirm any document types that Recognition is
unsure about or that fail validation rules.

X To review the automatic classification results


1 Wait for the batch to be automatically loaded into Document Review.
Document Review will launch automatically with the batch open in the
module’s Document Classification view. This is used to quickly set any
missing document types or confirm any that are uncertain. If nothing can be
quickly fixed in this view, the problems can be overridden and will then
display in the Review view where you can see all the documents in the batch.

Figure 3-1. The Batch Loaded in Document Review

2 Select Document Classification | Override Problem to ignore the problem for


now.

16 Getting Started Guide (Classification and Separation)


Processing

A message will display stating that there are no more problem documents to
display in Document Classification view so the batch will now be shown in
Review.

Note You can also press F7 to override a problem.

Figure 3-2. Transition from Document Classification view to Review

3 Click OK on the message.


The batch will open in Review, with the overridden document displayed. It
has failed two validation rules (shown in the yellow message) which must be
fixed before the batch can be closed.

Getting Started Guide (Classification and Separation) 17


Chapter 3

Figure 3-3. The Batch in Review

In Review, you can see all the documents in the batch. The document with
the two problems looks to be a very poorly scanned Truth In Lending
document. The next document in the batch is also a Truth In Lending.
4 Click the + buttons to the left of the two documents to expand them and
display the thumbnails

18 Getting Started Guide (Classification and Separation)


Processing

Figure 3-4. The Two Documents with Thumbnail Images

5 Compare the two documents.


They appear to be the same document (they have the same loan number in
the top left). The first of the two documents can therefore be deleted.

Note If this document could not be deleted or the problem fixed, the batch
can be suspended. It is then available to be re-opened in Capture
Professional ad-hoc mode by an administrator who could investigate the
problem.

6 Right click on the problem document to display the context menu

Getting Started Guide (Classification and Separation) 19


Chapter 3

Figure 3-5. Deleting a Document

7 Select Delete and click Yes.


As there are no further problems in the batch you will be prompted to close
the batch.
8 Click Yes

Conditionally Extract Data


Having classified the documents and reviewed the results of the classification in
Document Review, extraction can now take place.
The standard instance of the ADR Recognition module is used to extract data from
the documents, resulting in a different set of ADR fields for each ADR document
type.

X To extract the data, no user interaction is required.

Recognition will automatically extract the data from the eleven documents.

20 Getting Started Guide (Classification and Separation)


Processing

Review the Extraction Results


Use ADR Completion to review the fields extracted for each document.

X To review the data


1 Wait for the batch to be automatically loaded into Completion.
Every document is being displayed for this example. In production, only
documents with missing or invalid data would be displayed.

Figure 3-6. Completion Window

This is a “Header” document. No data has been extracted, but the document
type is displayed as a read only field.
2 Press F12 to move to the next document.
Again, no data has been extracted for the “Tax Escrow” document type.
3 Press F12 to move to the next document
4 Use Tab/Shift+Tab to navigate around the fields on Document 3.

Getting Started Guide (Classification and Separation) 21


Chapter 3

Figure 3-7. Completion Window displaying Data Extracted for Loan Application
Documents

These fields have been specifically extracted for the “Loan Application”
document type. Such conditional extraction would not be possible without
the document classification that ran before.
5 Press F12 after viewing this document and after viewing each of the
remaining documents.
Notice the data that has been conditionally extracted (or deliberately not
extracted) for each of the document types in the document classification
solution.
When all the fields have been completed an End of Batch window is
displayed.
6 Click Exit Completion

Export the Data


Based on the document type, Scripted Export sets attributes to specify how the batch
is split into documents and the settings collection for each document. This settings

22 Getting Started Guide (Classification and Separation)


Processing

collection is linked to a document class in the repository, which includes the index
fields to be populated.

X To export data, Scripted Export will automatically process the batch and close
once complete.

In a production system, Assembly, Index and Commit would be run after Scripted
Export. Using the attributes set by Scripted Export, the Assembly component would
split the original batch into a batch per document and assign the specified settings
collection to the single-document batch.

The Index component (configured for each document type's settings collection)
would then copy the field attributes to the index fields on the document class.
Commit would then commit the data to the repository and delete the batch.

Tutorial: Running in Dedicated Mode


In dedicated mode, the ADR applications are run outside of Capture Professional.
Recognition and Scripted Export could be installed and run as Windows services, but
in this case all the applications are being run as dedicated applications.

Create a Batch
Create a batch based on the “Mortgage Applications Template (Dedicated)”
template.

X To create an example batch in Capture Professional


1 Open Capture Professional by clicking Start on the taskbar to display the
menu, and selecting All Programs | IBM FileNet Capture Professional |
Capture Professional 5.2
2 Select the “ADR Examples” repository
3 Expand the repository tree structure to display the folder names
4 Select the “Batches” folder.
Folders are optional, but are useful for organizing your work.
5 Select Edit | New Batch…
The New Batch window is displayed.
6 Enter a name for the new batch in the “Batch name” field, for example
“Mortgage Applications 2”

Getting Started Guide (Classification and Separation) 23


Chapter 3

7 Select the “Mortgage Applications Template (Dedicated)” template to use as


the basis for the batch.
The “Mortgage Applications Settings” settings collection will automatically
be selected.
8 Click OK

Import Images and Establish Document Boundaries


Zonal OCR and Event Activator are again used to determine the start and end of
each document.

X To import images and establish document boundaries


1 Make sure the name of the new batch is highlighted and select Tools | Start
| Capture Path.
File Import will import the example images from the following location:
<Installation Path>\Examples\Mortgage Applications\Images\Document
Images
Zonal OCR will read a single index field on each image. If the word
DOCUMENT is found in the field, Event Activator will set the attributes
marking this image as a separator sheet.
No further processing will occur due to the Save/Stop stage in the capture
path.
2 Select File | Return to Ad Hoc Mode to return to ad hoc processing
3 Select File | Exit to close Capture Professional

Classify Documents
Use the Classification and Separation instance of the Recognition module to
automatically classify the documents. Any documents that can not be recognized
confidently will be displayed in the next stage, in Document Review.

X To classify the documents


1 Open Recognition (Classification and Separation) by clicking Start on the
taskbar to display the menu, and selecting:
All Programs | IBM FileNet Capture Professional | ADR Examples | ADR
Recognition (Classification and Separation) – ADR Examples
This will only process batches from the ADR Examples local repository.

24 Getting Started Guide (Classification and Separation)


Processing

2 Select Session | Select Batch… and select the batch created in Capture
Professional from the list
3 Click Open.
Recognition will automatically classify the documents, no user interaction is
required.
4 When processing is complete the status bar will display “Status: Idle”
5 Select Session | Exit to close Recognition

Note Rather than selecting a single batch in Recognition, the module would
normally be started in Wait for any Batch mode to automatically process batches as
they become available. Alternatively, Recognition would be installed as a Windows
service and would process batches automatically.

Review the Classification Results


Use ADR Document Review to confirm any document types that Recognition is
unsure about or that fail validation rules.

X To review the automatic classification results


1 Open Document Review by clicking Start on the taskbar to display the menu,
and selecting:
All Programs | IBM FileNet Capture Professional | ADR Examples | ADR
Document Review – ADR Examples
This will only process batches from the ADR Examples local repository.
2 Select Session | Select Batch… and select the batch created in Capture
Professional from the list
3 Click Open
4 Override the problem in the Document Classification view
5 Click OK on the message box to open the Review view
6 Delete the poorly scanned document
7 Click Yes on the message box to close the batch
8 Click Cancel on the Select Batch window
9 Select Session | Exit to close Document Review

Getting Started Guide (Classification and Separation) 25


Chapter 3

Conditionally Extract Data


Having classified the documents and reviewed the results of the classification in
Document Review, extraction can now take place.
The standard instance of the ADR Recognition module is used to extract data from
the documents, resulting in a different set of ADR fields for each ADR document
type.

X To extract the data


1 Open Recognition by clicking Start on the taskbar to display the menu, and
selecting:
All Programs | IBM FileNet Capture Professional | ADR Examples | ADR
Recognition – ADR Examples
This will only process batches from the ADR Examples local repository.
2 Select Session | Select Batch… and select the batch created in Capture
Professional from the list
3 Click Open.
Recognition will automatically read the data from the eleven documents; no
user interaction is required.
When processing is complete the status bar will display “Status: Idle”.
4 Select Session | Exit to close Recognition

Review the Extraction Results


Use ADR Completion to review the fields extracted for each document.

X To review the data


1 Open Completion by clicking Start on the taskbar to display the menu, and
selecting:
All Programs | IBM FileNet Capture Professional | ADR Examples | ADR
Completion – ADR Examples
This will only process batches from the ADR Examples local repository.
2 Select Session | Select Batch and select the batch created in Capture
Professional from the list
3 Click Open
4 Review the data as you did when running the capture path

26 Getting Started Guide (Classification and Separation)


Processing

5 When the end of batch window displays, click Exit Completion

Export the Data

Based on the document type, Scripted Export sets attributes to specify how the batch
is split into documents and the settings collection for each document. This settings
collection is linked to a document class in the repository, which includes the index
fields to be populated.

X To export data
1 Open Scripted Export by clicking Start on the taskbar to display the menu,
and selecting:
All Programs | IBM FileNet Capture Professional | ADR Examples | ADR
Scripted Export – ADR Examples
This will only process batches from the ADR Examples local repository.
2 Select Session | Select Batch… and select the batch created in Capture
Professional from the list
3 Click Open.
Scripted Export will automatically classify the documents, no user
interaction is required.
When processing is complete the status bar will display “Status: Idle”.
4 Select Session | Exit to close Scripted Export.

Note Rather than selecting a single batch in Scripted Export, the module
would normally be started in Wait for any Batch mode to automatically
process batches as they become available. Alternatively, Scripted Export
would be installed as a Windows service and would process batches
automatically.

In a production system, Assembly, Index and Commit would be run after Scripted
Export. Using the attributes set by Scripted Export, the Assembly component would
split the original batch into a batch per document and assign the specified settings
collection to the single-document batch.

The Index component (configured for each document type’s settings collection)
would then copy the field attributes to the index fields on the document class.
Commit would then commit the data to the repository and delete the batch.

Getting Started Guide (Classification and Separation) 27


Chapter 3

Page Classification and Separation


In the page classification and separation solution, ADR Recognition (Classification
and Separation) and ADR Document Review are used to classify and separate
documents. Document boundaries are established from the classification of pages in
ADR Recognition (Classification and Separation) and used by the later ADR
modules.

The complete set of Capture components and ADR modules used is:
ƒ File Import
ƒ ADR Recognition (Classification and Separation)
ƒ ADR Document Review
ƒ ADR Recognition
ƒ ADR Completion
ƒ ADR Scripted Export

Running the Mortgage Applications Example


This section is similar to the last tutorial, but will demonstrate automatic separation.
The tutorial only uses the low volume scenario (using Capture Professional), though
either could be used.

Tutorial: Running the Capture Path

Create a Batch
Create a batch based on the template called “Mortgage Applications with Separation
Template”.

X To create a batch in Capture Professional


1 Open Capture Professional by clicking Start on the taskbar to display the
menu, and selecting All Programs | IBM FileNet Capture Professional |
Capture Professional 5.2
2 Select the “ADR Examples” repository
3 Expand the repository tree structure to display the folder names
4 Select the “Batches” folder.
Folders are optional, but are useful for organizing your work.
5 Select Edit | New Batch…

28 Getting Started Guide (Classification and Separation)


Processing

The New Batch window is displayed.


6 Enter a name for the new batch in the “Batch name” field, for example
“Mortgage Applications 3”
7 Select the “Mortgage Applications with Separation Template” template to
use as the basis for the batch.
The “Mortgage Applications with Separation Settings” settings collection
will automatically be selected.
8 Click OK

Import Images
Import the example images.

X To import images, make sure the name of the new batch is highlighted and
select Tools | Start | Capture Path.

File Import will import the example images from the following location:

<Installation Path>\Examples\Mortgage Applications\Images\Page Images

The Capture Path will automatically continue onto the next stage, classifying and
separating documents.

Classify and Separate the Documents


Use the Classification and Separation instance of ADR Recognition to automatically
classify and separate the documents. Any documents that can not be classified or
separated confidently will be displayed in the next stage, in Document Review.

Note It is not possible to see the document structure in Capture Professional until
Assembly has run.

X To classify and separate the documents, Recognition will automatically


classify and separate the documents: no user interaction is required.

Review the Classification and Separation Results


Use ADR Document Review to confirm any document types that Recognition is
unsure about or that fail validation rules.

X To review the automatic classification and separation results


1 Wait for the batch to be automatically loaded into Document Review.

Getting Started Guide (Classification and Separation) 29


Chapter 3

A problem is displayed in the Document Classification view, since the


separation confidence score for a Funding Transmittal document was too
low.

Figure 3-8. Problem in the Document Classification View

You can see that the document is, however, correctly classified. A simple
confirmation is required.
2 Press Enter to confirm the document type
3 Click OK on the message to open the Review view
4 Expand the documents to check the automatic separation has been successful
5 Select Session | Close Batch and click Yes to close the batch

Conditionally Extract Data


Having classified the documents and reviewed the results of the classification in
Document Review, extraction can now take place.
The standard instance of the ADR Recognition module is used to extract data from
the documents, resulting in a different set of ADR fields for each ADR document
type.

30 Getting Started Guide (Classification and Separation)


Processing

X To extract the data, no user interaction is required.

Recognition will automatically extract the data from the fourteen documents.

Review the Extraction Results


Use ADR Completion to review the fields extracted for each document.

X To review the data


1 Wait for the batch to be automatically loaded into Completion
2 Review the data that has been conditionally extracted for each of the
document types in the page classification and separation solution, pressing
F12 to move on to the next document
3 When you reach the final document and have verified that all the fields are
complete, press F12.
An End of Batch window is displayed.
4 Click Exit Completion

Export the Data


Based on the document type, Scripted Export sets attributes to specify how the batch
is split into documents and the settings collection for each document. This settings
collection is linked to a document class in the repository, which includes the index
fields to be populated.

X To export data, Scripted Export will automatically process the batch and close
once complete.

In a production system, Assembly, Index and Commit would be run after Scripted
Export. Using the attributes set by Scripted Export, the Assembly component would
split the original batch into a batch per document and assign the specified settings
collection to the single-document batch.

The Index component (configured for each document type's settings collection)
would then copy the field attributes to the index fields on the document class.
Commit would then commit the data to the repository and delete the batch.

Getting Started Guide (Classification and Separation) 31


Chapter 3

32 Getting Started Guide (Classification and Separation)


Chapter 4

Configuration

Overview

Introduction
To create an ADR solution, you first need to configure the ADR modules using the
ADR configuration tools and a set of sample documents. Once you have created and
tested this configuration, you need to assign it to a settings collection using Capture
Professional and create a batch template and capture path.

In these tutorials you will replicate the classification and separation elements of the
Mortgage Applications example configuration.

Sample Documents
In order to build your configuration, you need a set of sample documents that
accurately represent the documents you will process in the final solution. Typically
these documents will be exported from a current archive or repository system or
collected from current incoming documents.

Accuracy

The first step when configuring a solution is to use the Document Set Management
features in Transformation Studio to ensure the accuracy of the sample documents.
This is particularly important when building a configuration automatically – if the
input to the training process for classifiers and separators isn't accurate, the output
won't be accurate. The Document Set Management steps are also useful to ensure a
good understanding of the structure of the document set (and that no document
types are missing).

Getting Started Guide (Classification and Separation) 33


Chapter 4

Representative Documents

It is important that the sample documents are scanned using the production scanner
and represent the variations that are seen in production, for example faxes and
photocopies. If extraction (indexing) is being implemented as well as classification
and separation, it is recommended that the documents are scanned at 300 dpi.

Document Set Management Steps


The following steps are used to create two accurate document sets, which are then
used to configure and test a solution.

Step 1: Create Project Open Transformation Studio and create a new project.

Step 2: Import Documents Import documents, optionally with document properties.

Step 3: Assign Document Types If document types have not been imported, assign a
few document types manually and run automatic classification.

Step 4: Initial Analysis Get an overview of your document set.

Step 5: Select Sample Documents for Configuration Select a subset of documents to


cleanup and use for configuration.

Step 6: Read Page Content Read (OCR) all the pages in the documents selected for
configuration. Transformation Studio will use the reads in the next step.

Step 7: Cleanup Documents Within this step you will analyze your document set,
cleanup the documents and add more samples until the set is ready to be used for
configuration.

Step 8: Select Documents for Testing From the clean document set, select a set of
documents to use for testing. These documents must not be used for configuration.

34 Getting Started Guide (Classification and Separation)


Configuration

Figure 4-9. Document Set Management Process

Getting Started Guide (Classification and Separation) 35


Chapter 4

Create Configuration

Recognition

The Recognition module uses classifiers and separators to determine how a batch of
pages is split into documents, and to determine the type of each document.

Classifiers may be based on image or text content and are created from a set of
sample documents. These learn-by-example classifiers can be supplemented with
manually configured templated (including barcode) or rules-based classification
methods.

The advanced document separator is created automatically from the document types
assigned to a set of sample documents and, when run in production, takes into
account the confidence of the page classification results. The rules-based separator is
manually defined using a set of separation rules.

Transformation Studio is used to create all these classifiers and separators. For those
created automatically, it is particularly important that the sample documents are
accurately defined using the Document Set Management Steps.

Document Classification Configuration Steps


The following steps are used to create and test a Recognition document classification
configuration using the two accurate document sets.

Step 1: Create Configuration Create a default configuration from the Document


Classification template.

Step 2: Configure Text Classification Create a document text classifier, integrate into
the configuration and test.

Step 3: Add in Additional Classification Methods Optionally, configure templated


and rules-based classification, integrate into the configuration and test.

Step 4: Test Test the full configuration, analyzing the classification results and
looking for areas to improve.

36 Getting Started Guide (Classification and Separation)


Configuration

Figure 4-10. Recognition Configuration Steps (Document Classification)

Page Classification and Separation Configuration Steps


The following steps are used to create and test a Recognition page classification and
separation configuration using the two accurate document sets.
Step 1: Create Configuration Create a default configuration from the Page
Classification and Separation template.
Step 2: Configure Text Classification Create a page text classifier, integrate into the
configuration and test.
Step 3: Configure Document Separation Create a separator, integrate into the
configuration and test.
Step 4: Add in Additional Classification Methods Optionally, configure image,
templated and rules-based classification, integrate into the configuration and test.
Step 5: Test and Evaluate Performance Test the full configuration and evaluate the
classification and separation performance.

Getting Started Guide (Classification and Separation) 37


Chapter 4

Figure 4-11. Recognition Configuration Steps (Page Classification and Separation)

Document Review

The Document Review module is configured using a single Document Review


project file. This project file contains reasons for displaying documents, validation
rules and window options (for example shortcut keys and text labels).

The Document Review project file is configured using the Document Review Project
Editor.

Document Classification Configuration Steps


There is just one step when configuring Document Review:

Step 1: Configure a Document Review Project File Create and configure a project file
to include validation rules and interface options.

38 Getting Started Guide (Classification and Separation)


Configuration

Page Classification and Separation Configuration Steps

The page classification and separation tutorial will use the Document Review
configuration you created for the document classification tutorial.

Integrate the Configuration with Capture


Having built the configuration, a capture path, settings collection and batch template
need to be created in Capture Professional. The configuration can then be assigned to
each settings collection and a batch can be processed.

Integration Steps
The following steps are used to integrate the configuration with Capture and run a
batch through the solution.

Document Classification Integration Steps

Step 1: Create Capture Path

Step 2: Create Settings Collection

Step 3: Assign Configuration to Settings Collection

Step 4: Create Batch Template

Step 5: Process a Batch

Page Classification and Separation Integration Steps

Step 1: Create Capture Path

Step 2: Create Settings Collection

Step 3: Assign Configuration to Settings Collection

Step 4: Create Batch Template

Step 5: Process a Batch

Getting Started Guide (Classification and Separation) 39


Chapter 4

Document Classification Tutorial

Document Set Management

Step 1: Create Project


When using Transformation Studio, you will work in a project. Within this project
you can import and organize your sample documents and create one or more
configurations.

X To create a project
1 Open Transformation Studio by clicking Start on the taskbar to display the
menu, and selecting All Programs | IBM FileNet Capture Professional |
ADR Tools | Transformation Studio
2 Click New… to open the New Project window
3 In the window enter the name “Tutorial” for your new project.

Note It is recommended that the project is saved in the default location.


However, you may change the location by clicking Browse and selecting a
new location from the Project Location window.

4 Click Create to create the project and open the Import Documents tab

40 Getting Started Guide (Classification and Separation)


Configuration

Figure 4-12. Transformation Studio after Create New Project


1 Project Explorer showing current document sets and configurations
2 Document Types panel displaying the document types in the current document set
3 Status bar showing the current state of Transformation Studio
4 Tab area, currently showing the Import Documents tab

Step 2: Import Documents

Transformation Studio includes a wizard for importing documents into the current
project. Documents can be imported by selecting files and/or folders containing files.

Files on disk can be mapped into various document structures using the file/folder
structure of the image. In addition, data stored in the text of the filename (for
example document types) can be used for a first attempt at classification. However, if
document types are not known they can be assigned after import. Modifications to
document structure can also be done after import, but this is a manual process.

Getting Started Guide (Classification and Separation) 41


Chapter 4

Import of documents can be done multiple times, though within a single import each
file can only be imported once.

The import documents wizard is launched automatically when a new project is


created.

Note To launch the Import Documents Wizard manually, select File | Import
Documents, press CTRL+SHIFT+I or click .

The Example

The example mortgage applications have been exported from an archive system, and
have the following folder architecture:
ƒ Each document type is in a folder, named with the document type
ƒ Each document is in a folder
ƒ Each page is a single file, with the last part of the filename indicating the page
number

During import, you will use values from the filename and path to specify how the
files should be imported into documents and to set document properties:
ƒ Each folder will indicate a new document
ƒ The last part of the filename will indicate page order
ƒ The folder name will be imported as the document type

Note During import, Transformation Studio displays status messages at the bottom
of the window.

X To import the example mortgage documents


1 Select the images to import.
a Click Select Folders… at the top of the Import Documents tab
b In the Select Folders to Import window, navigate to:
<Installation Path>\Examples\Mortgage Applications\Sample Images

42 Getting Started Guide (Classification and Separation)


Configuration

Figure 4-13. Folders to Select for Import

c Select all the folders (there is one for each document type)
d Click Open

Getting Started Guide (Classification and Separation) 43


Chapter 4

Figure 4-14. Folders to be Imported

e Click Next to display Step 2


2 Specify the document structure and values to import.
Transformation Studio has already split (parsed) the filenames and paths
into values, as displayed using the example at the top of the tab. For this
example, there is no need to modify the parsing options.
a On the Structure panel, select “Every imported folder is a document”.
The preview on the right will update to show how the files will be
imported into documents.
b From the “Page sequence indicator” list, select the last value “11 - 01”.
This specifies that the eleventh value in the filename/path indicates the
order of pages in each document. For the currently selected example
document, the data in this eleventh value is “01”.

44 Getting Started Guide (Classification and Separation)


Configuration

Note If you have not installed to the default location, the number may be
different. Ensure you select the last item in the list.

c On the Document Properties panel, select “7 – Appraisal Report” from


the “Document type” list.
This specifies that the seventh value in the filename/path is to be
imported as the document type. For the currently selected example
document, the value of this seventh property is “Appraisal Report”

Note If you have not installed to the default location, the number may be
different. Ensure you select the item with the value “Appraisal Report” in
the list.

Figure 4-15. Options Specifying Document Structure and Properties

d Click Next to go to step 3


3 Specify import options

Getting Started Guide (Classification and Separation) 45


Chapter 4

a Select “Copy document files into project folder”, rather than referencing
the images in their current location.
This will move them into the project folder, making it easier to move your
project at a later time and ensuring no dependency on the images
remaining in their current location.

Note Using this option will slow the import process and require more
disk space.

Note It is possible to move the project when the images have not been
copied into it, but the images must be accessible from wherever the
project is moved to.

Figure 4-16. Import Options

b Click Import to import the files into the project.


The Cancel button will be renamed to Abort.
The import is complete when the Abort button is renamed to Finish.

46 Getting Started Guide (Classification and Separation)


Configuration

4 Click Finish to exit the Import Documents tab

Step 3: Assign Document Types

Transformation Studio includes a feature for automatically classifying documents.


Auto Classify compares the text content of each unclassified document to the text
content of any documents that do have a type assigned. The most probable document
type is assigned depending on the similarities/differences between the text.
In the example, the imported documents already have document types, so no work is
required here.

If you did import documents without document types, you would need to do this
step. Refer to the ADR Help for more information.

Step 4: Initial Analysis

Once you have imported your documents, the Overview tab displays the
composition of your document set.

Getting Started Guide (Classification and Separation) 47


Chapter 4

Figure 4-17. Overview

You can see how many document types you have and the distribution of documents
across those types. From the Overview you may realize that some types occur rarely
and don’t need configuration or that you have more or less document types or
documents than you expected.

X To review the mortgage applications


1 Review the chart for the number of document types (x-axis) and the number
of documents (y-axis).

Note The Tax Escrow has more documents than any other type and the
Header has very few documents.

48 Getting Started Guide (Classification and Separation)


Configuration

2 Review the Header documents to see whether you should get more
examples.
a Double-click the Header bar in the chart to open Browse Documents with
a filter to show just the Header documents.
b Scroll through the documents, looking at the amount of variation between
each example.
In fact, all of these documents contain barcodes that will be used to
classify the documents. Configuration of templated (barcode)
classification is simple and requires only a few examples, so no additional
documents are required.
3 Select the Overview tab to return to the chart view

Note To display more information on a specific document type, hold the mouse over
the bar in the chart and review the information in the tool tip and in the summary
statistics below the chart.

Note To change the chart display, use the toolbar buttons above the chart.

Step 5: Select Sample Documents for Configuration

In this step you will select a subset of the documents in your project to add into the
Sample Documents set, on which your configuration will be based. Any documents
not selected are put into the Unused Documents set. From this set it is easy to add
more samples later, without accidentally selecting duplicates.

Document Sets
When working in a project it is recommended that you always use standard
document sets to ensure maximum accuracy and efficiency when setting up a
solution. If you have a large number of documents it may be time-consuming and
unnecessary to work on all the documents in your project. It is also important to
separate out a set of documents to use for testing and to ensure these are not used
during configuration. Standard document sets and the tools to move documents into
them support both these options. In addition, you may wish to create subsets of
documents in your project according to your own criteria, in which case you can use
the custom document sets.
Each of the three standard document sets has a specific role.
ƒ Sample Documents Documents to use when developing the configuration
(for text and image classification these will be used to train the classifiers)

Getting Started Guide (Classification and Separation) 49


Chapter 4

ƒ Test Documents Documents to use when testing the configuration (not used
for training)
ƒ Unused Documents Documents that are not currently being used. These may
be additional documents that are not required for configuration or documents
that have not yet been classified.

Table 4-1 gives guidelines for the number of documents required for the different
classification methods (per document type). The figures take into account that some
documents may be misclassified or of poor quality and therefore may be discarded
before starting the configuration process. Although you can use more than the
suggested number of sample documents, this will slow down the configuration
process and may not improve accuracy. However, if your initial document set is poor
you should start with a higher number.
Table 4-1. Guideline Number of Documents per Document Type

Method Number of Documents

Text classification (or a combination of classification methods) 150

Image, templated or rules-based classification 10

Note For information on the suitability of documents/pages for a particular


classification method, refer to the ADR Help.

Documents in Multiple Document Sets

Documents in standard or custom document sets are shared with those in the overall
project, that is, a document in a standard or custom document set is the same as that
document in All Documents.

Note The actual image files contained in a document are not duplicated into each
set, only the names of the files are duplicated.

When documents are “added” to a set, they are members of the original set and the
set they have been added to.

When documents are “moved” to a set, they are members of the new set, but not the
original set.

Changes to properties or the structure of a document in any document set will affect
that document in all document sets to which it belongs. Similarly, if pages and

50 Getting Started Guide (Classification and Separation)


Configuration

documents are deleted or reordered in one document set they will be deleted or
reordered in all document sets to which they belong.

Note All documents are always present in the All Documents set. Documents can be
“added” to another set from All Documents, but cannot be “moved” to another set
from All Documents.

X To select sample documents to use for configuration


1 Select Document Sets | Select Sample Documents… or click to display the
Select Sample Documents window.

Figure 4-18. Select Sample Documents window

Note By default, 150 documents of each type will be added to the Sample
Documents set. If fewer than the specified number of documents exist in a
document type, a warning will display and all the documents in that type
will be added to Sample Documents set.

2 Click OK.
Once the documents have been successfully added, a message will display.

Figure 4-19. Message After Selecting Documents

3 Click Yes to open the Sample Documents set

Getting Started Guide (Classification and Separation) 51


Chapter 4

Figure 4-20. Project Explorer after Select Sample Documents

Step 6: Read Page Content

At this point you need to read (OCR) each page of the documents in your sample set.
Using these reads, Transformation Studio can help you analyze the documents with
the aim of finding any that are misclassified or poor quality. These reads will also be
used when you build text classifiers and configure additional classification methods.

Although all the documents in a project could be read, this is time consuming and
often unnecessary. Reading just the documents in the Sample Documents set is
sufficient (as these documents will be used for configuration and testing).

During the read, the status bar will display the number of the page being read and
the estimated time remaining. The read can be stopped at any point; no data will be
lost but you will need to read the remaining pages in order to continue to the next
step.

The read parameters used in the production configuration should match the
parameters used when reading the page content in Transformation Studio. When
you create a new configuration the default parameters will automatically match (the
parameters in the configuration resources folder are the same as those used by
default on the Read Page Content tab). However, if you are updating a production
configuration in which you have customized the full page read, you should use these
customized full page read parameters when reading page content in Transformation
Studio.

In addition, you should use custom read parameters if:


ƒ You have non-English language documents

52 Getting Started Guide (Classification and Separation)


Configuration

ƒ You only need to read a small section of each page (which will speed up
processing time)
ƒ You want to use the read for extraction as well as classification and need a
higher read accuracy

Note For information on setting custom read parameters refer to the ADR Help.

X To read the pages in the sample document set


1 Select Tools | Read Page Content.

Note As it is the currently open document set, “Sample Documents” will


automatically be selected in the Document Set list.

2 Click Read.
Once the read has finished, the Stop button will be renamed to Finish.
3 Click Finish

Important Reading all the pages in a document set may take a long time.

Step 7: Cleanup Documents

Within this step you will analyze your document set, cleanup the documents and
add more samples until the set is ready to be used for configuration. These three
steps may have knock-on effects to each other, requiring one or more steps to be
done multiple times.

The aim is to:


ƒ Require no more work in Cleanup Documents
ƒ Have at least 100 clean samples of each document type (this is critical if
configuring page text classification)

The following sections describe each of the individual steps. The tutorial then ties
these together, showing how each step may need to be done more than once.

Step 7.1: Analysis using the Overview tab


The Overview tab was first used in Step 4: Initial Analysis and displays statistical
information on a document set. Having read the pages in Step 6: Read Page Content,

Getting Started Guide (Classification and Separation) 53


Chapter 4

the Overview chart is updated to indicate how clean (accurate) Transformation


Studio has analyzed the set to be. Each document type in the chart is color-coded
according to the following criteria:
Table 4-2. Color Coding of Document Types in Overview Chart

Color Label Description

Green Clean The document type does not have very much variation and
needs little or no work in Cleanup Documents.

Orange Poor The document type has some variation and will need some
work in Cleanup Documents.

Red Very Poor The document type has a lot of variation in the text content. It
may need a lot of attention within Cleanup Documents or may
not be suitable for text classification.

Gray Unknown No information is available as the document type has not been
read or is “(Unknown)”, that is no type is assigned to the
documents.

Note This data is also visible by displaying the tool tip for a document type in the
chart (hover the mouse over the column).

Note The analysis of the documents is based on the page content (text) reads. This
means that occasionally a document type will appear to be poor, when it is actually
clean but only suitable for a classification method other than text (for example, image
or templated classification).

Step 7.2: Cleanup


Using the Cleanup Documents tab is an efficient way to cleanup your document set
with assistance from Transformation Studio. Possible problem documents are
identified automatically and displayed for manual confirmation. In addition,
documents that will help Transformation Studio to refine its analysis of the
document type in the most efficient way are displayed. These documents are
continually updated based on the confirmation (or re-classification) of the last
document.

Within Cleanup Documents there are two steps:


ƒ Cleaning up Extra Pages
ƒ Cleaning up Document Types

54 Getting Started Guide (Classification and Separation)


Configuration

Transformation Studio analyzes the document set and identifies pages it suspects are
extra. These may be blank pages, fax cover sheets, pages with text that isn't found on
other documents in the type or pages that cannot be read properly.
ƒ Cleaning up Extra Pages Within this step you will confirm whether or not
each of the marked pages (those suspected as being extra) are extra pages.
Only confirm that a page is an Extra Page if it is not representative of the type
and will not occur in production. Extra Pages may mislead the process when
building classifiers, reducing the accuracy of the overall solution.
ƒ Cleaning up Document Types When Transformation Studio analyzes the
document set it assigns a confidence state to each document: confident,
unconfident or misclassified. It also identifies documents that will help define
each document type. In this step you will confirm or remove the document
type for each of these identified documents until all the documents in the set
are confident. As you work, Transformation Studio will continually re-
analyze the set and adjust the confidence states for other documents.

Note As you confirm pages and documents, you may find that other documents are
affected. Therefore Transformation Studio may cycle you through the Cleanup
Documents process until cleanup is complete, that is there are no extra pages,
unconfident documents or misclassified documents.

Note As you work in Cleanup Documents, Transformation Studio will continually


re-analyze the documents, adjusting the confidence states based on the information
you provide. Only documents that will affect the states will be shown, reducing the
amount of work you need to do.

Step 7.3: Add More Samples


Having completed cleanup and analyzed your document set in Overview, you may
have determined that you need more samples in order to create an accurate
configuration. There are three methods for adding more classified documents into
Sample Documents:
ƒ Automatically classify documents of unknown type (that is, documents that
have never been classified or had their type removed during cleanup)
ƒ Move more documents from the Unused Documents set (documents were
moved into the Unused Documents set during Step 5: Select Sample
Documents for Configuration)
ƒ Import more documents

Getting Started Guide (Classification and Separation) 55


Chapter 4

Note Whenever you add or move documents to the Sample Documents set, it is
recommended you repeat cleanup. For an indication of the additional work required,
review the Overview chart.

X To cleanup your document set


1 Review the Overview chart.
You will see that two document types are red while the others are green.
This indicates that the Redemption and Tax Escrow types will need more
work to cleanup than the others.

Note To see more information on a specific document type, hover the mouse
over the bar in the chart.

2 Select the Cleanup Documents tab


3 Cleanup the documents that are displayed by following the on screen
instructions and answering the questions. This step will vary depending on
which documents were randomly selected by Transformation Studio as
Sample Documents. However, the same process is always used:
ƒ Suspected extra pages are displayed for each document type in the set
ƒ Documents needing their type confirmed are displayed for each
document type in the set
ƒ Additional suspected extra pages are displayed
ƒ Additional documents needing their type confirmed are displayed
Further documents may then be identified that need attention. If this is the
case a message will be displayed.
Between each of the above steps (and each change of document type) a
message is displayed.
Document Type Cleanup If the document displays with a colored title bar
and the question reads “Is the displayed document a <document type>?”, the
document type needs to be confirmed.

56 Getting Started Guide (Classification and Separation)


Configuration

Figure 4-21. Confirming Document Types

Only documents that will affect the confidence of the documents in the set
will be displayed. These documents are continually reassessed as you
confirm or remove document types.
The documents are color-coded as described in Table 4-3.
Table 4-3. Color Coding of Documents

Color Confidence State Description

Green Confident Transformation Studio is confident that this document


is correctly classified, that is, it has the correct
document type.

Gray Unconfident Transformation Studio is not confident that this


document is correctly classified.

Red Misclassified Transformation Studio believes this document is


incorrectly classified.

Getting Started Guide (Classification and Separation) 57


Chapter 4

Table 4-3. Color Coding of Documents

Color Confidence State Description

Blue Confirmed The document type has been manually confirmed.

a Look at the currently displayed document using the thumbnails and


Image Viewer and decide whether it has the correct document type.

Note The message above the document (and the color coding in the title
bar) indicates how confident Transformation Studio is about the
document.

Note Click a thumbnail to display that page in the Image Viewer

b Confirm or remove the document type:


ƒ Click Yes (or press ENTER or Y) to confirm the document type is
correct
ƒ Click No (or press N) to remove the document type

Tips Some of the documents in the Mortgage Applications set are


misclassified or incorrect.

Tax Escrow Approximately half or the documents in the Tax Escrow type are
actually Initial Escrow. Make sure you do not confirm these Initial Escrow
documents: when you see an Initial Escrow, click No to the question “Is the
displayed document a ‘Tax Escrow’?”.

Request for Tax Form These documents are all 2 pages long, if you see a 4
page document, right click on page 3 and select “Split Document” from the
context menu. You may see a document with a very skewed second page,
select the second page and click the Display Text button at the top of the
Image Viewer. You will see that the page read is very poor. This document
should not be used for configuration and should be deleted from the
document set. Right click on the document and select “Delete document
from project” from the context menu.

Redemption Each of the unstructured letters are redemption documents.


Two of these have second pages which may be seen in production.

58 Getting Started Guide (Classification and Separation)


Configuration

Loan Application Some of the loan applications have lots of pages. To wrap
the pages so they all display in the thumbnail viewer without scrolling, click
the Wrap Pages button above the thumbnail viewer.

Note Documents without a type become “Unknown” and can be


automatically classified later.

Note As you confirm each document the status bar will update the
proportion of documents of each state within the type. When this bar is
completely blue and green (that is, all documents are confirmed or confident)
the type is clean and a message will display.

Note When working on the Cleanup Documents tab, you may wish to close
the Project Explorer and Document Types panels. If you have multiple
monitors, you may find it beneficial to drag the Image Viewer to a separate
monitor.

Extra Page Cleanup If the document shows a page highlighted in pink and
the question at the bottom of the tab is “Is the selected page an Extra Page?”,
Transformation Studio is displaying a document which it believes contains
one or more extra pages.

Getting Started Guide (Classification and Separation) 59


Chapter 4

Figure 4-22. Cleaning Up Suspected Extra Pages

a Look at the currently marked page using the Thumbnail Viewer and the
Image Viewer and decide whether or not it is an extra page
b Confirm or clear the extra page mark:
ƒ Click Yes (or press ENTER or Y) to confirm the page is extra to the
document and will not occur in production
ƒ Click No (or press N) to clear the suspected extra page

Tip The only extra page in the mortgage applications set is a separator sheet
containing the text “NEW DOCUMENT”. All other suspected extra pages
may occur in production.

Continue working in Cleanup Documents until a message displays telling


you than no further work is required
4 Having cleaned up the documents, select the Overview tab.

60 Getting Started Guide (Classification and Separation)


Configuration

Each document type in the graph (except Unknown Documents) will be


green as it will contain only confirmed and confident documents.
5 Review the number of documents in each type.
You need at least 100 documents of each type (except the Header) in order to
create a configuration. You will not have enough documents in the Tax
Escrow (as the Initial Escrow documents were mixed into this type but were
reclassified as Unknown during cleanup).
6 Classify the Unknown documents. The majority of these documents are
Initial Escrows that were originally imported with the Tax Escrow type.
a Double-click the bar for Unknown documents in the Overview chart.
Browse Documents will open with just the Unknown documents visible.
b Browse through the thumbnails until you find an Initial Escrow
document
c Right click on the document to display the context menu
d Select “Change Document Type…” to display the Change Document
Type window
e In the Document Type box, enter “Initial Escrow”
f Click OK
g Click OK to the message box.
The document will now have the type “Initial Escrow” assigned.
h Find the next Initial Escrow document
i Right click on the document and select “Change Document Type…” from
the context menu
j Select “Initial Escrow” from the list of document types
k Click OK
l Assign the “Initial Escrow” type to three more documents, including a
two page Initial Escrow
m In the list of documents, select all the documents with the type “Initial
Escrow” (hold down CTRL and use the mouse to select multiple
documents at once)
n Right click on the selection to display the context menu
o Select Confirm Document Type
p On the Document Types panel in the bottom left of Transformation
Studio’s view, select (Unknown)

Getting Started Guide (Classification and Separation) 61


Chapter 4

q Right-click on the selection to display the context menu


r Select Auto Classify documents…
Transformation Studio will process the Unknown documents using
information it has learned about the document types in this set. A
window will display the results of the automatic classification, and you
should see that most of the documents were classified.
s Click Close
7 Having classified the Unknown documents, select the Cleanup Documents
tab to repeat the cleanup process.
Only a few documents should need to be confirmed.
8 Select the Overview tab and check the number of documents in each type.
Tax Escrow and Initial Escrow are both likely to have less than 100
documents of each type (ignore the Header as this can be configured using
just 10 documents).
The Initial Escrow and Tax Escrow documents were all originally part of the
Tax Escrow type.
9 Check whether there are more of this type in the project that are not
currently being used. If there are, move them into the Sample Documents set
so they can be used.
a Double-click the Unused Documents set on the Project Explorer panel.
The Overview chart and the Document Types panel will be updated to
show the composition of this set. There are documents in the Tax Escrow
type that are not currently being used.
b Right click on Tax Escrow in the Document Types panel to display the
context menu
c Select “Move documents to another document set…” to open the Move
Documents to Document Set window
d From the Move documents to list, select Sample Documents
e Select the third Move option, “Maximum number of selected documents
per document type”
f Click OK to move 100 documents into the Sample Documents set
10 Double-click Sample Documents on the Project Explorer panel.
The Overview chart is no longer color coded as not all the documents in the
set have been read.

62 Getting Started Guide (Classification and Separation)


Configuration

11 Read the new pages using the Read Page Content tool.
a Select Tools | Read Page Content.
On the Read Page Content tab, Sample Documents will be selected in the
Document Set list, the “Read only pages that are missing content” option
will be selected in Page Options and the “Use default read parameters”
option will be selected in Read Parameters.
b Click Read
c When the Stop button is renamed to Finish, click Finish
12 The Overview will now display Tax Escrow as orange. This is because the
Tax Escrow includes some (misclassified) Initial Escrow documents. Use
Auto Classify to try to reclassify these documents.
a On the Document Types panel, right click on Tax Escrow
b Select Auto Classify documents… from the context menu
c Click Close
13 Having added more documents to the set, run cleanup again.
a Select the Cleanup Documents tab
b Follow the instructions, confirming documents until a message states that
there is no more work to do
c Select the Overview tab
14 Review the chart.
There should now be at least 100 documents for each document type (except
for Header) and each bar should be green.

Note It is possible to review the documents that have been automatically classified,
using Browse Documents. For more information refer to the ADR Help.

Step 8: Select Documents for Testing

The Test Documents set is used to store a subset of the clean documents for use in
testing. These are not used during the configuration process and therefore form an
unseen set of documents to use for testing. As the test documents have been cleaned
up, a comparison between the data in the project and the results of running the
configuration on the documents will provide an accurate indication of performance.

Getting Started Guide (Classification and Separation) 63


Chapter 4

The Test Documents set is populated by moving documents from the Sample
Documents set.

Guidelines for Selecting Test Documents


When selecting test documents you can specify the percentage of documents to move
from the Sample Documents set. You can also specify whether documents that have
had their type manually confirmed may be moved into the test set, or whether they
must remain in the Sample Documents set.
The following table shows guidelines for selecting test documents.
Table 4-4. Test Document Selection Guidelines

Method Number of Keep Confirmed


Documents in Documents in Sample
Test Set Set

Page text classification 30% Yes

Multiple page level classification methods

Document text classification 90% Yes

Page image classification

Templated (including barcode) classification

Rules-based classification

X To select documents for testing


1 Select Document Sets | Select Test Documents… or click to display the
Select Test Documents window

Figure 4-23. Select Test Documents window

64 Getting Started Guide (Classification and Separation)


Configuration

2 Read the warning message; optionally click Show Warnings to see more
details
3 As the percentage of documents to move is already at 30%, click OK to move
the documents.
Once the documents have been successfully moved, a message will display.

Figure 4-24. Message After Selecting Test Documents

4 Click No to remain in the Sample Documents set rather than opening the
Test Documents set.
The number of documents in each set will be updated in Project Explorer.

Figure 4-25. Project Explorer after Selecting Test Documents

Create Recognition Configuration

Step 1: Create Configuration


A Recognition configuration is a set of files containing information that specifies how
Recognition processes documents for a specific solution. Once a configuration has
been created, these files - known as resources - are accessible from the Project
Explorer panel within Transformation Studio, and are stored in the following folder
on your computer:

Getting Started Guide (Classification and Separation) 65


Chapter 4

<Project Location>\<Project Name>\Configurations\<Configuration Name>\Resources

Note By default, Project Location is My Documents\Transformation Studio Projects\

Each Recognition configuration configures one instance of the Recognition module.


Separate Recognition instances (and hence Recognition configurations) are used for:
ƒ Page classification and separation
ƒ Document classification
ƒ Extraction
A Recognition configuration is always based on a configuration template. A template
is a set of resources that form the foundation of your configuration.

Note The resources created will vary depending on the type of template selected.

X To create a document classification configuration


1 Select Configuration | Create Configuration... to display the New
Configuration window
2 Select “Document Classification”.
The Name box will automatically be updated with the default name
“Document Classification”.
3 Click Add.
The configuration will be added into the Configurations list in Project
Explorer.

Figure 4-26. Project Explorer with a Configuration

66 Getting Started Guide (Classification and Separation)


Configuration

4 Click the + beside Document Classification to expand the configuration and


view the resources

Figure 4-27. Project Explorer showing Configuration Resources

Step 2: Configure Text Classification

Document Text Classifier


The classifier is created using the Build Document Text Classifier tab. Typically the
text classifier is trained on the documents in the Sample Documents set (after it has
been cleaned during document set management). Training options are selected
before the build process is started.

It is possible to specify whether training is restricted to documents that have been


confirmed, whether extra pages are trained on, and whether to further limit which
pages within a document are used in the training. Typically the first two options are
not selected (that is, all documents are used while training but not extra pages). The
pages to be used is only limited to save processing time (as the unused pages won't
need to be read in production) and if the document type can robustly be identified
from a subset of pages.

X To build the document text classifier


1 Select Configuration | Build Document Text Classifier into... | Configuration
“Document Classification” to display the Build Document Text Classifier tab.

Getting Started Guide (Classification and Separation) 67


Chapter 4

Sample Documents will already be selected in the “Training Document Set”


list and the document types within the set will be listed in the table.
2 Within the table, clear the “Include” check box for the Header document
type, so these documents are not used in training the classifier.
This document type will be accounted for later by configuring templated
(barcode) classification.
3 Click Build.
Once the classifier has been built, the Build button will be renamed to Finish.
4 Click Finish

Integrate Classifier
In production, Recognition runs a Recognition script, which in turn calls the
classifier. The Recognition script (called Document Classification.ifv) is created
automatically when the configuration is created. Two changes may be needed in this
script:
ƒ The name of the classifier
ƒ The pages to be used by the classifier (and therefore that need to be read)
The script will, by default, call a classifier called “Document text classifier.ibc”. This
is the default name of the classifier created using the Build Document Text Classifier
tab. If the name is left unchanged, no modification is needed to the script. For
information on changing the classifier name in the script, refer to the ADR Help.

The script will, by default, run the classification on all pages. For information on
changing the pages to be used, refer to the ADR Help.

X To integrate the classifier, no modifications to the script are required for this
tutorial.

Test Classification
You will test the configuration on the Test Documents set, that is, the documents that
were not used to build the classifier. These documents require exporting from
Transformation Studio so they can be loaded into the Recognition Test Tool. You will
need to export these documents in the correct file structure for testing document
classification (a multi-page image file for each document).

You will then assign the configuration to a project in Recognition Test Tool, where it
is run on the test documents.

68 Getting Started Guide (Classification and Separation)


Configuration

Note Although all testing could be done once the configuration is finished, it is
recommended that testing is done as each classification method is implemented,
ensuring any issues are quickly found and fixed.

X To test the configuration


1 Export the Test Documents set from Transformation Studio.
a Select File | Export Documents to display the Export Documents tab
b Select Test Documents from the “Document Set” list
c Click Browse… and navigate to the following location:
My Documents\Transformation Studio Projects\Tutorial
d Create a new folder called “Exported Document Sets”
e In the new folder, create a new subfolder called “Test Documents
(Document Classification)”
f Select the folder Test Documents (Document Classification) and click
Open
g Make sure the “Create one image file for each document” option is
selected
h Clear the “Export text files” option (but leave “Export recognition output
files” selected)
i Click Export.
The Export button will be renamed to Abort.
Once the documents have been exported (along with a batch file
containing the document set structure), the Abort button will be renamed
to Finish.
j Click Finish to close the Export Documents tab
2 Create a project in Recognition Test Tool and run the test.
a Open Recognition Test Tool by clicking Start on the taskbar to display the
menu, and selecting All Programs | IBM FileNet Capture Professional |
ADR Tools | Recognition Test Tool
b Select File | New Project... to display the New Project window

c On the Configuration tab, use the “Recognition Script File” button to


assign the template script file in your configuration:

Getting Started Guide (Classification and Separation) 69


Chapter 4

My Documents\Transformation Studio
Projects\Tutorial\Configurations\Document
Classification\Resources\Document Classification.ifv
d Select the Test Properties tab
e Select the “Display document tree after test” option
f Click OK
g Select Documents | Select Batch File… to open the Select Batch File
window
h Select the batch file you exported from Transformation Studio:
My Documents\Transformation Studio Projects\Tutorial\Exported
Document Sets\Test Documents (Document Classification)\All
Document Types.ibf
i Click Open
j Press F8 or click the Run Test button to test the configuration.
The batch file will not be altered during this process.
k Select the Summary tab to view the Test Documents set with document
types assigned.
The documents have been sorted by document type, and each set of
documents can be viewed by selecting the tab with a name corresponding
to their document type.
l Select File | Save Project and save the project as:
<Installation Path>\Test Projects\Document Classification.rtp
m Select File | Exit to close Recognition Test Tool

Step 3: Add in Additional Classification Methods


Multiple classification methods can be used together to ensure an accurate and
efficient configuration. For example, the Header document type would be classified
more robustly by reading the barcode than by reading all the text on the page (as
there is very little text and it varies significantly). When processing documents, there
are three classification methods that can be used:
ƒ Text classification
ƒ Templated classification (including barcodes)
ƒ Rules-based classification

70 Getting Started Guide (Classification and Separation)


Configuration

For more information on these methods, refer to Classification Methods or the ADR
Help.

Note Image classification is not available when processing documents (it can only be
used to classify individual pages).

Export Documents for Use in Other Tools


Templated and rules-based classification are configured in Definer. As with text
classification, the configuration is based on the Sample Documents. In order to use
these sample documents easily in Definer, they need to be exported from
Transformation Studio. You can export the whole Sample Documents set or, by
creating additional custom sets, just the samples for the document types that you
need to use in Definer.

X To export the Header documents in the Sample Documents set


1 In Transformation Studio, if the Sample Documents set isn't open, double-
click “Sample Documents” in Project Explorer
2 On the Document Types panel, right click on the type “Header”
3 From the context menu, select “Add documents to another document set…”

Figure 4-28. Add Documents to Document Set

Getting Started Guide (Classification and Separation) 71


Chapter 4

Note By adding the documents rather than moving them, the documents
still exist in the Sample Documents set.

4 In the “Add documents to” box, enter “Sample Header Documents”


5 Click OK
6 Select File | Export Documents to display the Export Documents tab
7 Select “Sample Header Documents” from the “Document Set” list
8 Click Browse… and navigate to the following folder:
My Documents\Transformation Studio Projects\Tutorial\Exported
Document Sets
9 Create a new folder called “Sample Header Documents”
10 Select the folder Sample Header Documents and click Open
11 Make sure the “Export text files” option is clear (but leave “Export
recognition output files” selected).

Note The Exported File Structure options are equivalent for single-page
documents, so either could be selected.

12 Click Export.
The Export button will be renamed to Abort.
Once the documents have been exported (along with a batch file containing
the document set structure), the Abort button will be renamed to Finish.
13 Click Finish to close the Export Documents tab

Definition File for Templated Classification

Templated classification is configured in a definition file. The definition file stores


registration marks, fields and barcodes which are used in production to classify
documents of the associated document type. The definition file is created in Definer.

X To create a definition file to classify the Header type by barcode


1 Open Definer by clicking Start on the taskbar to display the menu, and
selecting All Programs | IBM FileNet Capture Professional | ADR Tools |
Definer

72 Getting Started Guide (Classification and Separation)


Configuration

2 Select Image | Open Image… to open the Open Sample Image window
3 Select the first Header image in the location you exported the sample Header
documents to:
My Documents\Transformation Studio Projects\Tutorial\Exported
Document Sets\Sample Header Documents\Header\
4 Click Open

5 On the toolbar, click the Barcode button


6 Draw a rectangle around the top barcode on the image by clicking and
holding the mouse down.
Allow plenty of space around the barcode so it looks like this:

Figure 4-29. Barcode Field

7 On the Properties panel on the right, select the Name property and replace
the default value by entering “Barcode” for the field name
8 Select File | Save Definition to open the Save As window
9 Navigate to the location of your Recognition configuration:
My Documents\Transformation Studio
Projects\Tutorial\Configurations\Document Classification\Resources
10 Enter “Header” as the file name.

Note The name of the definition file will be the document type assigned if
classification is successful.

11 Click Save
12 Select Tools | Test Definition to open Test Mode. If prompted, click Yes to
save the configuration file each time you run a test.

13 Click Add Image To List, , to display the Select Images window

Getting Started Guide (Classification and Separation) 73


Chapter 4

14 Select the other Header images in the location you exported the sample
Header documents to:
My Documents\Transformation Studio Projects\Tutorial\Exported
Document Sets\Sample Header Documents\Header\
15 Click Open
16 Click Select All to select all the documents in the list
17 Click Process Document so you can check the result of each document
during the test
18 Click Auto Exit at End of Test so it is not selected
19 Click Run to start the test
20 Check the field shows the message “Barcode found at …” and that the data
matches the value above the barcode
21 Click Run Step to test the next document
22 Repeat the last two steps until the Run Step button is disabled
23 Click Close to exit the Test Mode window.
The barcode should have been found on every document. If needed, resize
the field and retest until all the barcodes are found.
24 In the main Definer view, select the Definition File tab below the image
25 Press Enter to make space for a new line of code after the line CORRECT NEVER
and before the line END
26 To register the document on finding a barcode, enter the following lines:

REGREGEXP .+
REGFORMID 1 -1

The complete field will then be:

BEGIN FIELD
COORDS 427 489 1429 652
FORMID 1
NAME Barcode
TYPE CODE39
CORRECT NEVER
REGREGEXP .+
REGFORMID 1 -1
END

74 Getting Started Guide (Classification and Separation)


Configuration

When Recognition runs, a successfully registered document is classified with


100% confidence, and assigned the document type given by the name of the
definition file.

Note For information on the two parameters used, press F1 to open the ADR
Help.

27 Select File | Save Definition


28 Select File | Exit to close Definer

Integrate Definition file


As with the text classifier, the definition file is called by the Recognition script in
production. The script will not call a definition file by default, but this can easily be
modified. The name of the definition file must also be updated.

X To integrate the definition file into the script


1 In Windows Explorer, navigate to your configuration’s resources folder:
My Documents\Transformation Studio
Projects\Tutorial\Configurations\Document Classification\Resources
2 Double-click the file Document Classification.ifv to open it in Script Editor
3 Turn on classification by definition file by changing the following line at the
top of the script:
Const CLASSIFY_BY_DEFINITION_FILE = False

To:
Const CLASSIFY_BY_DEFINITION_FILE = True

4 Set the name of the definition file by changing the following line at the top of
the script:
Const DEFINITION_FILE_FILENAME = "Template.idf"

To:
Const DEFINITION_FILE_FILENAME = "Header.idf"

5 Select File | Save File


6 Select File | Exit

Getting Started Guide (Classification and Separation) 75


Chapter 4

Test Classification

X To test document text and document templated classification together


1 Open Recognition Test Tool by clicking Start on the taskbar to display the
menu, and selecting All Programs | IBM FileNet Capture Professional |
ADR Tools | Recognition Test Tool
2 Open the project you used to test document text classification.

Note You can open the project from the recent projects on the File menu. By
default it will be:

<Installation Path>\Test Projects\Document Classification.rtp

3 Click Run Test


4 Once the test has finished, select the Header tab
5 Select one of the documents and click Script Messages in the bottom left
panel.
If templated classification has run, the message will read “Document
<number> classified by template as Header”.
Do not close Recognition Test Tool.

Step 4: Test Performance

You have already tested the configuration when you added each classification
method. Those tests primarily checked that each classification method was called and
no errors occurred when running a test. In this step, you will analyze the
performance of the classification in detail, checking that the classification methods
implemented are picking up the documents as you expect and looking to see where
configuration could be improved (for example, by adding in a new classification
method).

X To analyze the test results

1 In Recognition Test Tool, having run a test, click (Analyse Results)


2 Select the Document Classification tab to display a table of the percentage of
documents that have been confidently classified into each type
3 Click the % button so it is no longer selected.

76 Getting Started Guide (Classification and Separation)


Configuration

This displays the number of documents classified as each type rather than
the percentage
4 Check that the number of documents for each type matches the number of
documents in each type in the Test Documents set in Transformation Studio.
a Open the Tutorial project in Transformation Studio
b Double-click Test Documents to open the set
c Check the number of examples of each document type on the Document
Types panel and compare with the values in the Results Analysis table in
Recognition Test Tool

Note The numbers may not be exactly the same, but the closer they are, the
more effective the configuration.

5 Select File | Exit to close the Results Analysis window


6 Select File | Exit to close Recognition Test Tool

Create Document Review Configuration

Step 1: Configure a Document Review Project File

In this step you will create and configure a Document Review project file using
Document Review Project Editor. In a document classification solution, Document
Review is used to ensure document types are correctly assigned.

X To create a Document Review project file


1 Open Document Review Project Editor by clicking Start on the taskbar to
display the menu, and selecting All Programs | IBM FileNet Capture
Professional | ADR Tools | Document Review Project Editor
2 Select File | New
3 Select File | Save to display the Save Project File window
4 Navigate to the following folder within your Transformation Studio project:
My Documents\Transformation Studio Projects\Tutorial\Configurations
5 Create a new folder named “Document Review”
6 Open the folder
7 Enter the file name “Review”

Getting Started Guide (Classification and Separation) 77


Chapter 4

8 Click Save
9 Select the “Use the Document Classification view” option on the General
Options tab
10 Select the Types tab
11 Click in top left cell of the Document Types table
12 Enter the text “Appraisal Report” and press Enter to go to the next row
13 Enter the following types in the table:
ƒ Funding Transmittal
ƒ Header
ƒ Initial Escrow
ƒ Loan Application
ƒ Redemption
ƒ Request for Tax Form
ƒ Tax Escrow
ƒ Truth In Lending

Note It is important that the spelling and case of the document types is
exactly as written here, so that the types match those assigned in
Transformation Studio.

You will now specify a validation rule that states that all documents in the
batch must have a type specified in the list you just created. If this rule is
broken, a problem will display in the Document Review module.
14 Select the Validation tab
15 Click Add… to display the Select Validation Rule window
16 Select the rule “Every document must have a type specified in the list”
17 Click OK
18 Click Add… again
19 Select the rule “Every document must have a confident type“
20 Click OK
21 Select the Review tab
22 In the Review Options panel, select the property “Automatically go to next
problem”

78 Getting Started Guide (Classification and Separation)


Configuration

23 In the right column, click the arrow on the right to display the list and select
True
24 Select File | Save to save the project file
25 Select File | Exit to close Document Review Project Editor

Integrate the Configuration with Capture


Once the configuration has been created and tested, it needs to be assigned to a
settings collection in Capture Professional. Once a batch template and capture path
have been created, a batch can then be processed – the final test of the configuration.

This section summarizes the steps to configuring a capture path, settings collection
and template for an ADR solution running on a single computer. For full details refer
to the Capture documentation.

Step 1: Create Capture Path


For this tutorial, base the capture path on the example installed. For information on
creating a new capture path, refer to the Capture documentation.

X To create a capture path


1 Open Capture Professional by clicking Start on the taskbar to display the
menu, and selecting All Programs | IBM FileNet Capture Professional |
Capture Professional 5.2
2 Select File | New | Capture Path…
3 Following the wizard, create a capture path in the “ADR Examples”
repository with the name “My Mortgage Applications”, based on the
existing definition “Mortgage Applications”
4 Click Next to view the selected components:
ƒ File Import
ƒ Zonal OCR
ƒ Event Activator
ƒ ADR Recognition (Classification and Separation)
ƒ ADR Document Review
ƒ ADR Recognition
ƒ ADR Completion
ƒ ADR Scripted Export

Getting Started Guide (Classification and Separation) 79


Chapter 4

5 Click Next to view the capture path layout


6 Click Next
7 Click Finish to close the wizard

Step 2: Create Settings Collection


For this tutorial, base the settings collection on the example installed. For information
on creating a new settings collection, refer to the Capture documentation.

X To create a settings collection


1 Select File | New | Settings…
2 Following the wizard, create a settings collection in the “ADR Examples”
repository with the name “My Mortgage Applications Settings”, based on
the existing Settings Collection “Mortgage Applications Settings”
3 Click Next.
The “DocClass” box should contain the value “Indexless”. This indicates that
the settings collection for the batch has a document class with no index
fields. This is correct for a configuration with multiple document types. For
more information, refer to the ADR Help.
4 Click Next to display the components (but don't configure them here)
5 Click Next
6 Click Finish to close the wizard

Step 3: Assign Configuration to Settings Collection


As the settings collection is based on the example settings collection, most of the
settings are already configured. In this step, you will review the existing settings on
the Capture components and replace the installed configuration files for the ADR
modules with those you just created.

X To configure the settings collection


1 Copy the sample image from the original settings collection to the new
settings collection.

Note The sample image would normally be moved from an existing batch.
For more information refer to the Capture documentation.

80 Getting Started Guide (Classification and Separation)


Configuration

a Select the original settings collection “Mortgage Applications Settings”


b Expand the settings collection by clicking the + to the left of the settings
name
c Right click on the file 000.tif
d Select Copy from the context menu

Note To do this, the “Allow Cut/Copy/Paste/Delete/Drag/Drop”


permission must be enabled. See the Capture documentation for details.

e Select the new settings collection “My Mortgage Applications Settings”


f Right click on the settings name
g Select Paste from the context menu
2 Select the new settings collection in the repository tree view
3 Review the existing settings.
a Select Tools | Configure | File Import… to view the existing import
settings:
ƒ “File Specification” is “*.tif”
ƒ “Path” is <Installation Path>\Examples\Mortgage
Applications\Images\Document Images
b Click OK
c Select Tools | Configure | Zonal OCR… to display the existing settings
d Click the OCR Zones tab to see the Separator zone
e Click OK
f Select Tools | Configure | Event Activator… to display the existing
settings:
ƒ Document Separator (Delete Page after) =>
UserDefined.Separator equals DOCUMENT
If the text detected by Zonal OCR is equal to DOCUMENT, the page is a
separator sheet and a new document boundary is created.
g Click OK
4 Configure ADR modules to use the configuration you created during the
tutorial.
a Select Tools | Configure | ADR Recognition | Classification and
Separation… to open the setup dialog for this instance of Recognition

Getting Started Guide (Classification and Separation) 81


Chapter 4

b In the “Recognition Script File” section, click Clear to remove the existing
script
c Click Select Script File…
d Browse to your configuration’s resources folder:
My Documents\Transformation Studio
Projects\Tutorial\Configurations\Document Classification\Resources\
e Select the script file Document Classification.ifv
f Click Open
g Click OK
h Select Tools | Configure | ADR Document Review… to open the
Document Review setup dialog
i Click Select…
j Browse to your Document Review configuration folder:
My Documents\Transformation Studio
Projects\Tutorial\Configurations\Document Review
k Select the project file Review.drp
l Click Open
m Click OK
n Select Tools | Configure | ADR Recognition | ADR Recognition… to
open the setup dialog for this instance of Recognition.
The configuration for the standard instance of Recognition is installed
with the example, but does not require any modification. For more
information on configuring conditional extraction, refer to the ADR Help.
o Click OK
p Select Tools | Configure | ADR Completion… to open the Completion
setup dialog.
The configuration for Completion is installed with the example, but does
not require any modification. For more information on creating templates
for multiple document types, refer to the ADR Help.
q Click OK
r Select Tools | Configure | ADR Scripted Export… to open the Scripted
Export setup dialog.

82 Getting Started Guide (Classification and Separation)


Configuration

The configuration for Scripted Export is installed with the example, but
does not require any modification. For more information on writing ADR
data to index fields, refer to the ADR Help.
s Click OK

Step 4: Create Batch Template

X To create a template
1 Select File | New | Template…
2 Following the wizard, create a template in the “ADR Examples” repository
with the name “My Mortgage Applications Template”
3 Select the settings collection “My Mortgage Applications Settings”
4 Select the capture path “My Mortgage Applications”
5 Click Next and Finish to close the wizard

Step 5: Process a Batch


Now that the configuration is complete, process a batch by following the instructions
in Tutorial: Running the Capture Path but using the new template, settings collection
and capture path.

Getting Started Guide (Classification and Separation) 83


Chapter 4

Page Classification and Separation Tutorial

Summary
In this section, you will modify the current solution to use automatic document
separation. Automatic document separation can save significant time and cost, for
example by removing the need for separator sheets.

Changes needed to make the current solution into a classification and separation
solution are:
ƒ Change the Recognition configuration to run classification at page level (as it
will run before document boundaries are established) and then to call
separation.
ƒ Create a page classification and separation configuration for Recognition
ƒ Configure page classification methods
ƒ Add in advanced document separation (which will use the page
classification results to determine the document boundaries and document
types)
ƒ Change the Capture Path to remove Zonal OCR and Event Activator. These
are used in the Document Classification solution to establish document
boundaries prior to ADR.

Create Recognition Configuration

Step 1: Create Configuration


As for document classification, the first step is to create a configuration based on a
template.

X To create a page classification and separation configuration


1 Open Transformation Studio
2 Open your project
3 Select Configuration | Create Configuration... to display the New
Configuration window
4 Select “Page Classification and Separation”.
The Name box will automatically be updated with the default name “Page
Classification and Separation”.

84 Getting Started Guide (Classification and Separation)


Configuration

5 Click Add.
The configuration will be added to the Configurations list on the Project
Explorer panel.

Step 2: Configure Text Classification

Build Page Text Classifier


The classifier is created on the Build Page Text Classifier tab, where training options
are selected before the build process is started. Typically the text classifier is trained
on the documents in the Sample Documents set (after it has been cleaned during
document set management).

It is possible to specify whether training is restricted to pages within documents that


have been confirmed, whether extra pages are trained on, and whether to further
limit which page types are trained within each document type. Any page types that
are not used for training page text classification need to be classified using an
alternative page classification method (for example image or templated).

Note In order for a page type to be trained successfully, at least 50 examples of that
page type are required. A warning will display in the table if there are less than 50
examples of a page type within the document set.

X To build a page text classifier


1 Select Configuration | Build Page Text Classifier into… | Configuration
“Page Classification and Separation” to display the Build Page Text Classifier
tab.
Sample Documents will be selected by default in the “Training Document
Set” list and the document types within the set will be listed in the table.

Note Some warning triangles may display. Hover the mouse over a specific
triangle to display the warning. The warnings in the tutorial are due to not
having enough examples of some page types. If these warnings were seen on
a project, you would need to ask the customer for more examples of these
page types.

2 Within the table, select “None” in the “Train Using” column for the Header
document type as this will be classified by templated (barcode) classification

Getting Started Guide (Classification and Separation) 85


Chapter 4

Figure 4-30. Building the Page Text Classifier

3 Click Build.
Once the classifier has been built the Build button will be renamed to Finish
and the page text classifier will display on the Project Explorer panel.

Figure 4-31. Project Explorer Displaying the New Classifier

86 Getting Started Guide (Classification and Separation)


Configuration

4 Click Finish

Integrate Classifier
As for the document classification solution, Recognition calls a Recognition script
which in turn calls the classifier. The Recognition script (called Page Classification
and Separation.ifv) is created automatically when the configuration is created. One
change may be needed in this script:
ƒ The name of the classifier
The script will, by default, call a classifier named “Page text classifier.mod”. This is
the default name of the classifier created using the Build Page Text Classifier tab. If
the name is left unchanged, no modification is needed to the script. For information
on changing the classifier name in the script, refer to the ADR Help.

X To integrate the classifier, no modifications to the script are required for this
tutorial.

Test Classification
You will test the configuration on the Test Documents set, that is, the documents that
were not used to build the classifier. These documents require exporting from
Transformation Studio so they can be loaded into the Recognition Test Tool. You will
need to export these documents in the correct file structure for testing page
classification and separation (an image file for each page).

You will then assign the configuration to a project in Recognition Test Tool, where it
is run on the test documents.

Note Although all testing could be done once the configuration is finished, it is
recommended that testing is done as each classification method is implemented,
ensuring any issues are quickly found and fixed.

X To test the configuration


1 Export the Test Documents set from Transformation Studio.
a Select File | Export Documents to display the Export Documents tab
b Select Test Documents from the “Document Set” list
c Click Browse… and navigate to the following location:
My Documents\Transformation Studio Projects\Tutorial\Exported
Document Sets

Getting Started Guide (Classification and Separation) 87


Chapter 4

d Create a new subfolder called “Test Documents (Page Classification and


Separation)”
e Select the folder Test Documents (Page Classification and Separation) and
click Open
f Make sure the “Create one image file for each page” option is selected
g Make sure the “Export recognition output files” option is selected and the
“Export text files” option is cleared
h Click Export.
The Export button will be renamed to Abort.
Once the documents have been exported (along with a batch file
containing the document set structure), the Abort button will be renamed
to Finish.
i Click Finish to close the Export Documents tab
2 Create a project in Recognition Test Tool and run the test.
a Open Recognition Test Tool by clicking Start on the taskbar to display the
menu, and selecting All Programs | IBM FileNet Capture Professional |
ADR Tools | Recognition Test Tool
b Select File | New Project... to display the New Project window

c On the Configuration tab, use the “Recognition Script File” button to


assign the template script file in your configuration:
My Documents\Transformation Studio
Projects\Tutorial\Configurations\Page Classification and
Separation\Resources\Page Classification and Separation.ifv
d Select the Test Properties tab
e Clear the “Display document tree after test” option
f Click OK
g Select Documents | Select Batch File… to open the Select Batch File
window
h Select the batch file you exported from Transformation Studio:
My Documents\Transformation Studio Projects\Tutorial\Exported
Document Sets\Test Documents (Page Classification and Separation)\All
Document Types.ibf
i Click Open
j Press F8 or click the Run Test button to test the configuration.

88 Getting Started Guide (Classification and Separation)


Configuration

The batch file will not be altered during this process.


k Review the results by checking the page types assigned on the Summary
tab or reviewing the pages as they are sorted into tabs by page type.

Note More detailed analysis of performance will be done in Step 5: Test


and Evaluate Performance.

l Select File | Save Project and save the project in the following folder:
<Installation Path>\Test Projects\Page Classification and Separation.rtp
m Select File | Exit to close Recognition Test Tool

Step 3: Configure Document Separation

Build Advanced Document Separator


The separator is built using the Build Advanced Document Separator tab. It is based
on all the document types in the project (unlike the classifiers which are trained
using individual documents within a set). However, the documents are analyzed in
order to display the observed length of the documents for each type. This
information can then be used to set limits to restrict the number of pages allowed in
each document during separation.

X To build the advanced document separator


1 In your Transformation Studio project, select Configuration | Build
Advanced Document Separator into… | Configuration “Page Classification
and Separation” to display the Build Advanced Document Separator tab
2 For each document type with an Observed Document Length of 1 – 1 pages,
set the Maximum Page Limit to 1
3 For each document type with an Observed Document Length of 1 – 2 pages,
set the Maximum Page Limit to 2

Getting Started Guide (Classification and Separation) 89


Chapter 4

Figure 4-32. Building the Advanced Document Separator

4 Click Build.
When the separator has been built, the Build button will be renamed to
Finish and the separator will display on the Project Explorer panel.

90 Getting Started Guide (Classification and Separation)


Configuration

Figure 4-33. Project Explorer Displaying the New Separator

5 Click Finish

Integrate Advanced Document Separator


In production, Recognition first calls the script file (which calls the page classification
methods) and then runs a separation project file, which in turn calls the separator.
The separation project file (called Separation.drp) is created automatically when the
configuration is created. Two changes may be needed to this project:
ƒ The name of the separator
ƒ Confidence thresholds that specify for each document type the score that a
document must reach in order for it to be confidently classified and separated

The project will, by default, call a separator named “Advanced document


separator.ads”. This is the default name of the separator created using the Build
Advanced Document Separator tab. If the name is left unchanged, no modification is
needed. For information on changing the separator name in the project file, refer to
the ADR Help.

Editing thresholds is not covered in this guide. For further information refer to the
ADR Help.

X To integrate the separator, no modifications to the separation project are


required for this tutorial.

Test Advanced Document Separator


The advanced document separator is tested in Recognition Test Tool, along with the
classification configuration.

Getting Started Guide (Classification and Separation) 91


Chapter 4

1 Open Recognition Test Tool by clicking Start on the taskbar to display the
menu, and selecting All Programs | IBM FileNet Capture Professional |
ADR Tools | Recognition Test Tool
2 Open the project used to test page text classification

Note You can open the project from the recent projects on the File menu. By
default it will be:

<Installation Path>\Test Projects\Page Classification and Separation.rtp

3 Select File | Project Properties… to open the Project Properties window

4 At the bottom of the Configuration tab, click the button next to the
Document Review Project File box
5 Select the separation project in your configuration’s resources folder:
My Documents\Transformation Studio
Projects\Tutorial\Configurations\Page Classification and
Separation\Resources\Separation.drp
6 Click Open
7 Click OK
8 Click Run Test
9 Once the test has finished, select the Summary tab
10 If the tree view shows documents of various lengths, the separation has run.

Note More detailed analysis of performance will be done in Step 5: Test and
Evaluate Performance.

11 Select File | Save Project


12 Select File | Exit to close Recognition Test Tool

Step 4: Add in Additional Classification Methods


As for document classification solutions, multiple classification methods can be used
together to ensure an accurate and efficient configuration. When processing pages,
there are four classification methods that can be used:

92 Getting Started Guide (Classification and Separation)


Configuration

ƒ Text classification
ƒ Image classification
ƒ Templated classification (including barcodes)
ƒ Rules-based classification

For more information on these methods, refer to Classification Methods or the ADR
Help.

Definition File for Templated Classification

The definition file that was used for classifying the Header document type in the
document classification solution can be used again. However, one modification is
needed: the definition file needs to set the page type rather than the document type
(the separator will determine the document type when it does the separation, and
will not instantly classify it from the barcode registration).

X To re-use the definition file to classify the Header type by barcode


1 In Windows Explorer, navigate to the following folder in your
Transformation Studio project:
My Documents\Transformation Studio Projects\Tutorial\Configurations\
2 Copy the file “Header.idf” from:
…\Document Classification\Resources
To
…\Page Classification and Separation\Resources
3 Rename the new file “Header_start.idf”

Integrate Definition file


As with the document text classifier, the definition file is called by the Recognition
script when the Recognition module runs. The template script will not call a
definition file by default, so you need to integrate the file by:
ƒ Specifying that you are using templated classification
ƒ Specifying the name of your definition file

X To integrate the definition file into the script


1 In Windows Explorer, navigate to your configuration’s resources folder:

Getting Started Guide (Classification and Separation) 93


Chapter 4

My Documents\Transformation Studio
Projects\Tutorial\Configurations\Page Classification and
Separation\Resources
2 Double-click the file “Page Classification and Separation.ifv” to open it in
Script Editor
3 Turn on templated classification by changing the following line at the top of
the script:
Const CLASSIFY_BY_DEFINITION_FILE = False

To:
Const CLASSIFY_BY_DEFINITION_FILE = True

4 Specify the name of the definition file by changing the following line at the
top of the script:
Const DEFINITION_FILE_FILENAME = "Template.idf"

To:
Const DEFINITION_FILE_FILENAME = "Header_start.idf"

5 Select File | Save File


6 Select File | Exit

Test Text and Template Classification


The configuration is again tested on the Test Documents set in Recognition Test Tool.

X To test the configuration


1 Open Recognition Test Tool by clicking Start on the taskbar to display the
menu, and selecting All Programs | IBM FileNet Capture Professional |
ADR Tools | Recognition Test Tool
2 Open the project used to test page text classification and separation.

Note You can open the project from the recent projects on the File menu. By
default it will be:

<Installation Path>\Test Projects\Page Classification and Separation.rtp

3 Click Run Test


4 Once the test has finished, select the Header tab
5 Select one of the documents

94 Getting Started Guide (Classification and Separation)


Configuration

6 Click Script Messages in the bottom left panel.


If the first message is “Running classification by template (definition file)…”,
page templated classification has run.

Note More detailed analysis of performance will be done in Step 5: Test and
Evaluate Performance.

7 Select File | Exit to close Recognition Test Tool

Build Page Image Classifier


The classifier is created on the Build Page Image Classifier tab, where training
options are selected before the build process is started. Typically the image classifier
is trained on the documents in the Sample Documents set (after the set has been
cleaned during document set management).

As with page text classification, it is possible to specify whether training is restricted


to pages within documents that have been confirmed, whether extra pages are
trained on, and whether to further limit which pages are trained within each
document type. Any page types that are not trained on using page image
classification need to be classified using an alternative page classification method (for
example text or templated).

X To build a page image classifier


1 Open your Transformation Studio project
2 Select Configuration | Build Page Image Classifier into… | Configuration
“Page Classification and Separation” to display the Build Page Image
Classifier tab.
Sample Documents will be selected by default in the “Training Document
Set” list and the document types within the set will be listed in the table.

Getting Started Guide (Classification and Separation) 95


Chapter 4

Figure 4-34. Building the Page Image Classifier

Note Some warning triangles may display. Hover the mouse over a specific
triangle to display the warning. Two types of warning display in the tutorial:
one for too few examples of a page type and the other for too many. In a
project, if you have too few examples you need to go back to the customer to
ask for more. If you have too many examples, you need to look at the
variation within the page type. If there is a lot of variation, you would keep
all the pages or consider using a classification method other than image. If
there is only a little variation, you would remove some of the examples.

3 Click Build.
Once the classifier has been built, the Build button will be renamed to Finish
and the classifier will display on the Project Explorer panel.

96 Getting Started Guide (Classification and Separation)


Configuration

Figure 4-35. Project Explorer Displaying the New Classifier

4 Click Finish

Integrate Classifier
The Recognition script now needs updating to call page image classification. The
template script will not call a page image classifier by default, so you need to
integrate the file by:
ƒ Specifying that you are using image classification
ƒ Specifying the name of your definition file (not required for this tutorial as the
classifier you built has the default name)

X To call image classification


1 In Windows Explorer, navigate to your configuration’s resources folder:
My Documents\Transformation Studio
Projects\Tutorial\Configurations\Page Classification and
Separation\Resources\
2 Double-click Page Classification and Separation.ifv to open the file in Script
Editor
3 Turn on image classification by changing the following line at the top of the
script:
Const CLASSIFY_BY_IMAGE = False

To:
Const CLASSIFY_BY_IMAGE = True

4 Select File | Save File

Getting Started Guide (Classification and Separation) 97


Chapter 4

5 Select File | Exit to close Script Editor

Test Classification
The configuration is again tested on the Test Documents set in Recognition Test Tool.

X To test the configuration


1 Open Recognition Test Tool
2 Open the project used to test page classification and separation.

Note You can open the project from the recent projects on the File menu. By
default it will be:

<Installation Path>\Test Projects\Page Classification and Separation.rtp

3 Click Run Test


4 Once the test has finished, select one of the document type tabs (but do not
select the Header tab)
5 Select one of the documents
6 Click Script Messages in the bottom left panel.
If the message “Running classification by image…” is displayed, page image
classification has run.
7 Select the Summary tab.
You can see from the lengths of the documents that the addition of page
image classification has improved separation.

Note More detailed analysis of performance will be done in Step 5: Test and
Evaluate Performance.

Do not close Recognition Test Tool.

Step 5: Test and Evaluate Performance

This step includes analyzing test results in Recognition Test Tool and more detailed
performance evaluation using the BatchCompare utility.

98 Getting Started Guide (Classification and Separation)


Configuration

Testing in Recognition Test Tool

Testing has been done throughout the tutorial using the Recognition Test Tool. In
this step, the results of the test are analyzed in a little more detail, with the aim of
spotting any significant problems before going into the more detailed analysis using
BatchCompare.

X To analyze the test results

1 In Recognition Test Tool, having run a test, click (Analyse Results)


2 Select the Document Classification tab to display a table of the percentage of
documents that have been confidently classified into each type
3 Click the % button so it is no longer selected.
This displays the number of documents classified as each type rather than
the percentage
4 Check that the number of documents for each type matches the number of
documents in each type in the Test Documents set in Transformation Studio.
a Open the Tutorial project in Transformation Studio
b Double-click Test Documents to open the set
c Check the number of examples of each document type on the Document
Types panel and compare with the values in the Results Analysis table in
Recognition Test Tool

Note The numbers may not be exactly the same, but the closer they are, the
more effective the configuration.

5 Select File | Exit to close the Results Analysis window.


Do not close Recognition Test Tool.

Evaluating and Improving Performance

During the evaluation process the classification and separation results are analyzed
to determine if there are incorrect classifications or splits/joins in documents, where
these are most commonly occurring and therefore which areas in the configuration
need to be improved.

The BatchCompare utility is used to compare two offline batches:

Getting Started Guide (Classification and Separation) 99


Chapter 4

ƒ The first, a “reference batch”, is created during export of the Test Documents
from Transformation Studio. It contains accurate document types and
structure for all the Test Documents.
ƒ The second, a “comparison batch”, contains the same documents. However,
the document types and structure are exported from Recognition Test Tool,
after a test has been run using the new configuration.

Once the two batch files have been generated, they are compared using the
BatchCompare utility and the results of the comparison are output to a MS Excel
workbook.

Within Excel, additional statistics can be generated from the raw data using built in
macros.

Statistics

It is very important to consider the complete set of statistics when analyzing the
performance, as no single value indicates good or bad performance. For example, if
the separation statistics show several missed or additional splits, the classification
statistics will not accurately represent the performance.

For information on all the statistics, refer to the BatchCompare Reference in the ADR
Help.

Once the separation statistics have been taken into account, the key statistics are the
accuracy and classification rate for each document type.
ƒ Accuracy should be as high as possible for each document type
ƒ Classification rate should be as high as possible for each document type

A compromise must be found between the two values, since:


ƒ In order to improve the classification rate the accuracy may drop
ƒ In order to improve the accuracy the classification rate may drop

Note In order to use the BatchCompare utility you need to have Microsoft Office
Excel installed.

X To evaluate and improve performance


1 Export the comparison batch from Recognition Test Tool.
a Select Results | Export Results to Batch File…

100 Getting Started Guide (Classification and Separation)


Configuration

In the new window, navigate to the location of the offline batch you
imported into Recognition Test Tool (the reference batch):
My Documents\Transformation Studio Projects\Tutorial\Exported
Document Sets\Test Documents (Page Classification and Separation)
b For the File name, enter “All Document Types Automatic Results”
c Click Save
d Select File | Exit to close Recognition Test Tool
2 Use the BatchCompare utility.
a Open the command prompt by clicking Start on the taskbar to display the
menu, and selecting All Programs | Accessories | Command Prompt
b Enter the command:
cd <full path to folder containing offline batches>
To enter the folder path quickly, first navigate to:
My Documents\Transformation Studio Projects\Tutorial\Exported
Document Sets\
Then, drag and drop the Test Documents (Page Classification and
Separation) folder into the command prompt window.

Note The Excel workbook will be saved in the location from which
BatchCompare is run (the path specified at the command prompt).

c Enter the command:


BatchCompare –R <full path to reference batch> –C <full path to
comparison batch>
To enter the full paths quickly, navigate to the offline batches in Windows
Explorer and drag and drop the files into the command prompt window.
The offline batches are found in the following location:
My Documents\Transformation Studio Projects\Tutorial\Exported
Document Sets\Test Documents (Page Classification and Separation)
d Press Enter to run the batch comparison
e When the utility has finished, close the command prompt by clicking the
cross in the top right
3 Use MS Excel to evaluate the comparison results
a In Windows Explorer, double-click on the following file to open it in
Microsoft Office Excel:

Getting Started Guide (Classification and Separation) 101


Chapter 4

My Documents\Transformation Studio Projects\Tutorial\Exported


Document Sets\Test Documents (Page Classification and
Separation)\Unknown Batch Statistics.xls
This workbook contains raw comparison data and macros to generate
statistics from the data. For detailed information of the data, refer to the
ADR Help.
b On the Control Sheet, click Recalculate All

Note Macro security in Excel must be set to medium or low in order to


generate statistics. Select Tools | Options to access the security settings.

Note You will need to close and re-open the workbook for new security
settings to take effect.

4 Evaluate Performance.
a On the Summary Statistics worksheet, review the Separation Statistics
table.
Correct splits should be as high as possible with missed and additional
splits as low as possible.
b Review the Accuracy and Classification Rate statistics for the Per-
Document Results in the Classification Statistics table.
It is likely that Accuracy is close to 100% but Classification Rate is much
lower.
c Review the Confusion Matrix.
Ideally only the blue shaded cells will contain numbers. If there are any
numbers in the other (unshaded) cells, examples of the document type on
the left have been classified as the type in the column.
5 Improve accuracy.
If your accuracy is less than 90% or your confusion matrix has a lot of data in
the unshaded cells, refer to the following book of the ADR Help for
information on how to troubleshoot and improve performance:
How to configure… | Recognition | Step 4: Test | Page Classification and
Separation | Step 5: Evaluate and Improve Performance
6 If your accuracy Summary is at least 90%, but your classification rate
Summary is lower than 80%, determine your confidence thresholds by
modifying the Confidence Threshold values and recalculating the statistics.

102 Getting Started Guide (Classification and Separation)


Configuration

a For the Per-Document Results, change the confidence threshold for each
document type in the table on the far right.
ƒ For any document type with a classification rate of less than 95%,
change its confidence threshold to 85%
ƒ For any document type with a classification rate of less than 60%,
change its confidence threshold to 80%

Figure 4-36. Thresholds Set

b On the Control Sheet, select “Use user-defined category thresholds”


c Click Recalculate All

Figure 4-37. New Statistics

d Check that the Accuracy has not dropped. If it has, increase the
Confidence Threshold in 1% increments until the accuracy improves or
you are happy with the compromise.
Once you have determined the optimum thresholds, you need to set them
on the separator.
e Leaving Excel open (so you can see the thresholds), navigate to your
configuration’s resources folder:
My Documents\Transformation Studio
Projects\Tutorial\Configurations\Page Classification and
Separation\Resources

Getting Started Guide (Classification and Separation) 103


Chapter 4

f Double-click the separation project file Separation.drp to open it in


Document Review Project Editor
g Select the Document Separation tab
h Click Edit thresholds… next to the advanced document separator
i Change the thresholds so they match those determined in Excel

Figure 4-38. Edit Thresholds window in Project Editor

j Click Save Thresholds


k Click Close
l Select File | Save to save the separation project file
m Select File | Exit to close Document Review Project Editor
n Close Excel

104 Getting Started Guide (Classification and Separation)


Configuration

Integrate the Configuration with Capture


This section summarizes the steps to configuring a capture path, settings collection
and template for an ADR solution running on a single computer. For full details refer
to the Capture documentation.

Step 1: Create Capture Path


For this tutorial, base the Capture Path on the example installed. For information on
creating a new capture path, refer to the Capture documentation.

X To create a capture path


1 Open Capture Professional by clicking Start on the taskbar to display the
menu, and selecting All Programs | IBM FileNet Capture Professional |
Capture Professional 5.2
2 Select File | New | Capture Path…
3 Following the wizard, create a capture path in the “ADR Examples”
repository with the name “My Mortgage Applications with Separation”,
based on the existing definition “Mortgage Applications with Separation”
4 Click Next to view the selected components:
ƒ File Import
ƒ ADR Recognition (Classification and Separation)
ƒ ADR Document Review
ƒ ADR Recognition
ƒ ADR Completion
ƒ ADR Scripted Export
5 Click Next to view the capture path layout
6 Click Next
7 Click Finish to close the wizard

Step 2: Create Settings Collection


For this tutorial, base the settings collection on the example installed. For information
on creating a new settings collection, refer to the Capture documentation.

X To create a settings collection


1 Select File | New | Settings…

Getting Started Guide (Classification and Separation) 105


Chapter 4

2 Following the wizard, create a settings collection in the “ADR Examples”


repository with the name “My Mortgage Applications with Separation
Settings”, based on the existing Settings Collection “Mortgage Applications
with Separation Settings”
3 Click Next.
The “DocClass” box should contain the value “Indexless”. This indicates that
the settings collection for the batch has a document class with no index
fields. This is correct for a configuration with multiple document types. For
more information, refer to the ADR Help.
4 Click Next to display the components (but don't configure them here)
5 Click Next
6 Click Finish to close the wizard

Step 3: Assign Configuration to Settings Collection


As the settings collection is based on the example settings collection, most of the
settings are already configured. In this step, you will review the existing settings on
the Capture components and replace the configuration files for the ADR modules
with those you just created.

X To configure the settings collection


1 Select the new settings collection in the repository tree view
2 Review the existing settings
a Select Tools | Configure | File Import… to view the existing import
settings:
ƒ “File Specification” is “*.tif”
ƒ “Path” is <Installation Path>\Examples\Mortgage
Applications\Images\Page Images
b Click OK
3 Configure ADR modules to use the configuration you created during the
tutorial.
a Select Tools | Configure | ADR Recognition | Classification and
Separation… to open the setup dialog for this instance of Recognition
b In the “Recognition Script File” section, click Clear to remove the existing
script
c Click Select Script File…

106 Getting Started Guide (Classification and Separation)


Configuration

d Browse to your configuration’s resources folder:


My Documents\Transformation Studio
Projects\Tutorial\Configurations\Page Classification and
Separation\Resources\
e Select the script file Page Classification and Separation.ifv
f Click Open
g On the “Document Review Project File” panel, click Clear to remove the
existing separation project
h Click Select…
i Browse to your configuration’s resources folder:
My Documents\Transformation Studio
Projects\Tutorial\Configurations\Page Classification and
Separation\Resources\
j Select the project file Separation.drp
k Click Open
l Click OK
m Select Tools | Configure | ADR Document Review… to open the
Document Review setup dialog
n Click Select…
o Browse to your Document Review configuration folder:
My Documents\Transformation Studio
Projects\Tutorial\Configurations\Document Review
p Select the Document Review project file Review.drp
q Click Open
r Click OK
s Select Tools | Configure | ADR Recognition | ADR Recognition… to
open the setup dialog for this instance of Recognition.
The configuration for the standard instance of Recognition is installed
with the example, but does not require any modification. For more
information on configuring conditional extraction, refer to the ADR Help.
t Click OK
u Select Tools | Configure | ADR Completion… to open the Completion
setup dialog.

Getting Started Guide (Classification and Separation) 107


Chapter 4

The configuration for Completion is installed with the example, but does
not require any modification. For more information on creating templates
for multiple document types, refer to the ADR Help.
v Click OK
w Select Tools | Configure | ADR Scripted Export… to open the Scripted
Export setup dialog.
The configuration for Scripted Export is installed with the example, but
does not require any modification. For more information on writing ADR
data to index fields, refer to the ADR Help.
x Click OK

Step 4: Create Batch Template

X To create a template
1 Select File | New | Template…
2 Following the wizard, create a template in the “ADR Examples” repository
with the name “My Mortgage Applications with Separation Template”
3 Select the settings collection “My Mortgage Applications with Separation
Settings”
4 Select the capture path “My Mortgage Applications with Separation”
5 Click Next and Finish to close the wizard

Step 5: Process a Batch


Now that the configuration is complete, process a batch by following the instructions
in Tutorial: Running the Capture Path but using the new template, settings collection
and capture path.

108 Getting Started Guide (Classification and Separation)

You might also like