Word Stat 7
Word Stat 7
Provalis Research
This software and the disk on which it is contained are licensed to you, for your own use. This is copyrighted software
owned by Provalis Research. By purchasing this software, you are not obtaining title to the software or any copyright
rights. You may not sublicense, rent, lease, convey, modify, translate, convert to another programming language,
decompile, or disassemble the software for any purpose. You may make as many copies of this software as you need for
backup purposes. You may use this software up to two computers, provided there is no chance it will be used
simultaneously on more than one computer. If you need to use the software on more than one computer simultaneously,
please contact us for information about site licenses.
WARRANTY
The WORDSTAT product is licensed "as is" without any warranty of merchantability or fitness for a particular purpose,
performance, or otherwise. All warranties are expressly disclaimed. By using the WORDSTAT product, you agree that
neither Provalis Research nor anyone else who has been involved in the creation, production, or delivery of this
software shall be liable to you or any third party for any use of (or inability to use) or performance of this product or for
any indirect, consequential, or incidental damages whatsoever, whether based on contract, tort, or otherwise even if we
are notified of such possibility in advance. (Some states do not allow the exclusion or limitation of incidental or
consequential damages, so the foregoing limitation may not apply to you). In no event shall Provalis Research's liability
for any damages ever exceed the price paid for the license to use the software, regardless of the form of claim. This
agreement shall be governed by the laws of the province of Quebec (Canada) and shall inure to the benefit of Provalis
Research and any successors, administrators, heirs, and assigns. Any action or proceeding brought by either party
against the other arising out of or related to this agreement shall be brought only in a PROVINCIAL or FEDERAL
COURT of competent jurisdiction located in Montral, Qubec. The parties hereby consent to in personam jurisdiction
of said courts.
COPYRIGHT
Copyright 1998-2015 Provalis Research. All rights reserved. No part of this publication may be reproduced or
distributed without the prior written permission of Provalis Research, 1255 University Avenue, Suite #1255, Montreal,
QC, CANADA, H3B 3W9.
TABLE OF CONTENT
Introduction to WordStat .......................................................................................................................... 5
Programs Capabilities ............................................................................................................................. 7
The Content Analysis & Categorization Process ..................................................................................... 11
A Quick Tour: Performing Your First Content Analysis .......................................................................... 14
Common Tasks
Creating and Maintaining Dictionaries ........................................................................................... 114
Working with Rules ....................................................................................................................... 124
Using Lexical Tools for Dictionary-Building................................................................................... 127
Monitoring and Customizing Substitutions ...................................................................................... 135
Configuring External Preprocessing Routines.................................................................................. 139
Viewing and Editing Text ............................................................................................................... 142
Displaying keyword distribution using barcharts or pie charts.......................................................... 145
Creating and Using Norm Files ....................................................................................................... 148
Performing Text Retrieval Using Keywords .................................................................................... 150
Creating Bubble Charts ................................................................................................................... 155
Using Heatmap Plots ...................................................................................................................... 158
Performing Correspondence Analysis.............................................................................................. 162
Editing the Case Descriptor............................................................................................................. 166
Filtering Cases ................................................................................................................................ 167
Expression Operators and Rules ............................................................................................... 170
Supported xBase Functions ...................................................................................................... 171
Performing Analysis on Manually Entered Codes ............................................................................ 177
Computing Inter-rater Agreement Statistics ..................................................................................... 178
Exporting Frequency Data ............................................................................................................. 181
Exporting Categorization Models .................................................................................................... 183
Using the WordStat Document Classifier ........................................................................................ 184
WordStat Software Developers Kit (SDK) ..................................................................................... 188
Performing Multivariate Analysis on Words or Categories .............................................................. 189
Managing Outputs with the Report Manager ................................................................................... 191
References............................................................................................................................................. 198
Technical Support ................................................................................................................................ 199
Introduction to WordStat
WordStat is a text analysis module specifically designed to study textual information such as responses to
open-ended questions, interviews, titles, journal articles, public speeches, electronic communications, etc.
WordStat may be used for automatic categorization of text using a dictionary approach or various text
mining methods. WordStat can apply existing categorization dictionaries to a new text corpus. It also may
be used in the development and validation of new categorization dictionaries. When used in conjunction
with manual coding, this module can provide assistance for a more systematic application of coding rules,
help uncover differences in word usage between subgroups of individuals and assist in the revision of
existing coding using KWIC (Keyword-In-Context) tables.
WordStat includes numerous exploratory data analysis and graphical tools that may be used to explore the
relationship between the content of documents and information stored in categorical or numeric variables
such as the gender or the age of the respondent, year of publication, etc. Relationships among words or
categories as well as document similarity may be identified using hierarchical clustering and
multidimensional scaling analysis. Correspondence analysis and heatmap plots may be used to explore
relationship between keywords and different groups of individuals.
WordStat is a module that must be run from either of the following base products:
SimStat -This statistical software provides a wide range of statistical procedures for the analysis of
quantitative data. It offers advanced data file management tools such as the ability to merge data files,
aggregate cases, perform complex computation of new variables and transformation of existing ones.
When used with SimStat, WordStat can analyze textual information stored in any alphanumeric, plain
text and rich text memo variable (or field). It includes various tools to explore the relationship
between any numeric variable of a data file and the content of alphanumeric ones. Its close integration
with SimStat facilitates further quantitative analysis on numerical results obtained from the content
analysis (ex.: factor analysis or correspondence analysis on keyword frequencies, multiple regression,
etc.).
QDA Miner - The text management and qualitative analysis program allows one to create and edit
data files, import documents, and perform manual coding of those documents. Several analysis tools
are also available to look at the frequency of manually assigned codes and the relationship between
those codes and other categorical or numeric variables. When used with QDA Miner, WordStat can
perform content analysis on whole documents or selected segments of those documents tagged with
specific user defined codes.
WordStat module may be accessed in both of these programs from the CONTENT ANALYSIS command
in the ANALYSIS menu.
A few additional utility programs are also included with WordStat that may be run as standalone
applications or be accessed directly through WordStat:
Report Manager - This application has been designed to store, edit and organize documents, notes,
quotes, tables of results, graphics and images created by QDA Miner and WordStat or imported from
other applications.
Document Conversion Wizard - This utility program provides an easy way to import numerous
documents and create a project file. It can also be used to split large files into smaller units and to
extract various numeric and alphanumeric data from structured documents.
KEYWORD-IN-CONTEXT
Ability to display a Keyword-In-Context (KWIC) table of any included, leftover or user defined
word, word pattern or phrase.
KWIC tables may be sorted in ascending order of case number, words with context, or on values of
independent variables.
Ability to jump from a specific occurrence in the KWIC table to the original text variable in order to
view or edit the selected word.
KWIC tables may be saved in data files for further processing.
Customizable KWIC and report function to display all hits as lists of paragraphs, sentences or user
defined segments.
4- EXCLUSION PROCESS
An exclusion process may be applied to remove words that you do not want to be included in the
content analysis. This process requires the specification of an exclusion list. Such a process is used
mainly to remove words with little semantic value such as pronouns, conjunctions, etc., but may
also be used to remove some words or phrases used too frequently or with little discriminative
value.
The categorization process allows one to change specific words, word patterns or phrases to
other words, keywords or content categories and/or to extract a list or specific words or codes.
This process requires the specification of an inclusion dictionary. This dictionary may be used
to remove variant forms of a word in order to treat all of them as a single word. It may also be
used as a thesaurus to perform automatic coding of words into categories or concepts. For
example, words such as "good", "excellent" or "satisfied" may all be coded as instances of a
single category named "positive evaluation", while words like "bad", "unsatisfied" or
expressions like "not satisfied" may be categorized as "negative evaluation".
Automatic categorization of
texts
Click the OPEN AN EXISTING PROJECT button and select the SEEKING.WPJ file located in the
default folder.
If you closed or disabled this introductory dialog box, then from the main screen select the OPEN
PROJECT command from the file menu and select the SEEKING.WPJ file located in the default
folder.
Move to Step #4
From SimStat
Step #1 - Open the data file
From within SimStat, select the FILE | DATA | OPEN command sequence and select the
SEEKING.DBF file
Step #7 - Examining the relationship between included categories and the gender of the
author.
Press on the Sixth tab (Crosstab).
Click the WITH drop-down list box and select GENDER. to display a contingency table of categories
frequency by gender.
The TABULATE option allows one to choose whether the table should be based on the total
frequency of included words or on the total number of cases containing those words.
The SORT BY option allows one to sort the table on the word or category name (alphabetical order)
or by descending order of keyword frequency. You may also click any column header to sort the grid
in ascending or descending order of the values found in this column.
The DISPLAY option allows one to specify the information displayed:
Count
Row percent
Column percent
Total percent
Category percent (for case occurrences)
Percent of total words (for keyword frequency)
Step #9 - Visualizing the relationship between categories and the age of the author.
Use the mouse to highlight cells of the categories you would like to compare.
Click the button or press the right button of the mouse and select the Chart Selected Rows menu
item.
Click the button to close this dialog box and return to WordStat main window.
Click the button to display the KWIC table for this word or category.
To sort the table on the case number, on the keyword along with the prior or subsequent words, or on
the sex of the respondent, use the SORT BY drop-down list box.
To display KWIC tables of any user defined word or word pattern, set the LIST option to "User
defined", enter your word pattern (with or without wildcards) in the WORD edit box and click the
button.
Click the button. Note: If this button is inactive, click the button to refresh the
Step #17 - Quitting the module and returning to QDA Miner or SimStat
Click the button in the upper left-hand corner of WordStat and select the EXIT command or
click the X mark in the upper right-hand corner.
2. An alternative approach would be to build a content analysis process that would take into account
the misspelling of words. To achieve this, one may use the Substitution feature to automatically
replace those misspelled words with their correct forms or add the most commonly misspelled
keywords into the content analysis dictionary.
Remove hyphenation
While WordStat can be configured to accept compound words with dashes, it cannot differentiate dashes
and hyphens. As a consequence, a hyphenated word will often be treated as two separate words. It is thus
recommended to revise the text to ensure no hyphenation is present.
The program displays a dialog box where one can specify the spreadsheet page and the range of cells where
the data are located. You must specify a valid range name or provide upper left and lower right cells,
separated by two periods (such as A1..H20). If you set the Range Name list box to ALL, the program
attempts to read the whole page.
While the first method can read textual data stored in word processor documents, the last three methods
require the data to be stored on disk in plain ASCII files without any formatting or typesetting code. Most
word processors offer an option to save a document as a plain text file. If you don't know how to create such
a text file, please refer to your word processor manual.
The resulting file consists of a SimStat data file with two variables: RECNO, a numeric variable containing
a sequential number going from 1 up to the total number of cases encountered in the input file, and TEXT, a
memo variable containing the textual data for this case.
Note: Importation of numerous text variables may be achieved by performing successive importations of
page delimited memo files and then using the APPEND VARIABLES command to merge the resulting files
into a single one. In order to achieve this, great care should be taken to give unique names to the various
TEXT variables and to assure that the case sequence of the various text files is identical.
Dictionaries This page allows one to adjust various text analysis processes, create and modify
dictionaries, exclusion and substitution lists, as well as add, remove and edit existing entries in those
dictionaries.
Options This page contains various options controlling how the text data will be processed. It also
includes options affecting linguistic tools as well as program appearance.
Frequencies This page displays a table of the frequency of keywords or content categories. One may
also get a list of leftover words, allowing one to modify the current categorization dictionary, the
exclusion list or the substitution list.
Extraction This page allows one to perform topic modeling, find the most common phrases, extract
named entities, as well as misspellings and uncommon words, and assign them to the current
categorization dictionary, the exclusion list or the substitution list. One may also use the misspelling
page to batch replaced misspellings in the original documents.
Co-occurrences This page allows one to explore connections between words, keywords, phrases or
content categories using hierarchical clustering, multidimensional scaling, link analysis and proximity
plots
Phrase Finder This page allows one to extract the most common phrases and idioms and to easily
add them to the current dictionaries. Co-occurrence and comparison techniques such as clustering,
multidimensional scaling and correspondence analysis are also available from this page.
Crosstab This page allows one to compare keyword frequencies across values of numerical,
categorical or date variables. Along with a table of frequencies, several statistics and graphical
techniques may be applied including correspondence analysis, heatmaps, bubble charts, bar charts and
line charts. The automatic document classification feature can also be accessed from this page.
Keyword-in-Context This page allows one to display a concordance table word patterns or phrases,
or of all items related to a content category. Such a table is very useful to validate a dictionary by
allowing one to examine in context how words are being used.
Classification This page gives access to the automated text classification module that allows one to
apply a machine-learning approach to the existing textual database. Options allow one to develop a
classification model that can later be used to accurately classify uncategorized documents into
predefined classes.
Two additional pull down menus can be accessed to perform various tasks:
Clicking the button in the upper left-hand corner of the main window displays a menu that allows one
to leave WordStat and return to the calling application as well as to perform various tasks such as editing
The button in the upper right-hand corner provides access to this help file, which can also be accessed
at any time by pressing the F1 key. In addition, this menu allows you to check whether you are using the
latest version of WordStat and also gives access to specific important information and some useful links to
the Provalis Research website.
The following section provides a description of the four processing steps involved in the transformation of
textual data into keywords or content categories. Additional information about dictionary creation and
maintenance can be found on page 114.
STEP #1 - PREPROCESSING
The preprocessing option allows for the custom transformation of the text to be analyzed prior to, or in
place of the execution of the other three standard processes provided by WordStat: lemmatization,
exclusion and categorization. This transformation is accomplished by the execution of specially
designed external routines accessible in the form of an external EXE file or a function in a DLL
library. This feature is provided to offer greater flexibility by allowing any user with programming
skills or resources to customize the processing of textual information. For more information on this
feature see Configuring External Preprocessing Routines on page 139.
STEP #3 - SUBSTITUTION
The substitution process may be used to automatically replace specific words with other word forms. It
may be used to substitute common misspellings or perform lemmatization. One could also use this
process to perform a simple type of categorization where specific words are replaced with keywords.
WordStat provides four predefined substitution processes to perform lemmatization on documents in
English, French, Italian and Spanish. Lemmatization is a process by which various forms of words are
reduced to a more limited number of canonical forms. A typical example of lemmatization would be the
conversion of plurals to singulars and past tense verbs to present tense verbs. The lemmatization
algorithm implemented in WordStat is a dictionary-moderated method, partly inspired by Krovetz's
KSTEM suffix substitution algorithm. Since the lemmatization algorithm does not rely on a prior part-
of-speech tagging of words, it is much faster than traditional lemmatization routines. It may, however,
result in a few invalid word substitutions, but usually, those errors will have no major consequences on
the result of an analysis. However, it is important to remember that lemmatization, like stemming, may
decrease the measurement precision of some concepts or topics (please refer to Step #2 for more
information on this issue). WordStat offers a way to monitor all substitutions performed by this routine
and to override any by creating a list of custom substitutions. For more information on such a feature,
see Monitoring and Customizing Substitutions on page 135.
In the above example, words like CANADA, USA, or MEXICO may be coded as either NORTH-
AMERICA or COUNTRY, depending on whether the categorization is performed up to the first or second
level of the dictionary (see Level of Analysis, page 30).
Wildcards such as *, ?, # and the square brackets [ and ] are supported. For example, the following
item under the support category:
SUPPORT
SUPPORT*
will change SUPPORT, SUPPORTS, SUPPORTING, SUPPORTIVE, SUPPORTER, etc. into a single
word SUPPORT, while the following word pattern:
SUPPORT
*SUPPORT*
will also substitute all words with the substring "SUPPORT" in it, such as UNSUPPORTEDLY,
UNSUPPORTED, etc.
An expression that includes several words may also be substituted by joining the various words with
underline characters. For example, you may change the expression "going out" with the category
"NIGHTLIFE" by specifying the following item:
NIGHTLIFE
GOING_OUT
You may also use wildcards in expressions such as:
NIGHTLIFE
GO*_OUT
to substitute several forms of an expression at once.
Integer weights can also be assigned to specific items so that a specific word or word pattern may
count for more than one instance of the category. For example, in order to compute an
CATEGORIZATION SETTINGS
LEVEL OF ANALYSIS - This option allows one to specify up to which level the coding should be
performed. For example, in the following dictionary:
COUNTRY
NORTH-AMERICA
CANADA (1)
UNITED-STATES (1)
USA(1)
MEXICO (1)
SOUTH-AMERICA
BRAZIL (1)
CHILI (1)
if a level of 1 is specified, all words that are stored at a higher level than the root level will be
coded as the parent category at this first level. For example, words like CANADA and MEXICO
will be coded as COUNTRY along with other country names like BRAZIL. Setting the level of
analysis to a numeric value of 2 will results in the coding of those two words as NORTH-
AMERICA, while BRAZIL will be coded as SOUTH-AMERICA. Items stored at the same or
at a lower level than this option will remain unchanged.
Setting the LEVEL option to AS SHOWN instructs WordStat to match the level of
categorization performed to the level of details currently displayed in the tree view of the
categorization dictionary. This option allows one to set different levels of categorization by
expanding broad categories that should be broken down and by collapsing categories for
which finer details are not needed. For example, if we modify the above tree by collapsing the
NORTH-AMERICA category, WordStat will display it the following way:
COUNTRY
NORTH-AMERICA
SOUTH-AMERICA
BRAZIL (1)
CHILI (1)
The program will report frequencies of individual countries like BRAZIL or CHILI but will
categorize every instance of CANADA, UNITED-STATES, USA and MEXICO as NORTH-
AMERICA.
Please note that it is possible to prevent a category from being broken down into subcategories
or items, even if the level of analysis is set to a higher setting, or if it is set to AS SHOWN and
the items contained in this category are visible. Such a feature is useful when the content of a
category consists of different ways of referring to the exact same thing (for example
To make a category unbreakable, select it in the dictionary tree, click the button,
and put a check mark in the Unbreakable box. The folder icon normally used to represent
categories will be transformed into a folder icon with a key inside. You may also select the
category, right click, and then select UNBREAKABLE | YES from the pop-up menu. To
unlock the folder, follow the previously described steps for editing the category and remove
the check mark in the Unbreakable box or select UNBREAKABLE | NO from the pop-up
menu.
CATEGORIES ONLY - When the LEVEL OF ANALYSIS option is set to a value higher than one,
this option instructs WordStat to limit the level increase to the coding of the last category at or
below the specified level. This option is especially useful when working with unbalanced
hierarchical categorization systems where individual words are stored at different levels. For
example, in the following dictionary:
SENSATION
ODOR
AROMA (1)
BREATH (1)
FRAGRANCE (1)
NOSE (1)
ANXIETY
AFRAID (1)
TREMOR (1)
setting the level of analysis to 2 without enabling this option would code words like AROMA
or BREATH as ODOR, but would include in the final results individual words like TREMOR
or AFRAID. Enabling the CATEGORIES ONLY option ensures that individual words won't
be included but will be coded as their parent category.
USE FULL PATH AS CATEGORY NAME - When the LEVEL OF ANALYSIS option is set to a
value higher than one, this option instructs WordStat to substitute the full path of an item as
the category name. The slash ( / ) character is used to separate the various levels. For example,
in the above example, setting this option to true and the level analysis to 2 will code the word
AROMA as SENSATION/ODOR. Increasing the level of analysis up to 3 will return
SENSATION/ODOR/AROMA.
ALLOW OVERLAP - By default, categories are mutually exclusive such that a word can only be
entered in a single category. Enabling this option allows one to create overlapping categories
where words can be classified simultaneously into two or more categories. However, please
take note that current multivariate techniques available in WordStat such as clustering,
correspondence analysis and multidimensional scaling as well as other multivariate statistical
procedures make the assumption that categories are statistically independent. Using
overlapping categories creates data that clearly violate this assumption and may yield dubious
results.
SHOW WARNINGS - Some items in an exclusion list or categorization dictionary may remain
TYPE DESCRIPTION
Item also in the exclusion list An item found in a categorization dictionary cannot
be recognized if it matches an item found in the
exclusion list.
Phrase starts with an excluded word In order to be recognized, a phrase cannot start with a
word found in the exclusion list. Therefore, this
excluded word should preferably be removed from
the exclusion list in order for the phrase to be
recognized.
Enabling the Show Warnings option instructs WordStat to identify potential compatibility
problems affecting items in a dictionary, and it displays a list of those problems in a special
dialog box. This dialog box is displayed prior to the application of dictionaries for a content
analysis.
For more information on how to open, activate or deactivate a dictionary or how to add, edit or remove an
entry in a dictionary, see Creating and Maintaining Dictionaries, page 114)
DICTIONARY OPTIONS
ADD WORDS - When the inclusion dictionary is disabled, all words that are not found in the
exclusion list will be included in the final keyword frequency analysis. This option allows one
to restrict the number of words included to the most frequent ones by setting a minimum
Frequency or Case Occurrence criterion for inclusion. This option may also be used while
the inclusion list is active to add to this list, other words that are used at a high frequency.
However, this option can only be used to add new words to the list of words and categories
found in this inclusion dictionary and cannot be used to remove any of those items. To
remove items in this inclusion dictionary based on a frequency or case occurrence criterion
see the REMOVE WORDS option below.
REMOVE WORDS - This option allows one to restrict the number of included words or categories
to the most frequent ones by setting a minimum Frequency or Case Occurrence criterion for
TEXT TO INCLUDE
DON'T PROCESS TEXT WITHIN BRACES - This option can be used to instruct the program to
skip all text found between braces (i.e. { and } ). This option is especially useful when you
want to insert comments or annotations in the text variable without affecting the content
analysis. It can also be used to ignore in an interview transcript all questions, prompts, and
other verbal interventions made by the interviewer.
CHARACTERS
ACCEPT NUMERIC CHARACTERS - By default, every word consisting of numeric values or of a
mix of letters and numbers is excluded from the analysis. This option can be used to include
those words.
ADD CHARACTERS APPEARING - This set of options allows one to specify which characters,
besides letters of the alphabet, should be considered as an integral part of a word. For
example, the word "ex-wife" can be treated as a single word or as two separate words ("ex"
and "wife") if the hyphen is included in the list of valid characters. Two edit boxes may be
used to specify additional characters. The ANYWHERE option is used to specify special
characters that will be considered as part of a word, no matter where they appear, while the
EMBEDDED IN WORDS option should be used to specify characters that should be enclosed
within other valid characters and not at the beginning or the end of a word. For example,
adding the period and comma to the list of characters embedded in words, will allow one to
retrieve numeric values such as 97.5 or 1,000,000 or domain names like www.google.com as
a single token without the risk of retrieving words immediately followed by commas or
periods.
CASE SENSITIVE - By default, WordStat internally converts all text to uppercase letters so that
processing of words is cases insensitive. This may be inappropriate if one wants to identify
proper nouns or analyze text written in some European languages like German where
differences in letter cases may denote different meaning. Enabling this option prevent the
internal conversion to uppercase letters and will treat two instances of the same word different
in their case (lower or upper case) as two distinct words.
CASE PROCESSING
RANDOM SAMPLE - When this option is activated, the program will randomly select a fraction of
all cases and performs the content analysis on this subsample. The proportion of cases can be
specified using the spin button located at the right of the checkbox. This option reduces the
processing time for large files and is especially useful during the initial phase of an analysis
where dictionaries are constructed and categorization schema are developed and revised. It
also allows one to preview the kind of results that would be obtained on very large data files.
INCLUDE RECORDS WITH MISSING VALUES - When examining the relationship between
textual data and categorical or numerical variables, WordStat will skip any cases with a
missing value on any one of these variables. Enabling this option instructs WordStat to
include all cases, whether or not values are missing. All missing values are assigned to an
additional class labeled as "MISSING." Any analysis involving comparisons between classes
LANGUAGES PAGE
ADD OR REMOVE LANGUAGE RESOURCE FILES - WordStat may use various resources (such as
spelling dictionaries, thesauri, and lemmatization routines) to process some natural languages
more efficiently. By default, WordStat installs resources for processing English documents.
Clicking this button allows one to download and install resource files for other languages or
remove language resources previously installed.
ACTIVE SPELLING DICTIONARIES - WordStat makes use of language dictionaries to spell-check
existing textual data and to suggest inflected forms of words found in the user dictionary. This
group of options lets one specify which dictionary to use with the current data file.
ACTIVE THESAURUS - WordStat's SUGGEST feature can use a thesaurus to suggest synonyms of
existing words in the text collection or in the user categorization dictionary. This option allows
one to select the language of the thesaurus (for more information on this feature, see Using
Lexical Tools for Dictionary Building, page 127).
OTHERS PAGE
COLOR SCHEME - A color scheme is a set of colors used for background, page tabs, borders and
buttons. WordStat comes with several color schemes. To choose a desired color scheme, simply
select one from the drop-down list. To disable this feature and use the default Windows theme, set
this option to System Default.
FLAT TABLES (WITHOUT GRID LINES) - When this option is unchecked, tables in WordStat are
displayed with grid lines while column and row headers are displayed in 3-D. Enabling it,
removes grid lines and flattens row and column headers.
PERCENT DECIMAL PLACES - Use this option to modify the number of decimal places used to
display percentages in frequency tables and in crosstabulation tables.
SHOW HARD RETURNS AS - In KWIC lists and reports, hard returns normally used to mark the
beginning of a new paragraph are represented by a symbol. This option allows toggling on and
off the display of this symbol.
TREATMENT OF ITEMS NOT FOUND IN NORM FILES - When comparing keyword frequencies with
normative data, a specific word in the frequency list may be absent from the normative data file.
This may occur for neologisms, technical terms, proper nouns, low frequency words as well as
misspelled words. This option allows one to choose whether, in such a situation, the cells for
expected frequency and comparison statistics should be left empty or if those statistics should be
extrapolated by setting the expected frequency to the lowest possible frequency (based on the size
of the normative body of text that was used to compute the normative data).
By default, the table shows the included words in descending order of frequency. The table includes the
following statistics:
FREQUENCY Number of occurrences of the word or category names.
% SHOWN Percentage based on the total number of words displayed in the table
% PROCESSED Percentage based on the total number of words encountered during the
analysis.
% TOTAL Percent based on the total number of words less those excluded by list.
NO CASES Number of cases where this keyword appears.
% CASES Percentage of cases where this keyword appears.
TF*IDF Term frequency weighted by inverse document frequency. Such a weighting
is based on the assumption that the more often a term occurs in a document,
the more it is representative of its content yet, the more documents in which
the term occurs, the less discriminating it is.
Tabs at the top of the table allow you to access (1) a frequency table of all Included content categories or
keywords, (2) a frequency table of Leftover Words consisting of individual words that have not been
SORT BY - This option allows a display of words in the frequency table in alphabetical order, on
keyword endings, or by descending order of keyword frequency (NO WORDS column) or
case occurrence (NO CASES column). Sorting the table by keyword endings facilitates the
identification of plural form of words that should be substituted by their singular form or the
substitution of verbs by their infinitive form. One may also sort on any column in ascending
order on any column values by clicking this column header. Clicking a second time on the
same column header sorts the rows again in descending order.
The button can be used to move one or several words to the exclusion or substitution list or
to add or remove a word from the inclusion list. The permitted moves depend on the words
currently displayed. If you want to remove a word from the inclusion list, the DISPLAY option
should be set to INCLUDED or ALL. To add a word to the inclusion list, the DISPLAY should be
set to LEFTOVER or ALL. This button is also used to display a Keyword-In-Context table of the
selected keyword.
It is also possible to quickly access the pop-up menu invoked by this button by pressing the right
button of the mouse anywhere on the grid (see below).
The button allows one to produce barcharts or pie charts to visually display the distribution of
specific keywords or categories. To produce such charts:
The button allows one to create normative frequency data from the current file, to store them
on disk and to compare currently displayed frequencies with previously saved norms. See Creating
and Using Norm Files on page 148 for more information on this topic.
The button may be used to append frequency information to the current data file, save to disk
a matrix of word or keyword frequency by cases or export the current categorization model. For
more information on one of these topics see Exporting Frequency Data (page 181) or Exporting
Categorization Models (page 183).
The button allows one to access a keyword retrieval feature to retrieve all documents,
paragraphs or sentences containing a specific keyword or a combination of keywords. See
Keyword Retrieval (page 150) for more information on this topic.
The button allows one to automatically attach QDA Miner tags to all paragraphs or sentences
associated with currently displayed content categories or keywords. When clicked, a dialog box
asks whether the coding should be applied to whole paragraphs or to individual sentences. If some
WordStat keywords have no corresponding QDA Miner codes, new codes will be created under a
special codebook category before the autocoding process begins.
The button is used to draw color guidelines on alternate rows in order to facilitate the reading
of large tables. When clicking this button color guidelines are shown. Clicking this button again
removes the color guidelines.
The button allows one to view an automatic suggestion panel displaying leftover words
potentially related to the currently selected item. For more information on the auto-suggest panel,
see Working with the Auto-Suggest Panel on page 42.
The button displays various statistics on the text categorization process, such as the total
number of words processed, the number and proportion of words that have been excluded that
have been categorized. The dialog box also displays document statistics - such as the average word
length of sentences, paragraphs and documents as well as several coverage statistics, including the
percentage of cases, paragraphs and sentences containing at least one keyword and the proportion
of words that have been categorized or included. The coverage statistics are especially useful when
one applies a content analysis dictionary developed to describe a specific data set on new data sets.
A significant decrease in coverage may indicate the need to update a dictionary in order to better
reflect changes over time or specific differences in this new data set.
To cancel a modification:
If you want to undo the last assignment, deletion or modification made using this panel, click the
button. Note: If you leave the mouse cursor over this button, a hint window will appear
showing you which modification will be canceled by clicking this button.
The Topics modeling tool will automatically extract the most important topics from a text collection
using factor analysis. Results may be saved as a content analysis dictionary or may be further
examined using co-occurrence analysis or crosstabulation.
The Phrases extraction feature will identify idioms and common phrases and will allow one to add
them to a content analysis dictionary as well a perform co-occurrence analysis and comparison
analysis of those phrases.
The Named Entities extraction feature can identify proper nouns, names of people, locations or
organizations as well as acronyms. One may then select relevant ones and move them to the
categorization dictionary.
The Misspellings & Unknowns extraction feature provides a tool that identifies misspellings and
some technical terms by comparing the list of word forms encountered in the entire text collection
against a list of common words. Extracted words may be added to the current categorization dictionary
or to a substitution process. They may also be replaced in the original documents with a properly
spelled form.
The current implementation of the topic modeling procedure has a limit of 2500 words or content
categories. We are working on ways to increase the capability to at least twice this amount). To insure the
stability of the factoring solution, low frequency items should preferably be excluded. It is thus strongly
recommended to remove any word occurring less than 10 times on smaller data sets, ideally less than 30 to
50 times on larger ones. Stemming, lemmatization or the creation of a categorization dictionary may also be
used to group words or phrases, including less frequent ones, prior to the topic extraction.
WordStat provides the following analysis options to control the topic modeling process:
Segmentation - This option allows one to specify whether the data to be used for topic modeling will
be based on the co-occurrence of words in the same document, or whether they will be based on
co-occurrence within paragraphs or sentences. The choice of segmentation should ideally reflect
how topics are being distributed in a typical document and across documents as well as the
objective of the analysis. When the text collection consists of long document containing multiple
topics (such as long political speeches) and one needs to identify all topics in order to compare
their relative frequencies, then performing a segmentation by paragraph or by sentence may be
more sensitive than computing co-occurrences by documents. Alternatively, if one attempts to
differentiate documents by identifying domains or disciplines, or to identify the dominant issue of
documents, then performing the analysis at the document level may be more appropriate. When
analyzing responses to open-ended questions, which may include several topics listed in a single
paragraph, segmenting by sentence may also result in a more precise extraction of the various
topics they contain.
No. of Topics - Setting this option allows one to specify how many topics to extract.
Loading - This option allows one to set the minimum factor loading an item (typically a word) should
reach in order to be retained in the factor solution. By default, this value is set to 0.4. Increasing
the cutoff value will reduce the number of words, keeping only the more representative ones, while
reducing it may include words that are somewhat less characteristic of the extracted topic.
Once the options have been set, click the button to perform the analysis. Please note that
extracting topics on more than a few hundred words can take several minutes. Once extracted, the Topic
NO The factor number. Please note that some factor numbers may be omitted if none of
their items attained the factor loading cutoff criteria. When factors are being merged
by the user, this column displays the numbers of all factors that have been merged
together.
NAME WordStat use an algorithm to automatically provide a label for the extracted topic.
This label may be edited by cliking the button.
KEYWORDS This column list all keywords meeting the factor loading cutoff criteria in descending
order of factor loading.
% VAR This column shows the percentage of variance explained. Please note that the smaller
the segment one chose, the lower the percentage.
FREQ This column displays the total frequency of all items listed in the keywords column.
CASES This column displays the number of cases containing at least one of the items listed in
the keywords column.
% CASES This column displays the percentage of cases with at least one of the items listed in
the keywords column.
This button allows you to delete the topic on the selected row. .
Click this button to merge a topic in another one. You first need to select the row containing the
first topic you would like to merge and then click this button. A dialog box will appear with a
list of all other topics. Select the second topic and click OK .
To rename a topic, select it first and then click this button. Type the new name and click OK.
To retrieve segments associated with a topic, select it and click this button. All text segments
containing at least two keywords of the selected topic will be retrieved and presented in a table
format. You may however change both the type of segments retrieved (paragraphs, sentences or
full documents) or the minimum number of topic words needed for retrieval.
This button allows one to perform cooccurrence analysis of all the extracted topics including
clustering and multidimensional scaling, and create proximity plots as well as link charts. For
more information on the various features available, see the Co-Occurrence Page topic.
This button allows one to perform full crosstabulation analysis of all the displayed topics with
structured data, apply statistical analysis and create various charts such as correspondence plots,
heatmaps, bubble charts and bar charts. For more information on the various features available
for crosstabulation analysis, see the Crosstab Page topic.
This button stores the extracted topics currently displayed into a new categorization dictionary
where folders at the first level correspond to different topics, and where each of those folders
contains the associated words. A dialog box allows one to save
Press this button to append a copy of the topic table in the Report Manager. A descriptive title
will be provided automatically. To edit this title or to enter a new one, hold down the SHIFT
keyboard key while clicking this button (for more information on the Report Manager, see the
Report Management Feature topic).
This button allows to store the topic table to disk in various formats, including Excel, tab and
comma delimited files, plain text, HTML, XML, SPSS or Stata files.
Clicking this button allows you to print a copy of the displayed chart
On the right of this table, a panel allows one to look at the distribution of the selected topic among values of
up to two structured variables. One may chose to display this distribution using either a vertical bar chart, a
horizontal bar chart or a line chart by clicking on the corresponding button. Four statistics may also be
represented on those charts:
Before scanning for phrases, one has to set various options that will be used to determine the extent of the
scanning process. The first two options that need to be set are the minimum and maximum number of words
a phrase can have (Min words and Max words). These two values determine both the processing time, the
memory requirement as well as the number of resulting phrases. The larger the range between those
minimum and maximum values, the longer it will take to collect all possible sequences of words. The Min.
Frequency or Min. Cases options allow one to eliminate from the list phrases that appear only a few times
by setting a minimum frequency criterion. When set to Min. Frequency, the criterion specifies the
minimum number of times a phrase must appear regardless of whether it comes from a single document or
from multiple documents. Setting it to Min. Cases allows one to require those occurrences to appear in a
minimum specified number of cases.
When these options have been set properly, click the button to perform the search.
By default, found phrases are presented in descending order of frequency. The Sort By list box can be used
to reorder the obtained list in descending order of frequency, in alphabetical order or in descending order of
Finding overlaps
While WordStat tries to reduce redundancy in the list of phrases by automatically removing short phrases
that are part of larger ones, the resulting list may still contain items that are not independent of each other
such as phrases that sometimes overlap. In order to allow users to take into account potential overlaps when
selecting phrases, WordStat provides a display option that allows one to see when a selected phrase includes
a shorter one, is part of a larger one, or sometimes overlaps other phrases. Such information is especially
useful when one needs to identify idioms that are more specific, often found in longer phrases, or more
generic ones, usually composed of shorter phrases.
To enable the display of information regarding overlaps, simply click the button. A window appears to
the right of the table. Selecting a phrase in the table automatically shows all other items that overlap this
selected item. Each phrase is accompanied by a ratio indicating the total number of times this other phrase
occurs and how many times it overlaps with the selected item. For example, if one selects the phrase I'M
LOOKING FOR in the table showing it occurs 26 times in a document collection, one may notice that it
Choose the Statistic that should be used to assess the relationship between the frequency or case occurrence
of the phrases and classes of the categorical variable. The Chi-square is the overall chi-square value
computed on all classes of the categorical variable, while the Max Chi option is the chi-square value
computed on the class with the highest case occurrence or frequency against all the other classes. Select
None if you don't want to display any comparison statistic.
Check the Show highest class option to display a column indicating the label of the class with the highest
relative frequency or case occurrence. In the event that two or more classes obtain the same high
percentage, the cell will list all the labels associated with each of those classes.
Click the OK button to perform the computation.
Once the computation is completed, several additional columns are added to the right side of the table. To
sort rows based on values in any of the newly created columns, click the appropriate column heading.
Clicking several times on the same column heading toggles between ascending and descending order.
Enabling the Phrase containing option and entering a string in the edit box allows one to display only
phrases containing the specified string. If a comparison has been performed between classes of a
The Processing options are almost identical to those found at the top of the Options page (see page 33),
allowing one to add to the extracted phrases, to single words occurring more than a specific number of
times or in a specific number of cases, or to remove items too frequent or not frequent enough. One may
also restrict the analysis to a specific number of items.
The Saving into a new dictionary options may be used to store phrases and words into a new dictionary
file. Enabling this option and selecting All Phrases will allow the opportunity to store all phrases extracted
using the Phrase Finder page, whether or not they meet the specified frequency criteria. Selecting the other
option will store only phrases meeting those criteria as well as all words that were also included in the
analysis.
Click the button. A dialog box similar to this one will appear:
The Processing and Saving options are identical to the ones available when performing co-occurrence
analysis on phrases (see previous page). Two additional options are displayed to select the base statistic for
the correspondence analysis (Frequency or Case Occurrence) as well as the categorical variable
containing the classes to be compared.
For more information on crosstabulation, see Crosstab page (page 85)
Click the Find button to search the first item matching the typed string. Clicking this button again finds the
next occurrence of the search string, starting at the currently selected item.
Clicking the button brings a dialog box with the following three options:
Remove items in the dictionary - This option removes from the final list named entities that are
already in the active categorization dictionary. By default, this option is enabled.
Remove known common words - Enabling this option remove extracted single words entities that are
also common words. This option is especially useful to remove capitalized words in titles or words
in social media data that have been put in full capital typically to stress their importance. It may
however miss the identification of named entities that are also common nouns such as "Apple",
"Windows" or some acronyms.
Minimum frequency - This option allows one to eliminate from the list named entities that appear only
a few times.
When these options have been set properly, click the button to perform the search. Retrieved
items are listed in descending order of frequencies. The Total column indicates the total frequency of this
named entity, while the Unique column reports how many of those do not overlap with other ones.
Dictionary - By default, the extraction is performed in reference to common English words (British and
American English). To identify unknown words in documents written in another language or to
exclude technical terms from a specific domain click the language name(s) listed beside the
Dictionary option or modify the settings on the Language page located on the main Options page
of WordStat.:
Remove words already in the categorization dictionary - Some misspellings or uncommon words
may have already been added to an exclusion list or a categorization dictionary. This option
automatically remove those items from the lists of retrieved words.
Find match for words in the categorization dictionary - WordStat will attempt to match unknown
words with existing ones in the language spelling dictionary and identify the most likely
To Search unknown words, click the button. A dialog similar to this one will appear.
Available operations
Up to four types transformation or processing are allowed on these words: 1) You can replace all instances
of a selected word in the original document by another word or phrase; 2) You can assign the words to an
existing content categories; 3) you can instruct WordStat to automatically replace this word with another
one by adding it to a live substitution process; or 4) you can add this word to a custom list of valid words
causing the program to ignore those words the next time there is a search for vocabulary words. You may
also obtain a keyword-in-context list associated with a specific word in order to decide how that word
should be treated.
Except for the Keyword-in-context list, none of the other four operations are performed immediately.
Instead they are added to an action list allowing you to review, modify or cancel previously defined actions
prior to the application of all the specified changes.
Click the button. A dialog box similar to this one will appear.
Click the button. A dialog box will appear with a list of all categories and
subcategories of the current content analysis dictionary.
Select the category under which this item should be stored and then click OK.
You may add words from this list to the current categorization dictionary or to the exclusion list, or
assign it to the substitution process in order to have it replaced automatically by another word. To
perform any one of those assignments, simply select and drag the item into the proper location in the
dictionary panel to the left of the table (see Using the Dictionary Panel on page 41).
Click the button. A dialog box similar to this one will appear.
Click the button. All words associated with the removed actions are moved
back to the list of unknown words and positioned at the bottom of the list.
It is also possible to perform most of those actions or to produce a keyword-in-context table by selecting an
item and either clicking the button or right-clicking.
WordStat remembers text replacements made in prior sessions and can reapply them in batch. One can edit
entries in this list of replacements, remove items, merge lists by importing lists created by other users or
allow one to export this list to a file. To perform those operations, click the Setup button on the
Misspellings & Unknown page and then click the Batch Replacements Settings button. A dialog box
similar to this one will appear:
To edit a replacement
Select the row containing the item you want to edit.
Click the Edit button. A dialog box similar to the one showed above will appear.
Edit any one of the entries in the Substitute or the With edit box.
Click the Ok button to confirm the change.
The Min Frequency option allows one to eliminate from the list all words appearing only a few times by
setting a minimum frequency criterion.
Once this option has been set, click the button to start searching for vocabulary words. The list of
words retrieved are then listed in a frequency table on the left of the screen and presented in descending
order of frequency. To sort this list in alphabetical order, click the top of the first column.
Clustering Cases/Documents
When the clustering is set to be performed on cases or documents, the distance matrix used for
clustering and multidimensional scaling consists of cosine coefficients computed on the relative
frequency of the various keywords. The more similar two documents are in term of the distribution of
keywords, the higher the coefficient. The case label that is used to identify the various cases can be set
by choosing the Edit Case Descriptors command from the WordStat main menu (see page 166).
Clustering keywords
When clustering keywords or content categories, several options are available to define co-occurrence
and choose which similarity index will be computed from the observed co-occurrences.
CO-OCCURRENCE - This option allows you to specify how a co-occurrence will be defined. By
default, a co-occurrence is said to happen every time two words or two categories appear in the same
case (by case option). You may also restrict the definition of co-occurrence to entries that appears in
the same paragraph or the same sentence, or to words or categories that are located in the same user
defined section (delimited by a / character). Finally, you may restrict even further the definition of
co-occurrences by limiting the co-occurrences to a small window of words of specified length. Such
a small window is especially useful when doing an analysis directly on words (rather than categories)
since it allows to identify idioms or phrases that may need to be added to the categorization
dictionary. Co-occurrence on larger text segments such as cases or paragraphs may be more
appropriate to identify the co-occurrence of themes in individual subjects.
INDEX - The Index option lets you select the similarity measure used in clustering and in
multidimensional scaling. Four measures are available. The first three measures are based on the
mere occurrences of specific words or categories in a case and do not take into account their
frequency. In all those indexes, joint absences are excluded from consideration.
Jaccard's coefficient - This coefficient is computed from a fourfold table as a/(a+b+c)
where a represents cases where both items occur, and b and c represent cases where one item is
found but not the other. In this coefficient equal weight is given to matches and non matches.
Sorensen's coefficient - This coefficient (also known as the Dice coefficient) is similar to
Jaccard's but matches are weighted double. Its computing formula is 2a/(2a+b+c) where a
represents cases where both items occur, and b and c represent cases where one item is present
but the other one is absent.
Phi coefficient - This coefficient is a measure of association for two binary variables. It is
similar to the Pearson correlation coefficient in its interpretation.
Probabilistic - Traditional co-occurrence measures do not take into account the possibility that two
words will sometimes co-occur by chance. As a consequence, clustering solutions obtained using
those metrics are biased toward the formation of clusters of high-frequency items. While the problem
may remain undetected or negligible when clustering low frequency words or when analyzing
co-occurrence within a limited context (such as within a sentence, within a window or within a few
words), the problem becomes much more apparent when clustering broad content categories or
frequently used words. Enabling this option applies a correction to either the Jaccard or the Sorensen
coefficient.
CLUSTERING TYPE - Two broad types of keyword clustering are available. The first method is based
on keyword co-occurrences (First Order Clustering) and will group together words appearing near
each other or in the same document (depending on the selected co-occurrence window). The second
clustering method is based on co-occurrence profiles (Second Order Clustering) and will consider
that two keywords are close to each other, not necessarily because they co-occur but because they
both occur in similar environments. One of the benefits of this clustering method is its ability to
group words that are synonyms or alternate forms of the same word. For example, while TUMOR
and TUMOUR will seldom or never occur together in the same document, second order clustering
may find them to be pretty close because they both co-occur with words like BRAIN or CANCER.
Second order clustering will also group words that are related semantically such as MILK, JUICE,
and WINE because of their propensity to be associated with similar verbs like DRINK or POUR or
nouns like GLASS (for more information, see Grefenstette, 1994).
REMOVE SINGLE WORD CLUSTERS - One way to extract potentially interesting knowledge from
dendrograms is to focus on the aggregation of items at an early stage of the clustering process.
However, when clustering hundreds or thousands of items, the identification of those items requires
the user to scroll through a very long dendrogram which includes many clusters of isolated items.
Enabling this option simplifies the use of the dendrogram by hiding all single item clusters and
allowing one to concentrate only on the strongest associations. Setting this option also removes
isolated items from multi-dimensional scaling plots, greatly enhancing their value when analyzing a
large number of items. Please note, however, that when this option is enabled, changing the number
of clusters while viewing a 2-D or 3-D MDS plot will cause the program to recompute the distance
and location of remaining items.
TOLERANCE - This option specifies the tolerance factor that is used to determine when the algorithm
has converged to a solution. Reducing the tolerance value may produce a slightly more accurate
result but will increase the number of iterations and the running time.
MAXIMUM ITERATIONS - This option allows one to specify the maximum number of iterations that
are to be performed during the fitting procedure. If the solution does not converge to the limit
specified by the TOLERANCE option before the maximum number of iterations is reached, the
process is stopped and the results are displayed.
INITIAL CONFIGURATION - This option allows one to specify whether the multidimensional scaling
will be applied on a random configuration of points or on the result of a classical scaling.
Selecting the Classical Scaling option instructs WordStat to perform a classical scaling first on the
similarity matrix, and then use the derived configuration as initial values for the ordinal
multidimensional scaling analysis.
Selecting the Randomized Location option instructs WordStat to perform the multidimensional
scaling analysis on a random configuration of points. By default, WordStat initializes the random
routine before each analysis with a new random value. The seed value used for the creation of this
initial configuration is stored along with the final stress value in the history list box, located at the
bottom of the dialog box. The Seed option may be used to specify a starting number that will be
used to initialize the randomization process and produce a fixed random sequence. To recall a
specific seed value used previously, double-click the proper line in the history list box.
NO CLUSTERS - This option allows you to set how many clusters the clustering solution should have.
Different colors are used both in the dendrogram and in the 2-D and 3-D maps to indicate
membership of specific items to different clusters. However, if the option to remove single item
clusters is enabled, an increase in the number of clusters may in fact result in a decrease in the
number of clusters displayed and in the overall height of the dendrogram since all single item clusters
will be hidden.
DISPLAY - This option lets one choose whether the vertical lines of the dendrogram represent the
agglomeration schedule or the similarity indices.
When clustering keywords or content categories, clicking this button displays bars beside
each dendrogram item to represent their relative frequencies.
Use this button to increase the dendrogram font size and focus on a smaller portion of the
tree.
Use this button to reduce the dendrogram font size and view a larger portion of the tree.
This button allows one to perform full crosstabulation analysis with structured data, apply
statistical analysis and create various charts such as correspondence plots, heatmaps, bubble
charts and bar charts. A dialog box allows one to restrict the analysis to specific clusters
containing a minimum number of items and cluster names are automatically generated using
characteristic words and phrases of each cluster. For more information on the various
features available for crosstabulation analysis, see the Crosstab Page topic on page 85.
This button stores the cluster solution currently displayed into a new categorization
dictionary where folders at the first level correspond to different clusters, and where each of
Press this button to append a copy of the graphic in the Report Manager. A descriptive title
will be provided automatically. To edit this title or to enter a new one, hold down the SHIFT
keyboard key while clicking this button (for more information on the Report Manager, see
page 191).
This button allows storing the displayed dendrogram into a graphic file. WordStat supports
three different file formats: .BMP (Windows bitmap files), .PNG (Portable Network Graphic
compress files) and .JPG (JPEG compressed files).
To retrieve text segments or documents associated with a specific cluster, click anywhere on
a cluster to select it (the selected cluster is displayed using thicker black lines), and then click
this button to retrieve the associated documents. When performing first order clustering on
keywords, this operation retrieves all text segments containing at least two keywords of the
selected cluster .When performing second order clustering of keywords, all text segments
containing a single one of those keywords will be retrieved.
The slide ruler provides another way of quickly changing the number of clusters included in
the clustering solution. Moving the slider to the left increases the minimum distance required
to form a cluster and thus produces a dendrogram with more clusters. Moving the slider to
the right aggregates smaller clusters into bigger ones. However, if the option to remove
single-item clusters is enabled, an increase in the number of clusters may, in fact, result in a
decrease in the number of clusters displayed and in the overall height of the dendrogram.
Note: Clustering using other similarity or distance measures or agglomeration methods may be achieved
using the MVSP cluster analysis procedure (see Performing Multivariate Analysis, page 189).
Clicking the button allows one to jump to the full size Link Analysis page (page 74).
Clicking this button enables to zoom in a plot. To zoom an area of the plot, hold the left
mouse button and drag the mouse down/right. You'll see a rectangle around the selected
area. Release the left mouse button to zoom.
Clicking this button restores the original viewing area of the plot.
This button is used to create a copy of the chart to the clipboard. When this button is
clicked, a pop-up menu appears, allowing you to select whether the chart should be
copied as a bitmap or as a metafile.
Pressing down this button creates a constrained multidimensional scaling. This mapping
algorithm allows one to preserve the clustering structure in multidimensional scaling
plots, making the interpretation of 2-D and 3-D MDS maps a lot easier and more
consistent with the clustering solutions. Enabling this option allows one to use the MDS
module to create maps of concepts similar to those suggested by Trochim, in its
Concept Mapping procedure.
Pressing down this button displays lines to represent relationships between data points
of the multidimensional scaling plot. When the button is down, a cursor will appear in a
tool panel below the plot, allowing you to select the minimum association strengths to
be displayed.
Clicking this button creates a bubble plot where the areas of data points are proportional
to the relative frequency of those items. This type of display is especially useful when
one needs to take into account a third variable, in this specific case the frequency of
items, when interpreting the distance between data points.
Press this button to append a copy of the graphic in the Report Manager. A descriptive
title will be provided automatically. To edit this title or to enter a new one, hold down
the SHIFT keyboard key while clicking this button (for more information on the Report
Manager, see page 191).
This button allows storing the displayed multidimensional scaling plot into a graphic
file. WordStat supports four different file formats: .BMP (Windows bitmap files), .PNG
(Portable Network Graphic compress files) and .JPG (JPEG compressed files) as well
as .WSX a proprietary file format (WordStat Chart file). Charts stored in the latter
format may be opened, further edited and customized using the Chart Editor external
utility program.
Clicking this button allows you to print a copy of the displayed chart.
This button can be used to show or hide left, bottom and back walls.
Clicking this button draws anchor lines from the floor to the data point to better locate
data points in all 3 dimensions.
Clicking this button allows you to change the viewing angle of the chart. To rotate the
chart, make sure this button is selected, click any area of the chart, hold the mouse
button and drag the mouse to apply the desired rotation.
Layout selection
The following three buttons allow you to select one of the three layout types.
Clicking this button assigns the location of nodes in a multidimensional space such that nodes
that co-occur more often are plotted close together, while those co-occurring less often are
plotted far from each other.
Clicking this button draws nodes in a circle. Nodes that co-occur often are plotted close
together.
This track bar can be used to hide links. By default, the cursor is positioned to the right.
Moving it to the left gradually remove the weakest links, allowing you to identify more
clearly the strongest ones.
By default, a graph includes links reaching a specific statistical threshold. Clicking this
button add additional links by gradually reducing this statistical threshold.
Clicking this button remove all links that have been hidden using the track bar control (see
above), delete all nodes that are no longer connected, and refresh the display to take into
account the lower number of elements in the graph.
Navigation mode
The following three buttons allow you to select the default mouse cursor behavior.
Click this button down to select specific nodes and links. To select multiple items, hold the
CTRL key down while clicking on additional items or select the rectangular region of the
graph containing the items you want to select.
To select a node along with all nodes connected to it (neighbors), hold the ALT key while
click the node. One may also select a node, right-click and choose GET NEIGHBORS.
Once items are selected, they can then be moved, deleted or edited. One may also search for
associated text segments either by right-clicking to call a contextual menu or by clicking the
button on the toolbar.
Click this button down lets you control which part of the image is visible in the image
window.
Click this button down allows you to zoom in a specific region. Once this button is down,
click the upper left corner of the rectangular region you want to zoom in, drag the mouse
down the lower right corner of this region and then release the mouse button.
This list box let you set the zoom level to predefined levels. One can also type in the desired
zoom level.
Editing buttons
This buttons allows you to display or hide the numerical value associated with links.
This button allows you to change the size of nodes, links and their associated text. When
clicked, a dialog box with four track bars appear. Moving their cursor to the right increases
the size of its associated element while moving it to the left decreases it size.
Clicking this button delete the selected nodes and links.
Clicking this button brings a dialog box that allows you to change the style, the color and the
size of the selected nodes and links.
To retrieve segments associated with a topic, select it and click this button. When more than
one node or when a link is selected, all text segments containing at least two keywords will
be retrieved and presented in a table format. You may however change the type of segments
retrieved (paragraphs, sentences or full documents) or the minimum number of topic items
needed for retrieval.
This button is used to create a copy of the graph to the clipboard. When this button is
clicked, a pop-up menu appears, allowing one to select whether the graph should be copied
as a bitmap or as a metafile.
Press this button to append a copy of the graph in the Report Manager. A descriptive title
will be provided automatically. To edit this title or to enter a new one, hold down the SHIFT
keyboard key while clicking this button (for more information on the Report Manager, see
the Report Management Feature topic, page 191).
This button allows storing the displayed multidimensional scaling plot into a graphic file.
WordStat supports four different file formats: .BMP (Windows bitmap files), .PNG (Portable
Network Graphic compress files) and .JPG (JPeg compressed files).
Clicking this button allows you to print a copy of the displayed graph.
2. Double-clicking a node
When the graph is in selection mode, double-clicking a node will set the node being clicked on as the
target item and display all nodes associated with it according to the last criterion settings (primary
links). All currently displayed nodes not associated with this new target item will be removed.
If the shift key is being hold down while double-clicking a node, the graph will also include nodes
associated to a second degree to the target node (secondary links) as long as they also reach the same
criterion.
Clicking the button on the toolbar brings a dialog box that allows one to manually select items
to be graphed. This dialog box looks like the one below:
Once the options have been set, click the to update the graph and display the selected
nodes and links.
To select a keyword or a case that will be used as the point of reference, one can choose from the
KEYWORD or CASE drop down check list located at the top of the page. One can also freely browse
through different keywords or cases by double-clicking its bar in the Proximity Plot. The co-occurrence or
similarity to more than one target item may be displayed in a single chart allowing easy comparisons. When
several target items are selected, the proximity plot may consist of bars clustered side by side (clustered
bars), or stacked, representing either the total amount (stacked bars), or the relative distribution of scores
When looking at keyword co-occurrences, selecting a bar enables the button. Clicking this button
retrieves every document or text segment containing both keywords, allowing one to further explore the
factors that may explain this co-occurrence. When examining the similarity of documents rather than
keywords, clicking this button retrieves both documents and displays them side by side in a dialog box.
Right-clicking any existing bar displays a menu that allows one to remove the selected item, move it to the
list of target items either by adding it to the existing bars or replacing one of them. One may also retrieve
documents or text segment using this popup menu.
The Table page allows one to examine in more detail the numerical values behind the computation of those
plots. When the distance measure is based on co-occurrences, the table provides detailed information, such
as the number of times a given keyword co-occurs with another one (CO-OCCURS) and the number of
times it appears in the absence of this selected keyword (DO NOT). Such a table also includes the number
of time the selected keyword appears in the absence of the given keyword (IS ABSENT). In the example
below (computed using the paragraph as the frequency criteria), we can see on the highlighted line that the
word MILITARY co-occurs 107 times with IRAQ, but this word is encountered in 285 paragraphs without
the word IRAQ, while IRAQ is found in 1,182 paragraphs in the absence of MILITARY. The Jaccard
coefficient of 0.109 indicates that of all paragraphs containing either one of these words, 10.9 percent
contains both words. Note, however, that not all proximity measures can be interpreted as easily. To
facilitate the interpretation of this table, the status bar provides a textual interpretation of some of the
statistics.
The first two pages displays by default only the lower triangular part of the matrix of co-occurrence and
similarity. Selecting the Full Matrix option will display data on both sides of the diagonal.
Click the button. A descriptive title will be provided automatically for the table. To edit this
title or to enter a new one, hold down the SHIFT keyboard key while clicking this button.
For more information on the Report Manager, see the Report Management Feature (page 191).
Options
TABULATE - The TABULATE option allows choosing whether the values in the table should be based
on the total frequency of keywords or the number of cases containing those keywords.
WITH - The WITH drop down list allows choices on how the keyword count should be broken down.
The following options are available:
<other keywords> - display a square table showing the number of co-occurrence of words in the
same case.
<case number> - display the keyword occurrence or frequency for each individual cases.
ANY INDEPENDENT VARIABLE - If numeric, categorical or date variables were selected as
independent variables, their names will appear in this list box. Selecting any of those variable
names will display a contingency table allowing for the assessment of the relationship between this
variable and the keywords or content categories. When a date variable is selected, a dialog box like
the one below will appear allowing one to automatically recode those dates into various time
periods or date units like week days or months.
SORT BY - The SORT BY option presents the opportunity to sort the table by keyword or category
names (alphabetical order) or by descending order of frequency or case occurrence. When a statistic
is displayed (see option STATISTICS), the table can also be sorted based on the value of this statistic
or on its statistical probability. It is also possible to sort on the values of any specific column by
clicking this column heading. Clicking several times of the same column heading toggles between
ascending and descending sort orders.
DISPLAY - The DISPLAY list box allows one to specify the information displayed in the table. The
following options are available:
Count
Row percent
Column percent
Total percent
When the TABULATE option is set to Case Occurrence, two additional statistics are also available:
Percent of cases (percentage of all cases or individuals)
Category percent (percentage of cases or individuals in this subgroup)
STATISTIC - When keyword frequency or occurrence is broken down by an independent variable (see
WITH option), a drop down list box will appear. This list box allows one to choose among 12
association measures to assess the relationship between this independent variable and the utilization
of each word or category.
PROBABILITY - The probability option allows one to select whether the probability value should be
computed using a 1-tailed or 2-tailed test. Probabilities of Chi-square, Likelihood ratio, and Student's
F are always computed using a 2-tailed test.
AGREEMENT - When comparing word or category usage between different alphanumeric variables, a
drop down list box will appear. This list box allows one to choose among 8 different inter-rater
agreement measures to assess the reliability of coding.
The button is used to reapply the content analysis process to the current data set. This button is
disabled by default and becomes enabled when changes are made to any one of the currently active text
analysis processes, such as the categorization dictionary, the exclusion list or the substitution process.
Clicking this button will instruct WordStat to reprocess the text collection and update the current table.
Set the TABULATE and DISPLAY options so that the information you want to visualize is
displayed in the table.
Using the mouse, select the rows you would like to display. Multiple disjoint rows can be
selected through clicking while holding down the CTRL key.
Click the button or press the right button of the mouse and select the Chart Selected Rows
command.
A dialog box like this one will appear:
This window allows one to graphically examine the relationship between codes and values of an
independent variable. The bar chart should preferably be used to display the distribution of various
categories within subgroups as defined by a nominal independent variable, while the line chart should
preferably be used to examine the relationship between those categories and an ordinal or quantitative
variable.
A quick way to retrieve documents, paragraphs or sentences associated with a specific bar or pie slice
is by right-clicking and selecting the keyword retrieval command.
Controls Description
Press this button to vertically display the labels on the bottom axis
Press this button to append a copy of the graphic in the Report Manager. A
descriptive title will be provided automatically. To edit this title or to enter a new
one, hold down the SHIFT keyboard key while clicking this button. For more
information on the Report Manager, see the Report Management Feature (page
191).
Press this button to save a chart on disk. Charts are saved in a proprietary format and
may be edited and customized using the Chart Editor.
Pressing this button allows you to print a copy of the displayed chart.
Pressing this button causes the values represented on the bottom axis to be exchanged
with those of represented by different lines or bars (legend).
Click this button to turn on/off the 3-D perspective for the current chart.
This button allows you to edit various features of the chart such as the left and bottom
axis , the chart and axis titles, the location of the legend, etc.
This button is used to create a copy of the chart to the clipboard. When this button is
clicked, a pop-up menu appears allowing you to select whether the chart should be
copied as a bitmap or as a metafile.
Pressing this button closes the chart dialog box and returns to WordStat's main
screen.
Clicking the button on the chart dialog box gives access to a dialog box to customize the
appearance of barcharts and line charts. The options available in this dialog box represent only a small
portion of all settings available.
To further customize the chart, modify data points, value labels, or series order, click the button
located to the right-hand side of the dialog box.
LEGEND
Location - This option positions the legend. Legends may be placed at Top, Left, Right and Bottom
side of the chart.
From top - When the legend is displayed on the left or the right side of the chart, this option specifies
the legend's top position in percent of total chart height.
From left - When the legend is displayed on the top or the bottom chart, this option specifies the
legend's top position in percent of total chart width.
TITLES
Proper titles and axis labels are of utmost importance when describing the information displayed in a
chart. By default, WordStat uses variable names and labels as well as other predefined settings to
provide such descriptions.
The title page allows one to modify the top title, as well as the labels on the left, bottom and right axis.
To edit the title, select the proper radio button. Enter several lines of text for each title by pressing the
<Enter> key at the end of a line before entering the next line.
The Font button to the right-hand side of the edit box allows changing the font size or style of the
related title.
3-D VIEW
Orthogonal - Turning this option off disables the free elevation and rotation of the 3-D chart.
Zoom - This option zooms the whole chart. Expressed as a percentage, increasing the value positively
will bring the chart towards the viewer, increasing the overall chart size as the Zoom value increases.
3-D Percent - The 3-D Percent property indicates the size ratio between chart dimensions and chart
depth by specifying a percent number from 1 to 100.
Perspective - Use this property with Orthogonal unchecked to modify the 3-D perspective of the
Chart. Larger values add more depth perspective.
Bar shadow - Enabling this option adds dark shades to the sides of 3-D bars. Turning it off will color
the sides of the bar the same as the front.
Bar width - This option determines the percent of total bar width used. Setting this value to 100 makes
joined bars.
Click the button. A descriptive title will be provided automatically for the table.
To edit this title or to enter a new one, hold down the SHIFT keyboard key while clicking this
button.
(for more information, see the Report Management Feature topic on page 191).
When displaying rules, only the keywords or key phrases associated with the first item of those rules are
displayed. For example, in a rule like:
the KWIC list will contain only items in the SATISFACTION category meeting the conditions specified by
this rule.
Once an inconsistency has been detected, it becomes possible to reduce it by making changes to the textual
data or to the dictionaries. For example, the researcher may change all occurrences of the word KILL in the
original text for either KILL1 or KILL2 in order to differentiate the different meanings and then add only
one of these modified words (say KILL1) to the substitution or inclusion dictionary. The word KILLY may
also be added to the dictionary of excluded words. The categorization of phases may also be used to
distinguish various meanings of a word. For example, the use of KIND to refer to the adjective
("considerate and helpful nature") may be reliably differentiate from the use of KIND as a noun ("category
of things") or as an adverb by categorizing the phrase "KIND OF" as instances of this word used as a noun
or as an adverb and by categorizing the remaining instances of KIND as the adjective. Disambiguation may
also be performed by identifying words in close proximity that are associated with specific meanings and by
creating categorization rules (see Working with Rules on page 124).
The KWIC technique is also useful to highlight syntactical or semantic differences in word usage between
individuals or subgroup of individuals. For example, candidates from two different political parties may use
the word "rights" in their discourses at the same relative frequency, but we may find that these two groups
use this word with quite different meanings. We may also find that the meaning of a word like "moral"
evolves with the age of a child.
The upper-right part of the screen provides a list of all instances of keywords associated with a dictionary
category or of a specific word or phrase along with its surrounding text. The panel on the left shows a tree
view of items and their context in descending order of frequency and may be used to browse through and
filter through the KWIC list on the right. The text panel below the KWIC table displays the full document
from which the selected keyword comes from and highlights it. The text panel can be used to examine the
full context of a keyword, but may also be used to add words and phrases to the current categorization
dictionary or to the exclusion list. To assign a word or a phrase to a list or content category, position the text
cursor on the word you want to assign, or select one or several words with the mouse and right-click to
display a contextual menu. Select the To Categorization Dictionary or the To Exclusion List menu item.
LIST - This option allows for specifying whether the words for display in the KWIC table either should
be selected from the list of included words or from the list of all remaining words that have not been
explicitly excluded. The option User Specified allows one to enter a word or word pattern at the
keyboard and search for all instances of this expression.
WORD - This option allows one to choose among all keywords belonging to the list of Included or
Leftover words (see above). When the LIST option is set to User Specified, this option becomes an
edit box where one can type a word or word pattern. (Wildcards such as * and ? are supported).
SORT BY - This option allows for sorting the keyword-in-context table in either ascending order on
any of the following options:
Case number - The KWIC table is sorted in ascending order of case position.
Once the settings have been set, click the button to start searching all instances of the selected
keyword.
Clicking the button produces a concordance report on the keywords currently displayed in the KWIC
table. The sort order and context delimiter of the current KWIC table are used to determine the display
order and the amount of context displayed in this concordance report. This report is displayed in a text
editor dialog box (see below) and may be modified, stored on disk in RTF, HTML or plain text format,
printed, or cut and pasted to another application. Graphics may also be pasted anywhere in this report.
Settings
The Settings page allows one to select which variable contains the values to predict and choose a validation
method.
n-folds - This method consists of splitting the training set into smaller partitions and testing each
partition on the classification performance obtained by a model developed on the remaining ones. For
example, when using a five-fold cross-validation method, the training set is divided randomly into five
subsets, each containing approximately 20% of the documents. For each subset, the program tests the
accuracy obtained by a classification model developed on the remaining 80% of the original training
set. The performances obtained on all five classifiers are then used to estimate the performance of the
External file - A more conventional method for assessing the performance of a classifier is to test the
accuracy of the classifier on an entirely different set of documents that have also been classified but are
totally independent of the training set on which the categorization model is based. To perform such a
test, WordStat requires the test set to be stored in a different data file. When this option is selected, an
Open File dialog box is displayed allowing one to identify the file containing the external set. WordStat
then displays a dialog box like the one below allowing one to choose the text variable containing the
documents to be used for classification and the numerical variable containing the class to which this
document belongs. Once set, click OK to return to the classification page.
Once the variable and the validation method have been set, click the button to continue.
WordStat will compute all statistics needed, and will then automatically move to the Select Features page.
The strength of the relationship between an item and the classes of the categorical variable can be
computed either on the occurrence (present or absent), on the frequency of items in each class, or on
the percentage of words. To change the base statistic used for assessing differences among classes, set
the Compute statistics on list to the proper option.
The discriminative strength of each item is assessed using three statistics and is presented in a table
containing the following information:
Global Chi The overall chi-square value computed on all classes of the categorical variable.
Max Chi The chi-square value computed on the class with the highest case occurrence or
frequency against all the other classes.
Biserial The biserial correlation computed between the class of the categorical variables
with the highest case occurrences and the remaining classes. This coefficient
assumes that the presence or absence of a class is determined by a trait normally
distributed. Contrary to the standard correlation coefficient, this measure of
association may yield a value lower than -1.0 or higher than +1.0.
Predict Indicates the class in which the item most frequently occurs. When the highest
case occurrence appears for more than one class, the column includes the labels
of all those classes.
Clicking any column header sorts the table in ascending order of the data in that column. Clicking the
same column header a second time sorts its content in descending order. The check boxes in the first
column show, by default, that all items are to be included in the classification model. To manually
remove an item, simply click in the box to remove the check mark.
The FILTER TABLE option allows one to display only the terms characteristic of a specific class. It
may also be set to <selected> to display only items that have been selected for inclusion in the
categorization model. To display all items set this option to <all classes>.
The lower portion of the page displays a bar chart with the percentage of cases in each class of the
categorical variable containing the selected item. Classes are presented in descending order of case
occurrence. This graph is synchronized with the above table so that changing the selected item in this
table results in the display of the corresponding distribution chart.
To display the chart on the right-hand side of the table, click the button. To bring the bar chart
back to the bottom of the table, click the button.
To access the feature selection dialog box, click the button. The following dialog box will
appear:
The panel at the top of the page allows one to select the machine-learning algorithm to use, set
various analysis options and choose how the classification model will be tested.
Learning options
USE - This option is used to select the item statistic to be used in training and classification. Choosing
Case Occurrence results in the use of binary weights, indicating whether or not a word or keyword
occurs in the document. Selecting Keyword Frequency allows one to use additional information
related to how often this item occurs in each document. Percentage of Words and Percentage of
Keywords provide two methods to normalize the obtained frequency to take into account the
document length. Such normalization is performed by dividing the frequency either by the total
number of words found in the document or the total number of keywords that have been extracted by
WordStat.
FEATURE WEIGHTING - Feature weighting has been presented as an alternative to feature selection or
as a way to further improve classification accuracy from selected item sets. This method consists of
giving more weight to items that are rather good at differentiating documents from distinct classes
and negligible weight to those that are distributed evenly among classes. The most frequently used
weight in information retrieval is the TF*IDF measure where the frequency of an item is adjusted to
take into account the number of documents containing this item. However, such a weighting can be
considered to be only a crude approximation of the capacity of the item to differentiate documents
from distinct classes. More accurate performance of the classifier can be expected from using a
weight based on a more direct indicator of this discriminative capability such as the Global Chi-
square or the Max Chi described previously.
Results
A common way of assessing the accuracy of a classifier is by comparing the accuracy of predicted
class membership against actual membership. Such information is provided by the Confusion Matrix
where each predicted class is plotted against the actual class. Accurate predictions are plotted in the
diagonal going from the top left to the bottom right of the table. Values in this diagonal are printed in
bold characters for easy identification. Values in cells below or above this diagonal represent
classification errors. Besides the actual number of documents in each cell, the table shows the row,
column and total percentages. Row percentages represent the number of documents in a class that have
been classified in a specific way, while column percentages express the percentage of a specific
prediction actually belonging to a known class. This table may be used to identify which classes are the
easiest or hardest to predict, as well as which classification errors are the most common. To facilitate
comparisons across the classes of the categorical variable, two related statistics are printed on the right
of the table: Precision is the probability that documents identified as belonging to a class are correctly
classified and Recall is the probability of documents in a class to be correctly identified.
Several statistics are provided to assess the global performance of the classifier. The Nominal
Accuracy measure is the proportion of documents correctly classified It is considered a micro-average
statistic since it gives equal weight to documents regardless of how they are distributed among classes
of the categorical variable. The Average Precision and Average Recall measures are macro-average
statistics obtained by computing the mean precision and recall obtained for every class. The Ordinal
Data from prior trials are presented either in the form of a table (Table page) or as a line chart (Graph
page). Clicking any column header of the table sorts it in ascending order of the data in this column.
Clicking the same column header a second time sorts its content in descending order. By default, the
displayed statistics are computed for all classes of the categorical variable. To restrict the display of
either the table or the line chart to statistics related to a single class, set the Class list box to the desired
class. Setting this option to <all classes> brings back the micro- and macro-average statistics computed
on all classes.
Data from specific trials may be deleted from the table by selecting their rows and clicking the
button.
The Graph page allows one to compare the performance of various settings and the relationship
between those settings using a line chart like the one shown below.
Control Description
Press this button to save either the table or the chart on disk. The table may be saved
to disk in Excel, plain ASCII, text delimited, or HTML Charts may be saved in BMP,
JPG or PNG graphic file format or may be stored on disk in a proprietary format
(.WSX file extension) that may later be edited and customized using the Chart Editor.
Pressing this button prints a copy of the displayed table or chart.
This button allows the editing of various features of the chart such as the left and
bottom axis, the chart and axis titles, the location of the legend, etc.
This button is used to create a copy of the chart to the clipboard. When this button is
clicked, a pop-up menu appears allowing one to select whether the chart should be
copied as a bitmap or as a metafile.
The History page also gives access to the Experiment feature where one can quickly perform a series
of classification experiments. To access this feature click the button.
The basic principle of this dialog box is to create a list of classification experiments involving different
settings and then to instruct WordStat to perform all those experiments one after the other. Experiments
first need to be defined in the upper part of the dialog box and then to be moved to the table of
experiments located at the bottom of the dialog box. After several experiments have been defined and
added to the list, one can execute all of them at once.
The first steps involve setting the experiment options. The Feature Selection option located at the top of
the dialog box provides automatic feature selection options similar to those available from the first
page of the document classification dialog box with only one exception: while the original dialog box
allows one to select only one feature set size at a time, the current dialog box allows one to set
numerous feature set sizes at once. For example, by entering the following string in the Select edit box:
50 100 150 200 300
five classification experiments will be performed using the same analysis settings but on features set
sizes of 50, 100, 150, 200 and 300, picked out using the chosen selection method. For more information
on the Statistics, Based on and Optimization option, see Performing Automatic Feature Selection
(page114).
The Learning and Testing groups of options also provide similar settings as those available on the
Learn & Test page. Please refer to this section for information on the available algorithms and options.
Clicking the button displays a dialog box that allows one to quickly define numerous variations of
the current classifier and to add all those at once to the list of experiments to be performed.
Click the button. The program performs all the experiments in the list that had not been
executed before. Once an experiment is completed, the Executed column for this item is set to Yes
preventing the program from executing the same experiments twice. Results of every experiment are
automatically appended to the History page.
Click the button to close the dialog box and return to the History page of the classification dialog box.
The document classification feature supports numerous file formats such as plain ASCII text files as
well as HTML, Rich Text, MS Word, WordPerfect, Acrobat PDF files. Detailed results of
classifications are displayed in a table at the bottom of the dialog box and may be either saved to disk
or printed. When applied to the current database or another database, the automatic classification
feature may be useful to categorize unclassified documents or to review existing classifications based
on the results of the new classifier.
Click the Classify button to apply the current classifier to all documents in the list.
Select one or several text or document variables that will be used for classification purposes and
click OK. The content of the data file is displayed in a table, while the text to be classified is
displayed on its right. You can resize this text window by dragging its left border.
Click the Classify button to apply the current classifier to all documents contained in the selected
text variables.
To store the predicted class or the computed score obtained for every class, click the button.
A dialog box similar to the following will appear:
Click the button locate on the top of the Learn & Test page. A standard Save File dialog
box will appear.
Enter the file name under which you would like to save the classifier and then click the Save
button. WordStat will automatically provide a .wclas file extension.
Document classifiers may be retrieved and applied to new documents using the WordStat Document
Classifier utility program (see page 184). A special Software Developer's Kit (SDK) is also available
upon request from Provalis Research allowing any programmer to integrate WordStat categorization
and classification technologies into one's own database or document management system (see page
188).
Click the button located to the right of the exclusion list or of the categorization dictionary.
A dialog box enables specifying the name and location of the new dictionary file. If a dictionary file
is already active, it will ask whether existing entries should be copied to the new dictionary file. If
you answer Yes, all entries in the previously opened dictionary will be retrieved and stored in the
new one. Answering No will result in an empty dictionary.
In the Dictionary Viewer group box, select the dictionary to which you would like to add new words.
Select the rows containing the words you would like to add.
Select the dictionary to with you would like to add the selected words.
Select the rows containing the words you would like to add.
Press the right button of the mouse to display a pop-up menu.
Select the dictionary to with you would like to add the selected words.
Note: You may also drag and drop a word into the dictionary panel to the right of the frequency table (see
Using the Dictionary Panel, page 41).
From the text editor
Click the button or press the right button of the mouse to display a pop-up menu.
Select the dictionary to with you would like to add the selected words.
If you choose to add a word to the exclusion list, the word will automatically be stored in this file
without any dialog box. If the Inclusion dictionary is selected, the program will display a dialog box
similar to the following:
Weights can also be assigned to specific categorization, so that a specific word, word pattern, or
expression may count for more than one instance of the concept. The default value for this option is 1.
If you want to add a word to a non existing category, you first need to create such a category (see
below) and then follow the above steps to add the word to this new category.
In the Dictionary Viewer group box, select the Inclusion radio button.
Press on the button and select Category. The Add Categories dialog box will appear.
Select the Main Category or the Subcategory radio button depending on whether you want this new
category to appear at the main level or whether you want it to be created under an existing category.
If you choose to create a sub-category, you then need to select from the Location outline the category
under which you would like to store it.
Type the category names you would like to add in the edit box, one item per line, and click the Add
button.
Select the words or categories you would like to delete and click the button. If a
non-empty category is selected, you will be asked to confirm its deletion. If you answer Yes, all
words and subcategories belonging to this category will also be erased.
Select the item you want to modify and click the button.
By default, the dragged item is stored under the category at the cursor position. To move a word or a
category to the main level or to the same level as the category under the cursor, simply hold the ALT
key while dropping the dragged item.
Importing Dictionaries
To import a dictionary stored in an Excel, CSV or Tab delimited file, the data file has to be formatted such
that the name of categories are listed under up to four columns, and the items such as words and phrases are
listed in a separate column. The data file may also contain an additional column containing weights to be
used for each item, this weight being represented by a positive integer or floating point numerical value.
The first row must contain a header that will help one identifies the content of each column. A typical
dictionary table consisting items stored in a hierarchical dictionary with two levels (main categories and
subcategories) along with individual weights may look like this:
A sparse version of it that omits repetitions of category names may look like this:
To import such kinds of files, move to the Dictionaries page of WordStat and click the IMPORT button
located on the far right of the Categorization check box. An Open File dialog box will appear allowing you
to select the file containing the dictionary you want to import. Once selected, click Open. A dialog box
similar to this one appear:
If the name of the categories are not fully specified like in our first example, but resemble more to the
second table where only the first occurrence of the category name has been typed in, then one has to select
the Sparse Entry of Category check box to make sure that the items will be imported and stored in their
proper category.
Once all the options have been set, click the OK button. A Save File dialog box will ask you to enter a
dictionary file name. WordStat will point to the default folder where other dictionaries are stored. You may
browse to a different folder to store the new dictionary at another location. Once save, the newly created
dictionary will become the default one.
Export Dictionaries
To export a dictionary, move to the Dictionaries page of WordStat and click the EXPORT button located on
the far right of the Categorization check box. A Save File dialog box will appear. Select the proper file
format under which you would like to save the file, type the name of the file and click OK to create it.
Merging Dictionaries
The WordStat Merge feature allows one to append categories and items contained in one categorization
dictionary into another dictionary. To merge dictionaries:
From the dictionary page of WordStat, open the dictionary into which you would like to import
new categories.
Select the categories you would like to import by clicking in the box beside the desired
category(ies) and clicking OK. To select all items, right click anywhere on the list of categories
and select Check All. Choosing Uncheck All removes all check marks previously entered.
If an imported category already exists in the currently active dictionary, WordStat ignores duplicate
items and only imports new items not already found in the original category. New categories are
appended to the existing structure along with all their items.
FONT SIZE - This option may be used to adjust the font size used to print dictionary items.
NUMBER OF COLUMNS - Dictionaries may be printed with up to seven columns per page, allowing
one to print large dictionaries on fewer pages. Please note that when increasing the number of
columns per page, it may be necessary to decrease the font size to prevent the overlapping of items in
adjacent columns.
START NEW PAGE ON ROOT CATEGORY - Root categories are the dictionary categories still visible
when the dictionary is fully collapsed. Selecting this option instructs the program to start the printing
of all items starting from this root category at the top of a new page.
FOOTER - Enable this option to print a footer at the bottom of each page. A footer can consist of up to
three items: The Filename (printed on the left margin of the footer), the Page Number (located at the
bottom center of the page), and the Date (printed on the right margin of the footer).
To access the dictionary options of the currently active dictionary, click the button located
on the Dictionary page.
Description page
The first page of this dialog box allows one to enter text to describe the dictionary. Use this
description to inform other users about the intended use of a created dictionary, its assumptions, its
strengths and limitations, etc. Such an option may also be used to document for personal use how the
dictionary was created, what remain to be done, etc. A check box located on the upper left-hand
corner of this page can be used to automatically display this description in a dialog box when a user
opens it. The Version edit box allows you to insert a short alphanumeric string that will be used to
identify the version number of the dictionary.
Lock Options
Once a categorization dictionary has been developed and validated, one may want to restrict further
changes to the dictionary or prevent other users from setting content analysis options that would be
incompatible with this dictionary. For example, a categorization dictionary may require the presence
of a specific exclusion list to handle exceptions or may assume that the program accepts specific
characters or requires prior lemmatization of the documents. The Lock Options page allows one to
restrict operations and changes in analysis options for the currently active dictionary. However,
please note that those restrictions are neither permanent nor password protected. They may be
overridden at any time by accessing the Dictionary Options dialog box again and by removing any of
the restrictions that has been set previously.
To regain access to those editing operations, click the button and disable the items that
correspond to the dictionary to edit.
All the remaining options on this dialog box are used to prevent further changes to various options
found on the Dictionary or the Options page. Before locking any item, you have to make sure the
corresponding option is properly set. For example, if the categorization dictionary is not compatible
with prior stemming and requires some special characters to be treated as valid, you first need to
disable the Stemming option on the dictionary page, move to the Options page and enter all those
special characters in the Valid Characters edit box prior to locking those options in the Lock
Options page.
Please note that the dictionary description as well as any restriction applied to a dictionary are
automatically saved in a file with the same name as the categorization dictionary file but with an
.NFO file extension. To make sure the description and the various options follow the dictionary,
always make sure to include this .NFO file along with the .CAT categorization dictionary file.
the first item, SATISFIED, refers to a single word while #PROFESSOR will match any item found in the
PROFESSOR content category.
Just like words or phrases, rules may be stored anywhere in a categorization dictionary. A rule consists of a
target item and from one up to four conditions, each condition consisting of another item linked to the first
item using a Boolean (AND, NOT) or a proximity operator (NEAR, BEFORE and AFTER, or their
negative forms, NOT NEAR, NOT BEFORE, NOT AFTER). The context in which those conditions will be
tested also needs to be specified, allowing one to either consider the content of the entire document or
restrict the test to a single paragraph or a single sentence. When a proximity operation is used, one also has
to specify the maximum distance in number of words that must separate the two items in order for this
proximity condition to be tested as true or false.
To create a rule, click the button and select the RULES menu item. A dialog box similar to the
one below will appear:
item1 AND item2 ...both items occur in the same document, paragraph or sentence.
item1 NOT item2 ...the first item occurs in the document, paragraph or sentence but not the
second one.
item1 NEAR item2 ...both items occur in the same document, paragraph or sentence, and are
no more than n words apart.
item1 BEFORE item2 ...both items occur in the same document, paragraph or sentence, and the
second item appears after the first one within the next n words.
item1 AFTER item2 ...both items occur in the same document, paragraph or sentence and the
first item appears after the second one within the next n words.
item1 NOT NEAR item2 ...the first item occurs in a document, paragraph or sentence, and is not
found within n words of the second item.
item1 NOT BEFORE ...the first item occurs in a document, paragraph or sentence, and is not
item2 followed within n words by the second item.
item1 NOT AFTER item2 ...the first item occurs in a document, paragraph or sentence, and does not
occur within n words after the second item.
By default, operators are set to <none>. To add an additional criterion, set its operator to a valid Boolean or
proximity operator. To remove criteria, set the operator immediately below the last desired criteria to
<none>. When more than one condition is set, you will be asked to specify whether you want to match all
criteria or match any one of those.
Once the rule has been properly defined, click the button located in the lower right-hand corner
of the dialog box to append the rule definition to the selected content category and to clear the form. Once
you have finished entering rules, click the close button to quit this dialog box and return to the WordStat
main screen.
Please note that, in order to prevent any recursive or cross-reference problems in rules, content categories
can only refer to words, word patterns or phrases stored in categories and will thus ignore the presence of
other rules. For example, if a category named #SATISFACTION contains 10 word patterns and three rules,
any reference to this category in a rule will take into account those 10 words and will ignore instances
where any one of the three rules have been found to be true.
A spelling dictionary is used to propose inflected forms of existing words already in your dictionary.
Several dictionaries are currently available for different human languages such as English, French,
Italian, Dutch, etc.
Two English thesauri are also used to propose synonyms of words already in your dictionary.
A WordNet based lexical database is used to find synonyms, antonyms as well as hypernyms,
hyponyms, coordinate terms, holonyms, meronyms, etc. This database contains over 150,000 root
words (including many proper nouns) and offers over 120,000 synonym sets. The availability of word
sense definitions allows for manual as well as automatic filtering of proper word senses.
These three tools are available through the auto suggest panel on the frequency list (see page 42) as well a
through two dictionary-building commands.
The Basic command uses the selected spelling dictionaries and the two thesauri to identify related
synonyms and inflected forms.
The Advanced command gives you access to a more powerful dictionary-building tool that uses a
WordNet based lexical database to find, not only synonyms, but all related words such as hypernyms,
hyponyms, holonyms, meronyms, coordinate terms as well as the selected spell-checking dictionaries
to find inflected forms of those words.
Select the Dictionaries page by clicking the first tab at the top of the main WordStat screen.
WordStat will immediately start looking for synonyms and inflected forms of all words in your
inclusion dictionary and will report them in a dialog box like this one:
Point to the Programs folder in the Windows' Start menu, then select Provalis Research and then click
Dictionary Builder.
To access the advanced dictionary-building tool from within WordStat:
Select the Dictionaries page by clicking the first tab at the top of the main WordStat screen.
The first page is used to set various dictionary and search options. The second and third pages are
used to find words and idioms semantically related to existing entries in the dictionary, while the last
page is used to find derived form of those entries.
DICTIONARY PAGE
The first page of the dictionary builder program allows you to select or change the WordStat
dictionary, specify the words and categories you want to work with, along with the type of relationship
to look for. It also allows you to specify how the program will search for inflected forms of existing
words in your dictionary.
DEFINITIONS PAGE
Using a comprehensive lexical database such as WordNet to find related words and phrases has one
major drawback. Searching for numerous types of relationship for even small WordStat dictionaries
can yield a huge number of suggested words. For example, when searching for suggested words for a
dictionary containing 129 words grouped under 13 categories, more than 12,000 new words and
phrases were obtained, many of them unrelated to the existing categories. Browsing through such a
huge number of suggestions to find the most relevant ones can be an overwhelming task. The
Definitions page was created to somewhat reduce this burden by providing an intermediary step where
the user can select, for each of the words, the word senses that are the most relevant to the containing
category. The program offers both manual and automatic selections of word senses and also allows one
to combine both methods.
Best - This rule instructs the program to select for each word, the sense that has obtained the
highest relevance score. When selecting the highest score, a 20% tolerance is used so that, on some
occasions, more than one word sense will be selected. This selection rule is the most conservative
one and ensures that relevant word senses are the most likely to be selected. However, we have
also found that this selection method may lack some sensitivity and may fail to select other
relevant word senses (false negatives).
Relevance > 0 - This rule instructs the program to select all word senses that have been found to be
related, even slightly, to the category. This selection rule is very liberal in that it is the most likely
to select most relevant word senses at the cost of a lack of specificity (too much false positives).
Relevance > 0.1 - This rule is slightly more conservative that the previous one, in that it also
rejects all word senses that have obtained a score of 0.1. Besides a score of zero, 0.1 is the lowest
score that may be obtained. Experiences have shown that, very often, word senses with such a low
score are unrelated to the category. Removing those word senses thus results in an increase in
specificity along with only a marginal decrease in sensitivity.
The application of any of these three rules is performed by selecting the proper rule from the Select
drop down list. This list box may also be used to select or unselect all definitions.
Manual selection of word senses can be carried out either alone or after an automatic selection has
been made by the program. Manual selection is performed simply by browsing through the list of all
definitions and selecting those that are related to the current category while making sure unrelated
definitions are unselected. The decision to include or exclude a specific word sense may rely on the
displayed definition, on the relevance score, and also on the examination of all words that have been
found to be related to this specific word sense. Those suggested words are automatically displayed in
the right panel of the Definition page when the word definition is highlighted.
Selected word senses may be saved on disk by clicking the button, and later retrieve by clicking
the button.
Once the word senses have been chosen, activating the Words page will start the search, extract all
words and phrases related to the selected word senses, and will display them by categories and by the
type of relationship (synonyms, antonyms, etc.)
WORDS PAGE
The Words page displays a list of suggested words and idioms that were found to be related to existing
words in the various categories and allows you to select suggestions and add them to the existing
dictionary. The "All words" page includes a list of all words and idioms that were suggested,
irrespective of their relationship with the existing entries. The remaining pages allow one to examine
those same words by the nature of their relationship with existing entries.
Specificity index
Very often, a word is suggested in more than one category. This is especially true when the
dictionary includes categories that are semantically close each other. One good example of such a
categorization system is the Lasswell dictionary that tries to differentiate ten different forms of power
relations (power gain, power loss, cooperation, authoritative, conflict, doctrine, etc.). When making a
decision on whether a word should be added to a given category, it is important to consider whether
this word is specific to this category or whether it has also been suggested in other categories. The
Compute Specificity button allows one to obtain a specificity index as well as a list of all the other
categories in which this item also appears. This specificity index is computed by making the sum of
all relevance scores obtained by this word in the various categories and computing the proportion of
this total score that is related to the current category. A specificity of 1.0 indicates that this item has
only been suggested for this category. When the item has been found to be related to more than one
category, a list of all other categories in which it also appears is displayed in the Other Categories
column along with the relevance score obtained in each of those categories. You can use this
information to decide to which category this word should be added.
The Inflected Form page lists all words whose spelling begins with the same letters as existing words
and that were not already included in the actual dictionary. For example, if the word "understanding" is
found in the dictionary, the program will suggest words like "understandings", "understandingly". If the
Match Partial Word option is enabled (see Dictionary page), this same word will also yield words like
"understands", "understandable", and "understandably". The From column displays the original word
from which the inflected form has been derived.
Click the button to return to quit the dictionary builder program and return to WordStat.
Click the to the right of the substitution list to review all substitutions performed. A dialog
box similar to this one will appear:
The Substitution Process page of this dialog box provides information about the internal process (if any)
involved in the lemmatization or substitution routine as well as a list of all manual substitutions.
Type in the Original edit box the word you would like to replace, then type the replacement word in
the Replace with edit box and click OK to create the new substitution rule. This new rule is
automatically added to the substitution process.
To cancel a modification:
All changes are automatically saved to disk. To cancel any change made to the list of manual
substitutions during the current WordStat session, click the button. A list of all changes
performed in this substitution process will be displayed.
Select the modification you would like to cancel and click the Undo button.
The table presents all performed substitutions in alphabetical order as well as other information (such as
the substituted word, how frequent the substitution has been made, the length of the original word as well as
the inverted version of the original word). Clicking any column header sorts the table in ascending order of
The initial word is automatically entered in the Original edit box. Then type the replacement word
and click OK to create the rule. This new substitution rule is automatically added to a list of
exceptions and will automatically be accessed when using the currently selected lemmatization
routine.
Click the button. A descriptive title will be provided automatically for the table. To edit this
title or to enter a new one, hold down the SHIFT keyboard key while clicking this button (for more
information on the Report Manager, see page 191).
Click the button. If modifications have been made to the list but have not been saved, you will
be prompted whether those modifications need to be saved. Choosing NO will result in the loss of all
changes made to the list since you entered this dialog box or since the last time those modifications
had been saved. .
Click the button located to the right of the Preprocessing list box.
Type the name you would like to give to this preprocessing routine and click OK. This name will be
added to the list box of available routines.
Enter the name of the program file including the full path or click the button to display a dialog
box that will allows browsing through folders and then select the appropriate program file.
In the Working Dir edit box, specify the working directory for the program if necessary. Specifying
$TEMP as the working directory instructs the program to set the working directory to the temporary
folder.
Enter the parameters to transfer to the program at start-up. Typically, you will transfer the input and
output file names, as well as any command line options needed for the external routine, to perform
$IN The temporary text file created by WordStat and to be processed by the
external routine (the actual file name used is WORDSTAT.IN)
$OUT The name of the text file created by the external routine and retrieved by
WordStat (the actual file name used is WORDSTAT.OUT).
Click the button located to the right of the Preprocessing list box.
Type the desired name for this preprocessing routine and click OK. This name will be added to the
list box of available routines.
Enter the name of the DLL file including the full path or click the button to display a dialog box
that allows browsing through folders and then select a DLL.
Once a DLL file has been entered, the dialog box will provide a list of functions that are likely to be
compatible with WordStat (with names starting with "WS"). Select the function containing the
transformation routine needed.
WordStat must set apart in advance a "buffer size" that will contain the transformed text. By default,
the memory space is equal to the length of the original document. For many text transformation
routines, such as stemming or lemmatization, which often result in shorter text, this space should be
large enough. However, for other types of text preprocessing (e.g., part-of-speech tagging or
transformation of words into n-grams), the size of the transformed text may be twice or three times
larger than the original. The Buffer Size option allows you to specify how much larger the memory
space should be in order to hold the transformed text. A numerical value between 1 and 10 can be
used to represent the value by which the original text should be multiplied. For example, if this
Once all the options have been set, click the OK button.
Other tasks
To edit the settings of an external routine:
Select the preprocessing routine you would like to edit from the dropdown list.
Select the preprocessing routine you would like to remove from the dropdown list.
It is also possible from this dialog box to examine and edit all numeric and alpha numeric values stored in
other variables of the data file. To view and edit those values for the current case, the Show all variables
check box should be selected. When enabled, the screen will split vertically. On the left side, a panel with a
list of all variables with their values for this case will be shown. To edit any of those values, press the F2
key or double-click the value to edit.
The following table provides a short description of available buttons and controls:
CONTROL DESCRIPTION
This button allows the importation of text from various file formats including plain
text file, RTF, MS Word, WordPerfect, MS Write or HTML. If the variable containing
the document supports only plain text documents then all formatting options and
unsupported features, such as bullets, graphics or headers are removed.
Export the current document to disk. Plain text document may only be saved as plain
ANSI document while RTF documents may be saved in plain text or RTF format.
The color code dialog box allows you to assign specific font and background colors to each category of
the current inclusion dictionary.
Three types of charts may be used to depict the distribution of keywords or content categories:
The vertical bar chart is the default chart used to display absolute or relative frequencies
of keywords or content categories.
The horizontal bar chart displays the same information as the vertical one but is especially
useful when the number of keywords is high and their labels cannot be displayed entirely
on the bottom axis.
The pie chart is useful to display the relative frequency of each keyword and compare
individual values to other values and to the whole. Numerical values displayed in pie
charts are always expressed in percentages of either the total frequency or case
occurrences.
The Plot option allows one to select the values that will be used as the scale for the length of bars in
barcharts or as the percentage base for pie charts. For barcharts the options are:
FREQUENCY Number of occurrences of the keyword
% SHOWN Percentage based on the total number of keywords displayed in the table
% PROCESSED Percentage based on the total number of words encountered during the analysis
% TOTAL Percentage based on the total number of words that have not been excluded
NO OF CASES Number of cases where this keyword appears
% CASES Percentage of cases where this keyword appears
For pie charts, two options are available to specify how percentages will be computed:
FREQUENCY Percentage based on the total frequency of keywords
NO OF CASES Percentage based on the total number of case occurrences
The View Others option displays an additional bar or slice representing all items in the frequency table that
have not been selected.
The following table provides a short description of available buttons and controls:
Press this button to append a copy of the graphic in the Report Manager. A
descriptive title will be provided automatically. To edit this title or to enter a new
one, hold down the SHIFT keyboard key while clicking this button (for more
information on the Report Manager, see page 191).
Press this button to retrieve a chart previously saved on disk.
Press this button to save a chart on disk. Charts are saved in a proprietary format and
may be edited and customized using the Chart Editor.
Pressing this button allows you to print a copy of the displayed chart.
Click this button to turn on/off the 3-D perspective for the current chart.
This button allows you to edit various features of the chart such as the left and bottom
axis , the chart and axis titles, the location of the legend, etc.
This button is used to create a copy of the chart to the clipboard. When this button is
clicked, a pop-up menu appears allowing you to select whether the chart should be
copied as a bitmap or as a metafile.
Pressing this button closes the chart dialog box and returns to WordStat's main
screen.
Clicking the button on the chart dialog box gives access to a dialog box to customize the
appearance of barcharts and line charts. The options available in this dialog box represent only a small
portion of all settings available.
To further customize the chart, modify data points, value labels, or series order, click the button
located to the right-hand side of the dialog box.
TITLES
Proper titles and axis labels are of utmost importance when describing the information displayed in a
chart. By default, WordStat uses variable names and labels as well as other predefined settings to
provide such descriptions.
The title page allows one to modify the top title, as well as the labels on the left, bottom and right axis.
To edit the title, select the proper radio button. Enter several lines of text for each title by pressing the
<Enter> key at the end of a line before entering the next line.
The Font button to the right-hand side of the edit box allows changing the font size or style of the
related title.
3-D VIEW
Orthogonal - Turning this option off disables the free elevation and rotation of the 3-D chart.
Zoom - This option zooms the whole chart. Expressed as a percentage, increasing the value positively
will bring the chart towards the viewer, increasing the overall chart size as the Zoom value
increases.
3-D Percent - The 3-D Percent property indicates the size ratio between chart dimensions and chart
depth by specifying a percent number from 1 to 100.
Perspective - Use this property with Orthogonal unchecked to modify the 3-D perspective of the
Chart. Larger values add more depth perspective.
Bar shadow - Enabling this option adds dark shades to the sides of 3-D bars. Turning it off will color
the sides of the bar the same as the front.
Bar width - This option determines the percent of total bar width used. Setting this value to 100
makes joined bars.
Bar depth - Use this property to limit the depth that each bar series uses. By default, bars will take up
the part proportional to the number of bar series in the chart so that the back of a bar will join the
front of the bar immediately behind it. To insert a gap between series of bars, decrease this value.
Pie depth - Use this property to change the thickness of the pie chart.
Click the button and select the SAVE AS A NORM FILE command. A file-saving dialog box
will be displayed.
Click the button and select the REMOVE NORM STATISTICS command.
RETRIEVE - This option determines the text unit on which the search will be performed as well as what
will be retrieved. You can select three different text units:
The Documents search unit allows WordStat to apply the search expression on each document
associated with a specific case and, if a specific document meets the search condition, its location
will be displayed.
Setting this option to Paragraphs allows WordStat to display any paragraph meeting the search
condition.
When selecting Sentences as the search unit, WordStat returns sentences meeting the search
condition.
To enter a second filtering condition, click the button. You can choose to link the two
filtering conditions using either one of the three Boolean expressions: AND, OR or NOT. Choosing
AND will retrieve all text units fulfilling both criteria; selecting OR will result in a retrieval of text
units meeting either the first or the second condition, or both, while choosing the NOT Boolean
operator will retrieve text units meeting the first condition but not the second one.
VARIABLE FILTERING - The second group of options allows one to use other variables restrict the
retrieved text units to specific cases selected according to some logical condition. This filtering
condition may consist of a simple expression, or may include up to two expressions joined by a
logical operator (i.e., AND, OR). In the above screen shot, only text units from cases where the
variable GENDER is equal to MEN will be retrieved.
The following table shows the various operators available for each data type:
STRING Contains
Does not contain
Is empty
Is not empty
ADD VARIABLES - This dropdown checklist box may optionally be used to add the values stored in
one or more variables to the table of retrieved segments for the specific case from which a text
segment originated.
Once all the search options have been set properly, simply click the button to retrieve the
selected text units.
This table contains the case number and the variable from which the segment originates, as well as the value
of all additional variables selected by the user (see the ADD VARIABLES option above). When searching
for paragraphs or sentences, the table also displays the text associated with the retrieved unit and its location
(its paragraph and sentence number). By using arrow keys or by clicking a row the associated text is
displayed in a separate window at the bottom of the screen with all keywords in bold. Selecting specific
words or phrases in this text window by right-clicking displays a pop-up menu to assign them to a content
category, the exclusion list or to obtain a keyword-in-context list.
Click the button to assign the selected code to the highlighted text segment.
Click the button to assign the selected code to all text segments matching the search expression.
NOTE: To automatically attach QDA Miner tags to all paragraphs or sentences associated with all
currently displayed content categories or keywords, you may use the autocoding feature available by
clicking the button on the Frequencies page (see page 37).
Click the button. A dialog box will appear allowing you to create a new QDA Miner code (for
more information see Adding a QDA Miner codes).
Click the button. The sort order of the current table is used to determine the display order in the
report. This report is displayed in a text-editing dialog box and may be modified, stored on disk (in
RTF, HTML or plain text format), printed, or cut-and-pasted into another application. Graphics and
tables may also be inserted anywhere in this report.
Click the button. A dialog box similar to this one will appear:
To adjust the font size and style of the title and labels
Click the Change button beside the Title or Labels options. A Font setting dialog box will appear,
letting you change the font, the font size, style, and color.
Press this button to append a copy of the chart in the Report Manager. A descriptive title
will be provided automatically. To edit this title or to enter a new one, hold down the SHIFT
keyboard key while clicking this button (for more information on the Report Manager, see
page 191).
Click this button to save a chart on disk. Charts may be saved in BMP, JPG or PNG graphic
file format or may be saved in a proprietary format (.WSX file extension) that can later be
edited and customized using the Chart Editor.
This button creates a copy of the chart to the clipboard. When this button is clicked, a
shortcut menu appears allowing you to select whether the chart should be copied as a bitmap
or as a metafile.
Pressing this button closes the chart dialog box and returns to WordStat's main screen
HEATMAP PAGE
The main section to the right of this page has shown in the screen shot below, displays the heatmap grid
representing the relative frequencies of each cell (row and column intersection) using different brightness or
color tones. Optional dendrograms are displayed at the top and to the left margin of this grid. The size of
these dendrograms may be adjusted by moving the mouse cursor over the bottom edge (upper dendrogram)
or the right edge (dendrogram on the left) of the dendrogram and dragging its limit to the desired size.
The font size used to display the row and column values may also be adjusted by clicking the or
buttons located on the top toolbar. The size of cells and the distance between dendrogram leaves are
automatically adjusted to the new font size.
To identify which specific cases or text documents are associated with a cell or group of cells, simply select
the rectangular area for which you would like to obtain such a list, click the button and select
documents, paragraphs or sentences. WordStat locates the documents or text segments associated with those
cells and displays them in the Keyword Retrieval dialog box (see page 150).
PLOT When more than two axes have been extracted, this control allows you to select all the
possible axe combinations that can be graphed on the two axes of the plot.
Words This checkbox allows you to display or hide the row points (i.e., words or category names)
Groups This checkbox allows you to display or hide the column points (i.e., subgroup labels).
Clicking this button enables to zoom in a plot. To zoom an area of the plot, hold the left
mouse button and drag the mouse down/right. You'll see a rectangle around the selected area.
Release the left mouse button to zoom
Clicking this button restores the original viewing area of the plot.
Clicking this button closes the Dendrogram & Concept Map dialog box and returns to
WordStats main window.
This button can be used to show or hide left, bottom and back walls.
Clicking this button draws anchor lines from the floor to the data point to better locate data
points in all 3 dimensions.
Clicking this button allows you to change the viewing angle of the chart. To rotate the chart,
make sure this button is selected, click any area of the chart, hold the mouse button and drag
the mouse to apply the desired rotation.
Locating a data point on the depth dimension of a 3-D plot can be very difficult especially
when the plot remains static. One often has to rotate this plot constantly on the various axes
to get an accurate idea of where the data point is located on this third axis. Clicking this
button forces WordStat to rotate the plot automatically. To disable the automatic rotation,
click a second time.
Words or categories closely associated with two subgroups will be plotted in an angle from the
origin that will lie between those two groups. In the above example, the rock groups Korn and
Metallica seem to be characteristics of both 15 and 16 years old listeners.
For a more comprehensive description of this method, its computation and applications, see Greenacre
(1984). For an application of correspondence analysis to the analysis of textual data see Lebart, Salem,
and Berry (1998).
This dialog box allows you to specify a label that will be used to describe each case. The label may be
changed by editing the text in the DESCRIPTION STRING edit box. To insert the value stored in a specific
variable into the description, simply enter the variable name in uppercase letters and enclose this name
between braces. Alternatively, you can insert a variable name at the current caret location by clicking the
corresponding item in the VARIABLES list located just above the edit box.
If you enter the following string:
{GENDER} subject - {AGE} years old
The {GENDER} and {AGE} strings will be replaced with their corresponding value for this specific case.
If the current case contains information about a seventeen-years-old male, the above string will be displayed
as:
Male subject - 17 years old
It is also possible to insert the following string:
{CASENUM}
This string will display a unique case number, representing the physical order of this case in the project file.
This dialog box consists of two major sections. The upper section allows one to specify up to four filtering
conditions joined by logical operators (such as AND, OR). Each condition consists of a variable, an
operator and, if needed, some numerical, categorical or string values. The following table presents the
various operators available for each data type.
The lower section allows one to filter cases based on the presence of words or phrases associated with
specific content categories in the document being processed by WordStat. This section is disabled by
default and becomes active as soon as a content analysis has been performed. The AND, OR and NOT
Boolean operators are used to determine how those two categories should be combined.
For example:
APPEARANCE AND ART Selects only cases that contain words in both categories.
APPEARANCE OR ART Selects cases that contain words in either one of these two categories.
APPEARANCE NOT ART Selects all cases containing a word in category APPEARANCE but that
do not contain any word found in the ART category.
Once a filtering expression has been entered you can apply the filter and leave this dialog box by clicking
the Apply button. If the filter expression is invalid, an error message will appear and exiting from the dialog
box will not occur.
To temporarily deactivate the current filter expression, click the Ignore button. The filter expression will be
kept in memory and may be reactivated by selecting the FILTER CASES command again and clicking
Apply.
To exit from the dialog box and restore the previously active filtering expression, click the Close button.
A more advanced xBase filtering dialog box can be accessed by clicking the button. The
filtering expression can be typed directly into the Filter edit box using the proper syntax, or one can use any
element displayed in the upper part of the dialog box to build a valid expression. To obtain more
information on expression operators and evaluation rules and supported xBase functions, see page 171.
VARIABLE NAME LIST BOX - Double-clicking a variable name from the list box located to the left of
the dialog box inserts that name in the edit box at the current caret position.
FUNCTION LIST BOX - A list of valid xBase expressions is displayed to the right of the dialog box.
Double-clicking an xBase function in the list box, inserts that function at the current caret position.
When a function requires one or several arguments, the argument section remains highlighted. To
replace the highlighted text with a value, an expression or a variable name, simply type the proper
text on the keyboard or select a variable name or function.
NUMERIC, BOOLEAN AND RELATIONAL OPERATORS BUTTONS - Clicking any relational or
Boolean operation or on any numeric button inserts the corresponding symbol in the edit box at the
current caret position.
The following section provides a description of xBase syntax rules used in the FILTER command and a
detailed description of each xBase function.
String Operators
+ Joins two strings. Trailing spaces in the strings are placed at the end of each string.
- Joins two strings and removes trailing spaces from the string preceding the operator and
places them at the end of the string following the minus sign operator.
Numeric Operators
+ Addition
- Subtraction
* Multiplication
/ Division
^ Exponentiation (or **)
Relational Operators
= Equal to
== Exactly equal to
<> Not equal to
# Not equal to
!= Not equal to
< Less than
> Greater than
<= Less than or equal to
>= Greater than or equal to
$ Is contained in
Evaluation Order
When more than one type of operator appears in an xBase expression, the order of evaluation is as
follows: Expressions containing more than one operator are evaluated from left to right. Parentheses
are used to change the evaluation order. If parentheses are nested, the innermost set is evaluated first.
Logical operators are evaluated as NOT first, AND second, and OR last. Logical evaluation order may
also be altered with parentheses. In multiple conditional expressions that contain the NOT operator,
always use parentheses to enclose the NOT operator with the expression to which it applies.
ALIAS()
ALLTRIM (String)
Trims both leading and trailing spaces from a string. The string may be derived from any valid xBase
expression.
AT (SearchString, TargetString)
Determine whether a search string is contained within a target. If found, the function returns the
position of the search string within the target string (relative to 1). If not found, the function returns 0
(zero).
CHR (Val)
Converts a character string into an xBase date. The string must be formatted according to the Windows
date format settings.
CTOD("12/31/94")
DATE ()
Returns the system date (today). Use DTOC(DATE()) to retrieve today's date formatted according to
the Windows settings.
DAY (DateVariable)
Returns the day portion of an xBase date as an integer.
DELETED ()
Returns True if the case is deleted and False if not deleted.
DESCEND (String)
An xBase function that inverts a key value using 2's complement arithmetic. The result of the operation
is the arithmetic inverse of the key value. When inverted keys are sorted in ascending sequence, the
result is in descending order. A filter expression could be
DESCEND(DTOS(billdate)) + CUSTNO
DTOC (DateVariable)
Converts an xBase date into a character string formatted according to the Windows settings. For
example, if the date format was American and the date variable contained March 21, 1995,
DTOC(datevariable) would return '03/21/1995'.
DTOS (DateVariable)
Converts an xBase date into a string formatted according to standard xBase storage conventions
(CCYYMMDD). For example, December 21, 1993 would be returned as '19931221'. Indexes that
contain date elements should use the DTOS() function, which naturally collates into oldest date first.
EMPTY (Variable)
Reports the empty status of any xBase variable. Character and date variables are empty if they consist
entirely of spaces. Numeric variables are empty if they evaluate to zero. Logical variables are empty if
they evaluate to False.
Memo variables that contain no reference to a memo block in the associated memo file are empty.
INDEXKEY ()
Returns the current index key as a string. (Same as ORDKEY()).
LEN (Expression)
Returns the length of the expression result as an integer.
LOWER (String)
Converts the string expression into lower case.
MONTH (DateVariable)
Returns the month portion of an xBase date as an integer.
ORDER ()
Returns the current index order as an integer.
ORDKEY ()
Returns the current index key as a string. (Same as INDEXKEY())
Determine whether a search string is contained within a target, starting from the right side of the target
string. If found, the function returns the position of the search string within the target string (relative to
1). If not found, the function returns 0 (zero).
RECCOUNT ()
RECNO ()
Returns the rightmost characters of the expression for the defined length.
SELECT ()
Returns the workarea number for the current work area as a long integer.
SPACE (Length)
STOD (String)
The inverse of DTOS(). STOD() converts a string formatted according to standard xBase storage
conventions (CCYYMMDD) to an xBase Date formatted according to the Windows settings.
STR (Number, Length, Decimals)
Converts a number into a right-justified string with decimal digits following the decimal point. The
total length of the string is defined by the length parameter. STR(RECNO(), 5, 0) is a common
indexing element that ensures creation of unique keys if appended to another variable element.
If the decimals parameter is omitted, the function defaults to zero decimal places. If the length
parameter is omitted as well, the length of the result is the length of the variable.
TIME ()
Returns the system time as a string in the form HH:MM:SS.
Transform converts strings and numeric values into formatted character strings. The function
transforms the result of the first expression in accordance with the second picture string.
The picture string is made up of two parts. The first part is the Function string and it is optional for both
strings and numeric values (as long as the second Template string is present).
A character string transformation picture may consist of only a Function string or only a Template or
both.
A numeric picture must contain a Template string; the Function string is optional.
A logical value must contain only a Template string with Template characters L or Y.
The Function string consists of a leading @ character followed by one or more formatting characters. If
the Function string is present, the @ character must be the first character in the picture string with its
formatting characters immediately following and it may not contain spaces.
If a Template string exists as well, it follows the Function string. A single space separates the Function
string and the Template string.
Function string characters allowed for numeric values are:
B left justify;
C display CR after positive numbers;
X display DR after negative numbers;
Z blank a zero value;
( enclose negative numbers in parentheses.
Example: Where "phone" is a character variable holding a phone number with no formatting characters.
'transform(phone, "@R (###) ###-####")' returns '(909) 699-6776'.
If the formatting characters were actually present in the variable, the "@R" function would be omitted
TRIM (String)
Removes trailing spaces from the string expression.
UPPER (String)
Converts the string expression into upper case. Character variables used in index expressions
should always be converted to upper case to insure correct collating sequence.
VAL (String)
Converts a string of numeric characters into its equivalent numeric value. The conversion stops at
the first non-numeric character encountered (or the end of the string).
YEAR (DateVariable)
Unique Keywords
Unique keywords or code name may be inserted anywhere in the text. Those keywords should preferably
not be an existing dictionary word. For example, it may consist of an abbreviation of one or several words,
includes special symbols (such as #, & ^, _ etc.) or numeric digits. The retrieval of those codes can then be
achieved by adding all those keywords to the inclusion list . If special symbols have been used, they should
also be specified as valid characters on the options page.
LEVEL OF MEASUREMENT
WARNING: When the computation is based on keyword frequencies (rather than case occurrences) and
when codes can be used more than one time per case, it is usually recommended to use ordinal or interval
level agreement measures. Otherwise a difference in frequency for a specific code will be treated as a single
disagreement. For example, if for a single case a coder assigns a code twice and another coder uses this
same code three times, nominal level agreement measures will treat this difference as a single disagreement
and will ignore the fact that the both raters may be in agreement in two instances. As a result, the overall
agreement level will be underestimated. However, nominal level measures may still be used in those
situations if the researcher wishes to treat any difference in frequency as a disagreement.
DESTINATION - This option allows you to choose whether the new variables should be appended to the
current data file or written into a new file. If this last option is selected, a dialog box will appear
allowing you to specify the name and location of the new file. When data are saved to a new data
file, additional variables are created to store the case number and the numerical values of each
independent variable.
DATA TO SAVE - This option allows one to choose among four different kinds of data that may be
saved:
Keyword frequencies
Case Occurrences (i.e., a dummy variable with 0 when absent or 1 when present)
Percentage of words (i.e., the frequency of the keyword divided by the total number of words in
the case)
TF*IDF (i.e., the keyword frequency weighted by inverse document frequency).
VARIABLE NAMES - This option lets you determine what method should be used by WordStat to
create new variable names. When set to KEYWORD, the program will attempt to use each keyword
VARIABLE TYPE - By default, WordStat saves keyword statistics in as many variables as there are
keywords or content categories listed on the frequency page. For example, if the frequency table
contains 100 items, then 100 variables will be necessary to store the statistics associated with each
item. When you choose to store the occurrence of codes, WordStat offers you the possibility of
storing the observed occurrences in a limited number of polynomial (or multinomial) variables. For
example, if the maximum number of different content categories per case is no more than 10, then
you may instruct WordStat to create 10 numeric variables and store, in each of those, a numeric
value representing one of the content categories. If less than 10 categories are found in a specific
case, then the remaining variables are left empty. To store values in a limited set of nominal
variables, choose the Multiple Polynomial Variables option and enter the Maximum Number of
Variables that should be used for storing the values representing the content categories. To store the
name of those categories rather than their numerical values, select Multiple String Variables
instead. If the maximum number of content categories found in a single case is higher than the
specified number of variables, then a warning message will appear to let you know that some
information has been lost and to indicate the maximum number of content categories encountered in
the project. To export occurrences as zeros and ones in as many variables as there are codes, select
the Multiple Dichotomous Variables option.
ADD VARIABLES - This drop-down checklist box may be used to add the values stored in one or more
variables to the exported data file along with the statistics.
SAVE TOTAL NUMBER OF WORDS - This option appends a numeric variable named TOTWORDS
that contains the total number of words processed in each case.
Clicking the PREVIEW button displays a grid allowing one to see what the data file will look like.
Specific settings, such as the inclusion of special characters or numerical digits, may also need to be set in
order to collect relevant information.
In order to apply such a process to external documents, one should normally import the documents into a
SimStat data file or QDA Miner project file, run WordStat and replicate the exact same settings as those
originally used. An alternate solution is to export the categorization model to disk, which would include all
the relevant information and settings, and then use either the WordStat Document Classifier utility program
or functions of the Software Developer's Kit (release date Summer 2005) to retrieve the saved model and
apply it to the new documents.
Go to the Frequency page and click the button located at the top of the page.
Select the Export Categorization Model command. A dialog box should appear asking you for a
file name.
Enter the file name of the model you want to create and click Save.
By default, categorization model files are saved with a .wcat file extension in the \Models subfolder under
the program folder. NOTE: While the information in the exclusion list and categorization dictionary is all
stored in the categorization file, running a categorization model from outside WordStat may still require the
availability of some resource files such as language dictionaries or preprocessing libraries (EXE or DLL).
This should not cause an inconvenience when applying those models on the same computer as the one used
to create the model, since information about the original locations of those resource files is always stored
within the model file. However, when attempting to apply those categorization models on another
computer, the calling application may have some difficulty locating the needed resource files. Those files
should be stored either under paths identical to those on the original computer, in the application folder or
under specific subfolders. When an attempt is made to apply a saved categorization or classification model
for which some resource files are missing, an error message will be displayed providing the list of all
missing files, their original location and alternate locations where they might be.
For information on how to apply the saved categorization model, please refer to the WordStat Document
Classifier section or to the WordStat Software Developer's kit.
To load the document to be analyzed, select the OPEN command from the DOCUMENT menu or
click the button. An Open File dialog box will be displayed. In the File of Type list box, select
the format of the file you would like to read, locate the file, select it and click the Open button.
You may also type directly in the text editing window or paste a text previously copied to the
clipboard by moving to the text editor and then selecting the PASTE command from the
DOCUMENT windows or by clicking the button.
To open the categorization or the text classification model:, select the OPEN command from the
MODEL menu or click the button. Content analysis models are stored in files with a .wcat file
extension, while document classification models are stored in files with a .wclas file extension.
Select the model you would like to use and click the Open button.
Select the APPLY command from the MODEL menu or click the button.
When a categorization dictionary is applied, a single frequency table is displayed at the bottom of the
page with the following statistics:
% SHOWN Percentage based on the total number of keywords displayed in the table.
% TOTAL Percent based on the total number of words that have not been explicitly
excluded.
When a classifier is used, a second table is shown allowing you to examine the classification decision made
by the classifier as well as the computed values associated with each class of the categorical variable. When
the k-Nearest Neighbors algorithm is used for classification and the database containing the training set can
be located, a third table is shown, displaying the "k" most similar documents, their ranking and their
similarity scores.
Select the OPEN DATA FILE command from the DOCUMENT menu, or click the button. An
Open File dialog box will be displayed. Locate the data file containing the documents to analyze,
select it and click the Open button.
The content of the data file is displayed in a table while the text to be categorized or classified is
displayed on its right.
To open the categorization or the text classification model, select the OPEN command from the
MODEL menu, or click the button. Content analysis models are stored in files with a .wcat file
extension, while document classification models are stored in files with a .wclas file extension.
Select the model you would like to use and click the Open button.
Select the APPLY command from the MODEL menu, or click the button.
When a categorization dictionary is applied, a single frequency table is displayed at the bottom of the
screen with the number of occurrences of each keyword included in the model, as well as the total
number of words.
When a classifier is used, a second table is shown, allowing one to examine the classification
decision made by the classifier as well as the computed values associated with each class of the
categorical variable. This table is synchronized with the database shown at the top of the screen, so
that moving from one row to another - either in the database or this classification table - moves to the
corresponding row in the other table.
If the k-Nearest Neighbors algorithm is used for classification and the database containing the
training set can be located, a third table is shown, displaying the "k" most similar documents, their
ranking and their similarity scores.
To store in the opened data file either the predicted class or the computed score obtained for every
class, click the button. A dialog box similar to this one will appear:
Enter the variable name that will contain the predicted class.
To save the scores associated with each class and upon which the classification has been made, put a
check mark beside Save scores for all classes and enter a variable prefix (up to 7 characters).
Variable names are created by adding successive numeric values to this prefix. For example, if the
edit box at the right of the Variable Prefix option is set to "CLASS", the variable names will be
CLASS1, CLASS2, CLASS3, etc.
If any one of the specified variables does not exist, WordStat will create new ones and store the
numerical values associated with either the predicted class or the class scores. A confirmation dialog
In the Save as type list box, select the file format under which you would like to save the table. The
following formats are supported: ASCII file (*.TXT), Tab delimited file (*.TAB), Comma delimited
file (*.CSV), HTML file (*.HTM; *.HTML), Excel spreadsheet file (*.XLS).
To print a table:
For more information on this Software Developer's Kit, please contact [email protected].
Select the CHOOSE X-Y command from the STATISTICS menu and assign all the newly created
variables to the list of independent or dependent variables. (The distinction between dependent and
independent variables is not relevant for this kind of analysis. However, all variables assigned to a
single category will be processed together.)
Choose the OTHER | FACTOR ANALYSIS command from the statistics menu.
The button, found in many locations in QDA Miner, may be used to copy entire documents, tables and
charts to the Report Manager.
Selected text segments or image areas may also be appended by clicking the button.
To access the Report Manager from QDA Miner, run the REPORT MANAGER command from the
PROJECT menu.
The program presents its information as an outline, allowing a hierarchical organization of miscellaneous
pieces of information that is ideal for project management, organizing ideas, structuring information, or
designing and writing a research report.
The workspace emulates the appearance of Windows Explorer or of a standard Help file with the Table of
Contents (TOC) on the left and the Editor on the right.
Item Editor
The largest panel on the right of the program window is the Item Editor, which is like a built-in word
processor. This is where the item selected in the Table of Contents can be edited. Clicking a Table of
Contents item displays its contents for editing.
Toolbar
The Toolbar provides quick access to the most frequently-used functions. Just position the mouse over a
tool button and wait for the display of a brief text describing its function.
Comment panel
The Comment panel below the Item Editor allows the insertion and editing of comments related to the
selected topic. When new items are added to the Report Manager from QDA Miner, a default comment is
often already present, providing useful information about the origin of this item.
To rename an item:
Select the item to be renamed.
Select the RENAME command from the ITEMS menu or click the toolbar button.
In the Item Title Dialog, change the title.
Click OK.
To delete an item:
Select the item to delete in the Table on Contents.
Select the DELETE command from the FILE menu or click the toolbar button.
You will be asked to confirm that you really want to delete the item. If you're sure, then click Yes.
NOTE: Be aware that you cannot undo this if you make a mistake.
Moving Items
As more items are created and the Report Manager hierarchy grows, it is inevitable that you will want to
move items around, either to place one item under another, or to promote one to a higher level.
The easiest way to move items in the Table of Contents is by using drag-and-drop operations. Using the
mouse, you can move an item to a different location or move a group of items stored under a "parent" item
by dragging this "parent" item to its new location.
Select the item to move by clicking and holding down the left mouse button. (Keep the mouse
pressed until the drag-and-drop operation is completed.)
Drag the item to its new location and, only then, release the mouse button.
To promote the selected item, click the button, or select the PROMOTE command from the
ITEMS menu.
To demote the selected item, click the button, or select the DEMOTE command from the
ITEMS menu.
To move the selected item up relative to its siblings, click the button or select the MOVE UP
command from the ITEMS menu.
To move an item down relative to its siblings, click the button or select the MOVE DOWN
command from the ITEMS menu.
Editing Documents
The Report Manager offers many editing features to create and edit both simple text documents and
documents with complex formatting, as well as tables and graphics. When a document item is selected in
the Table of Contents, a DOCUMENT menu appears, displaying all available formatting and editing
options. A similar menu can also be obtained by right-clicking anywhere in this document. The toolbar
portion directly over the editing area also displays buttons to access the most often-used editing and
formatting functions.
Individual documents may also be printed or exported to disk in various file formats such as plain text, Rich
Text or HTML format. An IMPORT command is also available to read a document file stored in plain text,
Rich Text, MS Word, WordPerfect, HTML and a few additional formats. Executing such a command will
replace the existing content with the content of the imported file.
Editing Charts
Many charts saved in the Report Manager may be edited using many of the same options as those available
in QDA Miner, such as the multidimensional scaling plot obtained through the CODE CO-OCCURRENCE
analysis command, the correspondence analysis plots, and the bar charts and line charts created by the
CODING BY VARIABLES command, as well as the bar charts and pie charts produced by the CODING
FREQUENCY command. To obtain information on the display options available for those charts, see their
corresponding page in this manual. Other charts - such as dendrograms and heatmaps - are stored as image
files, so cannot be modified. However, just like other charts, they may be exported to disk in various file
formats such BMP, JPG or PNG graphic files.
The Text to Find edit box allows you to specify the text you want to find. The Case Sensitive and Whole
Word Only options function in the same way as in a standard word processor.
The search starts at the current topic item. Select Forward to continue searching items below the current
one, or Backward to move up and search items above the current document or table in reverse order. To
search all items, select the top item in the Table of Contents before using the Global Search dialog box.
The Scope option box is used to specify what is to be searched. You can restrict the search to Documents,
Tables, or Comments attached to items, or any combination of these three.
Once the search options have been set, click the Find button to start the search as well as to continue
searching for additional instances of this text.
By default, all items are marked for export. To prevent some items from being included in the exported file,
simply remove the check marks beside them. Clicking a "parent" item affects all "children" items in the
same way. To unselect all items, uncheck the project item located at the very top of the tree.
Once the selection process is completed, click the OK button. A Save File dialog box will be displayed,
allowing you to enter a file name and select the location where the file should be saved. After the file is
created, you will be asked if you want to view this file. Clicking Yes will open a web browser if the
exported file is an HTML document or either MS Word or Wordpad if the exported file is a Word
document.
Others
GREENACRE, M. (1984). Theory and Applications of Correspondence Analysis. Academic Press. Orlando,
Florida.
GREFENSTETTE, G. (1994). Corpus-Derived First, Second and Third-Order Word Affinities. In W. Martin,W.
Meijs, M. Moerland, E. ten Pas, P. van Sterkenburg, and P. Vossen, editors, Proceedings of EURALEX'94,
Amsterdam, The Netherlands.
SEBASTIANI, F. (1999). Machine Learning in Automated Text Categorization. ACM Computing Surveys,
34(1):147.
If you have any comment or suggestion for further improvement please contact Provalis
Research:
By Phone: 514-899-1672
By FAX: 514-899-1750
By Email: [email protected]