DataManipulation GettingStarted en
DataManipulation GettingStarted en
7 Annex ............................................................................................................................................................ 66
7.1 Functions of the Expression Editor ......................................................................................................................66
7.1.1 Arithmetic Operators ............................................................................................................................66
7.1.2 Boolean Operators ................................................................................................................................68
7.1.3 Date Operators ......................................................................................................................................70
7.1.4 Miscellaneous Operators ..................................................................................................................... 71
7.1.5 String Operators ................................................................................................................................... 72
7.1.6 Conversion Operators .......................................................................................................................... 74
IN THIS C H A P TE R
This document is addressed to people who want to evaluate or use the InfiniteInsight® and in particular the
InfiniteInsight® Explorer - Semantic Layer feature.
Before reading this guide, you should read chapters 2 and 3 of the InfiniteInsight® - User Guide that present
respectively:
An introduction to InfiniteInsight®
The essential concepts related to the use of InfiniteInsight® features
No prior knowledge of SQL is required to use Data Manipulation - only knowledge about how to work with tables
and columns accessed through ODBC sources. Furthermore, users must have “read” access on these ODBC
sources.
To use the Java graphical interface, users need write access on the tables KXADMIN and CONNECTORSTABLE,
which are used to store representations of data manipulations.
For more technical details regarding InfiniteInsight®, please contact us. We will be happy to provide you with more
technical information and documentation.
This document introduces you to the main functionalities of the Data Manipulation feature.
One of the useful features of Data Manipulation is the ability to declare arguments. Arguments are symbols with
associated values that can be changed before executing the data manipulations. They can be used anywhere
within Data Manipulation.
KXEN has not built a special engine to execute these data manipulations, since they can all be performed by
standard SQL engines embedded with all major relational databases. Instead, the KXEN Data Manipulation module
can be seen as an object oriented layer that is used to generate data manipulation statements in SQL, which are
processed, in turn, by the database server.
To facilitate reading, certain publishing conventions are applied throughout this guide. These are presented in the
following table.
Graphical interface features and file names Arial bold Click Next
The titles of particularly useful sections Garamond italicized bold See Operations
This section provides you with some useful definitions and details the technical requirements to the use of the
InfiniteInsight® Explorer - Semantic Layer feature.
IN THIS C H A P TE R
Definitions ............................................................................................................................................................... 5
Technical Requirements ......................................................................................................................................... 5
2.1 Definitions
Data Manipulation
Tabular representation of data made of lines and columns. Each Line represents an “observation”. Roles can be
assigned to columns such as “input”, “skip”, “target” or “weight”.
Data Preparation
Set of operations needed to create a Data Manipulation. It can be broken up into two separate phases:
InfiniteInsight® Explorer - Semantic Layer and Data Encoding.
InfiniteInsight® Explorer - Semantic Layer
Business intended data transformations, such as target definition or data set filtering.
Data Encoding
Technically driven data transformations, that are automatically handled by InfiniteInsight® Platform.
IN THIS C H A P TE R
A data manipulation is based on existing database tables. The first step to create it is to select the data base and
the table you want to work with.
2 Click the Browse button corresponding to the field Database Source. The window Data Source Selection
opens.
3 Select the data base from which you want to create a new data manipulation.
Two types of tables can be displayed in the window Data Source Selection:
data manipulations created with InfiniteInsight®,
standard database tables.
7 Select in the list the table you want to use in the new data manipulation.
8 Click the OK button to validate your selection.
9 In the field New Table Alias, enter the name you want to use to refer to the selected table. By default the alias
is based on the selected table name.
The metadata repository allows you to specify the location where the metadata should be stored.
The Data Manipulation Editor provides you with seven tabs allowing you to create new fields, merge tables, filter
the data, view the data, the SQL query and the Documentation corresponding to the current data manipulation.
You can go back and forth between those tabs to define the new data set.
The expression editor allows you to create fields (one by one or several at a time) and to edit filter conditions as
you would do with a calculator.
To build your expression, you have at your disposal:
Functions: they allow to build complex expressions with one or more fields.
Variables and their values:
the Fields of your database
the Field sets you have defined in your data manipulation
the Prompts you have defined in your data manipulation
the defined Categories for the existing fields/variables.
Field Association
Messages
To Create a Field
1 Enter an expression in the text area located in the upper part of the panel.
Note - To know more about every function, an explanation label is available when moving the mouse over it as
shown in the above screenshot as an example for the Arithmetic Operator "Absolute".
You can also insert these elements in a chosen position by drag-and-dropping them from one of the trees
to the text area.
The color of the indicator located above the Messages area indicates the state of the formula. For further
details, go to section Messages.
2 To validate the formula, click the OK button.
3 A pop-up appears to name the new computed field. Enter a name in the Name field.
Field Sets
Creating several fields by applying the same calculation to various existing fields is a frequent need. The
expression editor allows you to do that thanks to the use of field sets.
For example, the use of field sets allows you to sum up a large number of fields, or to compute their maximum.
3 In the field Alias Mask, enter a mask allowing filtering the fields by their name. A mask is made of a part
common to the name of all the fields you want to see displayed, and of the star character (*) allowing to
complete the parts which differ in the field names. The star can be used anywhere in the mask and as many
times as needed.
4 Uncheck the fields you do not want to keep in the field set.
5 In the field Set Name, enter a name for the new field set.
6 Click the OK button. The window closes and the new field set is displayed in the list under the item Field Sets.
3 Click the Edit option. The field set edition window opens.
Note - To change the name of a field set amounts to duplicating it.
When you want to apply a calculation to fields whose names have a common root, you can create a field set on the
fly in the formula text field.
A field set created on the fly is defined by a mask, that is a fixed part common to the name of all the wanted fields
and a wildcard character representing the part of the names that changes for each field. Three wildcard
characters can be used to define the field sets: the at sign (@), the hash sign (#) and the dollar sign ($). Only one
wildcard can be used for each field set, and a same wildcard cannot be used twice in the same formula.
A field set can be used to define a list of arguments for n-ary functions, that is, functions with an undefined
number of arguments. When a field set is used as argument of a n-ary function, and in this case only, you need to
frame the set name with braces { } to force its interpretation as an argument list.
As an example, let's consider a database table whose fields INCOME_APRIL, INCOME_MAY, INCOME_JUNE
contain the monthly income. If you want to create a field containing the highest monthly income for the quarter,
you need to use the following formula GREATESTN({INCOME_@}) which uses the INCOME_@ field set as a list of
variables and thus amounts to the formula GREATESTN(INCOME_APRIL ,INCOME_MAY ,INCOME_JUNE).
However when using the field set as is in the formula GREATESTN(INCOME_@), three fields are created by the
three following formulas: GREATESTN(INCOME_APRIL), GREATESTN(INCOME_MAY) and
GREATESTN(INCOME_JUNE), which cannot be used.
A prompt allows you to require a value from the user when using the data manipulation in any feature of
InfiniteInsight®.
To create a Prompt, refer to section Prompts > Creating a Prompt (see "Creating a Prompt" on page 43).
Categories
Using the existing variable categories to create conditions with interval, equality or inequality relations is a
frequent need. The expression editor allows doing that through the category extraction feature.
Extracting Categories
To Extract Categories
To extract categories from the variables of your database:
1 In the Variables section, double-click the option Categories. A sub-tree is displayed with the mention Extract
Categories... .
2 Double-click the option Extract Categories... A new window that lists all available fields opens.
8 Choose the type of variable, the variable and the category that you want to use. In the example below,
Nominal, client_type and OWNER have been chosen respectively.
2 Double-click the option Extract Categories... A new window that lists all available fields opens.
Note - The Sub-sampling mode replaces the Sample size mode available in case the box Advanced
Settings is not checked. Please refer to the procedure To Extract Categories (see "Extracting Categories" on
page 13) for details about the Sample size mode.
Line Selection The categories are extracted in the range of lines in the data set defined by the
values typed in the First Line and Last Line fields.
Random Selection The categories are extracted from lines picked up randomly in the data set.
The number of lines used to extract categories equals to the proportion of the total
number of lines in the data set, defined using the Proportion slider.
Random Selection - Advanced The categories are extracted from lines picked up randomly in the data set.
Mode A random value between 0 and 1 is attributed to each line. Lines with a value in the
Selected Range (value / 100) are used for category extraction. Other lines are
skipped.
Line Selection + Random The categories are extracted in the range of lines in the data set defined by the
Selection values typed in the First Line and Last Line fields.
The categories are extracted from lines picked up randomly within this range of the
data set.
The number of lines used to extract categories equals to the proportion of the total
number of lines within the defined range, defined using the Proportion slider.
Line Selection + Random The categories are extracted in the range of lines in the data set defined by the
Selection - Advanced Mode values typed in the First Line and Last Line fields.
The categories are extracted from lines picked up randomly within this range of the
data set.
A random value between 0 and 1 is attributed to each line in the defined range.
Lines with a value in the Selected Range (value / 100) are used for category
extraction. Other lines are skipped.
Note - W hatever the chosen combination, some categories of the data set may be missing in the expression
editor in case you choose to perform category extraction on a small portion of the data set.
7 Click the Extract button. A progress bar is displayed.
8 When the progress bar closes, click the Close button. Two tree items have been added in the Categories
section: Nominal Variables and Other Variables.
9 Choose the Type of variable, the variable and the category that you want to use. In the example below,
Nominal, Occupation and Adm-clerical have been chosen respectively.
When using several field sets in the same formula, you must select in the drop-down list Fields Association how
the fields from these sets will be associated:
Associate by Position: the fields from each set are associated depending on their position in the database
table.
Associate by Value: the fields from each set are associated depending on the value represented by the
wildcard characters used to define the field sets.
Do Cartesian Product: all the fields from one set are associated to all the fields from the other set.
Examples:
Let's consider a database table containing the following fields ordered as displayed:
income_january
income_april
income_february
income_march
expenses_march
expenses_january
expenses_april
The following formula INCOME_@ - EXPENSES_# uses two field sets, one grouping all the fields starting
with INCOME_, and the other all the fields starting with EXPENSES_ :
INCOME_@ EXPENSES_#
income_january expenses_march
income_april expenses_january
income_february expenses_april
income_march
Calculation Position
income_january - expenses_march 1
income_april - expenses_january 2
income_february - expenses_april 3
2. If the option Associate by Value is selected, the InfiniteInsight® will try to match the value represented by @
and the value represented by #, resulting in the following calculations:
income_january - expenses_march
income_january - expenses_january
income_january - expenses_april
income_april- expenses_march
income_april - expenses_january
income_april - expenses_april
income_february - expenses_march
income_february - expenses_january
income_february - expenses_april
income_march - expenses_march
income_march - expenses_january
income_march - expenses_april
Messages
The color of the indicator located above the Messages area indicates the state of the formula.
red contains an error, which is reported in the message area. It is not possible to validate
it (the Next button is disabled).
yellow can be validated but some inconsistencies may occur and are reported in the
message area.
green is valid.
In the case of an error or a warning, the Messages area located in the lower part of the panel provides details to
help you understand the problem.
The Fields tab displays the fields from the source table and allows you to add your own fields.
Each field can be referred to by an alias. Aliases are usually used to differentiate fields having the same name but
coming from different tables. By default, a field alias is the field name but you can change it.
Field visibility allows you to choose which fields will appear in the data set.
or
1 Right-click the line corresponding to the selected field. A contextual menu is displayed.
A field description allows you to add comments on this field to make it easier to use.
2 Select the Edit option. The panel New Computed Field is displayed.
New computed fields are useful when you want to add new variables that are not saved in any table but can be
derived from other data on demand. You should add variables when you think doing so can render more
information available to a predictive or descriptive model. For example, a ratio between two variables that has
business meaning or the conversion of a birth date to an age are useful transformations that would be very
difficult for modeling software to infer automatically.
Technically, this can be decomposed into six types of variable creation:
Aggregate,
Condition,
Lookup Table,
Normalization,
SQL Expression
Function.
You can also use the Expression Editor provided by InfiniteInsight® to define new fields as you want (see section
Expression Editor (on page 8)).
2 Select the type of field you want to create. The corresponding edition panel is displayed.
or
Right-click anywhere in the field list. A contextual menu is displayed.
Aggregates
The Aggregate option allows you to create aggregates similar to those automatically created by the InfiniteInsight®
Explorer - Event Logging feature. You can use InfiniteInsight® Explorer - Event Logging to automatically generate a
lot of different aggregates, next use a InfiniteInsight® Modeler - Regression/Classification model to identify the
important ones for your business issue. You can then create a new data set containing only the most relevant
aggregates.
Exists checks if at least one event exists for the current reference 0 if no event has been found
1 if at least one event has been found
NotExists checks if no event exists for the current reference 0 if at least one event has been found
1 if no event has been found
First identifies the first occurrence value of the first chronological occurrence for the
current reference
Note - needs a date column
Last identifies the last occurrence value of the last chronological occurrence for the
current reference
Note - needs a date column
2 In the list Target Column, select the variable on which you want to apply the selected function. When you
select the Count operation, an additional option (*) allows you to avoid selecting a specific column.
It is possible to select more than one Target Colum.
To Define a Filter
Two types are available:
Filter Event Table shows the list of the filter(s) performed on the Event Table;
Filter Reference Table shows the list of the filter(s) performed on the Reference Table.
1 Click the Add button ( ) to define a complex condition using the expression editor.
2 Select and Click the Edit button ( ) to amend the previously defined condition.
3 Click the Remove button ( ) to delete the previously defined condition.
To Define a Pivot
1 Select the variable to filter by in the Variables drop-down list.
2 To add categories to the table, you can:
automatically extract the variable categories by clicking the magnifier button ( ) located next to the
list and then select the values to keep or exclude by checking the corresponding Selection box.
or
enter a value in the field New Category and click the + button.
load a list of categories from files by clicking the open file button ( ) located at the right of the
Categories table list and then select the values to keep or exclude by checking the corresponding
Selection box.
Note - The number of created variables is indicated at the bottom of the panel. This number grows exponentially
when filtering by pivot. The higher the number of variables, the longer the model learning.
Defining a Condition
A condition allows you to define a field value depending on another field value case by case.
To Create a Condition
1 In the New list, select the option New Condition. The panel New Condition is displayed.
3 Set the condition parameters as detailed in the section Expression Editor (on page 8).
4 Click the OK button. <condition> is replaced by the condition you have defined.
7 Repeat steps 4 and 5 for each new case. The order in which the cases are listed is important since the first
true condition will determine the computed field value.
8 Use the buttons and located on the left to order the cases.
You can delete a case from the list by selecting it and clicking the button Remove Case.
9 On the line Else, click <return value> to define the value to be used when none of the defined cases are true.
10 Click the Next button. The pop-up Enter the Computed Field Name opens.
The user specifies cases, each of which is made up of a list of discrete values and a corresponding label. A classic
example is a “look-up table” that is used as a dictionary to translate values from identifiers into strings, or to group
values representing fine distinctions into a smaller number of more general bins
2 In the Field list, select the field from which the value will come from.
3 Select the type of results you want to generate in the list Output Storage.
4 In the column IF, double-click the value <UNDEFINED> to enter the first value you want to add as an entry.
5 In the column THEN, double-click the value <UNDEFINED> to enter the result value corresponding to that first
input value.
6 To add another set of values, click the + button located on the right. A new line with undefined values is
displayed.
7 Repeat steps 4 and 5 for this new entry.
8 In the field Other Values, enter the value that will correspond to the field values not set as specific entries.
9 Click the Next button. The pop-up Enter the Computed Field Name opens.
2 Click the - button located on the right. The selected entry is deleted.
WARNING - If the nominal variable you have selected contains too many categories (for example an Id variable), the categories will
not be extracted.
2 If you have already extracted once the categories from the selected variable, select the option Select
Reference Variable.
3 Else, select the option Extract values and fill. The pop-up window Extract Field Values opens.
Defining a Normalization
Normalization is a standard InfiniteInsight® Explorer - Semantic Layer primitive that appears in PMML (Predictive
Model Markup Language), a data mining specification language defined by the Data Mining Group (DMG).
Normalization is frequently applied to numeric variables prior to data mining and consists of a piece-wise linear
transform with the resultant variable typically ranging from 0 to 1. This can be used for rank transformations,
where the output represents magnitude in terms of the approximate proportion (percentile) of values below the
input value. Alternatively, a field may be converted based on how many standard deviations a value is from the
field's mean. Part of normalization is also specification of what value to use when a numeric input value is
unknown or out of the range seen in the training data.
2 Select a field in the list Select a Field to Normalize. Only integer or number fields are displayed.
3 In the section Normalization Points, click the + button to add a normalization point.
You need to define at least two points, but you can add more if needed by repeating steps 3 to 5 for each new
point.
7 In the list Minimum Values of the section Define Out of Range Behavior, select the behavior that should be
applied to values lower than the lowest point previously set. The available values are detailed in the following
table:
Null Value the Null value. Meaning that they are not
displayed on the plot.
8 Click the Next button. The pop-up Enter the Computed Field Name opens.
2 Enter a valid SQL expression in the text field. If the expression is not correct, an error will be displayed when
displaying the tab View Data.
3 Select the type of result the SQL expression will return in the list Result Type.
4 In the Type list, select the value type of the result.
5 Click the OK button. The pop-up Enter the Computed Field Name opens.
3.2.3 Merge
The Merge tab allows you to create a merge between the source table and another table in your data base, that is
to add information contained in another table when the selected field from the source table is equal to the
selected field of the target table.
Creating a Merge
To Create a Merge
Table
1 Select the table to be joined in the list Target Table.
2 Click the + button.
3 Select the source field in the list Source Field.
If the selected table contains fields corresponding to the source field, they are displayed in the list Target
Field, else the message No Fields Available is displayed.
5 Click OK to create the merge. This button is only activated when all the elements needed for the merge are
selected.
Once a merge has been created, all the fields of the target table are added to the list Source Field allowing you
to create new merges from them.
Fields
When merging new fields, you can define the naming by yourself by adding a prefix or a suffix to the aliases.
In the Prefix field, enter the prefix of your choice followed by an underscore.
In the Suffix field, enter the suffix of your choice preceded by an underscore.
Removing a Merge
To Remove a Merge
1 Select the merge to remove in the list.
2 Click the Remove button.
After the merge of two fields from different tables, you have the possibility to create a composite key by joining
two other fields of these same tables.
3.2.4 Filters
The Filters tab allows you to select only the records of interest to answering your business question. You can
compare a field value to another field, a constant or a prompt and keep only the corresponding observations.
You can choose if you want the selected records to match all the filters (which corresponds to the logical operator
AND) or at least one of the listed filters (which corresponds to the logical operator OR).
To Create a Filter
1 Click the button New Condition. The Expression Editor opens.
2 Set the parameters as detailed in the section Expression Editor (on page 8).
Removing a Filter
To Remove a Filter
1 Select the filter you want to remove in the list.
2 Click the Remove button.
3.2.5 Prompts
A prompt allows you to require a value from the user when using the data manipulation in any feature of
InfiniteInsight®.
There are two ways to set a prompt, either in the Prompt tab or when creating a new field or filter.
To Create a Prompt
1 On the Prompt tab, click the button New Prompt. The Prompt Editor opens.
2 Enter a name for the prompt in the Name field. This name will allow you to select the prompt as a value when
creating a field or filter. Select the type in the Type list.
3 Enter the default value in the Value field. When prompting the user, this value will be suggested by default.
4 Enter the sentence that will ask the user for a value in the Description field. For example: "What is the
minimum age required?".
5 Click the OK button to create the new prompt.
Notes
- The prompts do not appear in the tab View Data . Prompts that are not used in the data set are deleted when you
save the data set.
- Prompts that are not used (for example in a field or a filter) are deleted when you save the data manipulation.
Editing a Prompt
To Edit a Prompt
1 On the Prompt tab, click the Edit button. The Prompt Editor is displayed.
Removing a Prompt
You can only delete prompts that are not used in a field or a filter.
To Remove a Prompt
1 In the Prompt tab, select the prompt you want to delete.
2 Click the Remove button. If the prompt is used somewhere in the data set, the following message box is
displayed.
The tab View Data displays the data manipulation content. It allows you to verify if the results correspond to what
you expect. In this panel you can sort the data by column and select the rows to display.
2 In the field Last Row Index, enter the number of the last row you want to display.
3 Click the Refresh button to display the selected rows in the table above.
Searching a Variable
When your data set contains a great number of variables you can search and display a specific variable thanks to
the Search button located in the lower part of the panel View Data.
To Search a Variable
1 Click the Search button. The Search window is displayed.
2 Select the tab corresponding to the type of search you want to do.
The Index tab allows you to find a variable thanks to its index number. This number can be found in the
first column of the Fields tab.
The Variable tab allows you to find a variable by its name thanks to a list.
The tab SQL Statements displays the SQL query corresponding to the data manipulation being built.
The Documentation tab allows you to get an overview of your data manipulation. It contains all the options
selected for your data manipulation like Filters, Merge, Prompts or Expressions.
This screen shows:
Graphic Summary
Visible / Invisible Fields
Prompt
Expressions
Filters
In the below screenshot, the Graphic Summary displays:
the list of tables included in your data manipulation and how they were processed - with merge or
aggregates for example;
and the fields kept for your data manipulation known as the Visible Fields .
The Visible Fields displays in a table the ordering columns of the Visible Fields.
The Overview Settings allows you choosing how the overview will be formatted both for viewing and exporting (
). The generated file can be saved in .txt, .htm and .rtf file formats .
InfiniteInsight® provides three ways to save a data manipulation that can be used concurrently:
as an InfiniteInsight® Data Manipulation ,
as a table or a view,
as a KxShell script.
IN THIS C H A P TE R
1 When you have set all the parameters for the new data manipulation, click the Next button. The panel Save
and Export is displayed.
2 In the field Data Manipulation Name, enter a name for the newly created data set. This name allows you to
recognize and select the data set in your database.
3 You can enter additional information in the Description field.
2 Choose if you want to Save as a Table or to Save as a View by selecting the appropriate option.
3 Enter the name of the new table or view in the field Name of the Table/View.
2 Use the Browse button corresponding to the Folder field to indicate where the script is to be saved.
3 In the field KxShell Script, enter the name of the script file.
Now that you have saved your data manipulation, it can be used in another InfiniteInsight® feature.
IN THIS C H A P TE R
1 In the InfiniteInsight® main menu, select the feature you want to use.
3 With the Browse button of the Folder field, select the database where you have saved the data manipulation
created with the corresponding feature. If necessary, enter the login and password granting you access to the
database.
4 With the Browse button of the Estimation field, select the data manipulation.
In the database, the data manipulations created with the InfiniteInsight® Explorer - Semantic Layer feature are
represented by the following icon: .
5 Click the Next button. If the data manipulation uses prompts, they are displayed at this time.
You can either keep the default value, or enter a new value corresponding to your needs.
6 Click the Save button to validate the change. The Cancel button automatically keeps the default value.
7 Once all the prompts have been validated, the description panel is displayed.
You can now use the feature as you would with a standard data source.
The user wants to create a model on “customer” to predict the response to a mailing campaign, but the
information on “customer” is spread over several tables. One table contains reference information about the
customer. One field in this table represents the region, which is in fact an identifier (foreign key) that links to
another table containing demographic information about the region. A second field in the customer reference
table contains an identifier that links to another table describing the associated contract. The user would like to
complete the customer data set with demographics about the customer's region and characteristics of the
contract in order to see if this information can help to predict the customer response.
Technically, if the two tables can be accessed through the same database, this operation is a join. More
specifically, we are interested here in “outer left” joins, where the number of lines of the reference table is not
changed by the fact that information exists in the “decoration” table. If the information does not exist, then the
extra fields should be brought back empty.
A different type of use case is that of aggregation, where there may be more than one line in the decoration table
that can be associated with each “event” in the reference table. This situation occurs when dealing with
“transactions”. An example is a “fact table” that consists of records indicating that a given customer bought a
certain product on a specific date for a specified price. Of course, there can be several of such occurrences per
customer. Accordingly, there is necessary to summarize them in one or more ways (for example into fields that
extend a customer table) if the data is to be used in predictive modeling or segmentation. This is accomplished
via aggregation (sometimes called pivoting, or transposition) and is addressed via InfiniteInsight® aggregation
modules, Event Logging and Sequence Coding.
5.2.2 Filtering
The user wants to create a model predicting churn only for customers associated with a prepaid agreement. In
this case, one table contains reference information about the customer; one field contains an identifier that links
to another table containing information about his contract and the contract type. The user would like to keep only
customers with prepaid contracts to train the model.
Technically, this can be seen as a “where” clause in an SQL “select” statement. Sometimes, the same “where”
clause needs to be used for both training and apply data sets, and sometimes, the “where” clause can be used to
separate the training data set from the apply data set. In the latter case, it is important to be able to set the value
of an argument at runtime.
The user wants to add new variables that are not saved in any table but can be derived from other data on
demand. When using InfiniteInsight® Platform, variable creation should be primarily business-driven since the
software already knows how to transform data to accommodate its algorithms. Users should add variables when
they think this could provide more information for a predictive or descriptive model. For example, a ratio between
two variables that has business meaning or the conversion of a birth date to an age are useful transformations
that would be very difficult to infer automatically with a modeling software.
Technically, this can be decomposed into several types of variable creation:
Using a predefined function
Normalization
Case-based processing
A very wide range of potential functions could interest users. However, most of these functions are built using one
or more of the following elements:
Mathematical operators: such as +, *, -, /, %
Logical operators: AND, OR
String manipulations: LIKE, SUB-STRING, TRIM, LEFT-TRIM, RIGHT-TRIM, UPPER-CASE, LOWER-CASE.
Mathematical functions: LOG, EXP
Date manipulations: YEAR, MONTH, DAY, TIME, MINUTE, SECOND, DATE ADDITION, DATE DIFFERENCE
Note - The DayOfTheWeek function returns an integer that varies from 1 to 7 where 1 stands for Sunday, 2 for
Monday, ... and 7 for Saturday. This is the standard behavior of major RDBMS (including DB2, ORACLE,
SQLServer, ACCESS).
Normalization is a standard InfiniteInsight® Explorer - Semantic Layer primitive that appears in PMML (Predictive
Model Markup Language), a data mining specification language defined by the Data Mining Group (DMG).
Normalization is frequently applied to numeric variables prior to data mining and consists of a piece-wise linear
transform with the resultant variable typically ranging from 0 to 1. This can be used for rank transformations,
where the output represents a magnitude in terms of the approximate proportion (percentile) of values below the
input value. Alternatively, a field may be converted based on how many standard deviations a value is from the
fields mean. Part of normalization is also the specification of what value to use when a numeric input value is
unknown or out of the range seen in the training data.
An example of normalization is graphed below. The X-axis represents the original data, the Y-axis the normalized
value. The minimum and maximum X values correspond to the 0 and 1 values on the Y-axis, respectively. Each
point is defined based on the normalization method. Normalization values corresponding to data values falling
inbetween normalization points on the X-axis are calculated using straight-line interpolation. The graph is shaped
like a S, indicating that the normalization method is designed to enhance (spread out) differences between values
close to the mean while diminishing the magnitude of extreme (very high or very low) values.
Case-based Processing
This can be decomposed into sub-operations. A lot of user-defined manipulations are based on “cases”. A classic
example is a “look-up table” that is used as a dictionary to translate values from identifiers into strings, or to group
values representing fine distinctions into a smaller number of more general bins. A different example is the
application of complex conditions to generate outcome indicators or to segment a continuous value into
sub-ranges with corresponding segment identifiers. Sometimes, such manipulations can be done through joins
with a decoration table, but often it is easier to define them directly. In InfiniteInsight® platform, we have focused
on the following cases:
Look-up table
The user specifies cases, each of which is made up of a list of discrete values and a corresponding label.
Numeric case
The user specifies ranges (minimum and maximum values) and corresponding labels.
Generic case
The user specifies different cases, where each is defined through a possibly-complex Boolean expression.
The result of each expression is associated with either a user-defined variable, a field from the database, a
constant or a prompt.
It is possible to migrate a Data Manipulation while performing a data transfer. This option is available via the
InfiniteInsight® Toolkit feature.
The data transfer process encompasses various steps starting from the Data Manipulation creation in
InfiniteInsight® Explorer to performing its transfer in the InfiniteInsight® Toolkit.
2 Select:
a Data Type - in the example Data Base,
a Database Source - in the example kxendemo,
and a Table - in the example kxenodbc.client.
3 Click Next.
The panel Data Manipulation Editor appears.
For further details to perform this step, Creating a Data Manipulation (on page 6).
Important!
The new database must have the similar source table(s) as the existing database, that is, if the Data Manipulation is derived from
sources TableA and TableB, both TableA and TableB must be migrated to the new database before performing the Data
Manipulation transfer.
7 Edit the correct field mapping for the Target by clicking on the Edit Field Mapping icon.
The Edit Field Mapping screen appears. By default, there is no mapping defined.
10 Click Check.
A pop-window indicates that the mapping for the transfer is correct.
3 Select the transferred Data Manipulation and click Open. The Data Manipulation loads.
Function Syntax/Example
Add add(numeric)
Example:
add(1,2,3) => 6
Exponential exp(numeric)
Example:
exp(16) => 8886110
Logarithm ln(numeric)
Example:
ln(13) => 2,56495
Multiply multiply(numeric)
Example:
multiply(2,4) => 8
Round round(number)
Example:
round(15.2) => 15
Round Up ceil(number)
Example:
ceil(156.3) => 157
Sign sign(numeric)
Example:
sign(123456) => 1
sign(-123456) => -1
Function Syntax/Example
And Returns "1" if all the operands are true, otherwise returns "0"
and(<boolean>)
Example:
and(age > 18)
Does not Start with Returns "1" if the first string does not start with the second one
notStartsWith(Test:string, Candidate:string)
Example:
notStartsWith("KxTimestamp","Kx") => 0
Greater than Returns "1"if the left operand is greater than the right one
greater(LeftOperand:any, RightOperand:any)
Example:
greater(15,16) => 0
greater(16,15) => 1
Is True isTrue(boolean)
Example:
isTrue(age<30)
Checks if age < 30 is true.
If true, it returns "1" otherwise it returns "0".
Is not in List Returns "1" if the selected field is not in the selected values
notIsInList(Field:any, <Values:any>)
Example:
notIsInList(disp_id, 12, 16)
if disp_id = 1 the result is true; if disp_id = 12 or disp_id = 16 the result is false.
Not not(boolean)
Not Like Returns "1" if the given string does not match the given pattern
Use % to determine the place of the pattern in the string.
For example, if the pattern ends with %, you are looking for a string that starts with the pattern.
If the pattern starts with %, you are looking for a string that ends with the pattern. You can use
more than one % in the pattern.
notLike(Test Operand:string, Candidate Operand:string)
Function Syntax/Example
Function Syntax/Example
If Valid Returns the value of the second argument if the first argument is true, otherwise returns null
ifValid(boolean, any)
Function Syntax/Example
Concatenate concat(<string>)
Example:
concat("data", "mining") => datamining
Does not Start with Returns true if the first string does not start with the second one
notStartsWith(string, string)
Like Indicates whether the given string matches the given pattern
Use % to determine the place of the pattern in the string.
For example, if the pattern ends with %, you are looking for a string that starts with the pattern. If
the pattern starts with %, you are looking for a string that ends with the pattern. You can use more
than one % in the pattern.
like(string, pattern)
Not Like Returns "1" if the given string does not match the given pattern
Use % to determine the place of the pattern in the string.
For example, if the pattern ends with %, you are looking for a string that starts with the pattern. If
the pattern starts with %, you are looking for a string that ends with the pattern. You can use more
than one % in the pattern.
notLike(string, pattern)
Replace Replaces all occurrences of a specified string value with another string value
replace(tested string, string to replace, replacement string)
Start with Returns true if the first string starts with the second one
startsWith(StartWithStringTest:string, StartWithStringCandidate:string)
Function Syntax/Example
Converts Date to Integer Returns the number of days elapsed since 1900-01-01
dateToInteger(yyyy-mm-dd)
Example:
dateToInteger(2010-01-01) => 40177
Converts String to Date Note - If the content of the string does not correspond to a date, it leads to an
error.
stringToDate(string)
Converts String to DateTime Note - If the content of the string does not correspond to a DateTime, it leads to
an error.
stringToDateTime(string)
Converts String to Integer Note - If the content of the string does not correspond to an integer, it leads to
an error.
stringToInt(string)
Converts String to Number Note - If the content of the string does not correspond to a number, it leads to
an error.
stringToNumber(string)