Abinitio CookBook
Abinitio CookBook
NOTICE
This document contains confidential and proprietary information of Ab Initio Software Corporation.
Use and disclosure are restricted by license and/or non-disclosure agreements. You may not access,
read, and/or copy this document unless you (directly or through your employer) are obligated to
Ab Initio to maintain its confidentiality and to use it only as authorized by Ab Initio. You may not copy
the printed version of this document, or transmit this document to any recipient unless the recipient
is obligated to Ab Initio to maintain its confidentiality and to use it only as authorized by Ab Initio.
COPYRIGHTS
Copyright 1997-2006 Ab Initio Software Corporation. All rights reserved.
Reproduction, adaptation, or translation without prior written permission is prohibited, except as allowed under copyright
law or license from Ab Initio Software Corporation.
TRADEMARKS
The following are worldwide trademarks or service marks of Ab Initio Software Corporation (those marked are registered
in the U.S. Trademark Office, and may be registered in other countries):
Certain product, service, or company designations for companies other than Ab Initio Software Corporation are mentioned in
this document for identification purposes only. Such designations are often claimed as trademarks or service marks. In
instances where Ab Initio Software Corporation is aware of a claim, the designation appears in initial capital or all capital
letters. However, readers should contact the appropriate companies for more complete information regarding such
designations and their registration status.
WARRANTY DISCLAIMER
THIS DOCUMENT IS SUBJECT TO CHANGE WITHOUT NOTICE. AB INITIO SOFTWARE CORPORATION MAKES NO WARRANTY OF ANY KIND WITH
REGARD TO THIS MATERIAL, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE. AB INITIO SOFTWARE CORPORATION SHALL NOT BE LIABLE FOR ERRORS CONTAINED HEREIN OR FOR INCIDENTAL OR
CONSEQUENTIAL DAMAGE IN CONNECTION WITH THE FURNISHING, PERFORMANCE, OR USE OF THIS MATERIAL.
Parsing names............................................................................................... 5
Graph for parsing names ...................................................................................................................6
Creating the graph .............................................................................................................................6
Filling gaps................................................................................................... 33
Graph for filling gaps .......................................................................................................................34
Configuring the SCAN ......................................................................................................................36
The SCANs output .........................................................................................................................40
Writing the NORMALIZE transform ...................................................................................................41
Specifying the output record format.................................................................................................42
Running the graph and viewing the results ......................................................................................43
Measuring performance................................................................................ 45
Simple benchmarking graph ............................................................................................................45
Checking the accuracy of the outputs ..............................................................................................49
Other suggestions for measuring performance.................................................................................51
Startup time............................................................................................................................52
Elapsed time...........................................................................................................................53
Low-level operations...............................................................................................................54
Global sort.................................................................................................... 57
Simple global sort............................................................................................................................57
Sampling the data...................................................................................................................59
Calculating and broadcasting the splitters ..............................................................................59
Partitioning by range and sorting the partitions .......................................................................60
Checking the results ...............................................................................................................61
Producing serial output ...................................................................................................................62
Finding differences....................................................................................... 65
Creating a differencing graph...........................................................................................................68
Variations on simple differencing.....................................................................................................74
Quantiling .................................................................................................... 89
Simple serial deciling.......................................................................................................................90
Fixing decile boundaries ..................................................................................................................92
Parallel deciling...............................................................................................................................95
Getting the total record counts and starting record counts for each partition ..........................97
Calculating the splitters and fixing the boundaries across partitions .....................................101
Preparing the decile splitters for the final JOIN ......................................................................102
Assigning the deciles and outputting the data .......................................................................102
Parameterized parallel quantiling ..................................................................................................103
Creating a quantiling subgraph and inserting it into a test graph ...........................................104
Adding and exporting parameters ..........................................................................................106
Replacing hardcoded values with parameters .......................................................................107
Testing your quantiling subgraph...........................................................................................109
Index...................................................................................................................................................221
Setting up
To set up your Cook>Book environment:
1. Open a new instance of the GDE (to ensure that you will be running the setup script in the
host directory specified in the Run Settings dialog).
2. From the GDE menu bar, choose Run > Execute Command (or press F8).
3. In the Execute Command dialog, type the following and click OK:
$AB_HOME/examples/setup/set-up-cookbook.ksh
to this:
The input file stores the first name, middle initial, and last name in the name field of the
following record format (in people.dml):
record
decimal(',') id;
string(',') name;
decimal('\n') age;
end
tname string "" Holds in.name with leading and trailing blank characters
removed:
string_lrtrim(in.name)
end_of_first integer 4 Represents the location of the first blank character in tname:
string_index(tname, ' ')
beginning_ of_last integer 4 Represents the location of the last blank character in tname:
string_rindex(tname, ' ')
a. Hold down the Shift key and double-click the REFORMAT component to display the
Transform Editor.
NOTE: If the Transform Editor opens in Text View, choose View > Grid View.
b. Switch to the Variables tab and choose Edit > Local Variable to display the Variables
Editor.
c. On the Fields tab, enter the name, type, and length attributes shown above, and save
the variable.
a. Position the cursor in the first blank line on the Business Rules tab.
b. Choose Edit > Rule to display the Expression Editor.
c. In the Functions pane, expand the String Functions item and double-click
string_substring(str, start, length) to display an empty function definition
(string_substring(?, ?, ?)) in the bottom pane.
d. Replace the question marks with the following parameters, either by typing them
directly or by dragging the variable names from the Fields pane and typing as needed:
string_substring(tname, beginning_of_last + 1,
length_of(tname))
e. Click OK to return to the Transform Editor.
f. Drag the rule port to the last_name field in the Outputs pane.
NOTE: In the statement that computes last_name, the call to string_substring uses
the length of the entire string length_of(tname) for the length of the substring.
This has the effect of simply taking all the characters starting at beginning_of_last +
1. While the length calculation could have been refined (to length_of(tname)
beginning_of_last), the result would be the same, as string_substring simply returns
the available characters.
NOTE: The REFORMAT component in the graph is not explicitly described here, as it is merely
a placeholder representing any component or set of components that need to be tested
with a large volume of data.
a. Hold down the Shift key and double-click the FILTER BY EXPRESSION to display the
Expression Editor.
The input file contains the following data, which is sorted on the date field (dt):
Globally unique ID
The key parameter of the Scan: Assign id Globally component is set to the empty key: { },
causing the entire input file to be treated as a single key group. The SCAN simply counts the
records and assigns the current count to the id field for each record, so that every record is
uniquely identified across the entire output file. The output data looks like this:
NOTE: Like the other two SCAN components in the graph, this is a sorted SCAN (its sorted_input
parameter is set to Input must be sorted or grouped). The other two SCANs work without
upstream sorting, because the input data was already sorted on the dt field, and key was set
to either the empty key: { } or the dt field. But because the third SCAN has a two-field key
parameter, {dt; kind}, the transform will fail if the input data is not sorted on both of those
fields. Therefore, this SCAN must be preceded by a SORT WITHIN GROUPS component that uses
dt as its major_key parameter and kind as its minor_key parameter.
The output data looks like this:
NOTE: The scope of uniqueness for the count value generated by the SCAN and assigned to an
output field is determined by the key parameter of the SCAN, as described in Using the
SCANs key parameter to specify the key field(s) in the output (page 31).
1. Attach a SCAN component to your input file.
2. Hold down the Shift key and double-click the SCAN component to display the Transform
Editor.
NOTE: If the Transform Editor opens in Grid View, choose View > Text.
3. Delete all default transform functions except initialize, scan, and finalize.
4. Define the type temporary_type to be a decimal("") with an appropriate name (such as
count) by editing the default definition to read as follows:
type temporary_type =
record
decimal("") count;
end;
5. Code the initialize function as follows to initialize count to zero:
temp::initialize(in) =
begin
temp.count :: 0;
end;
6. Code the scan function as follows to increment count by 1 for each record it processes:
temp::scan(temp, in) =
begin
temp.count :: temp.count + 1;
end;
Global Every record across the entire output dataset is The empty key: { }
uniquely identified by the output field to which
the SCANs count field is assigned. In the
output for the first SCAN in making-unique-
keys.mp, the unique key is the id field.
A specific key group Each record within a key group is uniquely The field or fields that form the
identified by the combination of the field(s) key of the key group
specified in the key parameter, and the output NOTE: If the input file is not
field to which the SCANs count field is already sorted on this field or
assigned: these fields, you can use a
z In the output for the second SCAN in SORT or SORT WITHIN
making-unique-keys.mp, the unique key GROUPS as necessary, or use
is the combination of the dt and id fields. an in-memory SCAN instead.
z In the output for the third SCAN in
making-unique-keys.mp, the unique key
is the combination of the dt, kind, and id
fields.
NOTE: Using a special value allows you to treat the first record differently (see the
finalize function in Step c).
b. In the scan function, write the current balance date to the this_date field of the current
temporary record (for use in processing the next record), and the this_date field of the
previous temporary record to the prev_date field of the current temporary record:
out :: scan(temp, in) =
begin
out.this_date :: in.bal_date;
out.prev_date :: temp.this_date;
...
Startup time
The effect of startup time is proportionally larger when processing small data than when
processing large data, so you may need to take it into account when measuring or estimating
performance. Following is a description of how to do this for the sample record numbers and
times listed in the table below.
TIME (SEC)
1 3.15 3.15
To project the time (incorporating startup time) for a run on big data:
1. Run the graph on very small data (one record) to get the startup time (here 3.15 sec).
2. Run the graph on small data (here 50,000 records take 4.35 sec).
3. Subtract the startup time from the total time for small data to get a scaling factor for use
with large data (here 4.35 - 3.15).
4. Multiply the ratio of big data to small data by the scaling factor, and then add back the
startup time for example:
(500,000/50,000) * (4.35 3.15) + 3.15 = 15.15
Elapsed time
You can measure elapsed time in different ways. Since elapsed time is sensitive to load, you
should run on an unloaded system and/or do multiple runs and take the minimum, as described
earlier. You can get information on elapsed time in several ways:
z Elapsed time for an entire graph Displayed on the bottom right, in the GDE status
bar. If the status bar is not visible, choose View > Toolbars > Status Bar.
Low-level operations
To measure the cost of a very-low-level operation (processing you might do multiple times per
record or field) and factor out per-record processing and I/O overhead, use a for loop to execute
the operation N times per record. The graph benchmarking-low-level.mp uses this technique to
compare the performance of two different DML functions that extract strings with a specific
pattern.
a. Set its key specifier to the field on which to sort. (In global_sort.mp, this is the id field.)
However, that will badly affect parallel performance, as successive partitions of the SORT
component will have to wait for the CONCATENATE to consume data from preceding partitions.
Instead, it would be better to address the problem ab initio freshly, from the beginning. Rather
than finding splitters and partitioning by range, you can simply partition in round-robin fashion
(evenly distributing the data), sort the partitioned data, and then collect the data using a MERGE
component (as in global_sort_serial_simple.mp).
NOTE: If you create these graphs by modifying global_sort.mp, do not forget to change the
URL for the output file to file:$AI_SERIAL/out_file.dat.
Trash Unchanged records (that is, records that are identical in both input files).
Different Data Records showing changes between the Old Data and New Data datasets. Each record
file includes the key field(s), and an old_field and new_field field for each non-key field:
z For non-key fields that have not changed, these fields are blank.
z For non-key fields that have changed, the old_ field and new_ field fields contain the
old and new values of the field.
Updates file Records from the New Data file that have keys matching those of records in the Old Data
file but have one or more differing non-key fields; in other words, these are updated
versions of existing records.
Adds file Records from the New Data file whose keys do not appear in a record in the Old Data file.
Deletes file Records from the Old Data file whose keys do not appear in a record in the New Data file.
NOTE: This transform works because the fields in the output record format (specified
in difference_input.dml) do not have default values. For fields with default values,
you would have to use something like the following to ensure that records with
differences would be sent to the reject port:
out.dec_field ::
if (in.old_dec_field == in.new_dec_field)
in.old_dec_field
else force_error("Old and new values are different");
string and integer types Since all combinations of characters are valid strings, and all sequences
of bytes are valid binary integers, you can safely use the non-equal operator (!=) to compare
string and integer types. The comparison expression for a field named f is simply:
in0.f != in1.f
decimal, date, and datetime types If you think the quality of your data is questionable, check the
validity of decimal, date, and datetime values before comparing them. Do not simply cast them
to strings before comparing them, as values that are equal in their original types may have
different representations as strings. For example, 5 and 5.0 are equal as decimals, and
22APR1962 and 22apr1962 are equal as dates, but neither of these pairs are equivalent as
strings. Then do one of the following:
z If they are valid, compare them as their actual types.
z If they are invalid, cast them to strings of the appropriate length, and then compare
them as strings.
real types You can safely compare the values of real types with the non-equal operator, in the
sense that this operation never produces an error. The expression in0.f != in1.f, for example,
never rejects a record.
Note, however, that the IEEE representation accommodates a value that is not a number (NaN).
Such values always compare as unequal. If you think your dataset may contain NaNs, you
should take special care to treat all NaNs as equal to one another by using the math_isnan
function in a comparison expression like this:
if (math_isnan(in0.f) && math_isnan(in1.f))
0
else
in0.f != in1.f
void types You cannot directly compare the values of void types. Instead, reinterpret them as
strings and then compare the strings. To compare two fields of type void(100), use an expression
like this:
reinterpret_as(string(100), in0.f)
!= reinterpret_as(string(100), in1.f)
This function works for vectors of data types that can be compared using the non-equal
operator (!=). For vectors of other types such as reals, subrecords, and vectors you may
need to replace this expression:
in0[i] != in1[i]
record types Compare the values of record types field by field, using the appropriate rules for
the type.
union types Compare the values of union types by comparing their largest members.
with this:
math_abs(in0.f - in1.f) > 1e-10
Here the tolerance has been arbitrarily chosen to be .00000000001 (or 10-10). Setting the
tolerance for a comparison requires consideration of business rules (what is too small to
matter?) and the details of floating-point arithmetic (how close can you expect the values to
be?). You may also find it useful to create a function that defines the tolerance for a large
collection of graphs or projects.
Generated record formats and transforms can be useful in all sorts of other cases. For example:
z Data validation frequently requires applying one or more tests to all fields in each
record and attaching to each unacceptable field an identification of which test(s) failed.
To automate the generation of a transform and an output record format for such
validation, you could create a graph that produces them from an input record format
and a list of rules.
z Developers who receive field names and types in formats other than DML files (for
example, data dictionaries in the form of spreadsheets) could create a graph to convert
such information into DML files.
AB INITIO CONFIDENTIAL AND PROPRIETARY Automated generation of record formats and transforms 79
80 Automated generation of record formats and transforms AB INITIO CONFIDENTIAL AND PROPRIETARY
82 Automated generation of record formats and transforms AB INITIO CONFIDENTIAL AND PROPRIETARY
AB INITIO CONFIDENTIAL AND PROPRIETARY Automated generation of record formats and transforms 83
To parse the passed-in lists of key fields and ignored fields (including the newline field) into
vectors of strings to be stored in the Special Fields lookup file, the metadata generation graph
uses two NORMALIZE components that call the user-defined function parse_field_list. A GATHER
component then combines the two flows and outputs them to the lookup file.
84 Automated generation of record formats and transforms AB INITIO CONFIDENTIAL AND PROPRIETARY
AB INITIO CONFIDENTIAL AND PROPRIETARY Automated generation of record formats and transforms 85
86 Automated generation of record formats and transforms AB INITIO CONFIDENTIAL AND PROPRIETARY
AB INITIO CONFIDENTIAL AND PROPRIETARY Automated generation of record formats and transforms 87
88 Automated generation of record formats and transforms AB INITIO CONFIDENTIAL AND PROPRIETARY
NOTE: This recipe demonstrates one of the key ideas behind the basics of quantiling but is
flawed, as described later. The recipe Fixing decile boundaries (page 92) addresses the flaw.
To create a serial graph like quantile1.mp:
1. Connect a ROLLUP component to your input file and use it to count the records, as follows:
a. Set key to the empty key: {}
out::join(in0, in1) =
begin
record_number = record_number + 1;
out.* :: in0.*;
out.decile :: ((record_number * 10 - 1) / in1.count) + 1;
end;
This code assigns the correct decile to each record in sequence: (record_number * 10
1) results in a number from 9 (for the first record) to just under 10 times the number of
records (for the last record). That quantity divided by the number of records yields a
value between 0 (for the early records) and 9 (for the last records). To label the deciles 1
through 10, the code adds 1 to that value.
4. Run the graph and examine the output data to find the flaw in this graph:
By assigning the same decile value to all records in the same key group (specifically, the decile
value of the first record in the group), the SCAN fixes any bad boundaries caused by the JOIN.
temp::scan(temp, in) =
begin
temp.decile :: temp.decile;
end;
out::finalize(temp, in) =
begin
out.decile :: temp.decile;
out.* :: in.*;
end;
Parallel deciling
The above descriptions of deciling are sufficient for serial data. However, if you are processing
large amounts of data in parallel, the records that belong in each decile will probably be
distributed across partitions, so you must take the steps summarized on page 96 to find and fix
the decile boundaries in parallel (see quantile3.mp).
Getting the total record counts and starting record counts for each partition
To calculate the deciles properly, you must calculate the total count of records across all
partitions, as well as the starting count for each partition (that is, the total count of records in the
earlier partitions). You can do this by creating a subgraph similar to Get Counts in quantile3.mp
as described after the following figure, and connecting it to one of the out ports of the REPLICATE.
temp::initialize(in) =
begin
temp :: 0;
end;
temp::scan(temp, in) =
begin
temp :: temp + in.count;
end;
out::finalize(temp, in) =
begin
out.starting_count :: temp - in.count;
end;
A more complex task is fixing the boundaries to ensure that all records with the same
scored_value will be assigned to the same decile, even though they could be in different
partitions. You must write a JOIN transform that outputs a record only when it is the last record
assigned a given decile value. The transform shown below does this by comparing the decile of
the current record to that of the next, and putting out a record only when they differ. Here the
inputs are the following:
z in0 is the partitioned, sorted records (from the Global Sort subgraph).
z in1 is the starting counts for each partition (from the Get Counts subgraph).
z in2 is the total record count (from the Get Counts subgraph).
let integer(8) record_number =0;
let integer(4) this_decile = -1;
let integer(4) next_decile =-1;
z DENORMALIZE SORTED with its key set to the empty key: { } (to build a vector out of
them)
z BROADCAST (to send the vector to each partition of the JOIN)
out::join(in0, in1) =
begin
if (current_decile == 0)
begin
next_decile_boundary = in1.splitters[current_decile];
current_decile = 1;
end
while (in0.scored_value > next_decile_boundary)
begin
next_decile_boundary = in1.splitters[current_decile];
current_decile = current_decile + 1;
end
out.decile :: current_decile;
out.* :: in0.*;
end;
This graph should run, since you have not made any changes other than converting part of it to To update a containing graph with
changes made to a linked subgraph:
a subgraph.
z In the containing graph, right-
click the linked subgraph and
choose Update. (You can select
multiple components and
update all of them.)
serial_layout SERIAL_LAYOUT
parallel_layout PARALLEL_LAYOUT
SAMPLE_SIZE SAMPLE_SIZE
MAX_CORE MAX_CORE
NOTE: The grid in the Edit Parameters dialog allows global, graph-wide search and
replace operations, but only on items that are open in the view.
scored_value ${SCORED_VALUE}
decimal(12) ${SCORED_VALUE_TYPE}
3. Choose Edit > Replace, click Match Case (again to avoid making replacements in the
parameters you just defined), and where the values shown in the table below are values that you
want to parameterize, replace them as shown:
Do not replace the value 10 where it appears in the setting for max_core.
Do replace decile when it occurs in out_metadata; these are record formats used by the
components.
Do not replace decile in instances of [Same as], this_decile, or next_decile.
10 ${MAX_QUANTILES}
decile ${QUANTILE_FIELD_NAME}
4. For all the above replacements that occurred in a transform or record format, change the
interpretation from constant to ${} substitution as follows:
a. Choose Edit > Parameters.
b. On the left pane of the Edit Parameters dialog, click the plus signs to expand the
component listings for my-Subgraph_Parallel_Quantile, subgraph Get_Counts, and
subgraph Global_Sort.
c. Choose Edit > Find and search for ${ .
${SCORED_VALUE} scored_value
${SCORED_VALUE_TYPE} decimal(12)
${QUANTILE_FIELD_NAME decile
${MAX_QUANTILES} 10
By adding a setting of the form summary=path, you can tell the graph to write the tracking
information to the file specified in path, in a summary format that is easy to read from
another graph. (For details, see AB_REPORT and Generating a summary file in Ab Initio Help.)
The format of the summary file is specified in $AB_HOME/include/summary.dml, which
contains a record format with conditional fields: the presence or absence of the fields in a
specific summary file is determined by the kind of object (flow, component, job-start, or phase-
end) being reported on. After running the graph, you can examine the summary file to find the
flow(s) that are relevant for your audit.
2. In the INPUT FILE component, specify the path to the previously generated summary file as the
URL:
file:$RUN/audit1.summary.log
3. Set the key on the ROLLUP to the empty key: { }
Here the output of the ROLLUP is unimportant. Its sole purpose is to force errors if the values
from the summary file do not match.
4. In the ROLLUP component, create variables to store the values from the summary file:
let decimal("") input_accounts = sum(in.num_records,
((in.kind=="flow") and
(in.name=="Partition_by_Key_Account_Num.in")));
let decimal("") output_accounts = sum(in.num_records,
((in.kind=="flow") and (in.name=="Join_Cust_Acct.out")));
Because the summary file uses a record format with conditional fields, in.name will exist only
for certain values of in.kind, so a condition such as in.kind=="flow" is required to avoid NULL
NOTE: If you use summary files for auditing, make sure the original graph includes a legend
documenting the need to change the audit graph if the original graph changes.
NOTE: If you follow these steps to audit a similar graph, you should test the audit graph as
described in Testing the audit graph (page 129).
log_type out::final_log_output() =
begin
out.event_type :: "audit";
out.event_text :: count;
end;
out::reformat(in) =
begin
count = count + 1;
out.* :: in.*;
end;
log_type out::final_log_output() =
begin
out.event_type :: "audit";
out.event_text :: string_concat(count, "|", sum_balances);
end;
out::reformat(in) =
begin
count = count + 1;
sum_balances = sum_balances + first_defined(in.bal, 0);
out.* :: in.*;
end;
c. For the Transaction file, you also need to accumulate the count and the sum, so write
an embedded transform that includes log_type.dml, creates global variables count and
sum_tran, and defines a function named final_log_output to send them to the log port,
as shown in the following code. (Again, the function must be named final_log_output.
For details, see About the final_log_output function on page 124).
include "/~$AI_DML/log_type.dml";
Reading back and rolling up the values from the log files
Before your subgraph can audit the results by comparing the log values with the output values
from the main graph, you must generate the needed log values (those with an event_type of
audit) in the appropriate format. Also, since the log files were attached to multiple components
executing in parallel, you need to roll these up to get the total values for comparison.
To do this, write separate transforms to read back and roll up the log values for the following
two cases:
z Customer and Account logs count values
z Balance and Transaction logs count and sum values
Balance and transaction logs (count and sum values) Reading back the log values for the balances
and transactions is more complex, because two values separated by "|" are stored in a single
in.event_text field in the log file. You can use the reinterpret_as function to interpret this string
as a record containing those two values. In audit4.mp, the code for this is in
sum_log_counts_values.xfr.
To read back the log values for the balances and transactions:
z Create an expression with a record format that matches the format used for the event_text in
the log file, and assign this expression to a variable as follows:
a. On the Variables tab of the ROLLUP transform, add the following expression:
reinterpret_as(record decimal("|") count; decimal("|\n") sum;
end, in.event_text);
b. Drag the expression across to the Variables output pane.
The GDE creates a variable of type record with a default name such as expr1 and the
following format, and assigns the expression to it:
record
decimal("|") count;
decimal("|\n") sum;
end
TIP: Change the names of the in ports on the JOIN component to make them more
descriptive.
NOTE: The log port of a transform always has one or more records, even if the transform did
not receive any input records.
TIP: Another way of ensuring that a transform gets called is to add a GENERATE RECORDS
component to create a single dummy record.
3. Set the record_required parameter of the log ports flow into the JOIN to True (record
required on this input). (In the Audit Totals subgraph, this is record_required1.)
4. To ensure that the log ports flow has exactly one record, set the appropriate dedup
parameter of the JOIN to Dedup this input before joining. (In the Audit Totals subgraph, this
is dedup1.)
5. Test for NULL and force an error whenever the counts and totals on the inputs and outputs do
not match as expected.
In other words, write a sequence of conditions of the form if (error condition)
force_error("error message"). For example:
if (is_null(totals))
force_error("No records in output dataset");
...
if ((totals.debits + totals.credits) != transactions.sum_values)
force_error(string_concat(
"Sum of input transaction amounts '",
(decimal("")) transactions.sum_values,
"' does not match sum of output debits and credits '",
(decimal("")) (totals.debits + totals.credits),
"'"
));
All three input files have the same record format, which is based on the user-defined type
specified in transaction_type.dml:
type transaction =
record
date("YYYY.MM.DD") trans_date;
decimal(9,0) account_id;
decimal(10.2) amount;
string('\n') description;
end;
The input files use this by including transaction_type.dml in the record format and identifying
transaction as the top-level type:
include "/~$AI_DML/transaction_type.dml";
metadata type = transaction;
3. Click OK twice to close the Define Multifile Partitions dialog and the Properties dialog.
NOTE: You must enclose the command in parentheses and precede the opening parenthesis
with a dollar sign: $(...) to indicate that what follows is a ksh command.
3. Click OK twice to close the Define Multifile Partitions dialog and the Properties dialog.
To read multiple files from a list and write their contents to a single serial file:
1. Add an INPUT FILE component whose URL specifies a file containing a list of the files to read
and whose record format specifies how the filenames are separated.
In read_multiple_files_simple.mp, the URL of Input File List is file:$AI_SERIAL/
list_of_files.dat, and the referenced file has the following content:
trans-2006-01.dat
trans-2006-02.dat
trans-2006-03.dat
The record format for list_of_files.dat files is:
record
string('\n') filename;
end
2. Connect a READ MULTIPLE FILES component to the INPUT FILE and configure it as follows:
a. Hold down the Shift key and double-click the component to generate a default
transform. (If the Transform Editor is displayed in Package View, switch to Text View,
and you will be prompted to generate the transform.)
out::reformat(in) =
begin
out.* :: in.*;
out.trans_kind :: map_trans_kind(in.description);
end;
4. Attach an OUTPUT FILE component to the REFORMAT component and configure it as follows:
a. On the Description tab of the Properties dialog for the OUTPUT FILE, specify the URL for
the serial file to which you want to write the output data. For example:
file:$AI_SERIAL/processed.dat
b. On the Ports tab of the Properties dialog for the OUTPUT FILE, specify the record format
for the output data by selecting Use file and then browsing to the location of a file
containing the record format.
To read multiple files from a list and write reformatted data to renamed output files:
1. Add an INPUT FILE component whose URL specifies a file containing a list of the files to read
and whose record format specifies how the filenames are separated.
In read_and_write_multiple_files.mp, the URL of Input File List is file:$AI_SERIAL/
list_of_files.dat, and the referenced file has the following content:
trans-2006-01.dat
trans-2006-02.dat
trans-2006-03.dat
AB INITIO CONFIDENTIAL AND PROPRIETARY Header and trailer record processing 151
152 Header and trailer record processing AB INITIO CONFIDENTIAL AND PROPRIETARY
For each customer transaction, this format results in at least two records (one header and one
trailer) with all nine fields in each record, and NULL values for each empty field. To see this, view
the data in ht1_input Flat Records:
AB INITIO CONFIDENTIAL AND PROPRIETARY Header and trailer record processing 153
type body_type =
record
string(',') item;
decimal(',') quantity;
decimal('\n') cost;
end;
type trailer_type =
record
decimal(',') total_quantity;
decimal('\n') total_cost;
end;
AB INITIO CONFIDENTIAL AND PROPRIETARY Header and trailer record processing 155
For each customer transaction, this format also results in at least two records (one header and
one trailer) with three subrecords in each record, and NULL values for each subrecord that has
no content. To see this, view the data in ht1_input Using Subrecords:
156 Header and trailer record processing AB INITIO CONFIDENTIAL AND PROPRIETARY
As you can see by viewing the data in ht1_input Using Vectors, this format results in just one
record for each customer transaction: it consists of the header data, a vector of body data, and
one trailer record, and it has no NULL fields:
158 Header and trailer record processing AB INITIO CONFIDENTIAL AND PROPRIETARY
160 Header and trailer record processing AB INITIO CONFIDENTIAL AND PROPRIETARY
AB INITIO CONFIDENTIAL AND PROPRIETARY Header and trailer record processing 161
162 Header and trailer record processing AB INITIO CONFIDENTIAL AND PROPRIETARY
AB INITIO CONFIDENTIAL AND PROPRIETARY Header and trailer record processing 163
164 Header and trailer record processing AB INITIO CONFIDENTIAL AND PROPRIETARY
AB INITIO CONFIDENTIAL AND PROPRIETARY Header and trailer record processing 165
166 Header and trailer record processing AB INITIO CONFIDENTIAL AND PROPRIETARY
AB INITIO CONFIDENTIAL AND PROPRIETARY Header and trailer record processing 167
168 Header and trailer record processing AB INITIO CONFIDENTIAL AND PROPRIETARY
AB INITIO CONFIDENTIAL AND PROPRIETARY Header and trailer record processing 169
first AND last AGGREGATION c. Outputs fields from the header and trailer records, along with booleans indicating
FUNCTIONS whether any of the calculated values in the original data are incorrect:
The first and last functions return out.transaction_id :: first(in.transaction_id);
values from the first and last records out.customer_name :: first(in.customer_name);
in a key group. In ht_validate1.mp, out.mismatched_num_items :: actual_num_items !=
these are the header and trailer first(in.num_items);
out.mismatched_total_quantity :: actual_total_quantity !=
records, respectively.
last(in.total_quantity);
out.mismatched_total_cost :: actual_total_cost !=
last(in.total_cost);
out.num_items :: first(in.num_items);
out.total_quantity :: last(in.total_quantity);
170 Header and trailer record processing AB INITIO CONFIDENTIAL AND PROPRIETARY
AB INITIO CONFIDENTIAL AND PROPRIETARY Header and trailer record processing 171
172 Header and trailer record processing AB INITIO CONFIDENTIAL AND PROPRIETARY
AB INITIO CONFIDENTIAL AND PROPRIETARY Header and trailer record processing 173
This graph is identical to ht_validate1.mp (page 169) except that instead of using a ROLLUP
(which processes groups of records), this graph uses a REFORMAT, because each group of header,
body, trailer is in a single record.
The transform in the Reformat: Validate and Summarize component does the following:
174 Header and trailer record processing AB INITIO CONFIDENTIAL AND PROPRIETARY
AB INITIO CONFIDENTIAL AND PROPRIETARY Header and trailer record processing 175
176 Header and trailer record processing AB INITIO CONFIDENTIAL AND PROPRIETARY
The graph ht_build.mp computes those values as part of its processing. Specifically, it does the
following:
1. Rolls up the input records to compute the data for the above three fields (page 178).
2. Reformats the data to generic records that contain all eight fields, plus a kind field with a
value of H, B, or T to designate each record as a header, body, or trailer; and an ordering field
with a value of 1, 2, or 3 (page 179). The graph does this in two steps:
a. Reformats the body records
b. Reformats the rolled-up header records and trailer records
3. Merges the reformatted records, assembling them in header, body, trailer order (page 183).
4. Reformats the merged records, using a conditional record format to populate the header,
body, and trailer fields appropriately in the output data (page 184).
AB INITIO CONFIDENTIAL AND PROPRIETARY Header and trailer record processing 177
178 Header and trailer record processing AB INITIO CONFIDENTIAL AND PROPRIETARY
AB INITIO CONFIDENTIAL AND PROPRIETARY Header and trailer record processing 179
180 Header and trailer record processing AB INITIO CONFIDENTIAL AND PROPRIETARY
AB INITIO CONFIDENTIAL AND PROPRIETARY Header and trailer record processing 181
AB INITIO CONFIDENTIAL AND PROPRIETARY Header and trailer record processing 183
184 Header and trailer record processing AB INITIO CONFIDENTIAL AND PROPRIETARY
The OUTPUT FILE contains the names of each pair of cities and the distances between them. The
record format for this data is in route-lengths.dml:
record
string(',') route;
decimal(9.2) distance;
end;
/* Compute result: */
out :1: if (x <= 1.0 and x >= 0)
radius * 2 * math_asin(math_sqrt(x));
out :2: if (x > 1)
radius * 2 * math_asin(1.0);
out :3: 0;
end;
out::reformat(in) =
begin
out.route :: string_concat(in.from_location.name, " to ",
in.to_location.name);
out.distance ::
great_circle_distance(in.from_location.latitude,
in.from_location.longitude,
in.to_location.latitude,
in.to_location.longitude);
end;
Simplifying the code and controlling the output by using the right data type
Because the graph uses the fixed-point decimal type decimal(9.2) for distance in the output
record format rather than a floating-point decimal type (such as decimal(9) or decimal('\n')), it
generates easily readable output data with no additional code to round the results of the
floating-point arithmetic.
To see the effect of using a floating-point decimal type in the output record format for this
graph:
1. Save calculating-distances.mp under a different name (such as my-calculating-
distances.mp).
2. On the Ports tab of the Properties dialog of the Shipping Distances (in km) output file, select
Embed and click Yes when asked whether you would like to use the contents of the file
route_lengths.dml as the embedded value. (By doing this, you prevent this experiment from
overwriting the record format for the original graph.)
3. Change the record format of the distance field in the output file to the following and click OK:
decimal('\n') distance;
out::join(in0, in1) =
begin
out.customer_id :: in1.customer_id;
out.customer_place :: in1.place;
out.branch_id :: in0.branch_id;
out.branch_place :: in0.place;
out.distance :: great_circle_distance(in0.latitude,
in0.longitude, in1.latitude, in1.longitude);
end;
3. Uses a SORT WITHIN GROUPS to produce, for each customer_id, a list of records sorted in
ascending order by the distance to each branch:
a. Sets the major-key parameter to customer_id and the order to Ascending:
4. Uses a DEDUP SORTED component to get the shortest distance for each customer_id:
a. Sets the key parameter to customer_id and the order to Ascending.
b. Sets the keep parameter to first to keep the record with the smallest value (shortest
distance).
5. Outputs the data, producing records in the format specified by nearest-branch.dml:
record
decimal(",") customer_id;
string(",") customer_place;
decimal(",") branch_id;
string(",") branch_place;
decimal("\n".2) distance;
end
out::length(in) =
begin
out :: 17;
end;
out::normalize(in, index) =
begin
let point_km_type location_km = map_coordinates(in);
let integer(4) square_size = 1 << index;
/*The size of a square, in kilometers, is 2 to the power of
its index.*/
out.location_deg.* :: in.*;
out.location_km :: location_km;
out.square_size :: square_size;
out.branch_id :: in.branch_id;
out.square :: containing_square(location_km, square_size);
out.branch_place :: in.place;
end;
position_deg Location for which the Location provided The transform does not change the
nearest branch is in the input record value of this variable.
sought, in degrees
position_km Location for which the Location provided The transform does not change the
nearest branch is in the input value of this variable.
sought, in kilometers record, converted
to kilometers
size Length (in km) of a side 1 Initially, this is the size of the smallest
of a square in the square to search in. The optimal value
current grid depends on the typical distance
between branches.
At the end of each iteration, size is
doubled, and the next iteration searches
the next-larger grid size.
closest_branch Location of the branch far_away_ Initially, this is the location of the
that is currently closest, branch imaginary branch defined earlier (see
as a branch_lookup_ Step 3 above), which is farther away
type than any actual branch. Whenever the
transform finds a branch that is closer,
it assigns the newly found branch
location to this variable. When the loop
has iterated to search the next-larger
grid size (see size above) and has not
found a branch closer than the current
one (see distance_to_
closest_branch below), the loop
terminates, and the value of
closest_branch is output.
a. Inside the loop, the transform calculates a vector containing the coordinates of the
southwest corners of the square containing the point, and of the three squares closest
to it, each of length size:
begin
four_squares = four_containing_squares(position_km, size);
quantile3.mp 95
CONCATENATE multiplying-data-with-departitioners.mp 18
quantile3.mp 95
AB INITIO CONFIDENTIAL AND PROPRIETARY Cook>Book graphs listed by component and feature 213
ht_validate2.mp 174
multiplying-data-by-3-way-self-join-with-FBE.mp 19
FUSE benchmarking-with-checking.mp 49
GATHER multiplying-data-with-departitioners.mp 18
INTERLEAVE multiplying-data-with-departitioners.mp 18
JOIN difference.mp 65
nearest-branch-simple.mp 193
quantile1.mp 90
quantile2.mp 92
quantile3.mp 95
MERGE global_sort_serial_simple.mp 62
ht_build.mp 176
NORMALIZE fill-gaps.mp 34
multiplying-data-by-fixed-number.mp 14
214 Cook>Book graphs listed by component and feature AB INITIO CONFIDENTIAL AND PROPRIETARY
read_multiple_files_simple.mp 140
difference.mp 65
ht_process1.mp 160
ht_validate2.mp 174
make_difference_metadata.mp 84
name-parsing.mp 6
nearest-branch-quadtree.mp 197
REPLICATE benchmarking.mp 45
benchmarking-with-checking.mp 49
difference.mp 65
ht_validate1.mp 169
ht_validate2.mp 174
quantile3.mp 95
AB INITIO CONFIDENTIAL AND PROPRIETARY Cook>Book graphs listed by component and feature 215
audit2.mp 115
audit3.mp 116
audit4.mp 119
ht_build.mp 176
ht_process1.mp 160
ht_validate1.mp 169
ht_validate2.mp 174
quantile1.mp 90
quantile2.mp 92
SAMPLE global_sort.mp 57
SCAN fill-gaps.mp 34
ht_process2.mp 165
making-unique-keys.mp 21
quantile2.mp 92
216 Cook>Book graphs listed by component and feature AB INITIO CONFIDENTIAL AND PROPRIETARY
SORT global_sort.mp 57
global_sort_serial_out.mp 62
global_sort_serial_simple.mp 62
quantile1.mp 90
quantile2.mp 92
nearest-branch-simple.mp 193
TRASH benchmarking.mp 45
difference.mp 65
AB INITIO CONFIDENTIAL AND PROPRIETARY Cook>Book graphs listed by component and feature 217
ad_hoc_explicit.mp 134
ad_hoc_parameter.mp 137
ad_hoc_wildcard.mp 135
calculating-distances.mp 186
ht_data.mp 155
nearest-branch-quadtree.mp 199
nearest-branch-simple.mp 193
read_and_write_multiple_files.mp 144
read_multiple_files_simple.mp 140
218 Cook>Book graphs listed by component and feature AB INITIO CONFIDENTIAL AND PROPRIETARY
ht_validate2.mp 174
make_difference_metadata.mp 85
nearest-branch-quadtree.mp 203
nearest-branch-quadtree.mp 199
make_difference_metadata.mp 85
benchmarking.mp 45
generic_difference.mp 80
make_difference_metadata.mp 84
multiplying-data-with-parameter.mp 15
Metaprogramming make_difference_metadata.mp 84
AB INITIO CONFIDENTIAL AND PROPRIETARY Cook>Book graphs listed by component and feature 219
220 Cook>Book graphs listed by component and feature AB INITIO CONFIDENTIAL AND PROPRIETARY
A auditing
graph, testing 129
ad hoc multifiles sample application to audit 111
about 132 simple, counting records 113
explicitly listing files 134 using log files 119
using a command to list files 137 automated generation of record formats and
using a parameter to list files 137 transforms 79
using wildcards to list files 135
ad hoc output multifiles 138
add_fields function in metaprogramming B
example 85
benchmarking
add_rules function in metaprogramming
elapsed time 53
example 87
graph 45
advanced topics
graph with checking 49
building flat data into header/body/trailer
low-level operations 54
subrecord format
other suggestions for 51
reformatting to generic records 179
startup time 52
reformatting to subrecord format 184
equality checks
fuzzy 78
safe 75
U
unions, equality checks 75
unique keys
making 21
using a SCAN component to generate 30
user-defined types, creating 140
V
validating input data
flat 169
vectors 174
vectors
equality checks 75
normalizing data 167
to hold body data 158
validating vector input data 174
voids, equality checks 75
W
writing multiple files 144