Ab Initio Notes - Dynamic Lookups
Ab Initio Notes - Dynamic Lookups
CPU me
I/O
Memory
A 32-bit architecture has 2^32 unique addresses because each bit can be set to either 0 or 1
and there are 32 bits in each address. 2^32 is 4,294,967,296 -- about 4 billion.
Each bit references a byte of memory; so, there are enough unique addresses to reference
4,294,967,296 bytes of memory, which is 4 gigabytes (GB). Thus, the maximum amount of
memory that a process in a 32-bit architecture can reference is 4 GB.
In a 64-bit architecture, there are 2^64 unique addresses, so theore cally 16 exabytes (EB) of
memory could be referenced. However, in prac ce, most 64-bit architectures use only 48 bits
for addresses. The set of unique 48-bit addresses can reference 2^48 (281,474,976,710,656)
bytes of memory; that is, 256 terabytes (TB) of memory. As with 32-bit architectures, the
address space is usually divided into kernel space and user space. For example, processes
running on one common 64-bit architecture can address 128 TB of kernel space and 128 TB of
user space.
“Memory mapping" is taking data from a file on disk and mapping the bytes in the file
to virtual address space. If you write the lookup file index -- if you create it ahead of
me before it's needed by the graph -- and you save it in a file on disk, the lookup file
index can be memory mapped as well. Otherwise, if the index is not on disk, the index
is computed by the component that calls the lookup() func on, in the private memory
of that component.
0 – Load all records. If any par al records are present at the end of the lookup data
file, or if you specify a lookup index with indexpath and the index and the data files do
not match, the load will fail.
The load might fail if the lookup data file is being modified on disk at the same me as
the call to lookup_load.
-1 – Load all complete records. Allows but ignores any incomplete records being
wri en at that me. For this op on, do not specify an index with indexpath. Instead,
let the func on generate an index as part of the lookup opera on.
-4 - Load zero (0) records when the lookup file is first set up in memory. This op on
enables you to subsequently call lookup_reload at will to specify a number of
complete records to load into memory. Any incomplete records being wri en at the
me lookup_reload is called are ignored.
-5 - Allow later, separate lookup calls to append new lookup data to the file. This case
differs from regular appendable lookup calls (load_behavior=-2), which check for
growth of the lookup file at every func on call. This op on checks for growth of the
lookup file only at every checkpoint or computepoint of a con nuous graph, thereby
improving performance.
LOOKUP_LOAD_ALL_GENERATIONS 0
LOOKUP_LOAD_ALL_COMPLETE_GENERATIONS -1
LOOKUP_LOAD_APPENDABLE -2
LOOKUP_LOAD_UPDATABLE -3
LOOKUP_LOAD_NO_GENERATIONS -4
LOOKUP_LOAD_LAZY_APPENDABLE -5
load_behavior NOTE:
0 — Loads the lookup data file, all genera ons from earliest to
latest (ICFFs) or all records (non-ICFFs).
-4 — Loads zero (0) genera ons or records when the lookup file
is first set up in memory. For reloadable ICFFs and non-ICFF
lookup files.
Sta cally loading a lookup file creates a memory map for the lookup dataset. Because of the
memory mapping, a disadvantage of sta c loading is that a lookup dataset occupies a fixed
amount of virtual address space, even when the graph is not performing a lookup opera on
with this dataset. (For more informa on about memory mapping, see “Memory mapping
lookup files”.)
By dynamically loading lookup data, you control the following, on a per-component basis:
Which lookup datasets are loaded and how many copies of each. (See “Managing
memory for dynamic data lookup”.)
When a lookup dataset is loaded and unloaded. A lookup dataset is loaded when you
call lookup_load on it within a transform. If you do not explicitly unload a dataset by
later calling lookup_unload on it, the dataset is unloaded when its parent component
process ends. (See “Implicitly unloading a dataset”.)
This control offers the following advantages:
It helps to conserve memory. Applica ons can unload datasets that are not immediately
needed and load only the ones needed to process the current input record.
It allows the original data on disk to change, even while your graph is searching through
data dynamically loaded into memory. This kind of parallelism is only possible with
dynamically loaded data.
Before you work with a lookup dataset dynamically, you must define a lookup template, a
component- or DML-based specifica on that spells out the nature of the lookup data and the
opera ons that can be performed on it; see “Lookup templates and dynamic loading”.
A er a lookup template has been specified, you dynamically load and unload lookup data in the
following steps:
1. Loading the dataset into memory when it is needed by calling the lookup_load func on.
2. Searching lookup data with a key and retrieving records by calling the appropriate DML
core lookup func ons.
3. Releasing memory by unloading the dataset a er use, or allowing the graph to unload it
for you.
This sec on explains the ways that dynamic loading is enabled by lookup templates,
implemented as either a LOOKUP TEMPLATE component or a lookup template type (a special
DML record type).
To set up the dynamic loading process, begin with either of these star ng points:
In your graph, use a LOOKUP TEMPLATE component. (This is in contrast to sta c lookups, which
use a LOOKUP FILE component.)
For more informa on, see “Dynamic data lookup: The LOOKUP TEMPLATE component”.
In your transform or inline DML, create a lookup template type — a user-designed record of a
certain format. This approach frees dynamic data lookup from requiring any lookup components
at all and is implemented en rely in DML. You then create and use one or more instances,
or handles, of this type to perform dynamic lookup opera ons.
For more informa on, see “Dynamic data lookup using DML: Lookup template types”.
What these different methods have in common is that they implement a lookup template. Such
a template specifies characteris cs of a lookup data file, as follows:
The lookup template also specifies, within the scope of the template component or type, how
your transform interacts with the lookup file. Always remember that the data format in the
lookup file and the record format specified in the lookup template must be iden cal.
Upon dynamic loading of a lookup dataset with a lookup template, the lookup_load func on
returns, first of all, a lookup iden fier (LID) of predefined type lookup_iden fier_type. The LID
allows you to reference, within your transform DML code, the lookup data as loaded into
memory. In the case of a lookup handle of lookup template record type, the LID is one field in
the larger record structure.
TIP:
It is o en a good prac ce to declare the lookup iden fier or lookup handle variable as a global
variable in your component package. (The variable can s ll be assigned a return value
from lookup_load within a transform, not at the package level.) The iden fier or handle is then
accessible throughout the package. If your package contains more than one dis nct data
lookup, declare mul ple global iden fier or handle variables.
The Co>Opera ng System provides the following memory management controls that allow you
to prevent memory exhaus on while working with dynamically loaded lookup datasets:
The parameter load_once of the LOOKUP TEMPLATE component and the lookup
template type — When this parameter is set to True (as it is by default), the
Co>Opera ng System prevents a component’s transform from loading a dataset that is
already present in memory. This parameter is set and prevents mul ple loadings per
component.
Se ng load_once to False comes at the poten al cost of having mul ple data structures
associated with a lookup dataset loaded in memory at the same me, per component, and at
the ul mate risk of memory exhaus on.
For more informa on, see “Managing duplicate lookup datasets in memory”.
In recoverable con nuous jobs, the saving of dynamic lookup informa on in checkpoint files
requires dynamic lookup based on a lookup template type. In this context, the following are
true:
Lookup datasets created in memory with lookup_create are wri en to checkpoint files
on disk, if the associated handles are declared as non-transient global variables. Upon
recovery a er a graph failure, an in-memory lookup dataset and its handle that were
saved at checkpoints are restored from the last checkpoint.
Lookup datasets loaded into memory from disk with lookup_load are not saved to a
checkpoint file. But their handles (if non-transient global variables) and the disk paths to
their data and index files are saved at checkpoints.
If these in-memory lookup datasets are large or numerous, the associated checkpoint files will
be correspondingly large and costly to write.
A lookup ID and a lookup dataset in memory associated with a LOOKUP TEMPLATE component
are not saved at checkpoints in any case.
For more informa on on con nuous graphs and components and the recovery of created
lookup data, see the following topics:
o Using “transient” to prevent lookup handles from being added to the checkpoint
file
A LOOKUP TEMPLATE component can subs tute for a LOOKUP FILE component when you want
to dynamically load and unload lookup data. The LOOKUP TEMPLATE component is in essence a
virtual lookup file. It contains no lookup data, but it allows you to associate lookup data on the
fly with a search key and a DML record format.
When you call the lookup_load func on from within a transform, you specify the dataset you
want to load. The Co>Opera ng System uses the LOOKUP TEMPLATE informa on to load that
data dynamically into memory.
Use a LOOKUP TEMPLATE component instead of an INPUT FILE or LOOKUP FILE component
when your graph needs dynamic lookup, either to access a lookup dataset that can be iden fied
only during graph execu on or to repeatedly access one changing dataset by repeatedly
loading, unloading, and reloading it. Dynamic lookup places a whole dataset in main memory
first, before star ng lookup data opera ons on it. Loading a lookup dataset is made more
efficient if the lookup data file was created with a precomputed index; see “Preindexing lookup
data”.
NOTE:
In place of a LOOKUP TEMPLATE component in a graph, you can set up a transform with a
lookup template type. For more informa on, see “Looking up data dynamically with lookup
template types in component transforms”.
Dynamic loading
When the dataset is subject to change during graph execu on, a LOOKUP TEMPLATE component
allows you to refresh the memory-resident data by unloading and reloading it as needed.
Dynamically unloading lookup data also enables you to free up memory as soon as your
transform is done using it, instead of wai ng un l the end of a lookup component process —
which is when the Co>Opera ng System removes sta cally loaded lookup data from memory.
Indexing and precomputed indexes For dynamically loaded lookup data files, the availability of
a precomputed index improves performance, because, otherwise, the index has to be created
upon loading with lookup_load. A precomputed index can be created when a lookup data file is
created. Precomputed lookup index files depend on your pla orm and are thus not portable.
Related topics
Component folding
LOOKUP TEMPLATE
AB_LOOKUP_MAX_LOADS
When you place a LOOKUP TEMPLATE component in your graph, you define it by specifying two
parameters:
For detailed informa on, see “Looking up data dynamically with a LOOKUP TEMPLATE
component”.
This graph shows a simple example of how to use lookup components to dynamically load data.
Real-world applica ons may be more complex — for example, you could use separate lookup
tables, one for date informa on and one for region informa on. A batch job might run a new
lookup opera on daily, for use with a con nuously running query job.
1. The Generate Lookup Data component generates lookup data with the
specified keynum and keyname.
2. The Odd vs. Even component evaluates each keynum and separates the data records
into either the odd or even WRITE LOOKUP component, as appropriate. (The WRITE
LOOKUP component requires a key parameter — in this case, keynum.)
3. Each WRITE LOOKUP component writes two files: a data file and an index for that data
file.
1. The My Template LOOKUP TEMPLATE component defines the record format (the same
one used by Generate Lookup Data) and a key (keynum) for the lookup opera on.
Because this graph uses a non-compressed lookup file, the component’s parameters are set as
follows:
o keep_on_disk is False.
o block_compressed is False.
2. The Generate Random Keys component generates random datakeys and passes them to
the REFORMAT component.
b. Loads the appropriate data file into memory by calling lookup_load in its transform
func on. (It first unloads the current lookup file if one is already loaded.)
4. The keynum and keyname data from the lookup file is wri en to the Looked-up
Values component.
You use a lookup template type when you want to load and unload lookup data dynamically
without using a LOOKUP TEMPLATE component in your graph. The lookup template type defines
the lookup record format and key within itself. (It also op onally defines lookup data file and
index file paths.)
NOTE:
Take care not to confuse two different record formats: the record format of the lookup template
type, and the record format of the lookup data itself.
Actual lookup opera ons with lookup template types use lookup handles, which are instances of
lookup template types. You can simply create a lookup handle directly, leaving the type implicit;
or you can explicitly declare, then instan ate, such a type. For more informa on about handles,
see “Handles” and “Records” in the DML Guide and Reference.
The lookup handle technique provides more flexibility by le ng you define lookups en rely
within DML, without referencing a lookup component in your graph.
CAUTION!
To remain valid as a lookup template type, a record cannot contain any fields other than the
fields specified in the following syntax.
A lookup template type is a specific DML record type that you define in your DML code. You use
such a type by crea ng a lookup handle record directly, with a call to
the lookup_make_template func on, or as an instance of an explicitly declared type. A er it is
created, such a handle can simplify how you call the lookup_load, lookup,
and lookup_unload func ons. For a specific example of a lookup template type, see “Looking up
data dynamically with lookup template types in component transforms”.
A lookup template record type conforms to the following syntax and contains the specified
required and op onal elements. The first is a field; the rest are func on fields.
TIP:
To work around the restric ons on allowed fields in a lookup template record type, you can
embed this type as a subrecord in a larger type that includes any other fields that you want to
associate with the lookup template type. This subrecord must be used for the lookup template
type, not the whole record.
A required lookup ID field of type lookup_iden fier_type (long or integer(8)). Its name
must be id. Its default value must be -1 — the value used to denote an invalid handle
that iden fies nothing.
o A RecordFormat func on field, which must use the allocate func on or NULL:
Op onal func on fields that specify the lookup data and index file loca ons, and that
must be declared as type string(''):
If defined with your loca on strings, these op onal func on fields can subs tute for one or
both of the datapath and indexpath arguments in the lookup_load func on call. In a lookup
template type, these strings must be string constants or evaluate to string constants.
TIP:
If you want to specify datapath, indexpath, or both, as non-constant string variables or non-
constant string expressions, you must omit each variable path from the lookup template type
and specify them instead as the corresponding arguments in the lookup_load call.
Op onal func on fields that act as flags and that should be declared as boolean,
type bool:
For more informa on about the load_once and only_last_key_instance fields, see “Managing
duplicate lookup datasets in memory”. The direct_addressed field is required if the key field is
not included and op onal otherwise. The prac cal case for omi ng the key field occurs
if direct_addressed returns true.
These func on fields must return false or true. If not otherwise specified, the default values are
shown. These flags func on the same way as the flag parameters in the LOOKUP
TEMPLATE component, and their default values are the same. All of the preceding default field
values must be constants or evaluate to constants.
An op onal func on field that specifies the loca on of the technical repository that
contains the lookup data, and that must be declared as type string(''):
This string value will be interpreted by the EME Technical Repository using Parameter Defini on
Language (PDL) and does not need to be a constant. The string can use $ or ${} subs tu on
syntax. For more informa on, see “Parameter Defini on Language” in the Parameter Reference.
Like any DML record type, a lookup template type can be declared in the following places:
One or more graph parameter defini ons inside inline DML; see “Dynamic data lookup
without components: Inline DML data lookup”.
A DML package, elsewhere in your filesystem, that your graph can include by reference.
In many cases, you will need only one lookup handle record of a lookup template type, used in
one component, and not the lookup template type itself. The type does not need to be named,
and can be declared, unnamed, within a call to lookup_load or lookup_create.
With lookup_make_template, you can avoid explicitly declaring a lookup template type
altogether and instead create a lookup handle of an implicit, unnamed lookup template type
specified by the func on arguments.
However, in some situa ons, you will want to reuse a lookup template type in different
components, graphs, and environments. To create mul ple lookup handles of a single such type
by repeatedly referencing the type with lookup_load or lookup_create, you must first declare
and store a named record type with a lookup template type structure. A DML package is the
preferred way to store such a type and make it accessible to mul ple components and graphs.
For more informa on, see “DML packages” in the DML Guide and Reference.
As with any declared type, a er you declare a lookup template type, you cannot change the
type’s structure. In addi on, you cannot change the values of any of the fields in any lookup
handle record. The id field of a lookup handle does change its value as you use the record in
various dynamic lookup opera ons; the other field values are fixed by the record type
defini on. If you need a lookup template type with different field values, you must define and
use another such type.
While you can declare a lookup template type in DML in a component, a lookup template type
frees your dynamic data lookup from having to reference components. The lookup template
type can be declared in a DML package file on disk that your graph references, or in a Parameter
Defini on Language (PDL) $[ ] construct that contains inline DML as part of a graph parameter’s
defini on.
Avoiding references to components is essen al if you want to perform data lookup in inline DML
— for example, in a graph parameter defini on. You can define a parameter’s value with an
inline DML expression, but you cannot reference the graph’s components there. You can s ll
perform dynamic data lookups, however, if you use a lookup template type. The type is either
declared on the fly within the inline DML, or is included by the inline DML from a package file on
disk through the AB_DML_DEFS graph parameter. For an example, see “Looking up data
dynamically with lookup template types in inline DML”. For specific requirements for a lookup
template type in DML, see “Record type structure of a lookup template type”.
A DML expression enclosed by a PDL inline $[ ] construct is evaluated during the parameter
evalua on phase, before graph execu on. A DML object created in inline DML exists only in that
inline DML, while the inline expression is being evaluated. A er evalua on, only the result of
the expression survives, to be used as part of a parameter defini on, and this result cannot be
changed.
Related topics
Records
Handles
Depending on your applica on and its memory requirements, you can let the Co>Opera ng
System clear the memory at the end of the graph phase, or you can use
the lookup_unload func on to explicitly unload a lookup dataset. There are cases for adop ng
each technique, as follows:
When the total amount of lookup data is small compared to available memory, you may not
need to unload datasets from memory explicitly. However, if you do not intend to call
the lookup_unload func on, you need to be careful to avoid loading mul ple copies of the same
lookup dataset.
In the following transform, the dataset is never explicitly unloaded. But the Co>Opera ng
System clears it from memory when the last record has been processed and the graph phase
ends.
out::reformat(in) =
begin
end;
No ce that this transform loads a new copy of the same lookup dataset into memory with each
input record. This slows performance and wastes memory. Because the Co>Opera ng System
does not unload these datasets un l the last record is processed, the system might actually run
out of memory.
Dynamic lookup provides a way to prevent this from happening, depending on how you define
the lookup template; see “Managing duplicate lookup datasets in memory”.
If you have large lookup datasets or limited memory, the lookup_unload func on becomes
indispensable. In this example, suppose that you have two large lookup files
— north.dat and south.dat — each requiring about 30 gigabytes of memory, so that it is not
prac cal to keep both loaded at the same me. The two hold similar records, and each input
record contains a string field, region, that indicates the record’s region (N or S).
The following transform processes records arriving from both regions in arbitrary order. With
each input record, the transform checks whether the proper lookup dataset is loaded. If it is
not, the current dataset is unloaded and the proper one loaded.
out :: reformat(in) =
begin
if (new_region != region)
begin
if (new_region == "N")
else
region = new_region;
end
out.cust_name :: in.cust_name;
end;
This transform assumes dynamic loading with a LOOKUP TEMPLATE component. The same logic
works for dynamic loading and unloading with a lookup template type.
Related topics
lookup_unload
lookup_load
Dynamic data lookup: The LOOKUP TEMPLATE component