Usage of Dataset
Usage of Dataset
Usage of Datasets
1. INTRODUCTION
The Dataset stage is a file stage, which are operating system files managed by the control
files. It allows you to read data from or write data to a data set. The stage can have a single
input link or a single output link. It can be configured to execute in parallel or sequential
mode. The file naming convention for the dataset stage would be Filename.ds, for example,
xxx.ds. The types of dataset are
The following is the diagram which shows the Persistent and Virtual dataset
Page 1
Usage of Datasets
1. Properties Tab: In this tab, we could specify the name for the Input file, mode
for the dataset stage. The various mode are
o Overwrite
o Append
2. Partition Tab: In this tab, we could specify the partition type that could be used
by the stage. Also we could sort the input data. The various types are
o Auto
o Round Robin
o Entire
o Hash
Page 2
Usage of Datasets
3. Columns Tab: In this tab, we could specify the column metadata for the stage,
view the data for the stage and load or save the table definition. The various
properties that are available are
o Column Name
o Length
o Scale
o Nullable
o Extended
Page 3
Usage of Datasets
Below is the high level step that needs to be followed to create a Dataset job.
Page 4
Usage of Datasets
NOTE: Data generated from the row generator stage could be truncate in the after-
subroutine using the below command
5. DATA MANAGEMENT
Datasets are operating system files which are managed by the control files. (Control
files will have the information about the config file which has been used to create the
dataset and data files which will hold the actual data)
a. Control/Descriptor File
The descriptor file for a data set contains the following information:
1. Data set header information.
2. Creation time and data of the data set.
3. The schema of the data set.
4. A copy of the configuration files use when the data set was
created.
Page 5
Usage of Datasets
Schema Information
b. Data File: These are the actual files which will hold the data. The data of the
dataset could be viewed using the below options.
Page 6
Usage of Datasets
Page 7
Usage of Datasets
Page 8
Usage of Datasets
To get the data files, where it is been created and the total bytes, node split
Page 9
Usage of Datasets
Control_file_dataset.
txt
Page 10
Usage of Datasets
1) Daily .ds files were created in the Project serial run and were not deleted in the
post process, which lead to issue in long run stating “space constraint issue”.
2) The persistent data were deleted while cleaning process, hence history
information was lost.
3) The output load ready files were created even when the input files were not
present in the Project serial run (Ideally the job should have failed)
1) Clean the daily files (.ds) using the ORCHADMIN command in the post process.
2) Created a special node and resource disk in the config file, to maintain the
persistent data. Refer CYA_Directory_Change section in APPENDIX.
3) Include an IF THEN ELSE loop and abort the process if input file does not exist.
if [[ -f $PROJECT_SERIAL_RUN/etrap_cases.txt
&& -f $PROJECT_SERIAL_RUN/etrap_banking_activities.txt ]]; then
$DSHOME/bin/dsjob -run -local -jobstatus -mode NORMAL $DSproject $job_name
else
echo "Input files does not exist..... Plz check"
exit 1
8. ADVANTAGES OF DATASETS
Parallelism
Partition information will not be lost if the data is stored in dataset rather than
Stage.
Storage space is very less compared to the other stage files
9. DISADVANTAGES OF DATASETS
Data stored in Dataset will be in binary form and hence will not be in readable
format.
Generates an Output file even when the input file is not available.
Maintain same configuration file to read and write dataset.
Page 11
Usage of Datasets
APPENDIX
The following directory path was created to store the Persistent data.
{
node "etl_server"
{
fastname "dit1w104m7"
pools ""
resource disk "/etrade/IBM/dit/InformationServer/Server/Datasets"
{pools ""}
resource scratchdisk "/etrade/IBM/dit/InformationServer/Server/Scratch"
{pools ""}
}
node "db2_server"
{
fastname "dwdev1w88m7"
pools "db2"
resource disk "/tmp" {pools ""}
resource scratchdisk "/tmp" {pools ""}
}
node "db2_server2"
{
fastname "dwdev2w88m7"
pools "db2"
resource disk "/tmp" {pools ""}
resource scratchdisk "/tmp" {pools ""}
}
node "etl_server_spl"
{
fastname "dit1w104m7"
pools "keys"
resource disk "/etrade/home/suiteadm/db2_common/static/BIS_KEYS"
{pools ""}
resource scratchdisk "/etrade/dit/crm/batch/uscrm/mfs_6way/f5/Scratch"
{pools ""}
}
Page 12