0% found this document useful (0 votes)
104 views

Usage of Dataset

The document discusses usage of datasets in data integration jobs. It describes how datasets allow reading from and writing to operating system files, and how they can be configured for parallel or sequential execution. It also provides details on creating and executing jobs that use datasets, and how dataset files can be managed and manipulated using command line utilities. Common issues with datasets in one implementation are discussed along with the steps taken to resolve them.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
104 views

Usage of Dataset

The document discusses usage of datasets in data integration jobs. It describes how datasets allow reading from and writing to operating system files, and how they can be configured for parallel or sequential execution. It also provides details on creating and executing jobs that use datasets, and how dataset files can be managed and manipulated using command line utilities. Common issues with datasets in one implementation are discussed along with the steps taken to resolve them.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Usage of Datasets

Usage of Datasets

1. INTRODUCTION

The Dataset stage is a file stage, which are operating system files managed by the control
files. It allows you to read data from or write data to a data set. The stage can have a single
input link or a single output link. It can be configured to execute in parallel or sequential
mode. The file naming convention for the dataset stage would be Filename.ds, for example,
xxx.ds. The types of dataset are

 Persistent Dataset (Data in the file stage) and


 Virtual dataset (Data is moved between stages using data sets., link data)

The following is the diagram which shows the Persistent and Virtual dataset

Page 1
Usage of Datasets

2. PROPERTIES OF DATASET STAGE

1. Properties Tab: In this tab, we could specify the name for the Input file, mode
for the dataset stage. The various mode are

o Overwrite
o Append

2. Partition Tab: In this tab, we could specify the partition type that could be used
by the stage. Also we could sort the input data. The various types are

o Auto
o Round Robin
o Entire
o Hash

Page 2
Usage of Datasets

3. Columns Tab: In this tab, we could specify the column metadata for the stage,
view the data for the stage and load or save the table definition. The various
properties that are available are

o Column Name
o Length
o Scale
o Nullable
o Extended

Page 3
Usage of Datasets

3. CREATION OF A DATASET JOB

Below is the high level step that needs to be followed to create a Dataset job.

1. Drag & Drop Row Generator Stage


2. Load the metadata
3. Specify the no. of records to be generated
4. Drag & Drop the Dataset stage and connect to Row generator stage.
5. Set the below Properties set in the Dataset (TGT_brk_src_risk_event_id)
o File name for the Output Dataset stage
o Update mode for the Dataset – Overwrite by default
o Load the column metadata
6. Save and compile the job.

Page 4
Usage of Datasets

4. EXECUTION OF THE DATASET JOB


When the job runs, the following are the run time behaviors,

o Creation of control file in the path mentioned in the Filename


o Creation of the data file in the resource disk allocated for the particular node.

NOTE: Data generated from the row generator stage could be truncate in the after-
subroutine using the below command

orchadmin truncate path/filename.ds

5. DATA MANAGEMENT

Datasets are operating system files which are managed by the control files. (Control
files will have the information about the config file which has been used to create the
dataset and data files which will hold the actual data)

a. Control/Descriptor File
The descriptor file for a data set contains the following information:
1. Data set header information.
2. Creation time and data of the data set.
3. The schema of the data set.
4. A copy of the configuration files use when the data set was
created.

Page 5
Usage of Datasets

Schema Information

b. Data File: These are the actual files which will hold the data. The data of the
dataset could be viewed using the below options.

Page 6
Usage of Datasets

Data could be viewed for different partitions.

Page 7
Usage of Datasets

6. DATASET MANAGEMENT IN UNIX


Using the orchadmin command in UNIX we could manage, view, sort, copy, remove
the dataset in DataStage. The following gives the various commands available for
manipulating the dataset,

Dump data from dataset to text file


orchadmin dump –field item Val dataset_creation2.ds > dataset_creation.txt
orchadmin dump -name -field USERID cleansed_cya_daily_riskt.ds| sort |
uniq -c

Getting unique data from the dataset file


orchadmin dump -name -field <field_name> <dataset> | sort | uniq -c

Viewing the record schema in the dataset


orchadmin describe -s dataset_creation2.ds
The above command is used to display the schema details of the dataset.

Page 8
Usage of Datasets

To get the data files, where it is been created and the total bytes, node split

orchadmin describe -f dataset_creation2.ds


The partition split and total records available in each partition/node
orchadmin describe -v dataset_creation2.ds

Page 9
Usage of Datasets

Copying data from dataset file


orchadmin cp/copy dataset_creation2.ds dataset_creation3.ds

Truncating dataset file


orchadmin truncate <<dataset.ds>>

Deletion of Dataset file


orchadmin rm/delete/del <<dataset.ds>>
This will delete both the dataset descriptor file and the data file

Control_file_dataset.
txt

Page 10
Usage of Datasets

7. DATASET IMPLEMENTATION IN ETRADE


In ETrade BIS we have used the dataset stage in the Cyota (CYA) as Input/output
file and have been created using the nodex1.apt (single node) configuration. There
were few issues on using dataset stage which were later identified and fixed. The
following has the details of the issues identified,

1) Daily .ds files were created in the Project serial run and were not deleted in the
post process, which lead to issue in long run stating “space constraint issue”.
2) The persistent data were deleted while cleaning process, hence history
information was lost.
3) The output load ready files were created even when the input files were not
present in the Project serial run (Ideally the job should have failed)

Steps followed to overcome the issues identified

1) Clean the daily files (.ds) using the ORCHADMIN command in the post process.
2) Created a special node and resource disk in the config file, to maintain the
persistent data. Refer CYA_Directory_Change section in APPENDIX.
3) Include an IF THEN ELSE loop and abort the process if input file does not exist.

if [[ -f $PROJECT_SERIAL_RUN/etrap_cases.txt
&& -f $PROJECT_SERIAL_RUN/etrap_banking_activities.txt ]]; then
$DSHOME/bin/dsjob -run -local -jobstatus -mode NORMAL $DSproject $job_name
else
echo "Input files does not exist..... Plz check"
exit 1

8. ADVANTAGES OF DATASETS

 Parallelism
 Partition information will not be lost if the data is stored in dataset rather than
Stage.
 Storage space is very less compared to the other stage files

9. DISADVANTAGES OF DATASETS

 Data stored in Dataset will be in binary form and hence will not be in readable
format.
 Generates an Output file even when the input file is not available.
 Maintain same configuration file to read and write dataset.

Page 11
Usage of Datasets

APPENDIX

CYA Directory Change:

The following directory path was created to store the Persistent data.
{
node "etl_server"
{
fastname "dit1w104m7"
pools ""
resource disk "/etrade/IBM/dit/InformationServer/Server/Datasets"
{pools ""}
resource scratchdisk "/etrade/IBM/dit/InformationServer/Server/Scratch"
{pools ""}
}
node "db2_server"
{
fastname "dwdev1w88m7"
pools "db2"
resource disk "/tmp" {pools ""}
resource scratchdisk "/tmp" {pools ""}
}
node "db2_server2"
{
fastname "dwdev2w88m7"
pools "db2"
resource disk "/tmp" {pools ""}
resource scratchdisk "/tmp" {pools ""}
}
node "etl_server_spl"
{
fastname "dit1w104m7"
pools "keys"
resource disk "/etrade/home/suiteadm/db2_common/static/BIS_KEYS"
{pools ""}
resource scratchdisk "/etrade/dit/crm/batch/uscrm/mfs_6way/f5/Scratch"
{pools ""}
}

Page 12

You might also like