Pentaho Data Integration (PDI) Tutorial
Pentaho Data Integration (PDI) Tutorial
The following tutorial is intended for users who are new to the Pentaho suite or who are evaluating Pentaho as
a data integration and business analysis solution. The tutorial consists of six basic steps, demonstrating how to
build a data integration transformation and a job using the features and tools provided by Pentaho Data
Integration (PDI).
The Data Integration perspective of PDI (also called Spoon) allows you to create two basic file types:
transformations and jobs. Transformations describe the data flows for ETL such as reading from a source,
transforming data and loading it into a target location. Jobs coordinate ETL activities such as defining the flow
and dependencies for what order transformations should be run, or prepare for execution by checking
conditions such as, "Is my source file available?" or "Does a table exist in my database?"
The aim of this tutorial is to walk you through the basic concepts and processes involved in building a
transformation with PDI in a typical business scenario. In this scenario, you are loading a flat file (.CSV) of sales
data into a database so that mailing lists can be generated. Several of the customer records are missing postal
codes (zip codes) that must be resolved before loading into the database. In the preview feature of PDI, you will
use a combination of steps to cleanse, format, standardize, and categorize the sample data. The six basic steps
are:
Parent Topic
• Setup
Child Topics
• Prerequisites
• Step 1: Extract and load data
• Step 2: Filter for missing codes
• Step 3: Resolve missing data
• Step 4: Clean the data
• Step 5: Run the transformation
• Step 6: Orchestrate with jobs
https://help.hitachivantara.com/Documentation/Pentaho/9.0/Setup/Pentaho_Data_Integration_(PDI)_tutorial 1/33
Updated: Tue, 21 Mar 2023 08:57:35 GMT
Prerequisites
To complete this tutorial, you need the following items:
Parent Topic
Parent Topic
1. Select File New Transformation in the upper left corner of the PDI window.
https://help.hitachivantara.com/Documentation/Pentaho/9.0/Setup/Pentaho_Data_Integration_(PDI)_tutorial 2/33
Updated: Tue, 21 Mar 2023 08:57:35 GMT
2. Under the Design tab, expand the Input node, then select and drag a Text File Input step onto the
canvas.
3. Double-click the Text File input step. In the Text file input window, you can set the step's various
properties.
https://help.hitachivantara.com/Documentation/Pentaho/9.0/Setup/Pentaho_Data_Integration_(PDI)_tutorial 3/33
Updated: Tue, 21 Mar 2023 08:57:35 GMT
5. Click Browse to locate the source file, sales_data.csv, in the ...\design-tools\data-
integration\samples\transformations\files folder. The Browse button appears in the top
right side of the window near the File or Directory field.
Parent Topic
1. Click the Content tab, then set the Format field to Unix.
2. Click the File tab again and click the Show file content near the bottom of the window.
3. The Number of lines (0-all lines) window appears. Click the OK button to accept the default.
4. The Content of first file window displays the file. Examine the file to see how that input file is delimited,
what enclosure character is used, and whether or not a header row is present.
In the sample, the input file is comma delimited, the enclosure character being a quotation mark (") and
it contains a single header row containing field names.
5. Click the Close button to close the window.
Parent Topic
1. Click the Content tab. The fields under the Content tab allow you to define how your data is formatted.
2. Verify that the Separator is set to comma (,) and that the Enclosure is set to quotation mark ("). Enable
Header because there is one line of header rows in the file.
https://help.hitachivantara.com/Documentation/Pentaho/9.0/Setup/Pentaho_Data_Integration_(PDI)_tutorial 4/33
Updated: Tue, 21 Mar 2023 08:57:35 GMT
3. Click the Fields tab and click Get Fields to retrieve the input fields from your source file. When the
Number of lines to sample window appears, enter 0 in the field then click OK.
4. If the Scan Result window displays, click Close to close the window.
https://help.hitachivantara.com/Documentation/Pentaho/9.0/Setup/Pentaho_Data_Integration_(PDI)_tutorial 5/33
Updated: Tue, 21 Mar 2023 08:57:35 GMT
5. To verify that the data is being read correctly, click the Content tab, then click Preview Rows.
6. In the Enter the number of rows you would like to preview window, click OK to accept the default.
The Examine preview data window appears.
7. Review the data. Do you notice any missing, incomplet, or variations of the data?
9. Give the transformation a name and provide additional properties using the Transformation Properties
window. There are multiple ways to open the Transformation Properties window.
◦ Right-click on any empty space on the canvas and select properties.
◦ Double-click on any empty space on the canvase to select properties.
◦ Use the CTRL-T keyboard combination.
10. In the Transformation Name field, type: Getting Started Transformation.
Below the name you will see that the filename is empty.
11. Click OK to close the Transformation Properties window.
https://help.hitachivantara.com/Documentation/Pentaho/9.0/Setup/Pentaho_Data_Integration_(PDI)_tutorial 6/33
Updated: Tue, 21 Mar 2023 08:57:35 GMT
Parent Topic
Parent Topic
1. Under the Design tab, expand the contents of the Output node.
3. Create a hop between the Read Sales Data and Table Output steps. To create the hop:
2. Click the Read Sales Data (Text File Input) step and drag the mouse to draw a line to the Table
Output step.
4. Double-click the Table Output step to open its Edit properties dialog box.
Parent Topic
1. Click New next to the Connection field. You must create a connection to the database.
https://help.hitachivantara.com/Documentation/Pentaho/9.0/Setup/Pentaho_Data_Integration_(PDI)_tutorial 7/33
Updated: Tue, 21 Mar 2023 08:57:35 GMT
The Database Connection dialog box appears.
2. Provide the settings for connecting to the database.
Field Setting
3. Click Test to make sure your entries are correct. A success message appears. Click OK.
NoteIf you get an error when testing your connection, ensure that you have provided the correct
settings information as described in the table and that the sample database is running. See Start and
Stop the Pentaho Server for information about how to start the Pentaho Server.
4. Click OK, to exit the Database Connections window.
Parent Topic
2. This table does not exist in the target database, so Pentaho can generate the DDL to create the table
and execute it. In this scenario, the DDL is based on the stream of data coming from the previous step,
which is Read Sales Data.
https://help.hitachivantara.com/Documentation/Pentaho/9.0/Setup/Pentaho_Data_Integration_(PDI)_tutorial 8/33
Updated: Tue, 21 Mar 2023 08:57:35 GMT
4. Click the SQL button at the bottom of the Table output dialog box to generate the DDL for creating your
target table.
5. The Simple SQL editor window appears with the SQL statements needed to create the table.
https://help.hitachivantara.com/Documentation/Pentaho/9.0/Setup/Pentaho_Data_Integration_(PDI)_tutorial 9/33
Updated: Tue, 21 Mar 2023 08:57:35 GMT
6. Click Execute to execute the SQL statement.
The Results of the SQL statements window appears.
7. Examine the results, then click OK to close the Results of the SQL statements window.
Parent Topic
https://help.hitachivantara.com/Documentation/Pentaho/9.0/Setup/Pentaho_Data_Integration_(PDI)_tutorial 10/33
Updated: Tue, 21 Mar 2023 08:57:35 GMT
Parent Topic
https://help.hitachivantara.com/Documentation/Pentaho/9.0/Setup/Pentaho_Data_Integration_(PDI)_tutorial 11/33
Updated: Tue, 21 Mar 2023 08:57:35 GMT
2. Here you specify the number of rows to preview. Optionally, you can configure break-points which
pause execution based on a defined condition, such as a field having a specific value or exceeding a
threshold.
3. Click the Quick Launch button. Preview the data and notice that several of the input rows are missing
values for the POSTALCODE field.
https://help.hitachivantara.com/Documentation/Pentaho/9.0/Setup/Pentaho_Data_Integration_(PDI)_tutorial 12/33
Updated: Tue, 21 Mar 2023 08:57:35 GMT
4. Click the Stop button on the preview window to end the preview.
Parent Topic
1. Add a Filter Rows step to your transformation. Under the Design tab, select Flow Filter Rows.
2. You need to insert your Filter Rows step between your Read Sales Data step and your Write to
Database step.
1. Right-click and delete the hop between the Read Sales Data step and Write to Database steps.
2. Create a hop between the Read Sales Data step and the Filter Rows step. Create a hop by
clicking on the step, hold the SHIFT key down and click-and-drag to draw a line to the next step.
3. Create a hop between the Filter Rows step and Write to Database step.
3. Double-click the Filter Rows step. The Filter Rows window appears.
https://help.hitachivantara.com/Documentation/Pentaho/9.0/Setup/Pentaho_Data_Integration_(PDI)_tutorial 13/33
Updated: Tue, 21 Mar 2023 08:57:35 GMT
The Fields window appears. These are the conditions you can select.
6. In the Fields window select POSTALCODE and click OK.
7. Click the comparison operator, (set to = by default), and select the IS NOT NULL from the displayed
Functions: window.
Parent Topic
First, you will use a Text file input step to read from the source file. Then, you will use a Stream lookup step to
bring the resolved postal codes into the stream. Last, you will use the Select values step to rename fields on
the stream, remove unnecessary fields, and more.
Parent Topic
https://help.hitachivantara.com/Documentation/Pentaho/9.0/Setup/Pentaho_Data_Integration_(PDI)_tutorial 14/33
Updated: Tue, 21 Mar 2023 08:57:35 GMT
Child Topics
2. Open the Text File Input step window, then enter Read Postal Codes in the Step name property.
Parent Topic
1. Click the Content tab, then set the Format field to Unix.
2. Click the File tab again and click the Show file content near the bottom of the window.
https://help.hitachivantara.com/Documentation/Pentaho/9.0/Setup/Pentaho_Data_Integration_(PDI)_tutorial 15/33
Updated: Tue, 21 Mar 2023 08:57:35 GMT
3. The Number of lines(0=all lines) window appears. Click the OK button to accept the default.
5. Examine the file to see how that input file is delimited, what enclosure character is used, and whether
or not a header row is present. In the example, the input file is comma (,) delimited, the enclosure
character being a quotation mark (") and it contains a single header row containing field names.
Parent Topic
1. In the Content tab, change the Separator character to a comma (,). and confirm that the Enclosure
setting is a quotation mark (").
Make sure the Header option is selected.
2. Under the Fields tab, click Get Fields to retrieve the data from your .csv file.
3. The Number of lines to sample window appears. Enter 0 in the field, then click OK.
Parent Topic
https://help.hitachivantara.com/Documentation/Pentaho/9.0/Setup/Pentaho_Data_Integration_(PDI)_tutorial 16/33
Updated: Tue, 21 Mar 2023 08:57:35 GMT
Resolve missing zip code information
Follow these steps to resolve the mising postal code information.
Procedure
1. Add a Stream Lookup step to your transformation by clicking the Design tab, expanding the Lookup
folder, then choosing Stream Lookup.
2. Draw a hop from the Filter Missing Zips to the Stream lookup step. In the dialog box that appears,
select Result is FALSE.
3. Create a hop from the Read Postal Codes step to the Stream lookup step.
4. Double-click on the Stream lookup step to open the Stream Value Lookup window.
6. From the Lookup step drop-down box, select Read Postal Codes as the lookup step.Perform the
following:
1. Define the CITY and STATE fields in the key(s) to look up the value(s) table.
2. In row #1, click the drop down in the Field column and select CITY.
4. In row #2, click the drop down field in the Field column and select STATE.
https://help.hitachivantara.com/Documentation/Pentaho/9.0/Setup/Pentaho_Data_Integration_(PDI)_tutorial 17/33
Updated: Tue, 21 Mar 2023 08:57:35 GMT
7. Click Get Lookup Fields to pull the three fields from the Read Postal Code step.
8. POSTALCODE is the only field you want to retrieve. To delete the CITY and STATE lines, right-click in the
line and select Delete Selected Lines.
9. In the New Name field, give POSTALCODE a new name of ZIP_RESOLVED and make sure the Type is set
to String.
https://help.hitachivantara.com/Documentation/Pentaho/9.0/Setup/Pentaho_Data_Integration_(PDI)_tutorial 18/33
Updated: Tue, 21 Mar 2023 08:57:35 GMT
11. Click OK to close the Stream Value Lookup edit properties dialog box.
Parent Topic
1. To preview the data, select the Lookup Missing Zips step, then right-click. From the menu that appears,
select Preview.
2. In the Transformation debug dialog window, click Quick Launch to preview the data flowing through
this step.
https://help.hitachivantara.com/Documentation/Pentaho/9.0/Setup/Pentaho_Data_Integration_(PDI)_tutorial 19/33
Updated: Tue, 21 Mar 2023 08:57:35 GMT
4. Click Close to close the window.
Results
The execution results near the bottom of the PDI window display updated metrics in the Step Metrics tab.
Parent Topic
1. Add a Select Values step to your transformation by expanding the Transform folder and choosing
Select Values.
2. Create a hop from the Lookup Missing Zips to the Select Values step.
3. Double-click the Select Values step to open its properties dialog box.
https://help.hitachivantara.com/Documentation/Pentaho/9.0/Setup/Pentaho_Data_Integration_(PDI)_tutorial 20/33
Updated: Tue, 21 Mar 2023 08:57:35 GMT
4. Rename the Select Values step to Prepare Field Layout.
5. Click Get fields to select to retrieve all fields and begin modifying the stream layout.
6. In the Fields list, find the # column and click the number for the ZIP_RESOLVED field.
Use CTRLUP (MacOS, COMMANDUP) to move ZIP_RESOLVED just below the POSTALCODE field (the one
that still contains null values).
7. Select the old POSTALCODE field in the list (line 20), right-click in the line and select Delete Selected
https://help.hitachivantara.com/Documentation/Pentaho/9.0/Setup/Pentaho_Data_Integration_(PDI)_tutorial 21/33
Updated: Tue, 21 Mar 2023 08:57:35 GMT
Lines
8. The original POSTALCODE field was formatted as an 9-character string. You must modify your new field
to match the form. Click the Meta-Data tab.
9. In the first row of the Fields to alter table the meta-data for section, click in the Fieldname column
and select ZIP_RESOLVED. Perform the following steps:
2. Select String in the Type column, and type 9 in the Length column.
10. Draw a hop from the Prepare Field Layout (Select values) step to the Write to Database (Table output)
step.
11. When prompted, select the Main output of the step option.
https://help.hitachivantara.com/Documentation/Pentaho/9.0/Setup/Pentaho_Data_Integration_(PDI)_tutorial 22/33
Updated: Tue, 21 Mar 2023 08:57:35 GMT
Parent Topic
data by mapping United States to USA using the Value mapper step. Cleaning the data ensures there is
In addition, this section of the tutorial demonstrates how to use buckets for categorizing the SALES data into
small, medium, and large categories using the Number range step. The tutorial shows how to insert these
cleaning and categorizing functions into your transformation, just prior to the Write to Database step on the
canvas.
Parent Topic
https://help.hitachivantara.com/Documentation/Pentaho/9.0/Setup/Pentaho_Data_Integration_(PDI)_tutorial 23/33
Updated: Tue, 21 Mar 2023 08:57:35 GMT
Procedure
1. Delete both hops connected to the Write to Database step. For each hop, right-click and select Delete.
2. Create a some extra space on the canvas. Drag the Write to Database step toward the right on your
canvas.
3. Add the Value mapper step to your transformation by expanding the Transform folder and choosing
Value mapper.
4. Create a hop between the Filter Missing Zips and Value mapper steps. In the dialog box that appears,
select Result is TRUE.
5. Create a hop between the Prepare Field Layout and Value mapper steps. When prompted, select the
Main output of the step option.
Parent Topic
https://help.hitachivantara.com/Documentation/Pentaho/9.0/Setup/Pentaho_Data_Integration_(PDI)_tutorial 24/33
Updated: Tue, 21 Mar 2023 08:57:35 GMT
1. Double -click the Value mapper step to open its properties dialog box.
3. In the Field Values table, define the United States and USA field values.
1. In row #1, click the field in the Source value column and type United States
2. Then, click the field in the Target value column and type USA
4. Click OK.
Parent Topic
Apply ranges
Follow these steps to apply ranges to your transformation.
Procedure
1. Add a Number range step to your transformation by expanding the Transform folder and choosing
Number range.
2. Create a hop between the Value mapper and Number rage steps.
3. Create a hop between the Number range and Write to Database (built using Table output) steps. When
prompted, select the Main output of the step option.
https://help.hitachivantara.com/Documentation/Pentaho/9.0/Setup/Pentaho_Data_Integration_(PDI)_tutorial 25/33
Updated: Tue, 21 Mar 2023 08:57:35 GMT
4. Double-click the Number range step to open its properties dialog box.
7. In the Ranges (min <=x< max) table, define the Lower Bound and Upper Bound field ranges along with
the bucket Value.
1. In row #1, click the field in the Upper Bound column and type 3000.0. Then, click the field in
the Value column and type Small.
2. In row #2, click the field in the Lower Bound column and type 3000.0, then click the field in the
Upper Bound column and type 7000.0. Click the field in the Value column and type Medium.
3. In row #3, click the field in the Lower Bound column and type 7000.0. Then, click the field in
the Value column and type Large.
8. Click OK.
https://help.hitachivantara.com/Documentation/Pentaho/9.0/Setup/Pentaho_Data_Integration_(PDI)_tutorial 26/33
Updated: Tue, 21 Mar 2023 08:57:35 GMT
Parent Topic
1. Double-click the Write to Database step to open its Edit properties dialog box.
2. Click the SQL button at the bottom of the Table output dialog box to generate the new DDL for editing/
altering your original target table.
1. The Simple SQL editor window appears with the SQL statements needed to alter the table.
3. The Results of the SQL statements window appears. Examine the results, then click OK to close
the Results of the SQL statements window.
https://help.hitachivantara.com/Documentation/Pentaho/9.0/Setup/Pentaho_Data_Integration_(PDI)_tutorial 27/33
Updated: Tue, 21 Mar 2023 08:57:35 GMT
3. Save your transformation.
Parent Topic
Results
After the transformation runs, the Execution Results panel opens below the canvas.
Parent Topic
• Step Metrics
Provides statistics for each step in your transformation including how many records were read, written,
https://help.hitachivantara.com/Documentation/Pentaho/9.0/Setup/Pentaho_Data_Integration_(PDI)_tutorial 28/33
Updated: Tue, 21 Mar 2023 08:57:35 GMT
caused an error, processing speed (rows per second) and more. This tab also indicates whether an
error occurred in a transformation step.
This tutorial introduces no intentional transformation errors, so the transformation should run
correctly. But, if a mistake does occur, steps that caused the transformation to fail are highlighted in
red. In the example below, the Lookup Missing Zips step caused an error.
• Logging
Displays the logging details for the most recent execution of the transformation. It also allows you to
drill deeper to determine where errors occur. Error lines are highlighted in red. In the example below,
the Lookup Missing Zips step caused an error because it attempted to lookup values on a field called
POSTALCODE2, which did not exist in the lookup stream.
https://help.hitachivantara.com/Documentation/Pentaho/9.0/Setup/Pentaho_Data_Integration_(PDI)_tutorial 29/33
Updated: Tue, 21 Mar 2023 08:57:35 GMT
• Execution History
Provides access to the Step Metrics and log information from previous executions of the
transformation. This feature works only if you have configured your transformation to log to a database
through the Logging tab of the Transformation Settings dialog box. For more information on
configuring logging or viewing the execution history, see Analyze your transformation results.
• Performance Graph
Analyzes the performance of steps based on a variety of metrics including how many records were
read, written, caused an error, processing speed (rows per second) and more. Like the Execution
History, this feature requires you to configure your transformation to log to a database through the
Logging tab found in the Transformation Settings dialog box.
• Metrics tab
Displays a Gantt chart after the transformation or job runs. This information includes how long it takes
to connect to a database, how much time is spent executing a SQL query, or how long it takes to load a
transformation.
https://help.hitachivantara.com/Documentation/Pentaho/9.0/Setup/Pentaho_Data_Integration_(PDI)_tutorial 30/33
Updated: Tue, 21 Mar 2023 08:57:35 GMT
• Preview Data
Parent Topic
• Defining the flow and dependencies that control the linear order for the transformations to run.
• Preparing for execution by checking conditions such as, "Is my source file available?" or "Does a table
exist?"
• Performing bulk load database operations.
• Assisting file management, such as posting or retrieving files using FTP, copying files and deleting files.
• Sending success or failure notifications through email.
For this part of the tutorial, imagine that an external system is responsible for placing your sales_data.csv
input in its source location every Saturday night at 9 p.m. You want to create a job that will verify that the file
has arrived and then run the transformation to load the records into the database. In a subsequent exercise,
you will schedule the job to run every Sunday morning at 9 a.m.
The following steps assume that you have built a Getting Started transformation as described in Step 1: Extract
https://help.hitachivantara.com/Documentation/Pentaho/9.0/Setup/Pentaho_Data_Integration_(PDI)_tutorial 31/33
Updated: Tue, 21 Mar 2023 08:57:35 GMT
and load data of the tutorial.
Procedure
2. Expand the General folder and drag a Start job entry onto the graphical workspace.
The Start job entry defines where the execution will begin.
NoteJobs run in a sequential order of steps and transformations can run in a parallel order of steps.
3. Expand the Conditions folder and add a File Exists job entry.
4. Draw a hop from the Start job entry to the File Exists job entry.
5. Double-click the File Exists job entry to open its Edit Properties dialog box. Click Browse and set the
filter near the bottom of the window to All Files. Select the sales_data.csv from the following
location: ...\design-tools\data-integration\samples\transformations\files.
https://help.hitachivantara.com/Documentation/Pentaho/9.0/Setup/Pentaho_Data_Integration_(PDI)_tutorial 32/33
Updated: Tue, 21 Mar 2023 08:57:35 GMT
9. Draw a hop between the File Exists and the Transformation job entries.
10. Double-click the Transformation job entry to open its edit Properties dialog box.
11. Click Browse to open the Select repository object window. Browse to and select the Getting Started
transformation.
14. Click Run icon in the toolbar. When the Run Options window appears, choose Local environment type
and click Run. The Execution Results panel should open showing you the job metrics and log
information for the job execution.
Parent Topic
https://help.hitachivantara.com/Documentation/Pentaho/9.0/Setup/Pentaho_Data_Integration_(PDI)_tutorial 33/33
Updated: Tue, 21 Mar 2023 08:57:35 GMT