Packt+ | Advance your knowledge in tech

You're reading from Pentaho Data Integration Quick Start Guide Create ETL processes using Pentaho

Product type Paperback

Published in Aug 2018

Publisher Packt

ISBN-13 9781789343328

Length 178 pages

Edition 1st Edition

Languages

Java

Tools

Pentaho

Concepts

Business Intelligence

Author (1):

Carina Roldán

View More author details

Table of Contents (15) Chapters

Title Page

Dedication

Packt Upsell

Foreword

Contributors

Preface

1. Getting Started with PDI FREE CHAPTER

2. Getting Familiar with Spoon

3. Extracting Data

4. Transforming Data

5. Loading Data

6. Orchestrating Your Work

1. Other Books You May Enjoy

Leave a review - let other readers know what you think

Index

Designing, previewing, and running transformations

In this section, we will create a transformation that is a bit more interesting than the one you already built. In doing this, you will have a chance to learn about the process of designing transformations, while also previewing your work.

The task is as follows: you will be given a file with a list of cities in the USA, along with their zip codes and their state names. You will have to generate a file containing only the cities in the state of NY, sorted by zip code. We will split the task into the following steps:

Designing and previewing the transformation
Learning to deal with errors that may appear
Saving and running the transformation

Designing and previewing a transformation

Let's start by developing the first part of the transformation. We will read the file and filter the data. In this case, the solution is quite straightforward (this will not always be the case). There is a PDI step for each of the tasks to accomplish. The CSV file input step will serve for reading the file, and the Filter rows step will filter the rows. The instructions are as follows:

First, create a transformation. You can do this from the main menu, from the main toolbar, or by pressing Ctrl + T.
From the Input folder that contains steps, drag and drop a CSV file input step to the work area.
Double-click on the step. A configuration window will show up.
Click on the Browse... button to locate the file. For this exercise, we will use a file that comes with the PDI software. You will find it in the following path, under the installation folder: samples\transformations\files\Zipssortedbycitystate.csv.
Click onGet Fields. The grid will be filled with the columns found in the file:

Configuring a CSV file input step

Click on Preview, then click on OK. A window with sample data will appear, as shown in the following screenshot:

Sample data

Click on Close to close the Examine preview data window, and then click on OK to close the configuration window.

Now that we have read the file, the data is available for further processing. The rows coming from the CSV file input step will flow towards the next step, which will be the filter:

From the Flow folder, drag and drop a Filter rows step.
Click on the output connector in the CSV file input step to create a hop towards the Filter rows step.
You will be prompted for the kind of hop. Select Main output of step, shown as follows:

Selecting a kind of hop

Double-click on the Filter rows step to configure the filter.
Fill in the configuration window, as shown in the following screenshot, to indicate that we will only keep rows with states equal to NY:

Configuring a filter

Close the window. The following is what you should have so far:

Simple transformation

Now, we will preview the results to see if we get what we expected:

Make sure that the Filter rows step is selected.

Note

When a step is selected, its border becomes wider, as shown in the previous screenshot.

Press F10 to preview the results. Alternatively, click on the Preview icon (the icon that looks like an eye) in the transformation toolbar. Then, click on Quick Launch. A window with the filtered rows will appear, as follows:

Previewing data

Note

By default, only 1,000 rows are previewed. If you want to look at more data, just click on Get more rows.

Click on Stop to stop the previewing process and close the window.

Note

As you can see in the preceding image, when you preview or run a transformation, a small window with metrics is displayed above the steps while the rows are being processed. These metrics are the same as those shown in the Steps metrics tab in the Execution Results window.

Understanding the logging options

PDI logs all of the executions of a transformation. By default, the level of the logging details is basic, but there are seven possible levels of logging, ranging from Nothing at all to Rowlevel (very detailed), which is the most detailed level of logging. You can change the level of logging as follows:

If you will run a transformation, in the Execute a transformation window, before clicking on Run, select the proper option:

Selecting the Log level

If you are previewing a transformation, instead of clicking on Quick Launch, select Configure. This will show you the Execute a transformation window. In this window, choose Log level, and then click on Run.

Understanding the Step Metrics tab

Before continuing, let's observe what is happening in the Execution Results window. You already know the Logging tab, which displays every task that you are performing. Now, click on the Step Metrics tab. You will see the following:

Step Metrics tab

In this tab, there is a grid with one row for each of the steps in the transformation. In this case, we have two of them: one for the CSV file input step, and one for the Filter rows step. The columns in the grid describe what happened in each step. The following are the most relevant columns in our example:

Read: The number of rows coming from the previous step
Written: The number of rows that leave the current step toward the next one
Input: The number of rows coming from external sources

For instance, the rows that the CSV file input step reads from the file travel toward the Filter rows step. In other words, the output of the CSV file input step, displayed under the Written column, is the input of the Filter rows step, displayed under the Read column.

Also, if you look at the Filter rows line in the Steps Metrics grid, the number under the Written column represents the number of rows that will leave the step (that is, the rows remaining after filtering).

CSV file input is the only step that gets data from an external source – a file. Therefore, this is the only step that has a value greater than zero in the Input column.

The last columns in the grid – Time, Speed (r/s), and input/output – are metrics to monitor the performance of the execution. As to the rest of the columns in the grid, they will be described in later chapters.

Dealing with errors while designing

Now, we will continue working on the transformation created in the previous section. This time, we will sort the final data by ZIP code. This is a very simple task, but we will use it as a method to learn how to deal with errors that may appear while we are designing:

From the Transform folder that contains steps, drag and drop a Sort rows step to the work area.
Create a hop from the Filter rows step to this one. Again, you will be prompted for the kind of hop. Select the Main output of step option:

Kinds of hop leaving a Filter rows step

Double-click on the Sort rows icon. Fill in the grid as follows:

Sorting data

Close the window.
Make sure that the Sort rows step is selected, and run a preview like you did before. If you followed the steps as explained, you will get an error.

There are several indications that will help you to understand that an error occurred:

A small red icon will appear in the upper-right corner of one or more steps. These are the steps that are causing the error.
The backgrounds of the corresponding rows in the Step Metrics tab will change to red:

Errors in the Step Metrics tab

The Logging tab will contain text explaining the error:

Errors in the Logging tab

In this case, as stated in the log, the problem was that we were referring to a field that doesn't exist. We typed ZIPCODE instead of POSTALCODE. Let's fix it, as follows:

Double-click on the Sort rows step and fix the name of the field
Close the window and run a preview again
You will see the rows with states equal to NY, sorted by ZIP code

Saving and running a transformation

The last task before saving and running the transformation is to send the results to a file. This is quite easy:

From the Output folder, drag and drop a Text file output step to the work area. Create a hop from the Sort rows step to this new step. Note that this time, you don't have to choose the kind of hop; a default kind of hop will be created.
Double-click on the Text file output step. In the configuration window, provide a name for the file that we will generate. You should specify the full path, for instance, C:/Pentaho/data/ny_cities.

Note

You don't have to type the extension; it is automatically added, as indicated in the extension textbox.

Close the window.

The transformation is complete. The only task to perform now is to save it and run it, as follows:

Save the transformation. You can do so by pressing Ctrl + S or by selecting the proper option from Main Menu or Main Toolbar.

Once the transformation has saved, you can run it. Do so by pressing F9. In the Logging tab of the Execution Results window, you will see the log of the execution. If you select the Preview data tab in the same window, you will see sample data coming from the step currently selected. As an example, click on the Filter rows step and look at the data in the Preview data tab. You will see all of the rows for the state of NY, although they are still out of order:

Preview data tab

If you click on the Sort rows step, you will see the same, but ordered. Also, a file should have been created with the same information. Browse your system to look for the generated file. Its content should be something like the following:

     CITY;STATE;POSTALCODE
      NEW YORK;NY;10001
      NEW YORK;NY;10003
      NEW YORK;NY;10005
      ...
      ...
      ELMIRA;NY;14925
      HOLTSVILLE;NY;501
      FISHERS ISLAND;NY;6390

Note

If you look at the sample lines, you will note that the code 501 is between 14925 and 6390. The codes are not sorted by number, but alphabetically. This is because the ZIP code was defined as a String in the input step.