Pentaho Data Integration Quick Start Guide

Chapter 1. Getting Familiar with Spoon

This chapter will show you how to work with Spoon by designing, debugging, and testing a transformation. In addition to exploring Spoon features, you will also learn the basics for handling errors when you are designing a transformation.

This chapter will cover the following topics:

Exploring the Spoon interface
Designing, previewing, and running transformations
Defining and using Kettle variables
Running transformations with the pan utility

Designing, previewing, and running transformations

In this section, we will create a transformation that is a bit more interesting than the one you already built. In doing this, you will have a chance to learn about the process of designing transformations, while also previewing your work.

The task is as follows: you will be given a file with a list of cities in the USA, along with their zip codes and their state names. You will have to generate a file containing only the cities in the state of NY, sorted by zip code. We will split the task into the following steps:

Designing and previewing the transformation
Learning to deal with errors that may appear
Saving and running the transformation

Designing and previewing a transformation

Let's start by developing the first part of the transformation. We will read the file and filter the data. In this case, the solution is quite straightforward (this will not always be the case). There is a PDI step for each of the tasks to accomplish. The CSV file input step will serve for reading the file, and the Filter rows step will filter the rows. The instructions are as follows:

First, create a transformation. You can do this from the main menu, from the main toolbar, or by pressing Ctrl + T.
From the Input folder that contains steps, drag and drop a CSV file input step to the work area.
Double-click on the step. A configuration window will show up.
Click on the Browse... button to locate the file. For this exercise, we will use a file that comes with the PDI software. You will find it in the following path, under the installation folder: samples\transformations\files\Zipssortedbycitystate.csv.
Click onGet Fields. The grid will be filled with the columns found in the file:

Configuring a CSV file input step

Click on Preview, then click on OK. A window with sample data will appear, as shown in the following screenshot:

Sample data

Click on Close to close the Examine preview data window, and then click on OK to close the configuration window.

Now that we have read the file, the data is available for further processing. The rows coming from the CSV file input step will flow towards the next step, which will be the filter:

From the Flow folder, drag and drop a Filter rows step.
Click on the output connector in the CSV file input step to create a hop towards the Filter rows step.
You will be prompted for the kind of hop. Select Main output of step, shown as follows:

Selecting a kind of hop

Double-click on the Filter rows step to configure the filter.
Fill in the configuration window, as shown in the following screenshot, to indicate that we will only keep rows with states equal to NY:

Configuring a filter

Close the window. The following is what you should have so far:

Simple transformation

Now, we will preview the results to see if we get what we expected:

Make sure that the Filter rows step is selected.

Note

When a step is selected, its border becomes wider, as shown in the previous screenshot.

Press F10 to preview the results. Alternatively, click on the Preview icon (the icon that looks like an eye) in the transformation toolbar. Then, click on Quick Launch. A window with the filtered rows will appear, as follows:

Previewing data

Note

By default, only 1,000 rows are previewed. If you want to look at more data, just click on Get more rows.

Click on Stop to stop the previewing process and close the window.

Note

As you can see in the preceding image, when you preview or run a transformation, a small window with metrics is displayed above the steps while the rows are being processed. These metrics are the same as those shown in the Steps metrics tab in the Execution Results window.

Understanding the logging options

PDI logs all of the executions of a transformation. By default, the level of the logging details is basic, but there are seven possible levels of logging, ranging from Nothing at all to Rowlevel (very detailed), which is the most detailed level of logging. You can change the level of logging as follows:

If you will run a transformation, in the Execute a transformation window, before clicking on Run, select the proper option:

Selecting the Log level

If you are previewing a transformation, instead of clicking on Quick Launch, select Configure. This will show you the Execute a transformation window. In this window, choose Log level, and then click on Run.

Understanding the Step Metrics tab

Before continuing, let's observe what is happening in the Execution Results window. You already know the Logging tab, which displays every task that you are performing. Now, click on the Step Metrics tab. You will see the following:

Step Metrics tab

In this tab, there is a grid with one row for each of the steps in the transformation. In this case, we have two of them: one for the CSV file input step, and one for the Filter rows step. The columns in the grid describe what happened in each step. The following are the most relevant columns in our example:

Read: The number of rows coming from the previous step
Written: The number of rows that leave the current step toward the next one
Input: The number of rows coming from external sources

For instance, the rows that the CSV file input step reads from the file travel toward the Filter rows step. In other words, the output of the CSV file input step, displayed under the Written column, is the input of the Filter rows step, displayed under the Read column.

Also, if you look at the Filter rows line in the Steps Metrics grid, the number under the Written column represents the number of rows that will leave the step (that is, the rows remaining after filtering).

CSV file input is the only step that gets data from an external source – a file. Therefore, this is the only step that has a value greater than zero in the Input column.

The last columns in the grid – Time, Speed (r/s), and input/output – are metrics to monitor the performance of the execution. As to the rest of the columns in the grid, they will be described in later chapters.

Dealing with errors while designing

Now, we will continue working on the transformation created in the previous section. This time, we will sort the final data by ZIP code. This is a very simple task, but we will use it as a method to learn how to deal with errors that may appear while we are designing:

From the Transform folder that contains steps, drag and drop a Sort rows step to the work area.
Create a hop from the Filter rows step to this one. Again, you will be prompted for the kind of hop. Select the Main output of step option:

Kinds of hop leaving a Filter rows step

Double-click on the Sort rows icon. Fill in the grid as follows:

Sorting data

Close the window.
Make sure that the Sort rows step is selected, and run a preview like you did before. If you followed the steps as explained, you will get an error.

There are several indications that will help you to understand that an error occurred:

A small red icon will appear in the upper-right corner of one or more steps. These are the steps that are causing the error.
The backgrounds of the corresponding rows in the Step Metrics tab will change to red:

Errors in the Step Metrics tab

The Logging tab will contain text explaining the error:

Errors in the Logging tab

In this case, as stated in the log, the problem was that we were referring to a field that doesn't exist. We typed ZIPCODE instead of POSTALCODE. Let's fix it, as follows:

Double-click on the Sort rows step and fix the name of the field
Close the window and run a preview again
You will see the rows with states equal to NY, sorted by ZIP code

Saving and running a transformation

The last task before saving and running the transformation is to send the results to a file. This is quite easy:

From the Output folder, drag and drop a Text file output step to the work area. Create a hop from the Sort rows step to this new step. Note that this time, you don't have to choose the kind of hop; a default kind of hop will be created.
Double-click on the Text file output step. In the configuration window, provide a name for the file that we will generate. You should specify the full path, for instance, C:/Pentaho/data/ny_cities.

Note

You don't have to type the extension; it is automatically added, as indicated in the extension textbox.

Close the window.

The transformation is complete. The only task to perform now is to save it and run it, as follows:

Save the transformation. You can do so by pressing Ctrl + S or by selecting the proper option from Main Menu or Main Toolbar.

Once the transformation has saved, you can run it. Do so by pressing F9. In the Logging tab of the Execution Results window, you will see the log of the execution. If you select the Preview data tab in the same window, you will see sample data coming from the step currently selected. As an example, click on the Filter rows step and look at the data in the Preview data tab. You will see all of the rows for the state of NY, although they are still out of order:

Preview data tab

If you click on the Sort rows step, you will see the same, but ordered. Also, a file should have been created with the same information. Browse your system to look for the generated file. Its content should be something like the following:

     CITY;STATE;POSTALCODE
      NEW YORK;NY;10001
      NEW YORK;NY;10003
      NEW YORK;NY;10005
      ...
      ...
      ELMIRA;NY;14925
      HOLTSVILLE;NY;501
      FISHERS ISLAND;NY;6390

Note

If you look at the sample lines, you will note that the code 501 is between 14925 and 6390. The codes are not sorted by number, but alphabetically. This is because the ZIP code was defined as a String in the input step.

Defining and using Kettle variables

In PDI, you can define and use variables, just as you do when you code in any computer language. We already defined a couple of variables when we created the kettle.properties file in Chapter 1, Getting Started with PDI. Now, we will see where and how to use them.

It's simple: any time you see a dollar sign by the side of a textbox, you can use a variable:

Sample textboxes that allow variables

You can reference a variable by enclosing its name in curly braces, preceded by a dollar sign (for example, ${INPUT_FOLDER}).

Note

A less used notation for a variable is as follows: %%<variable name>%% (for example, %%INPUT_FOLDER%%).

Let's go back to the transformation created in the previous section. Instead of a fixed value for the location of the output file, we will use variables. The following describes how to do it:

Open the transformation (if you had closed it). You can do this from Main Menu or from Main Toolbar.
Double-click on the Text file output step. Replace the full path for the location of the file with the following: ${OUTPUT_FOLDER}/${FILENAME}.

Note

Note that you can combine variables, and can also mix variable names with static text.

Close the window and press F10 to run the transformation.

In the window that appears, select the Variables tab. You will see the names of both variables – OUTPUT_FOLDER and FILENAME:

Variables in the Execute a Transformation window

The OUTPUT_FOLDER variable already has a value, which is taken from the kettle.properties file. The FILENAME variable doesn't have a value yet.

To the right of the name, type the name that you want to give to the output file, as shown in the following screenshot:

Entering values for variables

Click on Run
Browse the filesystem to make sure that the file with the name provided was generated

Beside the user-defined variables – those created by you, either in the kettle.properties file or inside Spoon – PDI has a list of predefined variables that you can also use. The list mainly includes variables related to the environment (for example, ${os.name}, for the name of the operating system on which you are working, or ${Internal.Entry.Current.Directory}, which references the file directory where the current job or transformation is saved). To see the full list of variables, both predefined and user-defined, just position the cursor inside any textbox where a variable is allowed, and press Ctrl + Spacebar. A full list will be displayed.

If you click on any of the variables for a second, the actual value of the variable will be shown, as indicated in the following screenshot:

PDI variables

If you double-click on a variable name, the name will be transcribed into the textbox.

Using named parameters

In the last exercise, you used two variables: one created in the kettle.properties file, and the other created inside of Spoon at runtime. There are still more ways to define variables. One of them is to create a named parameter. Named parameters are variables that you define in a transformation, and they can have a default value. You only have to supply a value if it differs from the default. Let's look at how it works, as follows:

Open the last transformation (if you had closed it).
Double-click anywhere in the work area excepting over the steps or hops. This will open the Transformation properties window.

Click on the Variables tab. This is where we define the named parameters.
Fill in the grid as shown, replacing the path in the example with the real path where you have PDI installed:

Defining a named parameter

Close the window.
Double-click on the CSV file input step. Replace the full path of the location of the file with the following: ${SAMPLES_DIR}/Zipssortedbycitystate.csv.
Close the window and save the transformation.

Click F9 to run the transformation. The Parameters tab in the Run Options window will show the named parameter that we just defined:

Running a transformation with a named parameter

Click on Run. PDI will replace the value of the variable, exactly as it did before.

Note that this time, we didn't supply a value for the variable, as it already had a proper value. Now, suppose that we move the samples folder to a different location. The following describes how we can provide the new value:

Click F9 to run the transformation.
In the Parameters tab, fill in the Valuecolumn with the proper value, as shown in the following screenshot:

Supplying a value for a named parameter

Click on Run. PDI will replace the value of the variable with the value that you provided, and will read the file from that location.

Running transformations with the Pan utility

So far, you have used Spoon to create and run transformations. However, if you want to run a transformation in a production environment, you won't use Spoon, but a command-line utility named Pan.

Let's quickly look at how to use this tool.

If you browse the PDI installation directory, you will see two versions of the utility: Pan.bat and Pan.sh. You will use the first if you have a Windows environment, and the second for other systems.

Note

In the next step-by-step tutorial, we will assume that you have Windows, but you should make the required adjustments if you have a different system.

The simplest way to run a transformation with Pan is to provide the full path of the transformation that you want to run. You can execute Pan in Windows as follows:

Pan.bat /file=<ktr file name>

For Unix, Linux, and other Unix-like systems, use the following command:

./Pan.sh /file=<ktr file name>

Let's suppose that you want to run the first transformation created in this chapter, which is located in the following directory:

c:/pdi_labs/my_first_transformation.ktr

In order to run it, follow these instructions:

Open a Terminal window
Go to the directory where PDI is installed and type the following code:

       Pan /file=c:/pdi_labs/my_first_transformation.ktr

Note

You must include the full path for the transformation file. If the name contains spaces, surround it with double quotes.

After running the command, you will see the log of the execution, which is the same log that you see in the Execution Results window in Spoon. In order to change the log level, just add the following:

-level:<log level>

The possible values for the log level are Nothing, Minimal, Error, Basic, Detailed, Debug, and Rowlevel.

As an example, the following command will print not only the basic log, but also the details of every row that is being processed:

Pan - level:Rowlevel /file=c:/pdi_labs/my_first_transformation.ktr

The details of the rows are as follows:

 ...
 2018/06/10 12:33:43 - Sort rows.0 - Read row: [YONKERS], [NY], [10701]
 2018/06/10 12:33:43 - Sort rows.0 - Read row: [YONKERS], [NY], [10703]
 2018/06/10 12:33:43 - Sort rows.0 - Read row: [YONKERS], [NY], [10705]
 2018/06/10 12:33:43 - Sort rows.0 - Read row: [YORKVILLE], [NY], [13495]
 2018/06/10 12:33:43 - Sort rows.0 - Read row: [YULAN], [NY], [12792]
 2018/06/10 12:33:43 - Sort rows.0 - Signaling 'output done' to 0 output rowsets.
 2018/06/10 12:33:43 - Sort rows.0 - Finished processing (I=0, O=0, R=1146, W=1146, U=0, E=0)

In the last version of our transformation, we added a named parameter with the path where PDI had to look for the input file. In Spoon, you provided the value in the Execution window. When running the transformation with Pan, you do it by using the param option, as follows:

/param:<parameter name>=<parameter value>

In our example, supposing that the new value is c:/samples, we build the command-line parameter as follows:

/param:"NAME=c:/samples"

Note

If you want to know all of the possible options for the Pan command, run Pan.bat or Pan.sh without parameters, and all of the options will be displayed.

MetalPesto Jan 19, 2020

Das Buch bietet einen Praxis-orientierten Einstieg in die Nutzung von PDI. Nicht mehr und nicht weniger. Kann man alles in der offiziellen Dokumentation und über andere Quellen herausfinden, aber wenn man keine Praxiserfahrung mit PDI hat und sich etwas Zeit sparen möchte, ist man mit diesem Buch gut beraten. Eine umfassende Vorstellung aller Steps sucht man allerdings vergebens.

Amazon Verified review