FAQ de Kettle
FAQ de Kettle
0, FAQ
17/04/07
page 1/34
17/04/07
Index
1. Preface.............................................................................................................................................. 4 2. Beginning user questions..................................................................................................................5 2.1. Problems starting spoon............................................................................................................5 2.2. What's the difference between transformations and jobs?........................................................5 2.3. Rule on mixing row 'types' on a hop in a transformation......................................................... 5 2.4. On duplicate fieldnames in a transformation............................................................................6 2.5. On empty strings and NULL.................................................................................................... 6 2.6. How to copy/duplicate a field in a transformation?..................................................................7 2.7. How to do a database join with PDI?....................................................................................... 8 2.8. How to sequentialize transformations?.....................................................................................8 3. Bug reports and feature requests...................................................................................................... 9 3.1. Preface...................................................................................................................................... 9 3.2. The links to click.......................................................................................................................9 3.3. What to put in a bug report?..................................................................................................... 9 3.4. What to put in a change request?............................................................................................ 10 3.5. A word of thanks.....................................................................................................................10 4. Source code access......................................................................................................................... 11 4.1. Windows................................................................................................................................. 11 4.2. Linux.......................................................................................................................................11 4.3. Web SVN access.....................................................................................................................11 4.4. Eclipse.....................................................................................................................................11 4.4.1. Subclipse......................................................................................................................... 11 4.4.2. Creating patches.............................................................................................................. 11 5. Further User questions....................................................................................................................12 5.1. Strings bigger than defined String length............................................................................... 12 5.2. Decimal point doesn't show in .csv output............................................................................. 13 5.3. Function call returning boolean fails in Oracle...................................................................... 14 5.4. Difference between variables/arguments in launcher............................................................. 15 5.5. How to use database connections from repository................................................................. 15 5.6. On inserting booleans into a MySQL database...................................................................... 15 5.7. Calculator ignores result type on division.............................................................................. 16 5.8. HTTP Client Step questions................................................................................................... 17 5.8.1. The HTTP client step doesn't do anything...................................................................... 17 5.8.2. The HTTP client step and SOAP.................................................................................... 17 5.9. Javascript questions................................................................................................................ 18 5.9.1. How to check for the existence of fields in rows............................................................ 18 5.9.2. How to add a new field in a row..................................................................................... 18 5.9.3. How to replace a field in a row....................................................................................... 19 5.9.4. How to create a new row.................................................................................................20 5.9.5. How to use something as NVL in javascript?.................................................................21 5.9.6. Example of how to split fields........................................................................................ 22 5.10. Shell job entry questions.......................................................................................................23 5.10.1. How to check for the return code of a shell script/batch file........................................ 23 5.11. Call DB Procedure Step questions........................................................................................23 5.11.1. The Call DB Procedure step doesn't do anything......................................................... 23 6. Twilight user-development questions.............................................................................................24 page 2/34
17/04/07
6.1. Things that were once proposed but were rejected.................................................................24 6.1.1. Implement connection type as a variable parameter ...................................................... 24 6.1.2. Implement on the fly DDL creation of tables, ... ........................................................24 6.1.3. Implement a step that shows a dialog and asks parameters............................................ 25 6.1.4. Implement serialization of transformations using reflection.......................................... 25 6.1.5. Implement retry on connection issues.............................................................................26 6.1.6. Implement GUI components in a transformation or job................................................. 26 6.1.7. Hardcoding Locale.......................................................................................................... 26 7. Development questions.................................................................................................................. 27 7.1. Development guidelines......................................................................................................... 27 7.1.1. Priority on development..................................................................................................27 7.1.2. Division of functionality in steps and job entries........................................................... 27 7.1.3. Rows on a single hop have to be of the same structure.................................................. 27 7.1.4. Null and are the same in PDI .....................................................................................27 7.1.5. On converting data to fit the corresponding Metadata....................................................28 7.1.6. On logging in steps in Pentaho Data Integration............................................................ 29 7.1.7. On using XML in Pentaho Data Integration................................................................... 29 7.1.8. On dropdown boxes and storing values ......................................................................... 29 7.1.9. On using I18N in PDI..................................................................................................... 30 7.1.10. On using Locale's in PDI.............................................................................................. 30 7.1.11. On reformatting source code.........................................................................................30 7.1.12. On using non-temporary storage...................................................................................31 7.2. On using Subversion...............................................................................................................31 7.3. How do you start developing your own plug-in step..............................................................32 7.4. Can you change the XML input step to process my file?.......................................................32 7.5. On Serializable and Binary..................................................................................................... 33 7.6. Success factors of PDI............................................................................................................ 34 7.6.1. Modular design............................................................................................................... 34
page 3/34
17/04/07
page 4/34
17/04/07
Assuming you downloaded the binary version of Pentaho Data Integration: check whether you extracted the zip file maintaining the directory structure: under the main directory there should be a directory called lib that contains a file called kettle.jar. If this is not the case re-extract the zip file in the proper way. When you fetched the sources of Pentaho Data Integration and compiled yourself you are probably executing the spoon script from the wrong directory. The source distribution has a directory called bin that contains the scripts, but if you compile the proper way the distribution-ready Pentaho Data Integration will be in a directory called distrib. You should start the spoon script from that directory.
page 5/34
17/04/07
page 6/34
17/04/07
This will duplicate fieldA to fieldB and fieldC. 2) Use a calculator step and use e.g. The NLV(A,B) operation as follows:
This will have the same effect as the first solution: 3 fields in the output which are copies of each other: fieldA, fieldB, and fieldC.
page 7/34
17/04/07
This will have the same effect as the previous solutions: 3 fields in the output which are copies of each other: fieldA, fieldB, and fieldC.
page 8/34
17/04/07
To see a list of the open change requests in the latest development version, go here:
http://www.javaforge.com/proj/tracker/browseTracker.do?tracker_id=1274&reset=open&view_id=-1&pagesize=0
a short but complete description of the problem a stack trace if any is available. From version 2.4.0 on, you need to click on the "Details" button in the error dialogs. a version number of Kettle, the build number & data are also nice. Use pan -version to obtain this information. sometimes a small transformation to show the problem can be really helpful, consider attaching it to the bug report. consider attaching a sample of the input file you used
page 9/34
17/04/07
page 10/34
17/04/07
Create a directory, for Example "Kettle" and right click on that directory, pick "SVN Checkout..." To update to the latest changes, pick SVN Update...
4.2. Linux
If you are running Linux, you need to install the "svn" Subversion client. The checkout command for the latest stuff is then:
svn checkout svn://source.pentaho.org/svnkettleroot/Kettle/trunk/
page 11/34
17/04/07
page 12/34
17/04/07
I output this table via a File output step (with extension .csv) and open it with Excel, I get the following:
How do I fix this and add the .00 to the last 2 rows? A: First of all in the File output step you have to use a format of "#.00" for the column. This will get you the a precision after the decimal point of 2 decimals on all of the values in the output file, if you open it with Notepad. If you open the .csv file with Excel the ".00" will still not be there, the problem is that .csv output when opened in Excel uses a default formatting which hides the .00. As of PDI version 2.4.0 there's an Excel output step that can do some basic Excel formatting.
page 13/34
17/04/07
A: It's an Oracle thing. That sort of return value is only valid in a PL/SQL block. This is because an Oracle table can't contain a boolean data type (boolean is not part of the SQL standard). It's suggested that you return a varchar with 'true' / 'false' in it (or 'Y' / 'N'). If you then set the convert the data type to boolean you might find that you will get a boolean. For a reference from the Oracle manuals on this behaviour: It is not feasible for Oracle JDBC drivers to support calling arguments or return values of the PL/SQL RECORD, BOOLEAN, or table with non-scalar element types. However, Oracle JDBC drivers support PL/SQL index-by table of scalar element types. For a complete description of this, see Chapter 11, Accessing PL/SQL Index-by Tables As a workaround to PL/SQL RECORD, BOOLEAN, or non-scalar table types, create wrapper procedures that handle the data as types supported by JDBC. For example, to wrap a stored procedure that uses PL/SQL booleans, create a stored procedure that takes a character or number from JDBC and passes it to the original procedure as BOOLEAN or, for an output parameter, accepts a BOOLEAN argument from the original procedure and passes it as a CHAR or NUMBER to JDBC. Similarly, to wrap a stored procedure that uses PL/SQL records, create a stored procedure that handles a record in its individual components (such as CHAR and NUMBER) or in a structured object type. To wrap a stored procedure that uses PL/SQL tables, break the data into components or perhaps use Oracle collection types.
page 14/34
17/04/07
page 15/34
17/04/07
page 16/34
17/04/07
5.8. HTTP Client Step questions 5.8.1. The HTTP client step doesn't do anything
Q: The HTTP client step doesn't do anything, how do I make it work? A: The HTTP client step needs to be triggered. Use a Row generator step generating e.g. 1 empty row and link that with a hop to the HTTP client step.
page 17/34
17/04/07
5.9. Javascript questions 5.9.1. How to check for the existence of fields in rows
Q: How do I check for the existence of a certain value in a row? A: The following snippet will let you check this. But keep in mind that you can not mix rows in PDI, all rows flowing over a single hop have to have the same number of fields, which have to be of the same name and type. The snippet:
var idx = row.searchValueIndex("lookup"); if ( idx < 0 ) { // doesn't exist } else { var lookupValue = row.getValue(idx); }
var value = Packages.be.ibridge.kettle.core.value.Value.getInstance(); value.setName("name_of_field"); value.setValue("value_of_field"); // possibly using types other than String row.addValue(value);
page 18/34
17/04/07
field1.setValue(100);
setValue() takes all possible types that can be used in PDI (also String, Dates, ...).
page 19/34
17/04/07
var newRow = row.Clone(); // make a copy // modify newRow _step_.putRow(newRow); // sends an extra row on the output of the step
Note that you should make sure that the row you're putting out to the next steps is of the same layout as the row that normally gets sent out. And also not that newRow is being put out before the regular row.
page 20/34
17/04/07
fieldName.nvl('1');
which would replace the value of fieldName with the value of '1' if fieldName is null.
page 21/34
17/04/07
java; var str = Merchant_Code.getString(); var code = ""; var name = ""; for (i = 0; i < str.length(); i++ ) { c = str.charAt(i); if ( ! java.lang.Character.isDigit(c) ) { code = str.substring(0, i); name = str.substring(i); Alert(code=+code+, name=+name); break; } }
The Alert() is just to show the fields of course. After the outer for loop you could add code and name in new separate fields e.g.
page 22/34
17/04/07
5.10. Shell job entry questions 5.10.1. How to check for the return code of a shell script/batch file
The Shell script considers a return code of 0 to mean success, anything else is failure. You can use hops to control the resulting flow.
5.11. Call DB Procedure Step questions 5.11.1. The Call DB Procedure step doesn't do anything
Q: The Call DB Procedure step doesn't do anything, my transformation finishes without doing anything and without issuing errors. How do I make it work? A: The Call DB Procedure needs to be triggered. Use a Row generator step generating e.g. 1 empty row and link that with a hop to the Call DB Procedure step.
page 23/34
17/04/07
For anything but simple SQL statements the SQL you write will be database type dependent. E.g. If you use Oracle analytics you're SQL won't run anymore on DB2 or MySql. Currently you know the type of the database and you could use the full functionality of the database; Starting with PDI version 2.3.1 and continuing in later versions more database specific settings were introduced. So just specifying this is Oracle, MySql, ... is not sufficient anymore, and a way to parametrize these specific options would need to be found as well (which would make it pretty complex); How many data warehouses run on multiple types of databases. Most data warehouses are created based on specific operational systems and targeting only specific database types since resulting reports and cubes would also need to run on those databases. So the use of connection type parameterizing would probably also not be that huge.
Possible workaround: maintain duplicate jobs for multiple databases. Alternatively you can use the generic ODBC which supports variable substitution for the driver as of PDI version 2.5.0GA. The disadvantage of the latter solution being that the special database processing for some types of database will not be done of course.
page 24/34
17/04/07
page 25/34
17/04/07
page 26/34
17/04/07
7. Development questions
7.1. Development guidelines 7.1.1. Priority on development
Correctness/Consistency If a tool is not correct it's not going to be trusted however fast it may be. It can't be that the same input will produce output A in one case, and output B in another case. Backwards compatibility Everyone like upgrades to go smoothly. Install the new binaries and be able to run without testing is the ideal. Of course, in some cases compatibility has to be broken for the greater good in the long term, but then it should be clearly documented (for upgrades). Speed There is a need for a speed. No-one wants to wait 30 minutes to insert 100.000 rows. User friendliness It should not be a torment to use a tool. It should allow both novice and expert users to get their job done. As example: any XML or configuration file should have a GUI element to manage this and should never be edited manually.
page 27/34
17/04/07
page 28/34
17/04/07
That is because otherwise, the string calculation of what you send to the log is always calculated. For Basic and Minimal logging levels this doesn't matter as normally they would always be on, but it does for Debug and Rowlevel.
if someone wants to add extra values in the future he must use the order you defined first; it makes the XML output very much unreadable.
It's better to convert from a Locale string in the GUI to some English equivalent which is then stored. As example:
Suppose on the GUI you have a dropdown box with values Date mask and Date time mask; Instead of using a 1 in the output for Date mask and 2 for Date time mask, it would be better to put in the output DATE_MASK for Date mask and DATE_TIME_MASK for Date time mask; Also note that DATE_MASK/DATE_TIME_MASK would then not be allowed to be subject to I18N translation (which is ok for transformation/job files).
page 29/34
17/04/07
Only translate what a normal user will see, it doesn't make sense to translate all debug message in PDI. Some performance improvements were achieved in PDI just by removing some of translations for debug messages; Make sure you don't translate strings used in the control logic of PDI:
If you would e.g. make the default name of a new step language dependent this would still make jobs/transformations usable across different locales; If you would e.g. make tags used in the XML generated for the step language dependent there would be a problem when a user would switch his locale; If you would translate non-tag strings used in the control logic you will also have a problem. E.g. in the repository manager Administrator is used to indicate which user is administrator (and this is used in the PDI control logic). So if you would translate administrator to a certain language, this would work as long as you wouldn't switch locales.
page 30/34
17/04/07
page 31/34
17/04/07
StepMetaInterface: contains the meta-data; StepInterface: performs the actual work, implements logic; StepDialogInterface: pops up a dialog in Spoon; StepData: contains temporary data like ResultSets, file handels, input streams, etc.
You can implement these the easiest way by looking at the existing steps and by inheriting from the base classes: BaseStepMeta, BaseStep. Package up these 4 classes and any others you might need in a jar file, for example foo.jar. After that all you need to do is design an icon, save it in PNG format, for example foo.png. 32x32 is the default although size but other dimensions should work equally well. Then create a bootstrap in the form of plugin.xml. Look at the example plugin.xml to see all the options in action. Finally, put all 3 files: foo.jar, foo.png and plugin.xml in a directory with a name of your choice (for example foo:
plugins/steps/foo/
or
$HOME/.kettle/plugins/steps/foo/
The PDI step loader will search for the file plugin.xml during startup and load the specified class as well as the extra jars/classes you might need.
7.4. Can you change the XML input step to process my file?
Q: I have to process an XML file which currently can't be processed by KETTLE, e.g. there's one optional field which depends on the value of an element and that should also be included as a field in a row, ... Can you build this functionality in in the XML input step? A: First of all it would depend what functionality you need. If the functionality is generally useful it can be built in. If it would only be useful for you it wouldn't make sense to build it in. As alternative solutions: consider processing the XML file via a Javascript step, or if what is required is very complex consider writing your own PDI step which you maintain yourself (outside of the PDI distribution).
page 32/34
17/04/07
page 33/34
17/04/07
page 34/34