Modern data processing demands workflows that are not just robust but also flexible and adaptive. Organizations today rely on tools that enable them to ingest, transform, and output data in dynamic ways, responding to varying inputs and operational scenarios. Pentaho Data Integration (PDI), also known as Kettle, is a platform that facilitates such operations through its visual and metadata-driven approach to creating transformations and jobs.
At its core, a transformation in PDI defines how data flows from source to output, undergoing various stages such as filtering, cleansing, formatting, or calculation. A job, on the other hand, orchestrates the execution of transformations, manages decision-making, and oversees file operations, conditional flows, and execution branches. Together, transformations and jobs help in creating reusable, modular, and intelligent data workflows.
This article explores how to create such workflows using advanced features within Pentaho, with a specific focus on capturing runtime inputs, validating parameters, controlling the execution flow, and building structured job sequences that react accordingly.
Capturing Dynamic Input Using Variables
The need to make workflows adaptable starts with their ability to receive and act on parameters. Imagine a scenario where the processing logic depends on which examination file is passed at runtime. Instead of hardcoding the filename within each transformation, a more scalable approach is to allow it to be passed dynamically as a parameter.
To capture this parameter in a transformation, start by creating a new transformation using the Spoon interface. Drag a system info step onto the canvas. This step allows you to extract various runtime properties, including command-line arguments. Configure it to capture the first argument and store it in a field named filename.
This dynamic capture allows the workflow to be versatile. Whether you are processing exam1.txt or exam9.txt, the system adapts accordingly, making it suitable for automation where multiple files are processed sequentially or conditionally.
Implementing Validation to Ensure Reliable Execution
Merely accepting a filename is not enough. It must be validated to ensure the transformation behaves reliably. This involves adding a filtering mechanism that inspects whether the captured filename is valid or null. Use a filter step to define the condition: if the filename field is not null, continue; if it is null, abort the transformation.
In the canvas, connect the system info step to a filter rows step. Now, introduce two branches from this filter:
- One leading to an abort step, which will end the transformation with a user-friendly message indicating that a filename is required.
- Another leading to a set variables step, which stores the filename as an environment variable for use in subsequent transformations or jobs.
This flow ensures that any user or process calling the transformation without a filename will immediately be informed of the error, preventing unnecessary downstream failures. It is also a good practice in larger pipelines where invalid input at one point can ripple through multiple steps and create cascading issues.
Saving and Reusing the Transformation
Once the transformation is structured and validated, save it with a meaningful name such as getting_filename. This makes it easier to identify and reuse across different jobs or modules. Reusability is an essential aspect of efficient ETL design. By decoupling logic into small, specific units, you can reduce redundancy, increase maintainability, and enable modular execution.
The getting_filename transformation now acts as a dynamic input validator and variable setter. Any job that requires a filename as input can begin with this transformation to standardize the input and control the execution path.
Modifying an Existing Transformation to Use Runtime Input
The next step is to modify an existing transformation so that it can use the dynamically set variable. Suppose you have a transformation that reads examination results from a fixed file path. To make it adaptable, open this transformation and remove the step that manually sets the file path.
Instead, navigate to the file input step (such as a text file input step) and configure it to use a variable. In the path field, input ${FILENAME} instead of hardcoded text. This tells the system to look for the filename value set earlier in the execution.
Disable any options that suggest accepting filenames from previous steps unless explicitly needed. The transformation now becomes file-agnostic. Regardless of which file is passed in the command line, the transformation will dynamically read from it, provided the file exists.
Save this updated transformation under a new name like examinations_dynamic to distinguish it from its static counterpart.
Creating a Job to Control Transformation Execution
Now that you have a dynamic filename handler and a transformation that uses it, the next goal is to build a job that ties them together. Create a new job using Spoon. Drag a start step onto the canvas, followed by a transformation step. Link the start to the transformation step.
Configure the transformation step to point to getting_filename, the transformation that handles input. This step will be executed first in the job, capturing and validating the input filename.
Next, add a file existence checker. Drag a file exists step from the palette and link it to the output of the getting_filename transformation. Configure this step to look for the file specified in ${FILENAME}. This ensures that the job does not proceed if the file does not exist, thereby preventing downstream errors.
From the file exists step, create two branches:
- One leading to the main transformation that processes the examination file, such as examinations_dynamic.
- Another leading to a logging step or an abort step that triggers if the file is not found.
In the case where the file exists, the data flows seamlessly through to the transformation. Otherwise, the job logs an appropriate error or halts gracefully.
Enhancing Job Control with Logging and Abort Mechanisms
To further enhance the robustness of the job, consider incorporating a write-to-log step that captures the nature of any errors. This can be particularly useful in production environments where logs serve as the primary audit trail for job execution.
Configure the logging step to report missing files, invalid formats, or missing parameters. Also, use color-coded hops (e.g., red for failure paths) to visually distinguish between success and failure outcomes in the Spoon interface.
You can add final success and abort steps at the end of the flow to formalize the job’s exit points. Success steps can be connected to the normal execution path, while abort steps can conclude branches that failed validation or file checks.
Saving this job with a name like examination_processor provides a reusable component that can be triggered from command-line utilities, schedulers, or parent jobs.
Executing and Testing the Job with Different Scenarios
Testing the job thoroughly is crucial before deploying it in a production scenario. Run the job multiple times with varying inputs:
- Without passing a filename, to confirm that the job aborts and logs an appropriate error.
- With a non-existent filename, to verify that the file existence check fails.
- With a valid file, to ensure that the entire job executes successfully.
Each test run provides insights into how well the job handles unexpected input, how user-friendly the error messages are, and whether the data gets processed accurately.
For added clarity, switch the logging levels between minimal and basic to capture different levels of detail. This allows you to balance verbosity with relevance depending on the audience or stage of execution.
Benefits of Modular and Dynamic Workflow Design
The approach detailed above emphasizes modularity, flexibility, and error control. Instead of embedding hardcoded values into every transformation, variables are used. Instead of blindly executing steps, validation gates and conditional flows are introduced. This design:
- Reduces the risk of runtime errors
- Makes it easier to update and maintain logic
- Supports dynamic file processing
- Provides clear error messages and logging
- Enables better integration with automated tools
Workflows constructed this way can be scheduled to run in batch jobs, triggered from external systems, or reused in larger process flows. The use of jobs to orchestrate transformations ensures that tasks are executed in the right order, with fail-safes in place.
Preparing for the Next Phase of Workflow Enhancement
With dynamic input handling and foundational control mechanisms in place, the next step is to enhance the logic to analyze and generate outputs. For example, you may want to read examination data and extract top performers, identify students needing improvement, or split data into separate categories.
This requires more than just reading and writing files—it involves calculations, sorting, grouping, and filtering, often across multiple transformations chained within jobs.
By continuing to build on this foundation, you can evolve from basic file processing to complete data analytics workflows that automate insight generation and reporting.
Introduction to Output Generation Workflows
Once data ingestion and validation are successfully implemented, the next logical step in a data integration workflow is generating useful output. In many real-world cases, this involves transforming raw input into summarized or filtered data that offers insights—such as identifying top performers from examination scores.
Pentaho Data Integration (PDI) offers a robust set of transformation tools to process and structure data efficiently. In this segment, you will explore how to design transformations that:
- Read and clean data
- Manipulate fields
- Calculate performance metrics
- Output refined datasets such as top 10 performers for each subject
This article focuses on breaking down complex transformation logic into understandable components, each contributing toward a larger goal of meaningful data output.
Designing the Transformation for Top Scores
Begin by creating a new transformation and saving it with a descriptive name like top_scores. This transformation will process the cleaned global examination file and extract students with the highest marks in each subject area.
The input file should already contain consolidated student data from previous steps, such as names, scores, and other metadata. Use a text file input step to load this data into the transformation canvas.
This step should read the global examination file, which aggregates examination entries. The structure is expected to contain student identifiers, names, and their scores in multiple subjects such as writing, reading, speaking, and listening.
Refining and Restructuring the Input Data
Once the input step is in place, it’s important to clean and structure the data for accurate analysis. Add a select values step next. This allows you to remove any unnecessary columns that do not contribute to the performance analysis, such as processing timestamps or source file names.
Then, add a split fields step. This step helps in separating combined values. For instance, if the full name of the student is in one field, you can split it into first and last name. This makes it easier to sort, filter, and personalize output later.
Follow this with a formula step. In this step, apply logic to:
- Convert all name fields to uppercase for consistency
- Normalize scores, if required (e.g., convert scores from a 100-point scale to a 5-point scale by dividing each by 20)
This cleaned and standardized dataset is now ready for sorting and filtering.
Filtering and Ranking Students
The goal is to identify the top 10 students in each subject. Start by focusing on one subject, such as writing. Add a sort rows step to arrange the students in descending order of their writing scores.
Once sorted, add a JavaScript step to limit the output to just the top 10 rows. The script should monitor the row count and allow only the first 10 to proceed. This technique filters out everything beyond the 10 highest scores.
After this filtering step, add a sequence generator step to assign ranks to each student. This step creates a new field that increments with each row—starting from 1 for the highest score down to 10.
Use another select values step to rename and rearrange fields. Change the sequence field to position, and rename the score field from writing to score. This improves the clarity of the final report.
Writing Top Scores to Output Files
To generate a report of top performers, add a text file output step. Configure it to write the top 10 writing scores to a file named writing_top10.txt. Use a variable for the output directory to ensure flexibility, for example ${LABSOUTPUT}.
Specify which fields to include in the output file, such as:
- position
- student_code
- student_name
- student_lastname
- score
Preview the output and validate that all fields are correctly populated and formatted. Save the transformation at this stage to avoid losing progress.
Repeating the Process for Other Subjects
The same pattern can now be applied to other subjects. Duplicate the block of steps starting from sort rows through to the output step. Update the sorting field to reading, speaking, or listening as needed.
Ensure that the sequence field is uniquely named for each subject, such as seq_r, seq_s, and seq_l, before being renamed to position.
Likewise, update the output file names to reading_top10.txt, speaking_top10.txt, and listening_top10.txt.
By the end of this process, the transformation will consist of four similar branches—each dedicated to one subject. The structure remains modular, and each part can be tested and adjusted independently.
Saving and Testing the Transformation
Once all branches are created, validate each by previewing the final output steps. Save the transformation with an appropriate name, and run it.
On successful execution, four output files should be created, each containing the top 10 students for a specific subject. Review these files to ensure data accuracy, correct rankings, and proper formatting.
Having this transformation in place allows you to update performance leaderboards regularly by simply running the transformation after new data is appended.
Optimizing with Subtransformations
As transformations grow larger, managing them can become cumbersome. Pentaho allows for modularity through subtransformations. These allow you to break down a large transformation into smaller, manageable components.
Begin by identifying logical sections of your current transformation. The data preparation steps (input, cleansing, normalization) can be moved to a subtransformation. Similarly, the filtering, ranking, and output steps can be bundled together.
To implement this, select and copy the data preparation steps. Paste them into a new transformation and end it with a copy rows to result step. Save it with a name like top_scores_preparing.
Then, open your original transformation and delete the preparation steps, leaving only the processing and output steps. Add a get rows from result step at the beginning of the canvas. This step receives the output from the preparation subtransformation. Save this as top_scores_processing.
Now, these two transformations can work together in a modular, sequential way. This also improves testing and debugging since you can isolate issues to either preparation or processing.
Executing the Modular Workflow with a Job
To execute both subtransformations in sequence, create a new job. Start with a start step, followed by two transformation steps.
The first transformation entry should point to top_scores_preparing. The second should point to top_scores_processing.
Link them in order and save the job with a name like top_scores_flow. This job now orchestrates the complete logic of generating top performers from cleaned examination data.
Test the job by running it with a valid input file already present in the global examination file. All four output files should be created or updated as expected.
Integrating Output Generation into a Larger Workflow
In practical use, the goal is to automatically generate top score files whenever new data is appended. Instead of triggering this process manually, you can integrate it into your main examination job.
Open the examination job that appends new data to the global file. At the end of this job, add a new job entry. Configure it to execute top_scores_flow.
Link it to follow the last transformation that appends the data. This ensures that every time a new file is processed, the top score analysis is also refreshed.
This approach maintains consistency, removes manual dependencies, and provides immediate insights from the most recent data.
Testing with New Data Inputs
Now that everything is integrated, pick a new examination file that hasn’t been appended yet. Run the main job with this file as the argument.
Upon completion, verify that the new file has been appended correctly and that all four top performer reports have been updated.
Inspect the job metrics and execution logs to confirm that both the main transformation and the output generation job executed without errors.
Summary of Workflow Achievements
At this stage, your workflow accomplishes several advanced tasks:
- Dynamically reads examination input files
- Validates and sets filenames as variables
- Processes and cleans examination data
- Sorts and filters for top performance
- Outputs subject-wise ranked reports
- Uses subtransformations for modular design
- Chains processes using jobs for automation
This approach not only improves efficiency but also ensures that every time new data is introduced, the reporting pipeline automatically updates the outputs—making your workflow dynamic and self-sustaining.
Preparing for Row-Based Iterations
While current workflows are effective for batch-level processing, some scenarios may require row-level iterations. For example, you may want to generate personalized reports for each student with a writing score below a threshold. This introduces another advanced concept: iterating transformations for each row of input.
The next phase will explore how to create such iteration workflows, combining selection, filtering, and file generation logic into a row-by-row execution model using Pentaho jobs and transformations.
Introduction to Iterative Processing in ETL Workflows
In data integration projects, batch processing is common. However, there are scenarios where executing a transformation or job multiple times—once for each row of input—is necessary. Such row-level iteration is often needed when generating personalized reports, performing custom logic per user, or creating separate files based on individual records.
Pentaho Data Integration offers a streamlined way to perform iterative processing using its job architecture. This article explores how to build a solution that:
- Filters specific student records
- Iterates through each matching row
- Executes a transformation individually for every row
- Generates separate files dynamically
This enables tailored outputs and automated reporting in large-scale data environments.
Filtering Specific Rows for Iteration
The first step is to identify the rows that need individual processing. Suppose the goal is to find all students with a writing score below 60 and generate separate files for each one.
Start by creating a new transformation and save it with a name like students_list. This transformation will:
- Read the global examination file
- Filter for students with low writing scores
- Select key fields for use in downstream steps
Add a text file input step and point it to the global data file where all examination results are stored. After the input step, insert a filter rows step to define the condition: writing score less than 60.
Connect the filter to a select values step. Retain only the necessary fields such as student code and name. These fields will later be used as inputs to generate personalized outputs.
To prepare this dataset for iteration in a job, add a copy rows to result step at the end of the transformation. This sends the matching rows to a job for further processing, one by one.
Save the transformation and preview it to ensure that it’s returning the correct student records.
Designing the Job for Iterative Execution
Now, design a job to use the filtered data from the students_list transformation and trigger another transformation for each student record.
Create a new job and save it with a descriptive name. Begin by adding a start step. Then drag a delete files step, followed by two transformation entries.
The delete files step is useful for clearing old outputs before generating new ones. Configure this step to target files in the output folder matching a pattern, such as all files that start with hello and end with .txt.
Next, configure the first transformation entry to run the students_list transformation. This supplies the rows for iteration.
The second transformation entry is where row-based iteration occurs. Point it to another transformation, for example hello_each, which will use one row at a time to generate output.
Most importantly, enable the option to “Execute for every input row?” on the second transformation entry. This instructs Pentaho to pass one row at a time from the result set into the second transformation.
Link all job entries in sequence and save the job.
Creating the Output Transformation for Each Row
Create the transformation hello_each that takes each student’s data and creates a personalized file. This transformation will be executed once per student with a writing score below 60.
Within this transformation, add a get variables step if needed to retrieve values passed from the job. More commonly, a get system info or input row metadata will automatically capture the fields provided.
Add a formula or set field value step to prepare content for the output file. For example, you might write a message like:
“Hello [Student Name], your writing score is below 60. Please consult your teacher.”
After constructing the content, use a text file output step to write the message to a file. Name the file dynamically using a pattern that includes the current timestamp or student code to ensure uniqueness. Use a variable such as ${LABSOUTPUT} in the file path to make the destination configurable.
When configured properly, each execution of the transformation will produce a separate file for one student. Save and test the transformation by passing in a sample row.
Running the Job and Validating Output
Return to the main job that links the students_list transformation and hello_each transformation. Press F9 to open the job execution window.
Run the job and observe the job metrics and execution logs. If everything is configured correctly:
- The job will read the global data
- Filter for students with low scores
- Iterate through each one
- Generate a unique file for each
Once completed, navigate to the output directory. You should see files with names such as hello_134522.txt, hello_134523.txt, and so on. Each file will contain a personalized message.
This setup demonstrates the power of Pentaho’s job architecture to process and output data on a per-row basis, an essential feature for user-specific reports or alerts.
Combining Iterative Jobs with Existing Workflows
The row-based iteration job is highly useful but works best when integrated into a larger workflow. Suppose you already have a job that processes and appends examination data to a global file. After appending the data and generating top score reports, you might want to automatically trigger the iterative job to notify students with low performance.
To accomplish this, open your main job and add a new job entry after the final transformation or report generation step. Configure this new entry to execute the row iteration job.
Link the new job entry to the previous step. Now, every time new data is appended, the top performers are reported, and individual alerts are also generated for underperforming students.
This not only reduces manual intervention but ensures that feedback is delivered in real-time as part of a complete data processing pipeline.
Parameterizing the Workflow for Flexibility
Hardcoding values can reduce portability and reusability. To avoid this, make use of parameters and variables throughout the workflow.
For example:
- Use ${FILENAME} for input file names
- Use ${LABSOUTPUT} for output directories
- Use ${Internal.Job.Filename.Directory} to locate transformation files dynamically
By setting these variables at the job level or passing them through command-line arguments, your workflow becomes more dynamic and easier to deploy across different environments or datasets.
This parameterization allows you to:
- Easily run the same workflow on different datasets
- Move the job between environments without changing internal paths
- Adjust configurations centrally through a properties file
Validating and Debugging Iterative Executions
Iterative workflows can be harder to debug since transformations are run repeatedly. Here are some tips to validate them:
- Set logging level to basic or detailed for more insight
- Add write-to-log steps inside transformations to capture key values
- Preview transformations individually before integrating them into jobs
- Use small datasets during testing to keep log output manageable
Ensure that your output filenames are unique for each row to prevent overwriting. Including timestamps or row-specific values (like student codes) helps avoid this problem.
After validating the job, reset the logging level to minimal for production environments to improve performance.
Summary of the Complete Workflow Chain
At this stage, your Pentaho workflow is capable of:
- Accepting dynamic input
- Validating and storing input parameters
- Reading and preparing data from examination files
- Generating top 10 performer lists by subject
- Writing subject-specific summary reports
- Identifying underperforming students
- Iterating through filtered rows to generate personalized alerts
All these operations are encapsulated within modular transformations and jobs, using variables, conditional logic, and automated chaining. The result is a scalable, repeatable, and efficient system for data integration and reporting.
This approach transforms a manual, error-prone reporting process into a real-time, automated intelligence system suitable for any data-driven institution or organization.
Final Thoughts
With the combination of transformation logic, job orchestration, variable handling, and iterative processing, Pentaho enables the construction of complex data workflows that are intuitive yet powerful.
Whether you are building ETL pipelines for business intelligence, automating reports for education, or integrating input/output operations across various file types and formats, the techniques covered in this series form a solid foundation.
Start small, modularize your logic, and iterate your process design. Over time, your workflows will become easier to manage, more reliable, and highly adaptable to changing business requirements.