CloverDX Blog on Data Integration

CloverDX Jobflows – Build, Monitor, and Manage Complex Workflows

Written by CloverDX | October 16, 2012

Lately, I have been confronted with a data integration task of extracting customer data from an Oracle database to an XML file, uploading it to FTP, and finally encrypting that file. If you use a single CloverDX graph to do this, the overall logic might get lost in complexity of the graph. You will need many components, execute them in different phases and, most importantly, your solution will be vulnerable to all kinds of errors – causing the graph to fail for not-so-obvious reasons.

Now what would a mighty tool for such complex tasks be? The answer is the new jobflow component in CloverDX 3.3.

Jobflows is a high-level approach to managing ETL processes. It allows you to combine ETL with other activities to form complex business processes – providing orchestration, conditional job execution, and error handling. Actions participating in jobflow include scripts, executables, web services (REST/SOAP) and common file operations (both local and remote).

Jobflows example

First of all, let's take a look at the picture below. It shows the principal logic of the example I will be discussing in this article.

(yellow blocks – jobflow *.jbf, green blocks – ETL graphs *.grf)

Step One – Extract from DB

Let us say that the first step in our complex set of tasks will be responsible for extracting rows from the Oracle DB, converting the data to XML and uploading it to an FTP. It will do nothing more and nothing less, and after doing so, it will report its results to the next step in the jobflow pipeline (Did the job fail? How long did it run? When was it started? What was the error message?). This step will be performed by an underlying ETL graph, whose execution is managed by ExecuteGraph component.

If you are scratching your head to figure out how jobs communicate with each other, keep in mind that ExecuteGraph is a component like any other, i.e. it is connected to other components by edges. Jobflow components can use pre-defined (template) metadata containing fields that allow you to send job results through the upstream chain (try right-click on an edge → New metadata from template).

Step Two – Send XML to FTP

In this example, the DB extraction step is connected to a File Operation component called CopyFiles that uploads the result to an FTP server:

In fact, the metadata describes a token – a piece of information that flows on jobflow edges – similarly to a record in an ETL graph.
Tokens are responsible for triggering events in a jobflow; you can make components react to incoming tokens in various ways (read more).

 

Input mapping

Back to the example: How would I let the CopyFiles component know the name and path of the XML file to be uploaded to FTP? I added an extra metadata field to the output edge of the ExecuteGraph component. As soon as the ExecuteGraph component finishes, it populates the field with a file name the executed graph has created (see Dictionary and Output Mapping section). I used the Output mapping of the ExecuteGraph to map the file name to the output edge. Respectively, Input mapping on CopyFiles component lets you read source/target file paths from input edge. The affected attribute in this case is sourceURL, so you would write the Input mapping like this:

$out.0.sourceURL = "${DATAOUT_DIR}" + "/" + $in.0.xmlName;

Notice the $out.0 – in Input Mapping it represents the internal structure of the component attributes – it acts as a record so that you can use it in CTL2. Its metadata is fixed by the component definition – check the transformation dialog to see all the attributes you can map.

Dictionary and Output Mapping

I have not yet mentioned how the DB extraction job communicates with its depended jobs, particularly, where and how it gets the XML file name. Obviously, it needs to 'send out' its internal parameter – the XML name. This is carried out via one of CloverDX's earlier features – dictionary.

Dictionary is an interface between the graph and its surroundings, e.g. other jobflow components, externally called procedures, etc. Using a dictionary to pass parameters between two jobs works like this:

  • First, you create a dictionary entry in the ETL graph (via Outline → Dictionary) and configure its data type. This determines what kind of data will be exchanged – string in this case.
  • Second, you populate the dictionary entry with a value within your ETL graph – either using the SetJobParameters component or in CTL (e.g. Reformat).
  • Now, in the parent jobflow, you can map the dictionary entry content in ExecuteGraph's Output mapping. The dictionary entry will appear in the left panel under Dictionary folder:

Note – Dictionary can also be manipulated with the GetJobInput and SetJobOutput components. These allow you to read and write (respectively) from/to a dictionary inside an ETL graph. GetJobInput is usually a starting component that reads the dictionary value and maps it to its output edge, passing it to the rest of the graph. SetJobOutput, on the other hand, will typically be placed at the end of your graph where you want to make the results accessible to the parent.

Error Handling

So far, I have not taken into account any errors either on the graph or on the jobflow level. Let's look at the middle step in my jobflow hierarchy – what happens if, for example, the file server goes down during the FTP transfer?

  • Obviously, the XML file will not be uploaded completely and the ETL graph will fail.
  • Next, the error will be signalled to the ExecuteGraph component which called the graph.
  • The error info is sent to the second output port of the component where you can react to it – stop the jobflow by placing the Fail component there, send email to the support team (EmailSender), or handle the error by another graph or nested jobflow.

In the picture below, I react to an FTP transfer error by writing an error log (notice this is handled by a separate graph) and then stopping the jobflow using Fail component:

There are many more possible reactions to failures. I could have configured the executing component not to Stop processing on fail – if an error occurred in the executed graph, the jobflow would not break and incoming tokens would sill trigger new graph executions (i.e. ignoring errors).

Step Three – Analyse Jobflow Stats

The penultimate step in my jobflow analyses previous job runs from a statistical point of view. First, it combines (Combine) tokens from preceding jobs and then computes average duration of each one. If that goes okay, the jobflow ends with Success.

Step Four – Encrypt Output

This overall result is then signalled to a logic one level higher (the top jobflow in the main diagram), which executed all the underlying tasks by a single ExecuteJobflow component. Using ExecuteScript, it calls a binary utility to produce a hash of the XML created earlier. The ListFiles component in front of it helps assembling the complete File URL.