Using Python in CloverDX

Python has long been one of the premier choices for data engineers looking to organize, clean, and analyze data from various sources. It is relatively easy to learn with powerful libraries that enable anyone from data scientists to business analysts to work with datasets to derive value.

However, utilizing python for data parsing, organizing, and cleansing at an enterprise scale often leads us into the trap of re-inventing the wheel when we try to script handling the intricacies of real-life data processes. Parsing files, handling exceptions, or scheduling scripts in cron are just a few tedious jobs you shouldn't be doing yourself in 21st century. By combing your Python skills with a complete data integration platform like CloverDX, you can focus on deriving value from your data rather than wrangling with mundane tasks that surround your business logic.

CloverDX gives you the flexibility to connect to any data source, on-prem or in the cloud, in any format. We can use CloverDX to push our data through data pipelines that will parse, clean, and transform your data into something we can use in analytic processes. CloverDX is designed to be the robust data backbone of an organization. Supporting Python-based analysis and data manipulation is just one of the many ways we can transform organizations’ data infrastructure.

CloverDX process for Python integration
Figure 1: CloverDX process for Python integration

In this example, we’re going to grab sales data from a CSV file, clean/validate our data, push it through a python script utilizing NumPy libraries for analysis, and finally write out to a target database for our Marketing team to use.

Setup

I've already written a python script to take incoming sales data, analyze it, and turn it into a readable Excel report for our Marketing and Sales teams. I can leverage CloverDX to act as my ingestion engine to easily grab data from any source on the fly without having to manually re-write how I pull and parse data in my Python script.

First, I'll use a JSON Extract component to pick apart a piece of JSON and restructure it so that I can effectively use it as an input into the rest of our data process. In this case, I’ll be restructuring into a flat CSV format.

JSON file reader
Figure 2: JSON file reader.

My JSON Extract Mapping looks like this:

JSONReader configuration dialog
Figure 3: JSONReader configuration dialog
Next, I'll perform a data quality check to make sure all of my records on my incoming data stream are useable. For example, I'm only interested in records with valid dates and over a certain unit-price threshold. I'm going to use a Data Validator component to establish my business rules for incoming data.
Data quality / data sanity check
Figure 4: Data quality / data sanity check

My Data Validation component will allow me to create the following rules:

Validator component configuration dialog
Figure 5: Validator component configuration dialog

Our next step is restructuring our data into standard-in (stdin) format so we can run it through our python script.

Data serialization for Python script
Figure 6: Data serialization for Python script
We're going to use a Flat File Writer component to transform our data from a columnar format to a single discrete record for processing in Python.
Serialization to component's output port
Figure 7: Serialization to component's output port
From there, we're going to utilize my Python Script using a Subgraph that I've built.
Python interpretter as Subgraph
Figure 8: Python interpreter as Subgraph
Subgraphs in CloverDX allow you to wrap business logic (like a piece of python data analysis in our case) in a reusable format that you can share within your team or organization.
Implementation of Python interpretter
Figure 9: Implementation of Python interpreter

In this example, we're using an Execute Script component to call out to my Python script.

Within this component, we can run any script (including Python). We can either paste our script directly into the “Script” input box or we can provide the URL of our .py file in the “ScriptURL” box. We can also parameterize all of these inputs to dynamically call on different data sources and analysis processes from a single graph.

Python interpreter configuration
Figure 10: Python interpreter configuration

Our script is going to be called on the incoming stdin input. The Python script looks like this:

import sys
import pandas as pd
import numpy as np

sales = pd.read_csv(sys.stdin)
report = sales.pivot_table(
	values = ['ext price', 'quantity'],
    index = ['Account Name'], 
    columns = ['category'], 
    fill_value = 0, 
    aggfunc = np.sum
)
report.to_csv(sys.stdout, index = True, header = False, line_terminator = '\n')

The resulting analysis from our script is going to go the the output port of our subgraph and out back into our original graph.

Now that we've set up our Subgraph, let's go back to the graph level to add some finishing touches.

Our subgraph output is going to be a single continuous string in Python StdIn format. We’re going to run it through a Flat File Reader to normalize it into a columnar form that’s easier to digest in the form of a business report.

Deserialization of Python script output
Figure 11: Deserialization of Python script output
My Flat File Reader configuration looks like this:
Reading from component's input port
Figure 12: Reading from component's input port

Finally, we can convert this output into a nice Excel report using our Excel Spreadsheet Writer Component.

Outputting process data
Figure 13: Outputting process data
We're going to set the "FileURL" property to our first output port (0) in a discrete format in order to incorporate it into our next component.
Binary serialization in CloverDX
Figure 14: Binary transport serialization in CloverDX

Our spreadsheet report output is going to look like this:

Excel report as a result of Python script
Figure 15: Excel report as a result of Python script

Now we can focus on the final part of our process: distributing our report via email:E-Mailing of a report

Figure 16: E-Mailing of a report
It's being sent directly into our Email Sender Component as a field called "report" in a byte format.
Configuration of EmailSender to send email with binary attachment
Figure 17: Configuration of EmailSender to send email with binary attachment
As you can see, CloverDX makes data cleaning, transformation, and ingestion incredibly easy. It allows you to leverage your existing Python investments on a platform that allows you to automate, orchestrate, and monitor your data pipelines.

Drag&Drop URL to Project Navigator or import project into CloverDX Designer

More from Tech Blog

  • Understanding the CloverDX Project Lifecycle

    Data Integration pipelines, like all software systems, have a lifecycle that includes transitions between development, test and production environments.... CloverDX How-To
  • New release: bugfixes for Eclipse Temurin JDK issues

    Today, we released three bugfix releases of CloverDX – CloverDX 5.8.3, CloverDX 5.9.2 and CloverDX 5.10.2. All three releases fix an issue which prevented... CloverDX
  • Hierarchical data structures - JSON

    Historically, there were 2 options to read JSON files prior to CloverDX 5.6. Following 5.6 we introduced a new option – to work with JSON directly in CTL.... CloverDX How-To
  • Customizing metadata propagation

    Metadata propagation, i.e. the ability to push metadata out from connected components is in the product since CloverETL 4.0.0. A new addition in CloverDX... Java
  • CTL2 error handling - try/catch block

    Poor data quality, format changes and unreachable data sources are just a few examples of runtime problems that can wreak havoc on a seemingly robust data... Feature

Visit CloverDX Blog

Read On