Using Python in CloverDX

Python has long been one of the premier choices for data engineers looking to organize, clean, and analyze data from various sources. It is relatively easy to learn with powerful libraries that enable anyone from data scientists to business analysts to work with datasets to derive value.

However, utilizing python for data parsing, organizing, and cleansing at an enterprise scale often leads us into the trap of re-inventing the wheel when we try to script handling the intricacies of real-life data processes. Parsing files, handling exceptions, or scheduling scripts in cron are just a few tedious jobs you shouldn't be doing yourself in 21st century. By combing your Python skills with a complete data integration platform like CloverDX, you can focus on deriving value from your data rather than wrangling with mundane tasks that surround your business logic.

CloverDX gives you the flexibility to connect to any data source, on-prem or in the cloud, in any format. We can use CloverDX to push our data through data pipelines that will parse, clean, and transform your data into something we can use in analytic processes. CloverDX is designed to be the robust data backbone of an organization. Supporting Python-based analysis and data manipulation is just one of the many ways we can transform organizations’ data infrastructure.

Figure 1: CloverDX process for Python integration

In this example, we’re going to grab sales data from a CSV file, clean/validate our data, push it through a python script utilizing NumPy libraries for analysis, and finally write out to a target database for our Marketing team to use.

Setup

I've already written a python script to take incoming sales data, analyze it, and turn it into a readable Excel report for our Marketing and Sales teams. I can leverage CloverDX to act as my ingestion engine to easily grab data from any source on the fly without having to manually re-write how I pull and parse data in my Python script.

First, I'll use a JSON Extract component to pick apart a piece of JSON and restructure it so that I can effectively use it as an input into the rest of our data process. In this case, I’ll be restructuring into a flat CSV format.

My JSON Extract Mapping looks like this:

Figure 3: JSONReader configuration dialog

Next, I'll perform a data quality check to make sure all of my records on my incoming data stream are useable. For example, I'm only interested in records with valid dates and over a certain unit-price threshold. I'm going to use a Data Validator component to establish my business rules for incoming data.

Figure 4: Data quality / data sanity check

My Data Validation component will allow me to create the following rules:

Figure 5: Validator component configuration dialog

Our next step is restructuring our data into standard-in (stdin) format so we can run it through our python script.

Figure 6: Data serialization for Python script

We're going to use a Flat File Writer component to transform our data from a columnar format to a single discrete record for processing in Python.

Figure 7: Serialization to component's output port

From there, we're going to utilize my Python Script using a Subgraph that I've built.

Python interpretter as Subgraph — Figure 8: Python interpreter as Subgraph

Subgraphs in CloverDX allow you to wrap business logic (like a piece of python data analysis in our case) in a reusable format that you can share within your team or organization.

Implementation of Python interpretter — Figure 9: Implementation of Python interpreter

In this example, we're using an Execute Script component to call out to my Python script.

Within this component, we can run any script (including Python). We can either paste our script directly into the “Script” input box or we can provide the URL of our .py file in the “ScriptURL” box. We can also parameterize all of these inputs to dynamically call on different data sources and analysis processes from a single graph.

Figure 10: Python interpreter configuration

Our script is going to be called on the incoming stdin input. The Python script looks like this:

import sys
import pandas as pd
import numpy as np

sales = pd.read_csv(sys.stdin)
report = sales.pivot_table(
	values = ['ext price', 'quantity'],
    index = ['Account Name'], 
    columns = ['category'], 
    fill_value = 0, 
    aggfunc = np.sum
)
report.to_csv(sys.stdout, index = True, header = False, line_terminator = '\n')

The resulting analysis from our script is going to go the the output port of our subgraph and out back into our original graph.

Now that we've set up our Subgraph, let's go back to the graph level to add some finishing touches.

Our subgraph output is going to be a single continuous string in Python StdIn format. We’re going to run it through a Flat File Reader to normalize it into a columnar form that’s easier to digest in the form of a business report.