New to CloverDX? This is the best place to start! This fly by walk-through will guide you through fundamentals of CloverDX and teach you the basics along the way.
Python has long been one of the premier choices for data engineers looking to organize, clean, and analyze data from various sources. It is relatively easy to learn with powerful libraries that enable anyone from data scientists to business analysts to work with datasets to derive value.
However, utilizing python for data parsing, organizing, and cleansing at an enterprise scale often leads us into the trap of re-inventing the wheel when we try to script handling the intricacies of real-life data processes. Parsing files, handling exceptions, or scheduling scripts in cron are just a few tedious jobs you shouldn't be doing yourself in 21st century. By combing your Python skills with a complete data integration platform like CloverDX, you can focus on deriving value from your data rather than wrangling with mundane tasks that surround your business logic.
CloverDX gives you the flexibility to connect to any data source, on-prem or in the cloud, in any format. We can use CloverDX to push our data through data pipelines that will parse, clean, and transform your data into something we can use in analytic processes. CloverDX is designed to be the robust data backbone of an organization. Supporting Python-based analysis and data manipulation is just one of the many ways we can transform organizations’ data infrastructure.
Figure 1: CloverDX process for Python integration
In this example, we’re going to grab sales data from a CSV file, clean/validate our data, push it through a python script utilizing NumPy libraries for analysis, and finally write out to a target database for our Marketing team to use.
Setup
I've already written a python script to take incoming sales data, analyze it, and turn it into a readable Excel report for our Marketing and Sales teams. I can leverage CloverDX to act as my ingestion engine to easily grab data from any source on the fly without having to manually re-write how I pull and parse data in my Python script.
First, I'll use a JSON Extract component to pick apart a piece of JSON and restructure it so that I can effectively use it as an input into the rest of our data process. In this case, I’ll be restructuring into a flat CSV format.
Figure 2: JSON file reader.
My JSON Extract Mapping looks like this:
Figure 3: JSONReader configuration dialogNext, I'll perform a data quality check to make sure all of my records on my incoming data stream are useable. For example, I'm only interested in records with valid dates and over a certain unit-price threshold. I'm going to use a Data Validator component to establish my business rules for incoming data.Figure 4: Data quality / data sanity check
My Data Validation component will allow me to create the following rules:
Our next step is restructuring our data into standard-in (stdin) format so we can run it through our python script.
Figure 6: Data serialization for Python script
We're going to use a Flat File Writer component to transform our data from a columnar format to a single discrete record for processing in Python.
Figure 7: Serialization to component's output port
From there, we're going to utilize my Python Script using a Subgraph that I've built.
Figure 8: Python interpreter as Subgraph
Subgraphs in CloverDX allow you to wrap business logic (like a piece of python data analysis in our case) in a reusable format that you can share within your team or organization.
Figure 9: Implementation of Python interpreter
In this example, we're using an Execute Script component to call out to my Python script.
Within this component, we can run any script (including Python). We can either paste our script directly into the “Script” input box or we can provide the URL of our .py file in the “ScriptURL” box. We can also parameterize all of these inputs to dynamically call on different data sources and analysis processes from a single graph.
Figure 10: Python interpreter configuration
Our script is going to be called on the incoming stdin input. The Python script looks like this:
The resulting analysis from our script is going to go the the output port of our subgraph and out back into our original graph.
Now that we've set up our Subgraph, let's go back to the graph level to add some finishing touches.
Our subgraph output is going to be a single continuous string in Python StdIn format. We’re going to run it through a Flat File Reader to normalize it into a columnar form that’s easier to digest in the form of a business report.
Figure 11: Deserialization of Python script output
My Flat File Reader configuration looks like this:
Figure 12: Reading from component's input port
Finally, we can convert this output into a nice Excel report using our Excel Spreadsheet Writer Component.
Figure 13: Outputting process data
We're going to set the "FileURL" property to our first output port (0) in a discrete format in order to incorporate it into our next component.
Figure 14: Binary transport serialization in CloverDX
Our spreadsheet report output is going to look like this:
Figure 15: Excel report as a result of Python script
Now we can focus on the final part of our process: distributing our report via email:
Figure 16: E-Mailing of a report
It's being sent directly into our Email Sender Component as a field called "report" in a byte format.Figure 17: Configuration of EmailSender to send email with binary attachment
As you can see, CloverDX makes data cleaning, transformation, and ingestion incredibly easy. It allows you to leverage your existing Python investments on a platform that allows you to automate, orchestrate, and monitor your data pipelines.
In CloverDX we sometimes get a question if and how we can work with DBT. These questions typically come up when IT/data engineering wants to empower data... Analytics and BI
HTTP APIs currently drive data integration space. No matter which system enterprises use, most of them these days do support some way to extract or modify... API
We frequently get a question what a CloverDX Cluster is, how it works and advise around configuration. So let me shed some light on it as I’ll try to... Deployment
Starting CloverDX 5.16.0, server installer is available via an RPM package making it easy to install and maintain going forward using YUM or DNF package... Deployment
In previous article, we covered how to establish a Kafka connection and how to publish messages into a Kafka topic. This article will cover the other side... CloverDX How-To
Kafka is a distributed event streaming platform capable of handle massive volumes of events. It is designed and operates similar to a messaging queue.... Data Transformation
Kafka is a distributed event streaming platform capable of handle massive volumes of events. It is designed and operates similar to a messaging queue. Kafka was originally developed by a development group at LinkedIn and was open-sourced to the public in...
In previous article, we covered how to establish a Kafka connection and how to publish messages into a Kafka topic. This article will cover the other side of operation - how to consume Kafka messages and some important properties which may either process...