Building custom REST APIs for ETL processes

HTTP APIs currently drive data integration space. No matter which system enterprises use, most of them these days do support some way to extract or modify their internal data using interfaces. But what if, data integration software itself should provide data using similar interfaces, maybe to be able to augment native functionality of source system? In CloverDX, these programmable interfaces are called Data Services.

Data Services is a feature, allowing you to consume or publish data into or out of CloverDX Data Management platform. It allows you to quickly expose a data pipeline (with or without transformations) in the form of a REST API. This can all be done with CloverDX Designer and does not require extensive knowledge of HTTP, JavaScript, C#, etc. This greatly helps reduce the effort and shorten timelines for creating or connecting to API’s in your organization.

In this article, we’re going to go through journey of conversion existing job to make it HTTP context-aware so it can be either consumed by external application via an API call, or directly by a person who’d just to plug YouTube channel performance data to his favorite reporting tool like Microsoft PowerBI, on-demand.

Data Services support GET, POST, PUT, PATCH and DELETE calls. This allows for simplified data management using an ETL process powered by synchronous REST API call. The logic of the service is implemented using a CloverDX ETL job with full access to HTTP context of the REST call: incoming data, request parameters and headers. Service can respond with data, response HTTP headers and status codes. All of this is in direct control of the service developer.

Existing batch data job introduction

Even though, we could start from scratch, for brevity of this article; we take existing job which previously ran daily via schedule, delivering YouTube channel performance reports in form of a CSV/Excel report as an email attachment.

1 basic report graph — Source ETL transformation, producing YouTube video performance report

To briefly introduce what is going on here. We start with a query to the YouTube API endpoint (The purple “Search YouTube” box) to get the top results for a search string. We then send a second API request to the YouTube analytics API to gather the statistics of the video such as the number of views, thumbs up/down, number of comments, etc. Next, we are doing a bit of validation to ensure we remove videos we’re not interested in (the Validator component filters out videos in any other than English language in this case). Finally, we are sorting on the view count, keeping only the top results. Results are then written to a flat file.

Conversion to REST job

To make this job HTTP context-aware (i.e., convert to Data Service), we will simply right click the graph we would like to convert and select Convert graph to data service.

This will automatically create new Data Service job using original graph. The most noticeable difference is, canvas has a pillar to the left (HTTP input – parameters and request body) and one to the right (HTTP output – response body) with all our original components in the middle.

data-service-converted — Original ETL job converted to Data Service

Most HTTP requests would have parameters. Parameter is any value, service consumers might want to change at the time of web service call.

In our example, the graph already had two parameters: QUERYSTRING and RESULT_COUNT. We will want the Data Service to accept these and pass into the graph.

In Data Service world, there are two types of parameters – job parameters (those of an ETL job) and request parameters (provided by HTTP call). These two are completely independent on each other. It sometimes happens, original job has some parameters configured (as in our case). Before you proceed, make sure you have no job parameters (in outline labelled just as "Parameters") set as required. This may sometimes lead to strange behavior where Data Service fails because of missing value for required parameter even when it is provided via HTTP request; request parameter is but job parameter is not. This brings us to another gotcha. Having Data Service parameters and job parameters with same names have undefined behavior! Otherwise, request parameters can be used just as job parameters. Hence the undefined behavior for overloaded names.

Parametrize a REST job

At the bottom left of the graph view, click on Endpoint Configuration tab. From here, we can add request parameters under the Input section as seen below. I gave them new names to make more obvious what would change in other parts of the solution - original ones are just, prefixed by DS_.

Another option would be to remove original job parameters altogether and create request parameters with same names. Then, no other changes to original solution would have to be made. Sometimes this is not possible, typically when dealing with external parameters. We’re taking the longer route here just because I want to demonstrate how to overcome this situation.

data-service-parameters — Configuration of HTTP request parameters

Original job parameters are being used in the first component, Search Parameters. We will replace the existing ones with the new request parameters – effectively rendering existing graph parameters unused. Alternatively, we could change graph parameter defaults. Instead of using QUERYSTRING="value", we’d have QUERYSTRING="${DS_QUERYSTRING}". This approach usually provides best convenience to non-breaking change ratio.

Moreover, it is more common to have HTTP parameters in lowercase. Choosing this approach sometimes leads to more tangible solutions as lowercase variable notation is in direct contradiction to CloverDX parameter naming best practices. They state, one should use uppercase parameters as they are easier to visually spot in a code (= treat them as programming constants).

data-service-switch-parameters — Replace job parameters to Data Service request parameters

Final touches

Finally, we reconnect output from the logic from FlatFileWriter to right-side (Output) pillar. FlatFileWriter can be removed afterwards. Resulting in a job looking like the one below.

data-service-final — Reroute transformation result to HTTP Response body

We can then go back into the Endpoint configuration tab and click test. Your browser should open the endpoint so you can run queries based on defined request parameters. Test page will make a call and displays results of the query.

When we're satisfied with Data Service responses, in Endpoint configuration, we simply click publish which will make it publicly available on the server for users with appropriate permissions. Additional properties (like security and monitoring) can be set in CloverDX Server Console, under Data Services module.

Summary

We've covered basics of how to jump-start Data Service development from existing job. This knowledge should help you build API endpoints for your various graphs but it only scratches the surface of what CloverDX can do. Please browse our wiki for additional information or feel free to reach out to us if additional help is needed.

Cullen Patel

November 7, 2022

API CloverDX How-To Feature

Quick start

CloverDX Academy

Building custom REST APIs for ETL processes

Existing batch data job introduction

Conversion to REST job

Parametrize a REST job

Final touches

Summary

More from Tech Blog

Sending emails via Azure Communication Services SMTP

Connecting to REST APIs (OpenAPI)

Performance tuning: How to troubleshoot database-related performance issues in CloverDX

CloverDX Transformation Language: How to Extend CTL with Java Functions

Organizing large projects: Separating Configuration and Data

Efficient data modelling with DBT and ETL data pipeline

Visit CloverDX Blog

AI in data transformation: Solving data privacy concerns

Building resilient data pipelines for sensitive, high-impact use cases

Bringing clarity to data modeling, AI, and frictionless pipelines

How data-driven automation is improving efficiency in high-touch fulfillment

The vital importance of data governance in the age of AI

Read On

Performance tuning: How to troubleshoot database-related performance issues in CloverDX

Here's How to Perform Code Debugging in CloverDX Designer