Efficient data modelling with DBT and ETL data pipeline

In CloverDX we sometimes get a question if and how we can work with DBT. These questions typically come up when IT/data engineering wants to empower data science teams with augmented, pre-checked, standardized or otherwise sanitized datasets. In this two-part miniseries I’d like to explain some theory behind ETL and ELT data processing techniques, why companies choose to go with DBT, before I walk you through a use case where CloverDX plays vital role to curate data before self-serve reporting in DBT may take place.

Tooling decisions

Assuming you already know, DBT is a tool of choice for analytics and you’re on a market for a tool to make data available for it, there is plenty of options to go with. DBT recommends Fivetran which you can use even to manipulate data before running DBT, if you only need to “dump” data into database on a regular basis you can even use Stitch. Or, when there is a need for more complex pre-processing and robust data pipelines you might prefer more traditional tools like CloverDX, Talend or AbInitio.

When a decision on a tool is made, data strategy is typically dictated by the tool of choice. Some tools may be deployed in both ETL and ELT pipelines (like CloverDX) for data ingest but usually one or the other is a preferred/more natural approach. That, I’d call more of a consequence of choices for technical fit from infrastructure cost, feature set, budget, existing talent and other factors.

Now may be a good time to point out why you’d even want to pre-process data for DBT when DBT itself is a data transformation tool. The answer is simple really. More than anything, it is intended user base – DBT aims to data science people and generally those who have strong SQL skills. But SQL on its own may be rather limiting when it comes to complex data transformations. It is possible to build entire data warehousing solution using SQL commands but still may be easier, faster and more robust to off-load it to an ETL tool or set of scripts. More traditional ETL tools aim more towards data engineering who may benefit from stronger programming support (Python, Java, .NET, …) and serve as more of a framework with tooling for efficient data access and manipulation.

Before we get into integration specifics of CloverDX and DBT, let’s introduce basic concepts I’ve already touched upon briefly.

ETL vs. ELT

Before we get any further it is important to understand what ETL and ELT stand for. All individual letters mean the same – Extract, Transform, Load. Position of Transform tells us, where data manipulation happens. In case of ETL, transformation happens in transit from source to target (typically database) in which case, database serves as storage only. ETL is target independent, so same implementation can be used for any target system and therefore migration from one target to another usually means just switch of data pump configuration.

diagram-etl
An ETL process applies transformation on transfer

Switching T and L results in ELT. Meaning, data manipulation happens directly in the data target. Simple.

diagram-elt
An ELT process uses backend storage engine to do transformations

Other variations of the three letters are in essence just optimizations or decomposition of responsibility among various tools – further, we discuss ETLT and ELTT as concepts built on top of the two. In context of this miniseries, last T is transformation is implemented in DBT but in general terms, this could be represented as any kind of data manipulation before reporting itself using e.g. Alteryx, Tableau Data Prep and similar.

diagram-etlt
Second transformation makes small modifications for reporting (e.g., data mart)

For ELT usually speaks couple of things – performance, cost effectiveness (not always true – more about that later) and maintainability. First mostly because of lower network overhead and a fact, targets are typically databases heavily optimized for aggregation, join and analytical queries. Obviously, each of these benefits is circumstantial. For example, using SaaS-based ELT tool means, data are still directed via a cloud provider to target database. Cost effectiveness and maintainability stem from a fact, ELT software is usually optimized to work with one specific database (target) and is either a SaaS service these days or is simply an integral part of a database engine. We started to encounter this pattern more frequently with upswing of Amazon RedShift and other scalable cloud databases sitting in state of the art network infrastructures.

Traditional ETL relies on a middleman in between data source and data target. This typically means, data needs to be transferred twice (before and after transformation), adding some network overhead; it also brings additional cost for maintenance of another server. On a flipside, there are far superior transformation and job management capabilities than relying on database’s native tooling. ETL tools are usually the Swiss army knife when it comes to supporting various sources and targets. True, SaaS tools typically have a host of “Read” connectors but are compatible with very limited number of targets (databases). Moreover, we very frequently come across a case when companies simply outgrown the ELT platform just because charges for running larger database clusters became far too expensive. Therefore, running additional on-premise server(s) to support ETL processes may be cheaper in comparison when data volumes or complexity of transformations grows.

With that said, we can look at DBT as an analysis engine which can play the final transformation part of ELT or ETLT/ELTT patterns where inbound data pipelines have to be implemented using other tools with or without prior sanitization.

diagram-eltt
ELTT is analogy of ETLT, just uses ELT as "workhorse"

Why DBT?

With DBT, the game has changed for many Data Analysts and Scientists. DBT cleverly combines SQL and Jinja templating (and by extension – Python) to build data models, relationships, optimizing queries along the way. Plus, license cost is very favorable. In tandem with CloverDX is even better because it is completely free as use of its paid functionality can be easily replaced by CloverDX features entirely, getting additional ones in the process.

It is no wonder why it gained traction so quickly. Python and SQL is in basic toolkit of any, even junior Data Analyst and Scientist.

We frequently see customers who like DBT for its dependency management and transparency of code it provides even with growing number of interdependent data models.

DBT automation features

Typical trademark of any production environment, especially in data industry, is some level of automation for it to make sense. Although there are many one-time migration projects, I’m not aware of a single one which would use DBT as its “workhorse”. It is far more often, data pipelines are executed on a schedule or an external trigger. DBT comes with DBT Cloud, allowing its subscribers to schedule DBT runs or to configure webhooks. Especially with webhook support, integration could be powerful as it allows external orchestration using virtually any modern tool or application. Drawback being, it comes with additional cost as it requires DBT Cloud and therefore is a paid solution (but still, very cheap).

Managing data with DBT and CloverDX

However even with external scheduling and orchestration, DBT will not solve any of the Extract or Load processes. It may even prove difficult to solve Data Engineer’s top challenges like incomplete, incoherent or duplicated data and other quality problems which in the transformation part of the process. With all this comes error reporting, profiling etc. or perhaps anonymization which is a frequent ask nowadays. That’s where an ETL and in an example at hand, CloverDX finds its place. To sanitize data before any kind of analysis can even begin. Before data is fixed beforehand, any kind of analysis may be skewed without even realizing it – data, nobody trusts is useless.

CloverDX is a nice addition to this concept which works well with DBT. Typically, in an ETLT setup. ELT (and by extension – ELTT) is also supported by CloverDX, but it is far less common approach simply because the power of the platform lies in robust orchestration options, error management and rich set of transformation components and functions.

Previously, I mentioned automation options which come with DBT Cloud. Simple scheduling may be limiting but option to trigger a job using webhook call opens options for better integration with most of modern tools, including iPaaS or external orchestrators. Most integrations DBT with CloverDX Server, I know of, does not use either and rely on DBT’s CLI which brings some useful benefits I’ll get to explain later in this miniseries.

Wrap up

We’ve explained what an ETL/ELT strategy is. In the next article, I’m going to show how ETLT (an extension to traditional ETL process) would look like with use of DBT CLI and DBT Cloud. The first (ETL) will get data from an XLSX spreadsheet, anonymizing sensitive information and evaluating score for the data profile. Going to load into Snowflake in an enriched raw format and using DBT to (re)build data models used for analysis.

More from Tech Blog

Visit CloverDX Blog

Read On