CloverDX Data Profiler: Under the Hood of Data Profiling Application
Recently we’ve released the second public beta release of a new member of our CloverDX family – CloverDX Data Profiler. With this article we’d like to share an overview of the technical architecture of CloverDX Profiler – the why and how of the design.
The CloverDX Data Profiler is a data profiling application that helps with data quality. It provides the users with various information about their data, such as average, number of empty values, histogram-like charts etc. When designing the application we had several goals in mind:
- Performance – Data of interest can be huge. The profiling of them needs to be reasonably quick so that the user does not need to wait for results for too long – we want profiling to be an interactive process, not a batch job.
- CloverDX Integration – Data profiling is often just one part of a more complex workflow – for example, it can be the starting point of a data integration project. We want to make the transition from data profiling to the following steps in data integration with CloverDX as smooth as possible.
- Web presentation – An important group of users of data profiling tools are people interested in just viewing the profiling results; they don’t need to design or execute data profiling jobs. Presenting the profiling results on web gives such users a very simple way to access the results.
CloverDX Profiler Architecture
Let’s have a look at the main building blocks of CloverDX Data Profiler:
- CloverDX Data Profiler GUI – This is the graphical tool where users create and execute data profiling jobs. We’ll talk about the jobs later in more detail, but basically they describe a data source to profile (e.g. file, database etc.) and how to profile it using metrics (e.g. calculate maximum value on a database column). The GUI is based on the Eclipse RCP platform and reuses components of CloverDX Designer – this is the obvious way for the reuse of GUIs for concepts common with CloverDX (e.g. metadata, database connections, data analytics etc.) and it paves way for future deep integration.
- CloverDX Data Profiler Engine – Profiling jobs created and edited in the CloverDX Profiler GUI are stored as ordinary XML files, similarly to CloverDX graphs. The profiler engine takes these jobs, analyzes them, and automatically generates and runs a CloverDX graph. The CloverDX graph performs the actual profiling – so the CloverDX Data Profiler Engine is a relatively thin wrapper on top of CloverDX Engine.
- CloverDX Engine – As described above, the real work of data profiling is performed by CloverDX Engine that runs a specially designed transformation graph. Because of this, the profiler gains access to the wide feature set and great performance provided by the CloverDX Engine.
- Result Storage – CloverDX graphs performing the profiling store the results in a Result Storage database. The database can be an embedded Derby database for single user scenarios or a shared external database for multiple users. The shared external database can be used in a scenario where multiple users create and execute profiling jobs and at the same time, need to share their results. We support a range of databases for the storage of results - Oracle, MySQL, PostgreSQL and MSSQL.
- Reporting Server – As observing profiling results is done via a web interface, we have a Reporting Server that both serves the web content and provides it with the profiling results stored in ResultsStorage. The Reporting Server isolates the web interface from details of the Result Storage by providing the profiling results as JSON via a REST interface. We kept the Reporting Server intentionally simple – more complex server-side functionality such as scheduling will be reused by integration with CloverDX Server.
All parts of CloverDX DataProfiler are currently bundled in one simple-to-use package as a standalone application – just start the CloverDX Data Profiler and it automatically launches an embedded Result Storage, Reporting Server etc. However we’ve laid the foundations for separating the building blocks for bigger deployment scenarios and better integration with the rest of CloverDX family.
Data Profiling Process
Now let’s have a look on the nitty-gritty details of actually profiling data. The basic premise is that we already have all the tools needed for profiling in CloverDX – a transformation graph has all the expressive power needed for profiling of data:
- Reading of a data source – CloverDX has a wide range of reader components and connectors to 3rd party systems.
- Calculating metrics – Transformations in CloverDX are extensible by Java code or by scripts in our proprietary scripting language CTL.
- Storing results – This is just writing to a database which is one of the basic use cases of CloverDX.
So anyone could manually create a CloverDX graph that profiles data. But it’s quite a complex task which would take the user’s focus away from the core: what data source does he want to profile and which metrics does he want to use. In the CloverDX Data Profiler, the user describes this core information in a profiling job, and then the CloverDX Profiler Engine transforms it into a CloverDX transformation graph. The profiling job is defined in a relatively simple XML file that can be edited in a graphical editor. The job primarily contains the following information:
- Data source – Description of data source to profile – path to a file, definition of a database connection etc.
- Metadata – Description of the data source structure by CloverDX metadata
- Metrics – Multiple metrics can be enabled on each field of the metadata, e.g. minimum value, longest string, median etc.
The above picture demonstrates the process of running a profiling job:
- The profiling job is stored in an XML file that contains all required information. The job is created and edited in the CloverDX Profiler GUI.
- The job XML file is taken by the CloverDX Profiler Engine which creates a CloverDX transformation graph and runs it.
- The picture of the transformation graph contains an actual profiling graph which reads the data source, stores the results and, most importantly, calculates metrics in parallel branches. Details of the branching depend on the kind of metrics used: some can be computed on-the-fly on the original data, while some require all data to be sorted (e.g. median). The number of branches depends on the selection of metrics and the number of available CPU cores – we optimize the graph for high performance.
This article is a brief introduction into the architecture of CloverDX Data Profiler. As you can see, we’ve saved a lot of effort by using the power of CloverDX at the core. The CloverDX Data Profiler can be also seen as a successful example of embedding CloverDX.