Data integration is the process of combining data from multiple different sources in order to extract additional value. The process usually involves actions like joining, cleansing, validating and enriching the data along the way.
It's not unusual to find, even in smaller organizations, dozens of applications being used. In larger organizations the applications can number in the thousands. Being able to work with all the data you have, across different applications, is essential in order to understand the information that your data holds.
Think for example of a situation where you need to combine various Excel spreadsheets with the information held in an Access database - this is a simple data integration task. At the other end of the scale is an extensive business intelligence platform that needs to work with data from maybe a customer database, web analytics platform, an invoicing system, stock management platform and third party systems such as address databases.
Data consolidation is a process that combines data from multiple systems and stores it in a single location. The goal is to simplify the application landscape with a single data location for other applications, such as BI, reporting and data mining, to access without having to deal with the complexities and often impossibilities of dealing with the data at source.
A very common example of data consolidation is a data warehouse. The primary objective of a data warehouse is to aggregate any information that is spread across multiple systems and present it to reporting or data mining systems in order to extract maximum value from the available data from a single data source.
Data propagation is an integration process that copies data from one system to another. This is often used to enrich data in a system by adding more information that can be found elsewhere or to prevent the need for duplicate data entry in multiple applications.
For example, enriching a licensing application with data from a CRM application can allow teams that manage licenses to serve customers more effectively by avoiding the need to use two different applications and enter customer information into both.
Another common data propagation task is the propagation of data from production systems to QA or pre-production labs in support of end-to-end testing of various in-house applications during their regular development and maintenance cycles.
Such data propagation often involves logic that anonymizes data or selects just a subset of the source data due to security or volume considerations.
Data virtualization provides a near real-time unified view of the data in multiple different systems without having to store all of the data in a single location.
When data is queried, the virtualization system queries the source systems and builds a unified view of the data on the fly. Various caches are often employed to ensure that queries are fast and do not overload the source applications.
Being able to expose virtualization queries through multiple mechanisms, including easily configured API endpoints, is important in order to make the query results as accessible as possible.
There are several ways in which data integration can be performed. Each has a different level of automated integration, ranging from little automated with the manual integration approach, through to the more comprehensive common data storage technique (e.g. a data warehouse).
Users operate with all the relevant information accessing all the source systems or web page interface. No unified view of the data exists.
This approach requires the particular applications to implement all the integration efforts. The downside of this technique is that it's only manageable when there’s a very limited number of applications.
This approach transfers the integration logic from particular applications to a new middleware layer. Although the integration logic is not implemented in the applications any more, there is still a need for the applications to partially participate in the data integration.
Read more on the importance of a data integration layer:
Download the white paper: Designing Data Applications the Right Way
The uniform data access - or virtual integration - technique leaves data in the source systems and defines a set of views to provide and access the unified view of the customer across the whole enterprise.
For example, when a user accesses the customer information, the particular details of the customer are transparently acquired from the respective system.
The main benefits of the virtual integration are:
However, the drawbacks can include:
This usually means creating a new system which keeps a copy of the data from the source systems to store and manage it independently of the original system.
The most well known example of this approach is a data warehouse. The benefits include data version management and being able to combine data from very different sources (mainframes, databases, flat files, etc.) The physical integration, however, requires a separate system to handle the vast volumes of data.
Any data integration initiative should be driven by business, not IT. The project needs a thorough analysis of the requirements so it can be designed to achieve the goals the business needs. This includes having a business champion who understands the data assets of the company and can lead discussions about the long-term aims of the data integration project.Blog: The 6 Biggest Data Integration Challenges (and How to Solve Them)
The biggest data integration challenge is often simply the basics of integrating data from disparate, often incompatible, sources. To help design a successful project, you need to thoroughly anaylze:
Custom coded data integration solutions can work perfectly well with simple or smaller projects. But as the amount of code grows, maintainability becomes a serious challenge. It's important to be aware of potential pitfalls of a custom coded approach, especially if your data needs increase.
Data integration works with live systems that can’t usually be shut down while the integration is querying or updating the data. It's important to ensure that the integration doesn’t negatively impact the involved applications.
This can be mitigated with proper resource monitoring. For example monitoring response times, disk or memory usage can all help detect overloads of production systems.
Since data integration is a continuous process, implementation needs to be as good as possible. The only way to ensure this is by rigorous testing before deployment to production
It is good practice to test on production data by taking a copy of the data and configuring testing instances of all involved applications or systems. Such end-to-end testing can be beneficial not only to the integration itself, but also to the other systems involved in the integration.
When designing a data integration, it is always important to decide how to represent the data. If the integration copies data from one application to another (data propagation), the design is given by the applications involved.
However, for data consolidation and virtualization efforts, selecting the suitable representation of each entity can make or break a project. If the data representation is hard to understand, it will be avoided, resulting in applications adding point-to-point interfaces rather than using a centralized data location.
Building a comprehensive, unified data model may sound like a good idea, but these attempts almost always fail, due to the sheer complexity (even with just a handful of application). It’s much better in practice to start small - with just the minimum required data that serves the business needs.
Whereas data integration can apply to any combining of different data sources, ETL (Extract, Transform, Load), is most often used to describe the process of getting data into a data warehouse specifically. But ETL can also be used to talk about any kind of process where the data is transformed between the source and target system.
ETL stands for Extract, Transform and Load, and is a process for getting data from one place (the source system) to another (the target), while performing some kind of transformation on the data in the middle. There’s also often a Cleaning step in the process to ensure the quality of the data you’re working with.
The first step covers getting the data from the source system and making it available for processing. The aim is to do this using as little resource as possible, and in a way that doesn’t negatively affect the source system in terms of performance, response time or any kind of locking.
Extracts can be done in several ways, either taking just data that’s been changed since the last extract, or a full extract of the data, which is then compared with the previous extract to identify changes. The frequency these extracts are done can be extremely important. Particularly for full extracts, you can often be dealing with volumes in the tens of gigabytes.
Cleaning the extracted data is important to maintain the quality of the target system. Data cleaning should perform basic data unification rules, such as:
The transform step applies a set of rules to transform the data from the source to the target. This includes converting any data to the same dimension (i.e. conformed dimension) using the same units so that they can later be joined.
The transformation step can also involve joining data from several sources; generating aggregates; generating surrogate keys; sorting; deriving new calculated values; and applying advanced validation rules.
The load step involves loading the cleaned, transformed data into the target system (often a database). During this step, it's important to make sure the load is performed correctly, and using as little resource as possible. In order to make the process efficient, it's good practice to disable any constraints and indexes before the load and enable them back only after the load is completed.
ETL? ELT? What's the deal?
What's the difference between ETL and ELT?
Using specialist data integration software can make dealing with a large number of data sources, formats or transformations easier. It can also help automate your data integration processes to save manual effort, increase repeatability and reliability, and build in error monitoring and handling.
But choosing the right software to suit your organization's requirements can be a complicated decision. Before you start the process, it’s crucial to have a good understanding of just what it is you want to achieve. Other things you need to consider are:
Read more about making the decision to upgrade from homegrown ETL solutions to a robust data platform: