Data Integration

What is Data Integration?

Data integration is the process of combining data from multiple different sources in order to extract additional value. The process usually involves actions like joining, cleansing, validating and enriching the data along the way.

It's not unusual to find, even in smaller organizations, dozens of applications being used. In larger organizations the applications can number in the thousands. Being able to work with all the data you have, across different applications, is essential in order to understand the information that your data holds. 

Think for example of a situation where you need to combine various Excel spreadsheets with the information held in an Access database - this is a simple data integration task. At the other end of the scale is an extensive business intelligence platform that needs to work with data from maybe a customer database, web analytics platform, an invoicing system, stock management platform and third party systems such as address databases.

Data Integration Examples

  • For a 360 degree customer view: By combining data from CRM applications, marketing applications and accounting systems, companies can derive interesting insights about how customers react to various marketing campaigns and measure effectiveness, compared to campaign cost and actual sales data.
Blog: Integrating Legacy Data with Salesforce
  • Creating a data warehouse: Typically created to store data collected from many different applications, data in a data warehouse is formatted so that a unified view of a whole department, or even the whole company, can be easily and effectively queried.
  • Data migration: A data migration project often requires integrating data from two (or more) sources into a single new system. For instance, migrating a customer database and a marketing system into one new CRM.

Types of Data Integration

Data Consolidation

Data consolidation is a process that combines data from multiple systems and stores it in a single location. The goal is to simplify the application landscape with a single data location for other applications, such as BI, reporting and data mining, to access without having to deal with the complexities and often impossibilities of dealing with the data at source.

A very common example of data consolidation is a data warehouse. The primary objective of a data warehouse is to aggregate any information that is spread across multiple systems and present it to reporting or data mining systems in order to extract maximum value from the available data from a single data source.

Enterprise Data Architecture Guide - download now

Data Propagation

Data propagation is an integration process that copies data from one system to another. This is often used to enrich data in a system by adding more information that can be found elsewhere or to prevent the need for duplicate data entry in multiple applications.

For example, enriching a licensing application with data from a CRM application can allow teams that manage licenses to serve customers more effectively by avoiding the need to use two different applications and enter customer information into both.

Another common data propagation task is the propagation of data from production systems to QA or pre-production labs in support of end-to-end testing of various in-house applications during their regular development and maintenance cycles.

Such data propagation often involves logic that anonymizes data or selects just a subset of the source data due to security or volume considerations.

Virtualization

Data virtualization provides a near real-time unified view of the data in multiple different systems without having to store all of the data in a single location.

When data is queried, the virtualization system queries the source systems and builds a unified view of the data on the fly. Various caches are often employed to ensure that queries are fast and do not overload the source applications.

Being able to expose virtualization queries through multiple mechanisms, including easily configured API endpoints, is important in order to make the query results as accessible as possible.

Data Integration Techniques

There are several ways in which data integration can be performed. Each has a different level of automated integration, ranging from little automated with the manual integration approach, through to the more comprehensive common data storage technique (e.g. a data warehouse).

Manual Integration or Common User Interface

Users operate with all the relevant information accessing all the source systems or web page interface. No unified view of the data exists.

Application Based Integration

This approach requires the particular applications to implement all the integration efforts. The downside of this technique is that it's only manageable when there’s a very limited number of applications.

Middleware Data Integration

This approach transfers the integration logic from particular applications to a new middleware layer. Although the integration logic is not implemented in the applications any more, there is still a need for the applications to partially participate in the data integration.

Read more on the importance of a data integration layer:
Download the white paper: Designing Data Applications the Right Way

Uniform Data Access or Virtual Integration

The uniform data access - or virtual integration - technique leaves data in the source systems and defines a set of views to provide and access the unified view of the customer across the whole enterprise. 

For example, when a user accesses the customer information, the particular details of the customer are transparently acquired from the respective system. 

The main benefits of the virtual integration are:

  • Nearly zero latency of the data updates propagation from the source system to the consolidated view.
  • There’s no need for a separate store for the consolidated data.

However, the drawbacks can include:

  • Limited information about the data's history and version management.
  • Restrictions on being able to apply the method to only 'similar’ data sources (e.g. the same type of database).
  • The fact that the access to the user data generates extra load on the source systems, which they may not have been designed to accommodate.
Blog: API Data Integration

Common Data Storage or Physical Data Integration

This usually means creating a new system which keeps a copy of the data from the source systems to store and manage it independently of the original system. 

The most well known example of this approach is a data warehouse. The benefits include data version management and being able to combine data from very different sources (mainframes, databases, flat files, etc.) The physical integration, however, requires a separate system to handle the vast volumes of data.

Challenges of Data Integration

Business vs IT

Any data integration initiative should be driven by business, not IT. The project needs a thorough analysis of the requirements so it can be designed to achieve the goals the business needs. This includes having a business champion who understands the data assets of the company and can lead discussions about the long-term aims of the data integration project. 

Blog: The 6 Biggest Data Integration Challenges (and How to Solve Them)

Technical considerations

The biggest data integration challenge is often simply the basics of integrating data from disparate, often incompatible, sources. To help design a successful project, you need to thoroughly anaylze:

  • From which systems will the data be sourced?
  • Is all the data to fulfil the requirements available?
  • What are the options for extracting the data? Update notification? Incremental extracts? Full extracts? 
  • How often will you need to run your extracts?
  • Are the required data fields populated properly and consistently? 
  • What volume of data are you going to need to process?

Business process considerations

  • You’ll also need to discover who within the business owns the systems you need to work with, and ensure that they are fully involved in the process. Your data security and backup policies can also affect your integration.

Planning for the new system (and the future)

  • As well as the technical integration, you’ll need to plan for who within the organization is going to be responsible for managing the new system, and what the support and SLAs are. Before the project begins, you also need to have a plan in place for future development - who will fund and manage ongoing maintenance and upgrades?
  • And considering your future requirements early on in the process can save expensive headaches further down the line. Can your integration easily scale to handle larger data volumes or new sources?

Poor error handling

  • As the number of systems involved in your integrations grows, the possibility of failure also increases, so it’s important to have a good strategy for handling errors such as downtime or poor data quality. Building in automated error notification processes means that errors can be detected and fixed quickly, whether automatically as part of an error management pipeline, or flagged up for manual resolution. 
White Paper: How to Design your Systems to Effectively Control Bad Data

Custom coded solutions

Custom coded data integration solutions can work perfectly well with simple or smaller projects. But as the amount of code grows, maintainability becomes a serious challenge. It's important to be aware of potential pitfalls of a custom coded approach, especially if your data needs increase.

  • Logging is typically an afterthought so when issues arise, there is a lack of diagnostic information when you need it, and developers who carry a lot of knowledge in their heads can cause issues when they leave.
  • Integration with new technology is slow to implement.
  • Performance in the early days is rarely a consideration but years later it often is and serious bottlenecks can develop - and fixing them can mean a full (time-consuming) refactoring.
  • Bugs can and will happen and debugging can be hard and cause downtime or interruptions.

Data Integration Best Practice

Data quality monitoring

Monitoring for every involved application is vital to prevent pollution of multiple applications with low quality data from a single source

Resource usage monitoring

Data integration works with live systems that can’t usually be shut down while the integration is querying or updating the data. It's important to ensure that the integration doesn’t negatively impact the involved applications.

This can be mitigated with proper resource monitoring. For example monitoring response times, disk or memory usage can all help detect overloads of production systems.

Testing

Since data integration is a continuous process, implementation needs to be as good as possible. The only way to ensure this is by rigorous testing before deployment to production

It is good practice to test on production data by taking a copy of the data and configuring testing instances of all involved applications or systems. Such end-to-end testing can be beneficial not only to the integration itself, but also to the other systems involved in the integration.

Data design

When designing a data integration, it is always important to decide how to represent the data. If the integration copies data from one application to another (data propagation), the design is given by the applications involved.

However, for data consolidation and virtualization efforts, selecting the suitable representation of each entity can make or break a project. If the data representation is hard to understand, it will be avoided, resulting in applications adding point-to-point interfaces rather than using a centralized data location.

Start small

Building a comprehensive, unified data model may sound like a good idea, but these attempts almost always fail, due to the sheer complexity (even with just a handful of application). It’s much better in practice to start small - with just the minimum required data that serves the business needs. 

Data Integration vs ETL

What's the difference between data integration and ETL?

Whereas data integration can apply to any combining of different data sources, ETL (Extract, Transform, Load), is most often used to describe the process of getting data into a data warehouse specifically. But ETL can also be used to talk about any kind of process where the data is transformed between the source and target system.

What is ETL?

ETL stands for Extract, Transform and Load, and is a process for getting data from one place (the source system) to another (the target), while performing some kind of transformation on the data in the middle. There’s also often a Cleaning step in the process to ensure the quality of the data you’re working with.

Steps in the ETL Process

Extract

The first step covers getting the data from the source system and making it available for processing. The aim is to do this using as little resource as possible, and in a way that doesn’t negatively affect the source system in terms of performance, response time or any kind of locking. 

Extracts can be done in several ways, either taking just data that’s been changed since the last extract, or a full extract of the data, which is then compared with the previous extract to identify changes. The frequency these extracts are done can be extremely important. Particularly for full extracts, you can often be dealing with volumes in the tens of gigabytes.

Clean

Cleaning the extracted data is important to maintain the quality of the target system. Data cleaning should perform basic data unification rules, such as:

  • Converting phone numbers and zip codes to a standardized form
  • Making identifiers unique (e.g. ensuring that Male/Female and M/F are translated into one standard format)
  • Removing or standardizing missing values
  • Validating address fields and converting them to proper naming (e.g. unifiying Street, St., St, and Str.)
  • Validating address fields against each other (e.g. making sure the State and Country, or the City and Zip Code values are valid together)

Transform

The transform step applies a set of rules to transform the data from the source to the target. This includes converting any data to the same dimension (i.e. conformed dimension) using the same units so that they can later be joined.

The transformation step can also involve joining data from several sources; generating aggregates; generating surrogate keys; sorting; deriving new calculated values; and applying advanced validation rules.

Load

The load step involves loading the cleaned, transformed data into the target system (often a database). During this step, it's important to make sure the load is performed correctly, and using as little resource as possible. In order to make the process efficient, it's good practice to disable any constraints and indexes before the load and enable them back only after the load is completed. 

Data Integration Software

Using specialist data integration software can make dealing with a large number of data sources, formats or transformations easier. It can also help automate your data integration processes to save manual effort, increase repeatability and reliability, and build in error monitoring and handling. 

But choosing the right software to suit your organization's requirements can be a complicated decision. Before you start the process, it’s crucial to have a good understanding of just what it is you want to achieve. Other things you need to consider are:

  • Technical considerations, such as where your data sits
  • What your future needs might be, so you can scale easily
  • Who will be using your software, and what’s their level of technical expertise?
  • How you plan to implement your solution - do you want to do it yourself, or get an expert to help you?

Read more about how to choose your data integration software in our Buyers' Guide:

New call-to-action

 

Read More about Data Integration