Data Integration

A reference guide

A data integration reference guide for anyone wanting a deeper understanding of what is often involved in data integration projects.

Download

Basic

What is a data integration?

  • What is Data Integration?

    Data Integration is the process of combining data from multiple sources in order to extract additional value. Data integrations are often long-lived as an organization can come to depend on the benefits.

Why should I care?

  • Why should I care?

    Organizations run many different applications and systems each of which has its own data. It is not uncommon, even in smaller organizations, to find dozens of applications. In larger organizations the applications can number in the thousands.

    In order to unlock the full informational potential of your organization’s data, being able to work flexibly with data stored across all applications is very important. A proliferation of Excel spreadsheets and Access databases are a common manifestation of the need to perform data integration tasks. However, the need also occurs at a far more strategic level in cases such as Business Intelligence and Reporting platforms that need to work with data originating in multiple applications.

    The instinctive reaction in IT departments can be code their way to a solution. However, this is not only slow, it also carries with it the risk that such solutions become harder and harder to maintain over time.

    A strategic approach to Data Integration allows an organization to efficiently expose their data to applications, systems and even DIY solutions. This can involve coding where necessary but will usually draw on tools to provide visual clarity and speed of development to the integration team.

Examples

  • Examples

    One of the very common data integrations is between various CRM applications, marketing applications and accounting systems. This allows an organization to derive interesting insights about how their customers react to various marketing campaigns and measure campaign effectiveness compared to actual sales and campaign cost.

    Another very common data integration project centers on the creation and maintenance of data warehouses. A data warehouse typically stores data collected from many different applications formatted so that a unified view of the whole department, or even the whole company, can be easily and effectively queried.

Common types of integration

Consolidation

  • Consolidation

    Data consolidation is a process that combines data from multiple systems and stores it in a single location. The goal is to simplify the application landscape with a single data location for other applications, such as BI, reporting and data mining, to access without having to deal with the complexities and often impossibilities of dealing with the data at source.

    A very common example of data consolidation is a data warehouse. The primary objective of a data warehouse is to aggregate any information that is spread across multiple systems and present it to reporting or data mining systems in order to extract maximum value from the available data from a single data source.

Propagation

  • Propagation

    Data propagation is an integration process that copies data from one system to another. This is often used to enrich data in a system by adding more information that can be found elsewhere or to prevent a need for duplicate data entry in multiple applications.

    For example, enriching a licensing application with data from a CRM application can allow teams that manage licenses to serve customers more effectively by avoiding the need to use two different applications and enter customer information into both.

    Another common data propagation task is the propagation of data from production systems to QA or pre-production labs in support of end to end testing of various in-house applications during their regular development and maintenance cycles.

    Such data propagation often involves logic that anonymizes data or selects just a subset of the source data due to volume or security considerations.

Virtualization

  • Virtualization

    Data virtualization provides a near real-time unified view of the data in multiple different systems without having to store all of the data in a single location.

    When data is queried, the virtualization system queries the source systems and builds a unified view of the data on the fly. Various caches are often employed to ensure that queries are fast and do not overload the source applications.

    Being able to expose virtualization queries through multiple mechanisms, including easily configured API endpoints, is important in order to make the query results as accessible as possible.

Pitfalls

Inadequate system knowledge

  • Inadequate system knowledge

    Data applications, especially mission-critical ones, tend to be in production for extended periods of time as companies typically do not want to invest in technology unless absolutely necessary.

    This means that institutional knowledge of applications is often lost due to inadequate documentation or staff turnover.

    A lack of system knowledge can negatively affect data migration or integration projects due to over-optimistic estimates or data mapping issues that only manifest themselves in later stages of the project.

    As such, a lack of knowledge can be very expensive and any poor estimates or deficient data mapping exercises can lead to costly project restarts and substantial budget overruns.

Changing requirements or scope

  • Changing requirements or scope

    It is not uncommon for the project scope to change during a project’s lifetime, which is why so many IT projects end up over budget.

    One of the most common reasons for this is that the involved applications or data sources are not necessarily fully understood or taken into account during the analysis phase. This is compounded by the fact that documentation of the systems involved is often incomplete, leading to guesses rather than estimates of what is required to implement the business logic. Such guesses are normally overly optimistic.

    The best way to avoid this is to be thorough and honest during the design phase and to ensure that all stakeholders are invited to contribute to the scope discussions. While this can be difficult and time-consuming in larger organizations, the benefits can easily outweigh the expense of a slightly longer design phase.

Poor error handling and reporting

  • Poor error handling and reporting

    Having a good strategy for handling failures is extremely important when data integrations remain operationally active and the possibility of failures grows rapidly as the number of systems involved grows. The most common issues faced are

    • Down-time of the various systems involved
    • Data quality issues

    It is important that data integration processes automatically detect and handle as many errors as possible. However, there are also cases where manual intervention is required and so suitable notification mechanisms should be built in.

    Error handling should not only be addressed during the initial development phase but also as a process of continuous improvement as new error conditions are encountered.

Incomplete or missing update strategies

  • Incomplete or missing update strategies

    Dealing with updates of various applications that are involved in the integration is also an important consideration. System updates can occur at any time and there is often little or no warning.

    As updates that impact data happen somewhat infrequently, there is a tendency not to plan for them properly, which can impact system availability.

    It is usually not possible to avoid downtime during updates but it is important to plan for such eventualities in order to minimize its effect.

    Planning an orderly rollback is also important. Inadequate rollback planning can have very serious consequences indeed, yet is often not implemented until after the first major failure.

Custom coded solutions

  • Custom coded solutions

    Companies regularly take a coding approach when working with data. This can work perfectly well in the short term or for simpler projects. However, there are some important considerations over time.

    • As the amount of code grows, maintainability becomes a serious challenge
    • Logging is typically an afterthought so when issues arise, there is a lack of diagnostic information when you need it
    • Integration with new technology is slow to implement
    • Performance in the early days is rarely a consideration but years later it often is and serious bottlenecks can develop
    • Performance bottlenecks can require a full refactoring that can take a great deal of time
    • Bugs can and will happen and debugging can be hard and cause downtime or interruptions
    • Developers carry a lot of knowledge in their heads and when they leave they take it with them
    • Code is often very poorly documented and often depends on institutional knowledge during maintenance. Leavers familiar with the codebase can leave the company with major maintenance headaches.

The Integration Process

Data Discovery

  • Data discovery

    Data discovery is an often overlooked and underestimated part of any data related project. People often make false assumptions about their data as most people view the data from their own perspective only. However, data discovery is a critical part of the design phase as it provides the input for the scope definition and project estimates.

    Discovery involves a formal analysis of the data itself, finding relationships within and between datasets. One also has to consider all applications that use the data and need to be reconfigured or updated, especially in the case of data migrations.

    As such, discovery in larger organizations can be an activity involving various teams and often cross departmental boundaries.

Data Validation

  • Validation

    Data validation is an important process in achieving sufficient data quality in the target system and goes hand in hand with data cleansing. Data issues are discovered during validation and data that does not pass the validation step is marked for cleansing.

    Data validation can be a manual or an automated process. A high degree of automation is, of course, preferable and indeed necessary for larger systems. Nevertheless, some manual validation may still be necessary, especially for the most important data that cannot be allowed through without human confirmation or correction.

Data Mapping

  • Data mapping

    Data mapping is a process that maps data from the format used in one system to a format used by another system. The mapping can take a significant amount of time as

    • multiple applications often store data for the same entity
    • applications can be unfamiliar or poorly documented.

    The data mapping documentation has a significant impact on the overall implementation effort as, in many cases, there is no single correct way of mapping data between different structures. Common reasons for this are

    • there is no direct one to one mapping between values
    • data structures representing the same entity are too different.

    Proper data mapping requires detailed knowledge from the data discovery project phase. It also usually involves substantial input from data consumers.

    The mapping process is simplified with tools that visualize the mapping between different entities and provide automation of the mapping process.

    Data mapping also needs to consider the future development of applications involved. In such cases some data structures might change and it is important for the mapping and its implementation to be able to accommodate such changes as easily as possible.

Data Cleansing

  • Data cleansing

    During data discovery, you will often find that the data cannot be used in its current form and first needs to be cleansed. There are many different reasons for low data quality, ranging from simple ones (anything involving human data entry is likely to have various errors including typos, missing data, data misuse etc.) all the way to complex issues stemming from improper data handling practices and software bugs.

    Data cleansing is the process of taking “dirty” data in its original location and cleaning it before it is used in any data transformation. Data cleansing is often an integral part of the business logic with the data being cleaned in the transformation but left unchanged in the originating system. Other approaches can also be used. For example, a separate, clean copy of the data can be created if the data needs to be reused or if cleansing is time-consuming and requires human interaction.

Best Practice

Data Quality Monitoring

  • Data quality monitoring

    Monitoring of data quality for every involved application is vital in order to prevent pollution of multiple applications with low quality data coming from a single source.

    Monitoring often consists of data validation rules that are applied to each record as it is transformed into its destination format.

    Choice of data attributes to monitor, and how to monitor them, is one of the key decisions of the design phase of the whole project. If you have too much monitoring with overly detailed reports, you can overwhelm stakeholders resulting in important problems being overlooked. On the other hand, too little monitoring is undesirable as important observations are simply not being reported.

    Proper data quality monitoring is, therefore, about carefully balancing the investment in building monitoring rules with the volume of output.

Resource usage monitoring

  • Resource usage monitoring

    Data integration works with live systems that cannot be shut down while the integration is querying or updating the data. It is therefore very important to ensure that the integration does not negatively impact the involved applications.

    This can be mitigated with proper resource monitoring. For example monitoring response times, disk or memory usage etc. can all help detect overloads of production systems.

Testing

  • Testing

    Since data integration is a continuous process, it is very important to ensure that implementation is as good as possible. This can be ensured only by doing rigorous testing before deployment into production.

    It is good practise to test on production data by taking a copy of the data and configuring testing instances of all involved applications or systems. Such end-to-end testing can be beneficial not only to the integration itself, but also to the other systems involved in the integration.

Data design

  • Data design

    When designing a data integration, it is always important to decide how to represent the data. If the integration copies data from one application to another (data propagation), the design is given by the applications involved.

    However, for data consolidation and virtualization efforts, selecting the suitable representation of each entity can make or break a project. If the data representation is hard to understand, it will be avoided resulting in applications adding point-to-point interfaces rather than using a centralized data location.

    Avoiding Unified Models

    Experience shows that trying to build a comprehensive, unified data model almost always fails due to the sheer complexity, even if just a handful of applications are involved.

    In practice, it is best to start with the required data consumers and then deliver just the necessary subset of data to serve their needs.

Integration using CloverDX

Connectivity

  • Connectivity

    One of the first questions asked is how to connect to the many different input and output systems in existence.

    Integrations and migrations often involve dozens, or even hundreds, of different applications, each with its own storage and connectivity requirements. This can include various file types in remote and local file systems, web services, database types, messaging queues and more.

    CloverDX provides a wide variety of “connectors” that provide the desired connectivity and also supports powerful abstraction of connectivity parameters. This makes it simple to work with the data regardless of source or target.

    For example, file processing abstraction allows you to work with your files whether local, in the Cloud or via FTP. Similarly, we offer powerful database abstraction so you can use the same set of components with any relational database.

    Such abstractions allow graph developers to build their solution locally before simply switching the connectivity parameters when ready for the production deployment.

    • Structured data: delimited and fixed-length files in any format
    • Hierarchical data: JSON, XML and custom formats
    • Excel files: XLS and XLSX
    • Relational databases: any database with JDBC driver
    • NoSQL databases: MongoDB
    • Big Data: HDFS, S3
    • REST and SOAP APIs
    • Other: SalesForce, Tableau, email, LDAP, JMS, ...

Reusability

  • Reusability

    In the world of software development, reusability is taken for granted through mechanisms such as functions and classes.

    In CloverDX we fully support reusability by allowing developers to build “child transformations”. A child transformation is a regular transformation that can be referenced from any other transformation. As well as giving the advantages of reusability, you can significantly simplify the readability of your top level transformations.

    subgraph

    CloverDX also supports reusability in several other ways. You can manage your own code libraries, store externalized connection settings for your databases, create shared libraries of your data structures and more.

    CloverDX is fully compatible with all common version control systems, so you can collaborate on projects using your preferred SCM tools such as git, Mercurial and Subversion.

Automation and Jobflows

  • Automation and Jobflows

    It is important that your transformation tool allows developers to build complete pipelines consisting of several transformations and processes with inter-process dependencies.

    CloverDX handles this using Jobflows. A Jobflow executes and monitors internal processes (data transformations or other Jobflows) as well as external processes (typically scripts). Together with the powerful error handling and visual nature of the jobflow design process, CloverDX allows developers to very quickly build entire pipelines that properly react to success and failure.

    jobflows

    Thanks to its ability to automate entire processes, monitor and send appropriate notifications, CloverDX can quickly become a central orchestration point, notifying responsible and interested parties on the progress of migrations and other data integration processes in real time.

    API Endpoints

    CloverDX exposes multiple different APIs that are useful for integration with 3rd party tools. For example it is possible to trigger jobs via REST or SOAP API calls or to monitor various external systems for changes such as arriving files or JMS messages.

Data Structure Management (Metadata)

  • Data Structure Management

    You’ll often have to work with a large number of different data structures. Each application may have its own way of storing data even if they serve the same purpose. For example, consider something as simple as storing information about customers in a CRM system where there are infinitely many ways of storing details about a person or a company including associated contact information.

    CloverDX can be thought of as “strongly typed”. Before processing each record, CloverDX needs to fully understand the structure and attributes of the data. This ensures that the transformation can enforce static business logic checking and prevent mismatched record types in various operations. This in turn can prevent logical errors when working with seemingly similar but actually different data structures.

    CloverDX’s ability to parametrize any aspect of a data transformation allows you to design generic transformations that generate their data type information on the fly based on the data they process. This can dramatically simplify the design of graphs that operate on more complex data.

Validation

  • Validation

    Validation is one way to ensure clean, valid data. The aim is to automate the detection of as many invalid records as possible, minimizing the need for human intervention.

    To simplify the process, CloverDX provides many built-in components with a UI based configuration, removing the need for coding.

    For example, the Validator component allows developers to quickly and visually design complex, rule-based validations. These validation rules generate information for each error that can then be fed into a reporting pipeline for the creation of quality reports.

    validation

    The ProfilerProbe component allows for easy measurement of various data properties such as data patterns, basic statistical properties of the data (minimum, maximum etc.), unique values and more.

Creating APIs

  • Creating APIs

    Many data integrations depend on being able to define and quickly expose APIs that can be consumed by external applications. In CloverDX, this can be very effectively done for REST APIs via the Data Services feature.

    CloverDX Data Services allows developers to design a REST API and visually develop transformation logic that sits behind this API call. The resulting API endpoint can be quickly published from the CloverDX Server.

    API config

    A CloverDX API endpoint can be used by many different systems such as Tableau, SalesForce and any other services that can call REST APIs.

    Typically, a transformation, such as the one below, can be launched.

    API