Data Ingestion

A reference guide

A data ingestion reference guide for managers and technical staff, this provides a deeper understanding of what is often involved and how to prepare.

Download

Basic

What is data ingestion?

  • What is Data Ingestion?

    Data ingestion is a process that is undertaken in every vertical market and across all business lines. It is the process of moving or on-boarding data from one or more data sources into an application data store. A continuous stream of data must be read, transformed, and written into a target system so it is ready for application usage.

    Data Ingestion differs from Data Migration in that a migration is a wholesale move from one system to another with all the timing and coordination challenges that brings, whereas an ingestion usually involves

    • either pulling in or working with data from operational systems to other applications that need to work in parallel with the source application
    • or loading data from one or more data sources that may or may not be associated with an application.

    The data to be ingested can be streamed or batch-loaded into your infrastructure.

    • Streamed ingestion is chosen for real time, transactional, event driven applications, for example a credit card swipe that might require execution of a fraud detection algorithm.
    • Batched ingestion is used when data can or needs to be loaded in batches or groups of records. Batched ingestion is typically done at a much lower cadence, but with much higher efficiency.

Why should I care?

  • Why should I care?

    Despite being a common process, setting up a data ingestion pipeline is rarely as simple as one would think. Complications arise as you are often consuming data managed and understood by third parties and trying to bend it to your own needs.

    This can be especially challenging if the source data is inadequately documented and managed. For example, a marketing team might need to load data from an operational system into a departmental application. You will need to consider the following.

    • Is the data to be ingested of sufficient quality?
    • After the data has been ingested, is it usable “as is” in the target application?
    • Is the data stream reliable and stable?
    • How to access the source data and to what extent does IT need to be involved?
    • How often does the source data update and how often should you refresh
    • How will the process be automated?

Examples

  • Examples

    The following examples are a few real world ingestion use cases encountered by CloverDX customers.

    Wealth Management - Salesforce

    A wealth management firm needs to load tens of millions of data records on a daily basis for use by a nationwide network of financial advisors.

    • Data has to be read from multiple data sources
    • Complex transformations need to take place
    • Data has to be automatically validated and cleansed
    • Data has to be written to Salesforce, taking into account API timeouts and batch restrictions
    • All processing needs to be completed in under one hour to provide up to date information to advisors
    • Previous script based approach became impossible to maintain and was buggy

    EE - Tableau+Redshift Reporting Platform

    EE, the UK’s largest mobile operator, built a Redshift based data warehouse to take data from many in-house systems and deliver corporate reporting platform.

    • Data ingested from multiple in-house systems
    • Data held on 28 million EE customers
    • Extensive mappings and transformations
    • Data loaded into Amazon Redshift
    • Ingestion involves both streaming and batch loading triggered by schedules and events
    • Complete removal of the need to write and maintain scripts
    • Frees developers from mundane data processing tasks, allowing them to focus on core business tasks
    • Tool based approach more efficient and maintainable than scripts

    GoodData - API driven data load for customer dashboards

    GoodData, a cloud based analytics platform, provides customers with an API to load, map and transform data.

    • Each customer has proprietary data
    • Each data set has its own set of rules associated with the ingestion process
    • The ingestion needs to update as well as overwrite
    • GoodData need to be able to build the transformations very quickly on behalf of the customer
    • Each customer is provided with a unique REST API endpoint, configured in CloverDX, that launches the ingestion process

Things to consider

Automation

  • Automation

    One of the primary considerations for a data ingestion project is how to automate the process. There will be a constant flow of data that you will not want to manually supervise or initiate.

    Human involvement is one of the biggest sources of failed or interrupted ingestions, so eliminating as much human interaction as possible is a key consideration in ensuring trouble free data ingestion. This is even more important if the ingestion occurs frequently.

    Other complexities can also arise.

    • A need to guarantee data availability with failovers, data recovery plans, standby servers, operations continuity etc.,
    • Setting automated data quality thresholds,
    • Providing an ingest alert mechanism with associated logs and reports,
    • Ensuring minimum data quality criteria are met at the batch, rather than record, level (data profiling)

Data mapping

  • Data mapping

    Data mapping is a process that maps data from the format used in one system to a format used by another system. The mapping can take a significant amount of time as

    • multiple applications often store data for the same entity
    • applications can be unfamiliar or poorly documented.

    The data mapping documentation has a significant impact on the overall implementation effort as, in many cases, there is no single correct way of mapping data between different structures. Common reasons for this are

    • there is no direct one to one mapping between values
    • data structures representing the same entity are too different.

    Proper data mapping requires detailed knowledge from the data discovery project phase. It also usually involves substantial input from data consumers.

    The mapping process is simplified with tools that visualize the mapping between different entities and provide automation of the mapping process.

    Data mapping also needs to consider the future development of applications involved. In such cases some data structures might change and it is important for the mapping and its implementation to be able to accommodate such changes as easily as possible.

Data ingestion parameters

  • Data Ingestion Parameters

    There are typically 4 primary considerations when setting up new data pipelines.

    • Velocity – at what speed does the data flow into your system/application.
    • Size – what is the volume of data that needs to be loaded.
    • Frequency – do you need to process in real-time or can you batch the loads?
    • Format – what format is your data in: structured, semi-structured, unstructured? Your solution design should account for all of your formats.

    It is also very important to consider the future of the ingestion pipeline. For example, growing data volumes or increasing demands of the end users, who typically want data faster.

    If these are not properly considered there can be negative consequences, typically performance, availability and their potential impact on SLAs.

Data cleansing

  • Data cleansing

    During data discovery, you will often find that the data cannot be used in its current form and first needs to be cleansed. There are many different reasons for low data quality, ranging from simple ones (anything involving human data entry is likely to have various errors including typos, missing data, data misuse etc.) all the way to complex issues stemming from improper data handling practices and software bugs.

    Data cleansing is the process of taking “dirty” data in its original location and cleaning it before it is used in any data transformation. Data cleansing is often an integral part of the business logic with the data being cleaned in the transformation but left unchanged in the originating system. Other approaches can also be used. For example, a separate, clean copy of the data can be created if the data needs to be reused or if cleansing is time-consuming and requires human interaction.

Governance and safeguards

  • Governance and safeguards

    Another important aspect of the planning phase is to decide how to expose the data to users. Typical questions asked in this phase of pipeline design can include:

    • Will this be used internally?
    • Will this be used externally?
    • Who will have access and what kind of access will they have?
    • Do you have sensitive data that will need to be protected and regulated?

    These considerations are often not planned properly and result in delays, cost overruns and increased end user frustration.

Pitfalls

Real-time or batch?

  • Real time or batch loading

    It is important to understand the required frequency of ingestion as it will have a major impact on the performance, budget and complexity of the project.

    The decision process often starts with users and the systems that produce that data. Typical questions that are asked at this stage are:

    • How frequently does the source publish new data?
    • Is the source batched, streamed or event-driven?
    • Does the whole pipeline need to be real-time or is batching sufficient to meet the SLAs and keep end users happy?

    There is a spectrum of approaches between real-time and batched approaches. For example, it might be possible to micro-batch your pipeline to get near real-time updates or even implement various combinations of different approaches for different source systems.

    Understanding the requirements of the whole pipeline in detail will ensure that the correct ingestion design decision are made.

Not understanding your data quality

  • Not understanding your data quality

    Data quality is one of the most underestimated properties of data. Data in any system that has been in production for a while can have all sorts of data quality issues. These can range from simple issues such as typos through to missing or invalid data.

    Data owners often have only a vague idea about the overall quality of their data and the impact on subsequent data oriented processes. While they will clearly understand more obvious issues, they may be completely unaware of more complex or legacy problems.

    We would strongly recommend doing a full data quality evaluation of your production data early on in a data centric project. This can be complicated by security restrictions, but in our experience, data in a test environment never fully captures the depth and complexity of the issues that appear in production systems.

Not planning for new sources, technologies and applications

  • New systems, applications and technologies

    In today’s world, businesses are constantly looking to optimize their systems to understand their data, expose it flexibly and extract maximum value from it.

    As a result, it is important to factor in these eventualities and be as prepared as possible to deal with

    • short term requests from management
    • system and data structure changes that might be forced on you in the future
    • new systems coming onstream that need to be incorporated

Custom coded solutions

  • Custom coded solutions

    Companies regularly take a coding approach when working with data. This can work perfectly well in the short term or for simpler projects. However, there are some important considerations over time.

    • As the amount of code grows, maintainability becomes a serious challenge
    • Logging is typically an afterthought so when issues arise, there is a lack of diagnostic information when you need it
    • Integration with new technology is slow to implement
    • Performance in the early days is rarely a consideration but years later it often is and serious bottlenecks can develop
    • Performance bottlenecks can require a full refactoring that can take a great deal of time
    • Bugs can and will happen and debugging can be hard and cause downtime or interruptions
    • Developers carry a lot of knowledge in their heads and when they leave they take it with them
    • Code is often very poorly documented and often depends on institutional knowledge during maintenance. Leavers familiar with the codebase can leave the company with major maintenance headaches.

Not understanding infrastructure

  • Not understanding your infrastructure

    Mosy data integration projects involve IT and require a thorough understanding of your infrastructure. There are many moving parts to a project and these all need to be understood and planned in order to avoid unforeseen delays mid project.

    • Appropriately configured firewall rules that allow connection to external data sources
    • Suitable access permissions to all systems
    • Knowledge of existing data structures, especially those in other domains or departments
    • Impact of temporary unavailability of source data on your ingestion processes
    • Knowledge of governance and compliance processes, such as password rotations, and how they impact ingestion

Inadequate system knowledge

  • Inadequate system knowledge

    Data applications, especially mission-critical ones, tend to be in production for extended periods of time as companies typically do not want to invest in technology unless absolutely necessary.

    This means that institutional knowledge of applications is often lost due to inadequate documentation or staff turnover.

    A lack of system knowledge can negatively affect data migration or integration projects due to over-optimistic estimates or data mapping issues that only manifest themselves in later stages of the project.

    As such, a lack of knowledge can be very expensive and any poor estimates or deficient data mapping exercises can lead to costly project restarts and substantial budget overruns.

Changing requirements or scope

  • Changing requirements or scope

    It is not uncommon for the project scope to change during a project’s lifetime, which is why so many IT projects end up over budget.

    One of the most common reasons for this is that the involved applications or data sources are not necessarily fully understood or taken into account during the analysis phase. This is compounded by the fact that documentation of the systems involved is often incomplete, leading to guesses rather than estimates of what is required to implement the business logic. Such guesses are normally overly optimistic.

    The best way to avoid this is to be thorough and honest during the design phase and to ensure that all stakeholders are invited to contribute to the scope discussions. While this can be difficult and time-consuming in larger organizations, the benefits can easily outweigh the expense of a slightly longer design phase.

Ignoring the possibility of failures

  • Ignoring the possibility of failure

    Contemplating failures is rarely taken seriously enough. Ensuring that contingencies are in place, should a migration or other integration process fail, is very important, especially for mission critical applications. A failed migration can leave both the legacy and the new system inoperational, causing major organizational disruption.

    It is important to set thresholds so that the migration phase is only considered failed if it makes more sense to perform a rollback instead of living with the result and fixing the issues afterwards.

    Some projects might require a success threshold close to 100% whereas others could have a lower threshold. If the benefits of using the new system, including its migration shortcomings, delivers a good enough outcome and the improved functionality of the new system can be taken advantage of, there may be no point in rolling back.

    Rollback is especially important for trickle data migrations. As they have multiple phases, the chance of at least one failure is high. As a result, it is important to design rollbacks in such a way that subsequent data migration phases are unnecessarily difficult or even impossible.

Best Practice

Data Quality Monitoring

  • Data quality monitoring

    Monitoring of data quality for every involved application is vital in order to prevent pollution of multiple applications with low quality data coming from a single source.

    Monitoring often consists of data validation rules that are applied to each record as it is transformed into its destination format.

    Choice of data attributes to monitor, and how to monitor them, is one of the key decisions of the design phase of the whole project. If you have too much monitoring with overly detailed reports, you can overwhelm stakeholders resulting in important problems being overlooked. On the other hand, too little monitoring is undesirable as important observations are simply not being reported.

    Proper data quality monitoring is, therefore, about carefully balancing the investment in building monitoring rules with the volume of output.

Reporting

  • Reporting

    Reporting is a key part of data migrations and other integrations and is rarely done well. Well executed reporting ensures that stakeholders get all the status information very quickly and are able to react in a short timeframe. This, in turn, shortens the time it takes to determine whether or not everything was successful.

    A variety of reports could be generated. Some common examples are

    • data quality reports
    • effort reports
    • resource utilization reports

    A well designed reporting pipeline will notify relevant parties so they can take necessary action should issues arise.

Integration using CloverDX

Connectivity

  • Connectivity

    One of the first questions asked is how to connect to the many different input and output systems in existence.

    Integrations and migrations often involve dozens, or even hundreds, of different applications, each with its own storage and connectivity requirements. This can include various file types in remote and local file systems, web services, database types, messaging queues and more.

    CloverDX provides a wide variety of “connectors” that provide the desired connectivity and also supports powerful abstraction of connectivity parameters. This makes it simple to work with the data regardless of source or target.

    For example, file processing abstraction allows you to work with your files whether local, in the Cloud or via FTP. Similarly, we offer powerful database abstraction so you can use the same set of components with any relational database.

    Such abstractions allow graph developers to build their solution locally before simply switching the connectivity parameters when ready for the production deployment.

    • Structured data: delimited and fixed-length files in any format
    • Hierarchical data: JSON, XML and custom formats
    • Excel files: XLS and XLSX
    • Relational databases: any database with JDBC driver
    • NoSQL databases: MongoDB
    • Big Data: HDFS, S3
    • REST and SOAP APIs
    • Other: SalesForce, Tableau, email, LDAP, JMS, ...

Data Structure Management (Metadata)

  • Data Structure Management

    You’ll often have to work with a large number of different data structures. Each application may have its own way of storing data even if they serve the same purpose. For example, consider something as simple as storing information about customers in a CRM system where there are infinitely many ways of storing details about a person or a company including associated contact information.

    CloverDX can be thought of as “strongly typed”. Before processing each record, CloverDX needs to fully understand the structure and attributes of the data. This ensures that the transformation can enforce static business logic checking and prevent mismatched record types in various operations. This in turn can prevent logical errors when working with seemingly similar but actually different data structures.

    CloverDX’s ability to parametrize any aspect of a data transformation allows you to design generic transformations that generate their data type information on the fly based on the data they process. This can dramatically simplify the design of graphs that operate on more complex data.

Validation

  • Validation

    Validation is one way to ensure clean, valid data. The aim is to automate the detection of as many invalid records as possible, minimizing the need for human intervention.

    To simplify the process, CloverDX provides many built-in components with a UI based configuration, removing the need for coding.

    For example, the Validator component allows developers to quickly and visually design complex, rule-based validations. These validation rules generate information for each error that can then be fed into a reporting pipeline for the creation of quality reports.

    validation

    The ProfilerProbe component allows for easy measurement of various data properties such as data patterns, basic statistical properties of the data (minimum, maximum etc.), unique values and more.

Data mapping

  • Data Mapping

    Data mapping is a fundamental part of data migration, data ingestion and other integration processes. It is essential to support the widest range of data transformation options as well as various means of streamlining transformation design and testing.

    CloverDX provides many mapping related components ranging from the simple (sorting, deduplication, lookups etc.) through to various programmable components that allow you to code your own business logic (Reformat, Normalizer, Denormalizer, Joiners, etc.).

    transform

    Many CloverDX components allow you to code your own business logic using our own scripting language, CTL, or Java, giving you access to the widest possible range of your own or external libraries.

    transform-ctl

    CloverDX provides comprehensive code debugging. Further, we also support data debugging by allowing you to inspect data in full detail as it flows through each part of your transformation.

    edge inspection

Reusability

  • Reusability

    In the world of software development, reusability is taken for granted through mechanisms such as functions and classes.

    In CloverDX we fully support reusability by allowing developers to build “child transformations”. A child transformation is a regular transformation that can be referenced from any other transformation. As well as giving the advantages of reusability, you can significantly simplify the readability of your top level transformations.

    subgraph

    CloverDX also supports reusability in several other ways. You can manage your own code libraries, store externalized connection settings for your databases, create shared libraries of your data structures and more.

    CloverDX is fully compatible with all common version control systems, so you can collaborate on projects using your preferred SCM tools such as git, Mercurial and Subversion.

Automation and Jobflows

  • Automation and Jobflows

    It is important that your transformation tool allows developers to build complete pipelines consisting of several transformations and processes with inter-process dependencies.

    CloverDX handles this using Jobflows. A Jobflow executes and monitors internal processes (data transformations or other Jobflows) as well as external processes (typically scripts). Together with the powerful error handling and visual nature of the jobflow design process, CloverDX allows developers to very quickly build entire pipelines that properly react to success and failure.

    jobflows

    Thanks to its ability to automate entire processes, monitor and send appropriate notifications, CloverDX can quickly become a central orchestration point, notifying responsible and interested parties on the progress of migrations and other data integration processes in real time.

    API Endpoints

    CloverDX exposes multiple different APIs that are useful for integration with 3rd party tools. For example it is possible to trigger jobs via REST or SOAP API calls or to monitor various external systems for changes such as arriving files or JMS messages.

Creating APIs

  • Creating APIs

    Many data integrations depend on being able to define and quickly expose APIs that can be consumed by external applications. In CloverDX, this can be very effectively done for REST APIs via the Data Services feature.

    CloverDX Data Services allows developers to design a REST API and visually develop transformation logic that sits behind this API call. The resulting API endpoint can be quickly published from the CloverDX Server.

    API config

    A CloverDX API endpoint can be used by many different systems such as Tableau, SalesForce and any other services that can call REST APIs.

    Typically, a transformation, such as the one below, can be launched.

    API