A data ingestion reference guide for managers and technical staff, this provides a deeper understanding of what is often involved and how to prepare.Download
What is data ingestion?
Data ingestion is a process that is undertaken in every vertical market and across all business lines. It is the process of moving or on-boarding data from one or more data sources into an application data store. A continuous stream of data must be read, transformed, and written into a target system so it is ready for application usage.
Data Ingestion differs from Data Migration in that a migration is a wholesale move from one system to another with all the timing and coordination challenges that brings, whereas an ingestion usually involves
The data to be ingested can be streamed or batch-loaded into your infrastructure.
Why should I care?
Despite being a common process, setting up a data ingestion pipeline is rarely as simple as one would think. Complications arise as you are often consuming data managed and understood by third parties and trying to bend it to your own needs.
This can be especially challenging if the source data is inadequately documented and managed. For example, a marketing team might need to load data from an operational system into a departmental application. You will need to consider the following.
The following examples are a few real world ingestion use cases encountered by CloverDX customers.
A wealth management firm needs to load tens of millions of data records on a daily basis for use by a nationwide network of financial advisors.
EE, the UK’s largest mobile operator, built a Redshift based data warehouse to take data from many in-house systems and deliver corporate reporting platform.
GoodData, a cloud based analytics platform, provides customers with an API to load, map and transform data.
Things to consider
One of the primary considerations for a data ingestion project is how to automate the process. There will be a constant flow of data that you will not want to manually supervise or initiate.
Human involvement is one of the biggest sources of failed or interrupted ingestions, so eliminating as much human interaction as possible is a key consideration in ensuring trouble free data ingestion. This is even more important if the ingestion occurs frequently.
Other complexities can also arise.
Data mapping is a process that maps data from the format used in one system to a format used by another system. The mapping can take a significant amount of time as
The data mapping documentation has a significant impact on the overall implementation effort as, in many cases, there is no single correct way of mapping data between different structures. Common reasons for this are
Proper data mapping requires detailed knowledge from the data discovery project phase. It also usually involves substantial input from data consumers.
The mapping process is simplified with tools that visualize the mapping between different entities and provide automation of the mapping process.
Data mapping also needs to consider the future development of applications involved. In such cases some data structures might change and it is important for the mapping and its implementation to be able to accommodate such changes as easily as possible.
Data ingestion parameters
There are typically 4 primary considerations when setting up new data pipelines.
It is also very important to consider the future of the ingestion pipeline. For example, growing data volumes or increasing demands of the end users, who typically want data faster.
If these are not properly considered there can be negative consequences, typically performance, availability and their potential impact on SLAs.
During data discovery, you will often find that the data cannot be used in its current form and first needs to be cleansed. There are many different reasons for low data quality, ranging from simple ones (anything involving human data entry is likely to have various errors including typos, missing data, data misuse etc.) all the way to complex issues stemming from improper data handling practices and software bugs.
Data cleansing is the process of taking “dirty” data in its original location and cleaning it before it is used in any data transformation. Data cleansing is often an integral part of the business logic with the data being cleaned in the transformation but left unchanged in the originating system. Other approaches can also be used. For example, a separate, clean copy of the data can be created if the data needs to be reused or if cleansing is time-consuming and requires human interaction.
Governance and safeguards
Another important aspect of the planning phase is to decide how to expose the data to users. Typical questions asked in this phase of pipeline design can include:
These considerations are often not planned properly and result in delays, cost overruns and increased end user frustration.
Real-time or batch?
It is important to understand the required frequency of ingestion as it will have a major impact on the performance, budget and complexity of the project.
The decision process often starts with users and the systems that produce that data. Typical questions that are asked at this stage are:
There is a spectrum of approaches between real-time and batched approaches. For example, it might be possible to micro-batch your pipeline to get near real-time updates or even implement various combinations of different approaches for different source systems.
Understanding the requirements of the whole pipeline in detail will ensure that the correct ingestion design decision are made.
Not understanding your data quality
Data quality is one of the most underestimated properties of data. Data in any system that has been in production for a while can have all sorts of data quality issues. These can range from simple issues such as typos through to missing or invalid data.
Data owners often have only a vague idea about the overall quality of their data and the impact on subsequent data oriented processes. While they will clearly understand more obvious issues, they may be completely unaware of more complex or legacy problems.
We would strongly recommend doing a full data quality evaluation of your production data early on in a data centric project. This can be complicated by security restrictions, but in our experience, data in a test environment never fully captures the depth and complexity of the issues that appear in production systems.
Not planning for new sources, technologies and applications
In today’s world, businesses are constantly looking to optimize their systems to understand their data, expose it flexibly and extract maximum value from it.
As a result, it is important to factor in these eventualities and be as prepared as possible to deal with
Custom coded solutions
Companies regularly take a coding approach when working with data. This can work perfectly well in the short term or for simpler projects. However, there are some important considerations over time.
Not understanding infrastructure
Mosy data integration projects involve IT and require a thorough understanding of your infrastructure. There are many moving parts to a project and these all need to be understood and planned in order to avoid unforeseen delays mid project.
Inadequate system knowledge
Data applications, especially mission-critical ones, tend to be in production for extended periods of time as companies typically do not want to invest in technology unless absolutely necessary.
This means that institutional knowledge of applications is often lost due to inadequate documentation or staff turnover.
A lack of system knowledge can negatively affect data migration or integration projects due to over-optimistic estimates or data mapping issues that only manifest themselves in later stages of the project.
As such, a lack of knowledge can be very expensive and any poor estimates or deficient data mapping exercises can lead to costly project restarts and substantial budget overruns.
Changing requirements or scope
It is not uncommon for the project scope to change during a project’s lifetime, which is why so many IT projects end up over budget.
One of the most common reasons for this is that the involved applications or data sources are not necessarily fully understood or taken into account during the analysis phase. This is compounded by the fact that documentation of the systems involved is often incomplete, leading to guesses rather than estimates of what is required to implement the business logic. Such guesses are normally overly optimistic.
The best way to avoid this is to be thorough and honest during the design phase and to ensure that all stakeholders are invited to contribute to the scope discussions. While this can be difficult and time-consuming in larger organizations, the benefits can easily outweigh the expense of a slightly longer design phase.
Ignoring the possibility of failures
Contemplating failures is rarely taken seriously enough. Ensuring that contingencies are in place, should a migration or other integration process fail, is very important, especially for mission critical applications. A failed migration can leave both the legacy and the new system inoperational, causing major organizational disruption.
It is important to set thresholds so that the migration phase is only considered failed if it makes more sense to perform a rollback instead of living with the result and fixing the issues afterwards.
Some projects might require a success threshold close to 100% whereas others could have a lower threshold. If the benefits of using the new system, including its migration shortcomings, delivers a good enough outcome and the improved functionality of the new system can be taken advantage of, there may be no point in rolling back.
Rollback is especially important for trickle data migrations. As they have multiple phases, the chance of at least one failure is high. As a result, it is important to design rollbacks in such a way that subsequent data migration phases are unnecessarily difficult or even impossible.
Data Quality Monitoring
Monitoring of data quality for every involved application is vital in order to prevent pollution of multiple applications with low quality data coming from a single source.
Monitoring often consists of data validation rules that are applied to each record as it is transformed into its destination format.
Choice of data attributes to monitor, and how to monitor them, is one of the key decisions of the design phase of the whole project. If you have too much monitoring with overly detailed reports, you can overwhelm stakeholders resulting in important problems being overlooked. On the other hand, too little monitoring is undesirable as important observations are simply not being reported.
Proper data quality monitoring is, therefore, about carefully balancing the investment in building monitoring rules with the volume of output.
Reporting is a key part of data migrations and other integrations and is rarely done well. Well executed reporting ensures that stakeholders get all the status information very quickly and are able to react in a short timeframe. This, in turn, shortens the time it takes to determine whether or not everything was successful.
A variety of reports could be generated. Some common examples are
A well designed reporting pipeline will notify relevant parties so they can take necessary action should issues arise.
Integration using CloverDX
One of the first questions asked is how to connect to the many different input and output systems in existence.
Integrations and migrations often involve dozens, or even hundreds, of different applications, each with its own storage and connectivity requirements. This can include various file types in remote and local file systems, web services, database types, messaging queues and more.
CloverDX provides a wide variety of “connectors” that provide the desired connectivity and also supports powerful abstraction of connectivity parameters. This makes it simple to work with the data regardless of source or target.
For example, file processing abstraction allows you to work with your files whether local, in the Cloud or via FTP. Similarly, we offer powerful database abstraction so you can use the same set of components with any relational database.
Such abstractions allow graph developers to build their solution locally before simply switching the connectivity parameters when ready for the production deployment.
Data Structure Management (Metadata)
You’ll often have to work with a large number of different data structures. Each application may have its own way of storing data even if they serve the same purpose. For example, consider something as simple as storing information about customers in a CRM system where there are infinitely many ways of storing details about a person or a company including associated contact information.
CloverDX can be thought of as “strongly typed”. Before processing each record, CloverDX needs to fully understand the structure and attributes of the data. This ensures that the transformation can enforce static business logic checking and prevent mismatched record types in various operations. This in turn can prevent logical errors when working with seemingly similar but actually different data structures.
CloverDX’s ability to parametrize any aspect of a data transformation allows you to design generic transformations that generate their data type information on the fly based on the data they process. This can dramatically simplify the design of graphs that operate on more complex data.
Validation is one way to ensure clean, valid data. The aim is to automate the detection of as many invalid records as possible, minimizing the need for human intervention.
To simplify the process, CloverDX provides many built-in components with a UI based configuration, removing the need for coding.
For example, the Validator component allows developers to quickly and visually design complex, rule-based validations. These validation rules generate information for each error that can then be fed into a reporting pipeline for the creation of quality reports.
The ProfilerProbe component allows for easy measurement of various data properties such as data patterns, basic statistical properties of the data (minimum, maximum etc.), unique values and more.
Data mapping is a fundamental part of data migration, data ingestion and other integration processes. It is essential to support the widest range of data transformation options as well as various means of streamlining transformation design and testing.
CloverDX provides many mapping related components ranging from the simple (sorting, deduplication, lookups etc.) through to various programmable components that allow you to code your own business logic (Reformat, Normalizer, Denormalizer, Joiners, etc.).
Many CloverDX components allow you to code your own business logic using our own scripting language, CTL, or Java, giving you access to the widest possible range of your own or external libraries.
CloverDX provides comprehensive code debugging. Further, we also support data debugging by allowing you to inspect data in full detail as it flows through each part of your transformation.
In the world of software development, reusability is taken for granted through mechanisms such as functions and classes.
In CloverDX we fully support reusability by allowing developers to build “child transformations”. A child transformation is a regular transformation that can be referenced from any other transformation. As well as giving the advantages of reusability, you can significantly simplify the readability of your top level transformations.
CloverDX also supports reusability in several other ways. You can manage your own code libraries, store externalized connection settings for your databases, create shared libraries of your data structures and more.
CloverDX is fully compatible with all common version control systems, so you can collaborate on projects using your preferred SCM tools such as git, Mercurial and Subversion.
Automation and Jobflows
It is important that your transformation tool allows developers to build complete pipelines consisting of several transformations and processes with inter-process dependencies.
CloverDX handles this using Jobflows. A Jobflow executes and monitors internal processes (data transformations or other Jobflows) as well as external processes (typically scripts). Together with the powerful error handling and visual nature of the jobflow design process, CloverDX allows developers to very quickly build entire pipelines that properly react to success and failure.
Thanks to its ability to automate entire processes, monitor and send appropriate notifications, CloverDX can quickly become a central orchestration point, notifying responsible and interested parties on the progress of migrations and other data integration processes in real time.
CloverDX exposes multiple different APIs that are useful for integration with 3rd party tools. For example it is possible to trigger jobs via REST or SOAP API calls or to monitor various external systems for changes such as arriving files or JMS messages.
Many data integrations depend on being able to define and quickly expose APIs that can be consumed by external applications. In CloverDX, this can be very effectively done for REST APIs via the Data Services feature.
CloverDX Data Services allows developers to design a REST API and visually develop transformation logic that sits behind this API call. The resulting API endpoint can be quickly published from the CloverDX Server.
A CloverDX API endpoint can be used by many different systems such as Tableau, SalesForce and any other services that can call REST APIs.
Typically, a transformation, such as the one below, can be launched.
Case Studies & White Papers