Data ingestion is the process of moving or on-boarding data from one or more data sources into an application data store. Every business in every industry undertakes some kind of data ingestion - whether a small scale instance of pulling data from one application into another, all the way to an enterprise-wide application that can take data in a continuous stream, from multiple systems; read it; transform it and write it into a target system so it’s ready for some other use.
A data migration is a wholesale move from one system to another with all the timing and coordination challenges that brings. Migration is a one time affair, although it can take significant resources and time.
Data ingestion on the other hand usually involves repeatedly pulling in data from sources typically not associated with the target application, often dealing with multiple incompatible formats and transformations happening along the way.
There’s two main methods of data ingest:
Data ingestion can take a wide variety of forms. These are just a couple of real-world examples:
Setting up a data ingestion pipeline is rarely as simple as you’d think. Often, you’re consuming data managed and understood by third parties and trying to bend it to your own needs. This can be especially challenging if the source data is inadequately documented and managed.
For example, your marketing team might need to load data from an operational system into a marketing application. Before you start, you’ll need to consider these questions:
When you’re dealing with a constant flow of data, you don’t want to have to manually supervise it, or initiate a process every time you need your target system updated. You really want to plan for this from the very beginning otherwise you'll end up wasting lots of time on repetitive tasks.
Human error can lead to data integrations failing, so eliminating as much human interaction as possible can help keep your data ingest trouble-free. (This is even more important if the ingestion occurs frequently).
Both these points can be addressed by automating your ingest process.
You’ll also need to consider other potential complexities, such as:
Data ingest can also be used as a part of a larger data pipeline. Other events or actions can be triggered by data arriving in a certain location. For example - a system that monitors a particular directory or folder, and when new data appears there, a process is triggered.
There are typically 4 primary considerations when setting up new data pipelines:
It’s also very important to consider the future of the ingestion pipeline. For example, growing data volumes or increasing demands of the end users, who typically want data faster.
Another important aspect of the planning phase of your data ingest is to decide how to expose the data to users. Typical questions asked in this phase of pipeline design can include:
These considerations are often not planned properly and result in delays, cost overruns and increased end user frustration.
It’s important to understand how often your data needs to be ingested, as this will have a major impact on the performance, budget and complexity of the project.
There is a spectrum of approaches between real-time and batched ingest. For example, it might be possible to micro-batch your pipeline to get near-real-time updates, or even implement various different approaches for different source systems.
Understanding the requirements of the whole pipeline in detail will help you make the right decision on ingestion design.
The decision process often starts with users and the systems that produce that data. Typical questions that are asked at this stage include: