The importance of validation in data ingest
When constructing your data ingest pipeline, the exciting stuff is the transformation – the mappings, reformattings and changes you build into your processes. But how much time are you spending fixing problems with data after it’s been through that transformation stage? Are you accounting not just for format requirements but for inefficiencies that occur when the data isn’t fully validated, either immediately fixing the problem or down the line when an error or missing data set is discovered?
Validation is not just a step in the data ingest process, it’s the backbone that holds it all together.
The basic steps of any data ingest pipeline are:
- Read – how are you detecting the data that needs to be ingested and assessing its format and readiness?
- Transform – what mappings and transformations need to apply to the data, and how can you ensure this is done in a way that will maintain quality and context?
- Validate – is the data now in the format and structure that the target destination requires?
- Write – how should your data be uploaded, and what metadata is needed about the process?
- Complete – how are you documenting that this process has taken place successfully before archiving the data?
But even though validation is a step in that process, it must also be considered throughout the rest of the pipeline as well. Each of those steps is an individual process taking place, and each process needs to be checked and re-checked, otherwise your entire process and data set could be compromised.
So here’s how you should be validating your data at each of the other data ingest steps:
In the Read stage, validation is partially about de-duping. You have to make sure not only that the data does indeed need to be ingested, but also that it’s not been ingested before. Then you need to validate any metadata structure before attempting to transform the data.
However, validating the incoming data is useful for more than just de-duping – as we’ll discuss in the Complete stage, reconciliation is an important process that will take your data ingest process from useful to strategic, and it starts in the Read stage.
Within transformation, it’s important to confirm that the desired transformation has taken place without any errors. And when transformation errors inevitably occur, it’s important to surface those errors in an actionable way. This means it must be easy to read for the subject – be that technical user, business user or machine – and provide enough context to make the solution clear.
Too often, transformation errors are reported in unfriendly formats with no clear prioritization or context – by making those error reports more easily readable and actionable by the subject, you can more quickly ensure the data is ready to move through the rest of your pipeline without causing problems downstream.
Of course, this stage is all about validation – has your data transformed correctly? Is it now in the right format, with the right metadata? Is it perfectly ready for import? The important thing here is that you’re validating data that is already in the right format, which is reliably possible only if validation is built into the transformation stage as well.
Not only should you validate before you write the data to the target platform, but also after the fact. Has the ingest taken place successfully? And do you have all the necessary information about that ingest process?
Finally, the complete stage may feel like a formality – log the import, archive the data – but this is where reconciliation happens. Even if your data is technically correct and has ingested successfully, does it look like you would expect it to? For example, if you’re ingesting birth date information and you normally have a fairly even split, but in one import half your records have an October birthdate, that’s worth recording and flagging, in the system and possibly to a user.
How CloverDX approaches validation
At CloverDX, we believe in an automation-first approach, especially when it comes to data ingest. By automating as much of your data ingest process as possible, you’re able to free up valuable developer time and put control of the process in the hands of your business users.
However, to do that, the key lies in error reporting. At CloverDX, we believe that good error reporting means a format appropriate to the stage of the process and the means of resolution. Different errors need different reporting in terms of format, context and immediacy. This means it must be easily readable and actionable by the subject dealing with those errors, be that a business user, developer or machine, and delivered in time to actually resolve the issue without creating more down the line.
CloverDX’s built-in Validator component combines with your schema definition to help identify everything from individual mapping errors to full-process failures. This gives you the agility to correct data errors in real-time and the big-picture context to refine the process over time.
With the right validation processes, even the worst, messiest data can be ingested successfully and reliably. For more on how to architect your systems for effective control of bad data, click here to download our whitepaper.