By understanding the sources of bad data, implementing validation and profiling techniques, considering data quality at each pipeline step, and managing data errors effectively, you can mitigate the risks associated with poor data quality.
Understanding data quality
Sources of bad data
Examples of bad data
Mitigating bad data risks in data pipelines
Most quality issues arise when you put your data in motion – an integration, ingestion, or reshaping for use in warehouse and so on. These data processes typically involve a data pipeline with multiple stages, each of which can allow errors to creep in. And they snowball – a seemingly small error can cause further errors throughout the pipeline and result in improper results for the end user.
Here's three things (plus one bonus thing) you can do to in your data pipelines to ensure better data quality:
1. Validation and profiling:
Incorporate explicit data quality checks in your pipeline, such as validation and profiling stages.
When building your data pipeline, you should build in stages that explicitly deal with data quality - scrutinizing the data on the way in and the way out. Two key ways of doing this are data validation and data profiling:
Validation involves applying business rules to each record to ensure data meets specific criteria (such as: Does the ID field have right number of characters? Are the begin and end time stamps in the right order? Do I have a valid phone number in one of these 3 fields?) Records that pass the validation checks can proceed to the next step in the pipeline; those that don’t get pulled out for assessment and correction.
Profiling involves creating statistical profiles of entire datasets to identify questionable data. These checks can be added at various pipeline stages, helping identify and filter out bad data effectively, and detecting when data quality might be beginning to decay.
2. Data quality in each pipeline step:
Recognize that poor data quality can emerge from any stage in the data pipeline.
Implement checks at each step to capture and address data quality issues as soon as possible, reducing the chances of propagating errors further downstream. Similar to unit testing in software development, treating each pipeline step as a testable component ensures early detection and resolution of data quality problems.
3. Data error management
Instead of merely capturing data errors, consider establishing a data error management process.
This process acts as a central hub for collecting rejected data from the pipeline, allowing for effective correction and reprocessing. It involves a combination of people, processes, and technology, and the data error manager should use a standard format for reporting errors.
One important consideration is who has responsibility for fixing data errors. Even though IT staff might be the first to detect errors in the pipeline, they’re probably not the right choice to be the ones who correct it – they might not know the context of the data, or have the authority to fix it. Involving business users or domain experts ensures proper context and authority for making data corrections.
It's also important to standardize error reporting formats to streamline the resolution process. Error reports need to be presented in a way that a human can understand, and that gives them an actionable piece of intelligence they can act on to correct the data.
4. Data about your data
Implementing these stages into your pipelines to identify and handle bad data can not only improve your immediate data quality, but it can also give you a valuable opportunity to identify larger business issues such as flawed processes or training requirements (e.g. 'All of our West region POS devices are using the wrong timestamp') or uncover and reconcile differences in interpretations or definitions (e.g. 'Is the project closed because the Status field says Closed, or because the Close Date is in the past?).
Good responses to bad data