Data Quality

A reference guide

A Data Quality reference guide for managers and technical staff, this provides a deeper understanding of what is often involved and how to prepare.

Download

Basics

What is Data Quality

  • What is Data Quality?

    Data quality is a measure of data’s fitness for purpose. It covers many different quantitative and qualitative properties including

    • Completeness
    • Internal consistency
    • Accuracy

    Data quality processes are designed to ensure that the data remains useful and that any problematic data is discovered as early as possible and then manually or automatically corrected.

Why should I care?

  • Why should I care?

    Most organizations produce and consume vast quantities of data. Many decisions, ranging from the mundane to the strategic, are based on data from several sources. As such, high data quality is essential for efficient function of the organization.

    Even when the importance of data quality is recognized, issues often go unnoticed or remain uncorrected for prolonged periods of time. This often stems from the fact that humans are good at dealing with low-quality data (a postman can usually deliver to a poorly written address).

    Automated systems, on the other hand, often cope badly with poor quality data and this often leads to increased manual labor or sub-optimal decisions that have negative financial consequences.

Measures of data quality

Measuring data quality

  • Measuring Data Quality

    To evaluate data quality, it is helpful to define a set of measures that provide a common assessment framework.

    These measures provide information about some measurable aspect of the data in question. There are six standard measures for data quality.

    • Completeness
    • Accuracy
    • Consistency
    • Validity
    • Integrity
    • Timeliness

Completeness

  • Completeness

    Completeness is a measure of whether all expected data is actually present in a given data set. Completeness can be measured for whole records as well as for attributes.

    For example, the data set of all company employees is complete on a record level when it contains a record for every employee.

    On an attribute level, the situation is more complex since quite a lot of data might be available for each employee (name, role, address, email, phone number, etc.) and some of these attributes might be mandatory (name, address) while others are only optional (phone number, personal email).

Accuracy

  • Accuracy

    Accuracy describes the degree to which given data correctly describes the real world object.

    An employee data set, for example, might contain inaccurate data if employees do not report their address changes or if spelling or other errors are introduced during data entry.

    Accuracy can be measured as percentage of entity attributes that match the properties of the real world object. This is of course often very hard to do since determining whether an object attribute is accurate can be difficult or impossible.

Consistency

  • Consistency

    Consistency is a measure of the synchronicity of different systems storing information about the same object.

    Employee data, for example, might be stored in an HR system and a project management system. Most of the data is system specific but some data points will be common and they can easily become unsynchronized resulting in one system having lower quality data.

Validity

  • Validity (Conformity)

    Validity is the measure of how well data meets various requirements placed on attribute values such as value size, allowed values for enumerations, data format for systems that allow free form data (e.g. whether dates are stored as MM/dd/yyyy or dd/MM/yyyy in text files).

    There are, of course, more sophisticated validations that implement complex, cross-field or cross-system validations.

    Data often becomes invalid due to programming errors or data entry errors. It is often possible, as well as desirable, to design automated systems that can flag or even correct invalid records.

Integrity

  • Integrity

    Integrity is the measure of the validity of relationships between different data entities. This concept is often applied to database applications where various integrity constraints can be enforced directly by the database (foreign keys, unique constraints etc.).

    You will also encounter relationships between data stored in non-relational data stores, in which case the integrity constraints must be enforced by the application logic.

    Integrity is most commonly broken in cases where common data attributes are stored in multiple systems.

Timeliness

  • Timeliness

    Timeliness reflects the degree to which data represents reality at a specific point in time. It can be thought of as the difference between something happening and that change being recorded in the system and propagated to users.

Pitfalls

Unrealistic data entry expectations

  • Unrealistic data entry expectations

    Many data quality issues are created at the point that data is entered into computer. Even though the degree of automation in many organizations can be high, there is always a considerable amount of data that is manually entered.

    Manual data entry inevitably brings with it mistakes whether incorrect spelling or entering data into the wrong field. Some mistakes can be caused by lack of training or a poor user interface and some applications perform little or no validation.

Changing business processes

  • Changing business processes

    Data quality issues are often caused by business process changes that are incorrectly communicated. It could happen that a new attribute is added to an entity and not all systems that work with that entity are properly updated resulting in lower quality data.

    If invalid data has propagated to other systems, correcting the data at that point can be problematic and costly, especially if it impacts multiple systems. It is, therefore, much easier and less costly to properly enforce validation rules in the source application.

    It is important to always consider the whole application landscape even when implementing changes that might seem trivial (like adding an attribute to entity).

Poor data organization

  • Poor data organization

    Today’s world is powered by data. What was considered a huge volume of data just a few years ago is now considered normal and with storage getting cheaper, many organizations now tend to store data indefinitely.

    While keeping data organized might seem to be a simple task, it is actually a major challenge in larger organizations. In many cases, no single party or department is formally assigned responsibility, which often leads to each department doing their own thing.

    A seemingly trivial but typical example is the naming of documents. If there’s no standard for document naming and storage, improvised systems (wikis, portals etc.) and conventions arise, making it hard to locate documents and resulting in growing inefficiencies. Staff turnover issues compound this still further.

Lack of data quality monitoring

  • Data quality monitoring

    Monitoring of data quality for every involved application is vital in order to prevent pollution of multiple applications with low quality data coming from a single source.

    Monitoring often consists of data validation rules that are applied to each record as it is transformed into its destination format.

    Choice of data attributes to monitor, and how to monitor them, is one of the key decisions of the design phase of the whole project. If you have too much monitoring with overly detailed reports, you can overwhelm stakeholders resulting in important problems being overlooked. On the other hand, too little monitoring is undesirable as important observations are simply not being reported.

    Proper data quality monitoring is, therefore, about carefully balancing the investment in building monitoring rules with the volume of output.

Maintaining data quality

Data Discovery

  • Data discovery

    Data discovery is an often overlooked and underestimated part of any data related project. People often make false assumptions about their data as most people view the data from their own perspective only. However, data discovery is a critical part of the design phase as it provides the input for the scope definition and project estimates.

    Discovery involves a formal analysis of the data itself, finding relationships within and between datasets. One also has to consider all applications that use the data and need to be reconfigured or updated, especially in the case of data migrations.

    As such, discovery in larger organizations can be an activity involving various teams and often cross departmental boundaries.

Data Validation

  • Validation

    Data validation is an important process in achieving sufficient data quality in the target system and goes hand in hand with data cleansing. Data issues are discovered during validation and data that does not pass the validation step is marked for cleansing.

    Data validation can be a manual or an automated process. A high degree of automation is, of course, preferable and indeed necessary for larger systems. Nevertheless, some manual validation may still be necessary, especially for the most important data that cannot be allowed through without human confirmation or correction.

Data Cleansing

  • Data cleansing

    During data discovery, you will often find that the data cannot be used in its current form and first needs to be cleansed. There are many different reasons for low data quality, ranging from simple ones (anything involving human data entry is likely to have various errors including typos, missing data, data misuse etc.) all the way to complex issues stemming from improper data handling practices and software bugs.

    Data cleansing is the process of taking “dirty” data in its original location and cleaning it before it is used in any data transformation. Data cleansing is often an integral part of the business logic with the data being cleaned in the transformation but left unchanged in the originating system. Other approaches can also be used. For example, a separate, clean copy of the data can be created if the data needs to be reused or if cleansing is time-consuming and requires human interaction.

Best practices

Fixing data at source

  • Fixing data at source

    There are many situations where data propagates to multiple different systems from a single source application. Failure to enforce data quality in the source application results if low quality data propagating to those systems and having a multiplying effect.

    It is therefore well worth the effort of implementing as many validations as possible within source applications.

Data validation on input

  • Data validation on input

    Any process that submits data should include validation. Many downstream issues can be prevented by implementing basic validation checks such as

    • date validation (date format, or even whether date is not in the future etc.)
    • verifying values against lists (e.g. checking state names or country codes)
    • checking for reasonable values of all required fields (e.g. do not allow “n/a” as person’s last name etc.)

Data Quality Monitoring

  • Data quality monitoring

    Monitoring of data quality for every involved application is vital in order to prevent pollution of multiple applications with low quality data coming from a single source.

    Monitoring often consists of data validation rules that are applied to each record as it is transformed into its destination format.

    Choice of data attributes to monitor, and how to monitor them, is one of the key decisions of the design phase of the whole project. If you have too much monitoring with overly detailed reports, you can overwhelm stakeholders resulting in important problems being overlooked. On the other hand, too little monitoring is undesirable as important observations are simply not being reported.

    Proper data quality monitoring is, therefore, about carefully balancing the investment in building monitoring rules with the volume of output.

Reporting

  • Reporting

    Reporting is a key part of data migrations and other integrations and is rarely done well. Well executed reporting ensures that stakeholders get all the status information very quickly and are able to react in a short timeframe. This, in turn, shortens the time it takes to determine whether or not everything was successful.

    A variety of reports could be generated. Some common examples are

    • data quality reports
    • effort reports
    • resource utilization reports

    A well designed reporting pipeline will notify relevant parties so they can take necessary action should issues arise.

Integration using CloverDX

Validation

  • Validation

    Validation is one way to ensure clean, valid data. The aim is to automate the detection of as many invalid records as possible, minimizing the need for human intervention.

    To simplify the process, CloverDX provides many built-in components with a UI based configuration, removing the need for coding.

    For example, the Validator component allows developers to quickly and visually design complex, rule-based validations. These validation rules generate information for each error that can then be fed into a reporting pipeline for the creation of quality reports.

    validation

    The ProfilerProbe component allows for easy measurement of various data properties such as data patterns, basic statistical properties of the data (minimum, maximum etc.), unique values and more.

Data cleansing

  • Data cleansing

    CloverDX provides wide array of functionality that can help with data cleansing. On a basic level it can use the Validator component to fix basic data formatting issues and standardize dates, numbers, phone numbers etc.

    When more involved data cleansing is necessary, CloverDX provides tools to implement any data cleansing logic very quickly - from simple implementation in a single Reformat component all the way to complicated data processing pipelines that implement a data-driven rule engine.

Connectivity

  • Connectivity

    One of the first questions asked is how to connect to the many different input and output systems in existence.

    Integrations and migrations often involve dozens, or even hundreds, of different applications, each with its own storage and connectivity requirements. This can include various file types in remote and local file systems, web services, database types, messaging queues and more.

    CloverDX provides a wide variety of “connectors” that provide the desired connectivity and also supports powerful abstraction of connectivity parameters. This makes it simple to work with the data regardless of source or target.

    For example, file processing abstraction allows you to work with your files whether local, in the Cloud or via FTP. Similarly, we offer powerful database abstraction so you can use the same set of components with any relational database.

    Such abstractions allow graph developers to build their solution locally before simply switching the connectivity parameters when ready for the production deployment.

    • Structured data: delimited and fixed-length files in any format
    • Hierarchical data: JSON, XML and custom formats
    • Excel files: XLS and XLSX
    • Relational databases: any database with JDBC driver
    • NoSQL databases: MongoDB
    • Big Data: HDFS, S3
    • REST and SOAP APIs
    • Other: SalesForce, Tableau, email, LDAP, JMS, ...

Reusability

  • Reusability

    In the world of software development, reusability is taken for granted through mechanisms such as functions and classes.

    In CloverDX we fully support reusability by allowing developers to build “child transformations”. A child transformation is a regular transformation that can be referenced from any other transformation. As well as giving the advantages of reusability, you can significantly simplify the readability of your top level transformations.

    subgraph

    CloverDX also supports reusability in several other ways. You can manage your own code libraries, store externalized connection settings for your databases, create shared libraries of your data structures and more.

    CloverDX is fully compatible with all common version control systems, so you can collaborate on projects using your preferred SCM tools such as git, Mercurial and Subversion.

Automation and Jobflows

  • Automation and Jobflows

    It is important that your transformation tool allows developers to build complete pipelines consisting of several transformations and processes with inter-process dependencies.

    CloverDX handles this using Jobflows. A Jobflow executes and monitors internal processes (data transformations or other Jobflows) as well as external processes (typically scripts). Together with the powerful error handling and visual nature of the jobflow design process, CloverDX allows developers to very quickly build entire pipelines that properly react to success and failure.

    jobflows

    Thanks to its ability to automate entire processes, monitor and send appropriate notifications, CloverDX can quickly become a central orchestration point, notifying responsible and interested parties on the progress of migrations and other data integration processes in real time.

    API Endpoints

    CloverDX exposes multiple different APIs that are useful for integration with 3rd party tools. For example it is possible to trigger jobs via REST or SOAP API calls or to monitor various external systems for changes such as arriving files or JMS messages.

Creating APIs

  • Creating APIs

    Many data integrations depend on being able to define and quickly expose APIs that can be consumed by external applications. In CloverDX, this can be very effectively done for REST APIs via the Data Services feature.

    CloverDX Data Services allows developers to design a REST API and visually develop transformation logic that sits behind this API call. The resulting API endpoint can be quickly published from the CloverDX Server.

    API config

    A CloverDX API endpoint can be used by many different systems such as Tableau, SalesForce and any other services that can call REST APIs.

    Typically, a transformation, such as the one below, can be launched.

    API