CloverDX Blog on Data Integration

4 Tips for Solving Large-Scale Enterprise Data Classification Problems

Written by CloverDX | March 23, 2020

Are you struggling to classify large-scale data?

Unfortunately, for many organizations, data pools end up looking more like murky oceans. And, understanding and classifying their data can take many months.

The challenges to overcoming these large-scale data classification problems are:

  • Scale of data – when there are thousands of structures to classify, the problem becomes increasingly complex.
  • Data quality – things like typos and phone number formats decrease data quality and confuse simple algorithms.
  • Poor use of data – this may entail staff abusing data fields for different uses.

These challenges make audits and dealing with regulators a nightmare because businesses don’t know what data they have or how they should treat it.

To shine a light and help your business overcome its large-scale enterprise data classification problems, follow these four tips.

Tip #1: When working at scale, you need to automate

There’s no need to set up the infrastructure and software for automation when you only have a handful of Excel spreadsheets. You'll save money and time by letting your IT team get it done themselves.

However, when you’re faced with large-scale data classification and facing hundreds of gigabytes and thousands of tables, the classification process is almost impossible to handle manually. Here, you’re better off turning to an automated solution.

Moreover, it’s important to remember that data classification is a never-ending process. This means that you need to design for your classification documents to be updated regularly (in other words, it’s ideal to rule out any manual steps).

Tip #2: Use clever algorithms

Using clever algorithms sounds obvious, but it’s also important to consider that they will only get you so far. You’ll always need to employ human judgement or ‘polish’ to trust the results of the algorithms.

Some organizations are just too reliant on algorithms. They expect them to work like magic bullets and solve all their data problems.

However, even if an algorithm solves an initial problem, if you don’t understand what it's done, your success will be short-lived. That’s because you’ll struggle to talk confidently with regulators about your data pipelines and processing intent. You need to be able to explain how your data is processed if you want to argue that it is processed properly.

It’s also best to practice a two-step process when using algorithms. This process consists of: using algorithms for cases which are easy to classify, and enlisting the help of a person to train the algorithm in cases that are more difficult and ambiguous.

Tip #3: Plan your resources

One of the sure-fire ways to doom your data classification project is to underestimate the resources you’ll need to complete it.

To scope your data classification project, you’ll need to clarify everything you need to complete the process, including:

  • If you’re working with thousands of tables of data, you’ll need a lot of resources to classify it.
  • Subject matter experts. When looking at specific fields of data, you’ll need subject matter experts to understand what’s what.
  • Timeframes, budget and further support. Factor in everything else you’ll need to make this a success. This includes budget, timeframes and access to other teams for technical support.

Tip #4 Avoid post-mortem classification if possible

It’s costly and inefficient to classify data after you start using it.

By using data models and other techniques to define your data before you start using it, you’ll dramatically improve your data classification efforts.

Additionally, applying technologies (such as our data model bridge) that bind the data model and data definition to the process will increase your chances of success further. It’s another way to use automation to make the process less laborious and less prone to error.

Data classification on a large scale

With regulations, such as the GDPR, CCPA and HIPAA becoming more stringent, this isn’t the time to take risks with your data.

To meet these regulations, you need to classify your data. Then, you’ll understand where it is, how sensitive it is, and how you should treat it.

But data classification remains a headache for many businesses. This is especially the case when the scale of your data is too much to handle manually. Yes, the tips we’ve covered will all help, but the fact remains that if you have large-scale data, classifying it manually is problematic and time-consuming.

This is where a tool like CloverDX Harvester can help.

It’s not a magic bullet, but with a bit of human help, Harvester will dramatically accelerate your data classification efforts. It does this by automatically creating a data map of the location and sensitivity of your data. No matter the type of data you’re handling - names, credit card numbers, addresses, etc. - Harvester will track it down and classify it.

This is a great way to keep your regulators happy, as you can show where you store data and how you treat it. You can also use this to decide which datasets need anonymizing and how to do it.

This turns your classification project from a burdensome, lengthy task into something that’s achievable within a matter of weeks while making it easy to maintain, update and manage your data pipelines.

To learn more about classification, anonymization, and how you can reduce the danger of your data, watch our webinar on Removing Danger From Data.