How to Build Painless Data Anonymization And Pseudonymization Into Your Data Architecture

stack of newspapers focused on the word classified - illustrating data anonymization

Data anonymization and pseudonymization are necessary data privacy techniques for extracting value from your data whilst remaining GDPR compliant.

Both involve using de-identification methods, such as scrambling, masking, and blurring to help conceal identifiable data sets. But there’s one key difference:

  • Anonymization scrubs your data of all identifiable information that could expose your data subject. For example, a simple but effective anonymization might involve replacing all names with a ‘*’ or replacing real credit card numbers with just 16 random digits.
  • Pseudonymization, on the other hand, does not remove all identifiable information, but does make it extremely difficult to link data back to its subject. Without the hidden ‘key’ (i.e. information of substituted fields), outside parties will never know the true identities behind your data.

Pseudonymization makes it easier to retain the usefulness of the data set with a decent level of protection, whereas simple anonymization (masking, deleting or randomly generating replacement data) provides stronger protection at the expense of losing much of the data’s value. As an example, randomly generated credit card numbers are useless for properly testing a web payment form or analysis of “which card issuer has most problems clearing payments”.

On the other hand, careful anonymization of the original card numbers would preserve critical information – card issuer, type, valid check digit – in a way that does not reveal any identifiable information about the holder.

Implementing either practice is easier said than done. That’s because applying anonymization and pseudonymization on a larger scale requires thorough planning and careful execution.

We’ve boiled this down to the three Ds of de-identification.

Removing danger from data - webinar - watch now

1. Define your use case and the level of anonymization needed

Ultimately, the more data fields you anonymize, the less ‘realistic’ and usable your data becomes. On the other hand, when you anonymize fewer fields, the data becomes less secure, and the easier it is to re-identify the data. How you process, share and use your data should define the anonymization technique used.

2. Discover your data

Taking the time to discover your own data might seem like an obvious next step. But for large organizations, this process is more akin to finding needles in not one, but multiple haystacks.

With numerous IT systems and hundreds of thousands of database tables, often containing similar data records, it’s difficult to work out what data you have and where it is. But, for compilatory and business purposes, you need to understand where your data resides.

This is a huge project for any large organization to undertake. As a result, you’ll need help from a consulting company or a data expert with the right tools for discovering and anonymizing your data simply and effectively.

3. Data anonymization and pseudonymization at scale

When anonymizing a single data set to send to a contractor, you can easily make do with Excel or other readily available anonymization tools.

However, if you have multiple use cases and, therefore, require various levels of anonymization, things become much more complex. This problem is only doubled when you consider the amount of your data that’s dotted all over your systems.

So, if you require large, enterprise scale anonymizations, the job will require anonymizing entire databases at once, alongside any other accompanying data (referential integrity, IDs etc). Without anonymizing or pseudonymizing this data, your anonymization process will fail.

In order to get your specified data treated correctly, you can either ask your developers to build a customized anonymization process internally or contact an expert who already has templates and tools built for the task. However, bear in mind that tackling this on an ad-hoc base internally may take months to years to complete.

The CloverDX approach

If you don’t have the time to wait for your developers to build an anonymization process, it’s better to enlist the help of an automated tool that can do most of the work for you.

CloverDX’s anonymization framework simplifies setting up and operating your complex anonymization and pseudonymization process. We’ve developed a ‘Data Harvester’ that crawls your thousands of databases and finds the specific sensitive datasets you’re looking for at large scale, cutting the time you’d be spending doing it manually from potentially months to just a couple of weeks...

From here, CloverDX’s anonymization engine uses rules that define multiple targets for the different levels of data anonymization required. As your data grows and changes (as it inevitably will), you can easily re-configure the platform and continue to anonymize your sensitive information automatically and with minimal hassle.

So, are you ready to de-stress your data anonymization processes? Watch our data anonymization webinar for more information.

webinar - data anonymization for better software testing - watch now

Posted on January 27, 2020
DISCOVER  CloverDX Data Integration Platform  Design, automate and operate data jobs at scale. Learn more

Related Content

Subscribe to our blog


Where to go next