What is Data Anonymization?

Written by CloverDX | August 05, 2019

Data anonymization is a process of masking data so that individual people or records can’t be identified. It enables life-like data to be used for software testing, analytics, visualization or sharing with third parties, but with any sensitive data safely obscured.

Unlike some other methods of creating non-identifiable data, data anonymization resembles your original dataset as closely as possible, keeping some important characteristics and relationships that can make a big difference to your analysis or testing.

Why would you want to use data anonymization?

In an ideal world, you’d test on your production data to give you the most accurate test results. But there are many reasons you can’t do that:

You need to comply with data protection, data privacy and compliance regulations. If your data contains personal information, you have a duty to protect it.
You need to get value from your data, but your data contains corporate information that you don’t want to share.
You need to protect your system resources; testing on live systems or production data can have an impact on performance.
Licensing restrictions: some software requires a full licence while working with production data, even in a development setting.
You need more data than you have available. For instance, machine learning or AI tools need large amounts of life-like data to create realistic models.

There are various ways of creating fake data that avoid some of these issues and help reduce the danger that can occur within your data. We look at some of these below, but all have their pros and cons. Essentially, for the most accurate testing you want to use data that’s fake, but still captures real-world situations.

Different methods of data masking

Data can be made un-identifiable in various ways, some easier than others, and some more useful than others.

Randomized data

Randomizing data is relatively easy, but provides limited value. As you can see in the example below, your data may retain some formatting rules, but otherwise is pretty much gibberish. This means that if you want to test your software, you don’t have a very realistic dataset (so your tests aren’t going to catch many potential bugs). It also means that you can’t perform any meaningful analysis on your data.

	BEFORE (actual data)	AFTER (randomized data)
Name	Frank Smith	Xxuzyg Mbdhu
Social Security Number	543-69-1573	888-88-8888
City	Denver	Xyzzz
Date of Birth	24 Jul 1975	1 Jan 2000

Synthetic data

Synthetic data - generating artificial data that resembles your original dataset but contains completely fake information - is a step above randomized data. It’s more elaborate, generates valid values (so better for testing and analysis) but still has some limitations.

	BEFORE (actual data)	AFTER (synthetic data)
Name	Frank Smith	John Doe
Social Security Number	543-69-1573	123-45-6789
City	Denver	Chicago
Date of Birth	24 Jul 1975	8 Feb 2014

Limitations of synthetic data

It’s only as good as the underlying datasets or models used to create it. For example:
- If you use English characters only, you won’t get any non-English names generated (which could be important if you’re trying to test software that’s used by a large number of people).
It doesn’t keep relationship or identification information that you might want, e.g.:
- US Social Security Numbers (SSNs) contain numbers that identify a state. You want to make sure you’re not using “real” SSNs that can be linked to a person, but you might want to keep information on how many SSNs from each state you have. Generating completely synthetic SSNs means you lose this.
It won’t always give you a realistic distribution of values (one that matches your original data)
- Synthesized data might distribute all values equally across a scale - whereas your original data might not have this same distribution.
It won’t contain the same errors as your original data
- Errors are difficult to synthesize, but are something that will undoubtedly exist in your production data, so you have to make sure you test for those conditions.

Data Anonymization with CloverDX

Synthetic data vs data anonymization

Unlike synthesized data, data anonymization does preserve some attributes of your original dataset.

Anonymization can for example change 'Frank from Denver' into 'John from Denver'. No longer a real person, but your data still keeps accurate information on the number of people in Denver (although you do of course lose information on the number of Franks. It’s important to decide which information is important for you to keep in your particular case).

Rather than creating completely fake data, data anonymization masks your existing dataset. There are several different methods of anonymizing your data, including changing certain values to remove identifying information, shuffling data around or altering values slightly.

The example below shows how some characteristics of the original production data remain, but it's no longer possible to identify individual records or tie sensitive information to a particular person.

BEFORE (actual data)			AFTER (anonymized data)
Name	Frank Smith	町達雄		町達雄	Frank Smith
SSN	543-69-1573	235-41-8875		543-67-0008	235-81-9568
City	Denver	New York		Delaware	Minneapolis
Birth	24 Jul 1975	14 Sep 1957		28 Jul 1975	17 Sep 1957

Advantages of data anonymization over synthetic data

Anonymized data:

Is more closely related to your original data
Can inject impurities into your data, making it more life-like
Keeps some useful information on relationships and statistical distribution
Can be used in end-to-end system testing, because it keeps relationships between systems
Can be used for AI or machine learning, because it doesn’t skew reality
Makes your testing better, so reduces the chance of embarrassing bugs being discovered after go-live

White Paper: Conquering the Challenges of Data Anonymization

Learn more about data anonymization

Our webinar Data Anonymization for Better Software Testing explores how data anonymization can help you get better test data and improve the quality of your software releases. We'll go into detail about data anonymization; the different methods of achieving it; and the pros and cons of each approach, as well as taking a look at how CloverDX can help manage the data anonymization process at enterprise scale.

Watch the webinar now >

View full post