Data anonymization is a process of masking data so that individual people or records can’t be identified. It enables life-like data to be used for software testing, analytics, visualization or sharing with third parties, but with any sensitive data safely obscured.
Unlike some other methods of creating non-identifiable data, data anonymization resembles your original dataset as closely as possible, keeping some important characteristics and relationships that can make a big difference to your analysis or testing.
Why would you want to use data anonymization?
In an ideal world, you’d test on your production data to give you the most accurate test results. But there are many reasons you can’t do that:
- You need to comply with data protection, data privacy and compliance regulations. If your data contains personal information, you have a duty to protect it.
- You need to get value from your data, but your data contains corporate information that you don’t want to share.
- You need to protect your system resources; testing on live systems or production data can have an impact on performance.
- Licensing restrictions: some software requires a full licence while working with production data, even in a development setting.
- You need more data than you have available. For instance, machine learning or AI tools need large amounts of life-like data to create realistic models.
There are various ways of creating fake data that avoid some of these issues and help reduce the danger that can occur within your data. We look at some of these below, but all have their pros and cons. Essentially, for the most accurate testing you want to use data that’s fake, but still captures real-world situations.
Different methods of data masking
Data can be made un-identifiable in various ways, some easier than others, and some more useful than others.
Randomizing data is relatively easy, but provides limited value. As you can see in the example below, your data may retain some formatting rules, but otherwise is pretty much gibberish. This means that if you want to test your software, you don’t have a very realistic dataset (so your tests aren’t going to catch many potential bugs). It also means that you can’t perform any meaningful analysis on your data.
|BEFORE (actual data)||AFTER (randomized data)|
|Name||Frank Smith||Xxuzyg Mbdhu|
|Social Security Number||543-69-1573||888-88-8888|
|Date of Birth||24 Jul 1975||1 Jan 2000|
Synthetic data - generating artificial data that resembles your original dataset but contains completely fake information - is a step above randomized data. It’s more elaborate, generates valid values (so better for testing and analysis) but still has some limitations.
|BEFORE (actual data)||AFTER (synthetic data)|
|Name||Frank Smith||John Doe|
|Social Security Number||543-69-1573||123-45-6789|
|Date of Birth||24 Jul 1975||8 Feb 2014|
Limitations of synthetic data
- It’s only as good as the underlying datasets or models used to create it. For example:
- If you use English characters only, you won’t get any non-English names generated (which could be important if you’re trying to test software that’s used by a large number of people).
- It doesn’t keep relationship or identification information that you might want, e.g.:
- US Social Security Numbers (SSNs) contain numbers that identify a state. You want to make sure you’re not using “real” SSNs that can be linked to a person, but you might want to keep information on how many SSNs from each state you have. Generating completely synthetic SSNs means you lose this.
- It won’t always give you a realistic distribution of values (one that matches your original data)
- Synthesized data might distribute all values equally across a scale - whereas your original data might not have this same distribution.
- It won’t contain the same errors as your original data
- Errors are difficult to synthesize, but are something that will undoubtedly exist in your production data, so you have to make sure you test for those conditions.
Synthetic data vs data anonymization
Unlike synthesized data, data anonymization does preserve some attributes of your original dataset.
Anonymization can for example change 'Frank from Denver' into 'John from Denver'. No longer a real person, but your data still keeps accurate information on the number of people in Denver (although you do of course lose information on the number of Franks. It’s important to decide which information is important for you to keep in your particular case).
Rather than creating completely fake data, data anonymization masks your existing dataset. There are several different methods of anonymizing your data, including changing certain values to remove identifying information, shuffling data around or altering values slightly.
The example below shows how some characteristics of the original production data remain, but it's no longer possible to identify individual records or tie sensitive information to a particular person.
|BEFORE (actual data)||AFTER (anonymized data)|
|Name||Frank Smith||町 達雄||町 達雄||Frank Smith|
Advantages of data anonymization over synthetic data
- Is more closely related to your original data
- Can inject impurities into your data, making it more life-like
- Keeps some useful information on relationships and statistical distribution
- Can be used in end-to-end system testing, because it keeps relationships between systems
- Can be used for AI or machine learning, because it doesn’t skew reality
- Makes your testing better, so reduces the chance of embarrassing bugs being discovered after go-live
Learn more about data anonymization
Our webinar Data Anonymization for Better Software Testing explores how data anonymization can help you get better test data and improve the quality of your software releases. We'll go into detail about data anonymization; the different methods of achieving it; and the pros and cons of each approach, as well as taking a look at how CloverDX can help manage the data anonymization process at enterprise scale.