• Blog
  • Podcast
  • Contact
  • Sign in
CloverDX Logo
Product
  • OVERVIEW
  • Discover CloverDX Data Integration Platform###Automate data pipelines, empower business users.
  • Deploy in Cloud
  • Deploy on Premise
  • Deploy on Docker
  • Plans & Pricing
  • Release Notes
  • Documentation
  • Customer Portal
  • More Resources
  • CAPABILITIES
  • Sources and Targets###Cloud and On-premise storage, Files, APIs, messages, legacy sources…
  • AI-enabled Transformations###Full code or no code, debugging, mapping
  • Automation & Orchestration###Full workflow management and robust operations
  • MDM & Data Stewardship###Reference data management
  • Manual Intervention###Manually review, edit and approve data
  • ROLES
  • Data Engineers###Automated Data Pipelines
  • Business Experts###Self-service & Collaboration
  • Data Stewards###MDM & Data Quality
clip-mini-card

 

Ask us anything!

We're here to walk you through how CloverDX can help you solve your data challenges.

 

Request a demo
Solutions
  • Solutions
  • On-Premise & Hybrid ETL###Flexible deployment & full control
  • Data Onboarding###Accelerate setup time for new data
  • Application Integration###Integrate operational data & systems
  • Replace Legacy Tooling###Modernize slow, unreliable or ad-hoc data processes
  • Self-Service Data Prep###Empower business users to do more
  • MDM & Data Stewardship###Give domain experts more power over data quality
  • Data Migration###Flexible, repeatable migrations - cloud, on-prem or hybrid
  • By Industry
  • SaaS
  • Healthcare & Insurance
  • FinTech
  • Government
  • Consultancy
zywave-3

How Zywave freed up engineer time by a third with automated data onboarding

Read case study
Services
  • Services
  • Onboarding & Training
  • Professional Services
  • Customer Support

More efficient, streamlined data feeds

Discover how Gain Theory automated their data ingestion and improved collaboration, productivity and time-to-delivery thanks to CloverDX.

 

Read case study
Customers
  • By Use Case
  • Analytics and BI
  • Data Ingest
  • Data Integration
  • Data Migration
  • Data Quality
  • Data Warehousing
  • Digital Transformation
  • By Industry
  • App & Platform Providers
  • Banking
  • Capital Markets
  • Consultancy & Advisory
  • E-Commerce
  • FinTech
  • Government
  • Healthcare
  • Logistics
  • Manufacturing
  • Retail
Migrating data to Workday - case study
Case study

Effectively Migrating Legacy Data Into Workday

Read customer story
Company
  • About CloverDX
  • Our Story & Leadership
  • Contact Us
  • Partners
  • CloverDX Partners
  • Become a Partner
Pricing
Demo
Trial

What is Data Anonymization?

Data Anonymization
Posted August 05, 2019
5 min read
What is Data Anonymization?

Data anonymization is a process of masking data so that individual people or records can’t be identified. It enables life-like data to be used for software testing, analytics, visualization or sharing with third parties, but with any sensitive data safely obscured. 

Unlike some other methods of creating non-identifiable data, data anonymization resembles your original dataset as closely as possible, keeping some important characteristics and relationships that can make a big difference to your analysis or testing. 

Read more: Data Anonymization: 7 Essential Use Cases

Why would you want to use data anonymization?

In an ideal world, you’d test on your production data to give you the most accurate test results. But there are many reasons you can’t do that:

  • You need to comply with data protection, data privacy and compliance regulations. If your data contains personal information, you have a duty to protect it.
  • You need to get value from your data, but your data contains corporate information that you don’t want to share.
  • You need to protect your system resources; testing on live systems or production data can have an impact on performance.
  • Licensing restrictions: some software requires a full licence while working with production data, even in a development setting.
  • You need more data than you have available. For instance, machine learning or AI tools need large amounts of life-like data to create realistic models.

There are various ways of creating fake data that avoid some of these issues and help reduce the danger that can occur within your data. We look at some of these below, but all have their pros and cons. Essentially, for the most accurate testing you want to use data that’s fake, but still captures real-world situations.

Read more: Why Your Business Needs A Data Anonymization Strategy

Book a free demo CloverDX CTA

Different methods of data masking

Data can be made un-identifiable in various ways, some easier than others, and some more useful than others.

Randomized data

Randomizing data is relatively easy, but provides limited value. As you can see in the example below, your data may retain some formatting rules, but otherwise is pretty much gibberish. This means that if you want to test your software, you don’t have a very realistic dataset (so your tests aren’t going to catch many potential bugs). It also means that you can’t perform any meaningful analysis on your data.

  BEFORE (actual data) AFTER (randomized data)
Name Frank Smith Xxuzyg Mbdhu
Social Security Number 543-69-1573 888-88-8888
City Denver Xyzzz
Date of Birth 24 Jul 1975 1 Jan 2000

Synthetic data 

Synthetic data - generating artificial data that resembles your original dataset but contains completely fake information - is a step above randomized data.  It’s more elaborate, generates valid values (so better for testing and analysis) but still has some limitations.

  BEFORE (actual data) AFTER (synthetic data)
Name Frank Smith John Doe
Social Security Number 543-69-1573 123-45-6789
City Denver Chicago
Date of Birth 24 Jul 1975 8 Feb 2014

Limitations of synthetic data

  • It’s only as good as the underlying datasets or models used to create it. For example:
    • If you use English characters only, you won’t get any non-English names generated (which could be important if you’re trying to test software that’s used by a large number of people).
  • It doesn’t keep relationship or identification information that you might want, e.g.:
    • US Social Security Numbers (SSNs) contain numbers that identify a state. You want to make sure you’re not using “real” SSNs that can be linked to a person, but you might want to keep information on how many SSNs from each state you have. Generating completely synthetic SSNs means you lose this.
  • It won’t always give you a realistic distribution of values (one that matches your original data)
    • Synthesized data might distribute all values equally across a scale - whereas your original data might not have this same distribution.
  • It won’t contain the same errors as your original data
    • Errors are difficult to synthesize, but are something that will undoubtedly exist in your production data, so you have to make sure you test for those conditions.
Data Anonymization with CloverDX

Synthetic data vs data anonymization

Unlike synthesized data, data anonymization does preserve some attributes of your original dataset.

Anonymization can for example change 'Frank from Denver' into 'John from Denver'. No longer a real person, but your data still keeps accurate information on the number of people in Denver (although you do of course lose information on the number of Franks. It’s important to decide which information is important for you to keep in your particular case).

Rather than creating completely fake data, data anonymization masks your existing dataset. There are several different methods of anonymizing your data, including changing certain values to remove identifying information, shuffling data around or altering values slightly.

The example below shows how some characteristics of the original production data remain, but it's no longer possible to identify individual records or tie sensitive information to a particular person.

BEFORE (actual data) AFTER (anonymized data)
Name Frank Smith 町 達雄   町 達雄 Frank Smith
SSN 543-69-1573 235-41-8875   543-67-0008 235-81-9568
City Denver New York   Delaware Minneapolis
Birth 24 Jul 1975 14 Sep 1957   28 Jul 1975 17 Sep 1957


Advantages of data anonymization over synthetic data

Anonymized data:

  • Is more closely related to your original data
  • Can inject impurities into your data, making it more life-like
  • Keeps some useful information on relationships and statistical distribution
  • Can be used in end-to-end system testing, because it keeps relationships between systems
  • Can be used for AI or machine learning, because it doesn’t skew reality
  • Makes your testing better, so reduces the chance of embarrassing bugs being discovered after go-live
White Paper: Conquering the Challenges of Data Anonymization

Learn more about data anonymization

Our webinar Data Anonymization for Better Software Testing explores how data anonymization can help you get better test data and improve the quality of your software releases. We'll go into detail about data anonymization; the different methods of achieving it; and the pros and cons of each approach, as well as taking a look at how CloverDX can help manage the data anonymization process at enterprise scale.

Watch the webinar now > 

Share

Facebook icon Twitter icon LinkedIn icon Email icon
Behind the Data  Learn how data leaders solve complex problems every day

Newsletter

Subscribe

Join 54,000+ data-minded IT professionals. Get regular updates from the CloverDX blog. No spam. Unsubscribe anytime.

Related articles

Back to all articles
Data Architecture Data Anonymization
4 min read

4 Tips for Solving Large-Scale Enterprise Data Classification Problems

Continue reading
Data Anonymization
5 min read

Data Anonymization: 7 Essential Use Cases

Continue reading
8 Fundamental Data Anonymization Mistakes That Could Put Your Business At Risk
Data Anonymization
5 min read

8 Fundamental Data Anonymization Mistakes That Could Put Your Business At Risk

Continue reading
CloverDX logo
Book a demo
Get the free trial
  • Company
  • Our Story
  • Contact
  • Partners
  • Our Partners
  • Become a Partner
  • Product
  • Platform Overview
  • Plans & Pricing
  • Customers
  • By Use Case
  • By Industry
  • Deployment
  • AWS
  • Azure
  • Google Cloud
  • Services
  • Onboarding & Training
  • Professional Services
  • Customer Support
  • Resources
  • Customer Portal
  • Documentation
  • Downloads & Licenses
  • Webinars
  • Academy & Training
  • Release Notes
  • CloverDX Forum
  • CloverDX Blog
  • Behind the Data Podcast
  • Tech Blog
  • CloverDX Marketplace
  • Other resources
Blog
The vital importance of data governance in the age of AI
Data Governance
Bringing a human perspective to data integration, mapping and AI
Data Integration
How AI is shaping the future of data integration
Data Integration
How to say ‘yes’ to all types of data and embark on a data-driven transformation journey
Data Ingest
© 2025 CloverDX. All rights reserved.
  • info@cloverdx.com
  • sales@cloverdx.com
  • ●
  • Legal
  • Privacy Policy
  • Cookie Policy
  • EULA
  • Support Policy