State of GenAI Data Readiness in 2024 - Survey results are in!

Get Survey Report arrow--cta

Data masking techniques and best practices

What is data masking?

Last updated on October 18, 2024 

Get Gartner Report
Data masking | The complete guide
Send me as a pdf

tableicon/Table of Contents

Data masking is the process of permanently concealing PII and other sensitive data, while retaining its referential integrity and semantic consistency.

Get Gartner Report

01

What is data masking?

Data masking is a method for protecting personal or sensitive data that creates a version of the data that can’t be identified or reverse-engineered while retaining referential integrity and usability.

The most common types of data that need masking are:
  • Personally Identifiable Information (PII) such as names, passport, social security, and telephone numbers
  • Protected Health Information (PHI) about an individual’s health status, care, or payment data
  • Protected financial data, as mandated by the Payment Card Industry Data Security Standard (PCI-DSS) and the US Federal Trade Commission (FTC) acts and safeguards
  • Test data, associated with the Software Development Life Cycle (SDLC)

    Masked data is generally used in non-production environments – such as software development, data science, and testing – that don’t require the original production data. 

    Simply defined, data masking combines the processes and tools for making sensitive data unrecognizable to – yet functional for – authorized users.

The data masking process

Iterative data masking lifecycle

The data masking process is an iterative lifecycle that can be broken down into 4 steps corresponding to 4 data masking requirements – discovering the data, defining the masking rules, deploying the masking functionality, and auditing the entire process on an ongoing basis.

With the right data masking process in place, your data teams can:

  1. Discover the data and its relationships 
    Your data masking tools should be in sync with a data catalog that collects, analyzes, and visualizes the metadata for all the information that requires masking.
  2. Define data masking rules
    You should be able to restrict access to sensitive data with Role-Based Access Controls (RBAC), and then define different masking rules for different types of users.
  3. Deploy the appropriate masking functions
    Your tools should have a wide range of built-in functions – e.g., redaction, scrambling, shuffling, and nulling out – but also allow you to customize functionality to your own specifications.
  4. Audit and report for compliance
    Generating exportable audit reports – for instance, including schema, table, and column name, as well as type, field, and probability of a match – is an ongoing part of the process.

02

Types of data masking

Over time, many types of data masking have evolved to provide more sophistication, flexibility and security, including:

Static data masking

Non-production environments, such as those used for analytics, testing, training, and development purposes, often source data from production systems. In such cases, sensitive data is protected with static data masking, a one-way transformation ensuring that the masking process cannot be undone. When it comes to testing and analytics, repeatability is a key concept because using the same input data delivers the same results. This requires the masked data values to persist, over time, and through multiple extractions.

For software testing, static data masking is usually employed on a copy of a production database. Advanced masking tools make data look real enough to enable software development and testing, without exposing the original values.

Dynamic data masking

Dynamic data masking is used to protect, obscure, or block access to, sensitive data. While prevalent in production systems, it is also used when testers or data scientists require real data. Dynamic data masking is performed in real time, in response to a data request. When the data is located in multiple source systems, masking consistency is difficult, especially when dealing with disparate environments, and a wide variety of technologies. Dynamic data masking protects sensitive data on demand.

Dynamic data masking automatically streams data from a production environment, to avoid storing the masked data in a separate database. As a rule, it’s used for role-based security for applications – such as handling customer queries, or processing sensitive data, like health records – and in read-only scenarios, so that the masked data doesn’t get written back to the production system.

This technique is frequently used in customer service applications to ensure that support personnel can access the data they need to assist customers while masking sensitive information like credit card numbers or personal identifiers – to maintain privacy and compliance with data protection regulations.

On-the-fly data masking

When analytics or test data is extracted from production systems, staging sites are often used to integrate, cleanse, and transform the data, before masking it. The masked data is then delivered to the analytics or testing environment. This multi-stage process is slow, cumbersome, and risky due to the possible exposure of private data.

On-the-fly data masking is performed on data as it moves from one environment to another, such as from production, to development or test. It’s ideal for enterprises engaging in continuous software development and large-scale data integrations. A subset of the masked data is generally delivered to authorized users upon request, because keeping a backup of all the masked data is inefficient and impractical.

Statistical data masking

Statistical data masking ensures that any masked data retains the same statistical characteristics and patterns as the real-world data it represents – such as the distribution, mean, and standard deviation. When production data is statistically masked, unauthorized users have great difficulty extracting any information of value afterwards.

Test data masking

Software applications require extensive testing before they can be released into production. Test data management tools that provision production data for testing must mask the test data to protect sensitive information. For example, in a legacy modernization program, the modernized software components must undergo continuous testing, making test data masking a key component in the testing process. Masking data with referential context and relational integrity – from production systems to the test environments – is critical.

Unstructured data masking

When it comes to protecting data privacy, regulations do not differentiate between structured and unstructured data. Scanned documents and image files, such as insurance claims, bank checks, and medical records, contain sensitive data stored as images. Many different formats (e.g., pdf, png, csv, email, and Office docs) are used daily by enterprises in their regular interactions with individuals. With the potential for so much sensitive data to be exposed in unstructured files, the need for unstructured data masking is obvious.

 

Data masking of unstructured data
Masking of unstructured data is particularly important in
the financial services industry due to strict regulations.

03

Data masking best practices

Here are the most common data masking best practices to assure data security and compliance.

  • Identify where your sensitive data resides

    Learn the location, access, movement, and usage of your sensitive data across your systems and environments.

  • Determine the right data masking techniques

    Choose from among the most appropriate data masking methods for your data, based on its sensitivity, usage, and security policies. Data masking techniques include anonymization, pseudonymization, encrypted lookup substitution, redaction, shuffling, date aging, and nulling out.

  • Test your data masking

    Verify that your data masking techniques produce the expected results, and that the masked data is realistic and functional enough for your needs.

  • Keep your data masking techniques secure

    Ensure that only authorized personnel can access and modify your data masking algorithms, and that they’re stored and managed securely.

  • Ensure referential integrity

    Make sure that the same data masking technique is applied consistently to the same type of data across your systems, to maintain the relationships and logic of the data.

 

Data masking catalog
By managing data with a business entity approach,
referential integrity and consistency are ensured.

04

Data masking techniques

There are several core data masking techniques associated with data obfuscation, as indicated in the following table:

Technique

How it works

Notes

Data anonymization

Permanently replaces PII with fake, but realistic, data

Protects data privacy and supports testing / analytics

Pseudonymization

Swaps PII with random values while securely storing the original data when needed

Applies to unstructured as well as structured data

Encrypted lookup substitution

Creates a lookup table with alternative values for PII

Prevents data breaches by encrypting the table

Redaction

Replaces a field containing PII with generic values, completely or partially

Useful when PII isn’t required or when dynamic data masking is employed

Shuffling

Randomly inserts other masked data instead of redacting

Scrambles the real data in a dataset across multiple records

Date aging

Conceals confidential dates by applying random date transformations

Requires assurance that the new dates are consistent with the rest of the data

Nulling out

Protects PII by applying a null value to a data column

Prevents unauthorized viewing

 

Data masking techniques
Data masking techniques are applied to create compliant,
realistic data for software testing and analytics.

05

Why is data masking important?

Data masking solutions are important to enterprises because they enable them to:

  • Achieve compliance with privacy laws, like CPRA, GDPR and HIPAA, by reducing the risk of exposing personal or sensitive data, as one aspect of the total compliance picture.
  • Protect data in testing environments from cyber-attacks, while preserving its usability and consistency.
  • Reduce the risk of data sharing, e.g., in the case of cloud migrations, or when integrating with third-party apps.

Masking tools are now needed more than ever before, to effectively safeguard sensitive data and to address the following challenges:

Regulatory compliance

Highly regulated industries, like financial services and healthcare, already operate under strict privacy regulations. Besides adhering to regional standards, such as Europe’s GDPR, California’s CPRA, or Brazil’s LGPD, companies in these fields rely on PII data masking to comply with the Payment Card Industry Data Security Standard (PCI DSS), and the Health Insurance Portability and Accountability Act (HIPAA). 

Insider threats

Many employees and third-party contractors access enterprise systems on a regular basis, for example for software testing or analytics purposes. Production systems are particularly vulnerable, because sensitive information is often used in development, testing, and other pre-production environments. With insider threats rising 47% since 2018, according to the Ponemon Institute report, protecting sensitive data costs companies an average of $200,000 per year.

External threats

In 2020, personal data was compromised in 58% of the data breaches, states a Verizon report. The study further indicates that in 72% of the cases, the victims were large enterprises. With the vast volume, variety and velocity of enterprise data, it is no wonder that breaches proliferate. Taking measures to protect sensitive data in non-production environments will significantly reduce the risk, one of many data masking examples.

Data governance

Your data masking tool should be secured with Role-Based Access Control (RBAC). While static data masking obscures a single dataset, dynamic data masking provides more granular controls. With dynamic data masking, permissions can be granted or denied at many different levels. Only those with the appropriate access rights can access the real data. Others will see only the parts that they are allowed to see. You should also be able to apply different masking policies to different users.

Flexibility

Data masking is highly customizable. Data teams can choose which data fields get masked, and how to select and format each substitute value. For example, every Social Security Number (SSN) has the format xxx-xx-xxxx, where “x” is a number from 0 to 9. They can substitute the first five digits with the letter x, or all 9 numbers with other random numbers, according to their needs.

06

Data masking challenges

Not only must the altered data retain the basic characteristics of the original data, it must also be transformed enough to eliminate the risk of exposure, while retaining referential integrity.

Enterprise IT landscapes typically have many production systems, that are deployed on premises and in the cloud, across a wide variety of technologies. To mask data effectively, here’s a checklist of must haves:

  1. Format preservation

    Your data masking tool must understand your data and what it represents. When the real data is replaced with fake data, it should preserve the original format. This capability is essential for data threads that require a specific order, such as dates.

  2. Referential integrity

    Relational database tables are connected through primary keys. When your masking solution hides or substitutes the values of a table’s primary key, those values must be consistently changed across all databases. For example, if Rick Smith is masked as Sam Jones, that identity must be consistent wherever it resides.

  3. PII discovery

    PII is scattered across many different databases. The right data masking tool should be able to discover where it’s hiding with advanced capabilities like GenAI-powered PII discovery.

  4. Data governance

    Data access policies – based on role, location, or permissions – must be established and adhered to.

  5. Scalability

    Real-time access to structured and unstructured data and mass/batch data extraction must be ensured.

  6. Integration

    On-prem or cloud integration with any data source, technology, or vendor is a must, with connections to relational databases, NoSQL sources, legacy systems, message queues, flat files, XML documents, and more.

07

Use cases of data masking

Organizations use data masking to comply with data privacy regulations, like GDPR, CPRA, and HIPAA, mainly to safeguard sensitive data, such as Personally Identifiable Information (PII), Protected Health Information (PHI), and financial data. 

Common data masking use cases include:

Software development and testing

Software developers and testers often require real data for testing purposes, but access to production datasets is risky. Data masking methods allow them to work with lifelike test data, without revealing any sensitive information.

Analytics and research

With data masking software at their disposal, data analysts and scientists can work with large datasets knowing that confidential information is protected. At the same time, researchers can provide insights by analyzing trends without ever compromising individual privacy. 

Internal training

By masking data, you can provide real-world examples to your employees without exposing any business or customer data. Your staff can learn and practice skills without having to access any data they’re not authorized to see.

External collaboration

Sometimes you need to share data with external consultants, partners, or vendors. With effective PII masking, you can collaborate with third parties without the risk of exposing sensitive data.

Retrieval-augmented generation

Retrieval-Augmented Generation (RAG) is a generative AI framework that's quickly emerging as a key use case for data masking. It's GenAI-powered PII discovery capability uses your Large Language Model (LLM) to automatically identify and classify your sensitive data and mask it on the fly.

08

How does data masking work?

To understand how data masking works, let’s compare it to data encryption and data tokenization.

While data masking is irreversible, encryption and tokenization are both reversible in the sense that the original values can be derived from the obscured data. Here’s a brief explanation of the 3 methods:

Data masking

Data masking tools substitute realistic, but fake, data for the original values, to ensure data privacy. Development, support, data science, business intelligence, testing, and training teams use masked data to make use of a dataset without exposing real data to any risk. 

There are many techniques for masking data, such as data scrambling, data blinding, or data shuffling, which will be explained later in greater detail. The process of permanently removing all PII from sensitive data is also known as data sanitization. There is no algorithm to recover the original values of masked data.

Data encryption

While data encryption is very secure, data teams can’t analyze or work with encrypted data. The more complex the encryption algorithm, the safer the data will be from unauthorized access. In a data masking vs  encryption comparison, encryption is ideal for storing or transferring sensitive data securely.

Data tokenization

Data tokenization, which substitutes a sensitive data element with random characters (tokens), is a reversible process. The tokens can be mapped back to the original data, with the mappings stored in a secure “data vault”.

In a data masking vs tokenization comparison, tokenization supports operations like processing a credit card payment without revealing the credit card number. The real data never leaves the organization and can’t be seen or decrypted by a third-party processor.
 
Data tokenization supports the Payment Card Industry Data Security Standard (PCI-DSS).

Data masking is not reversible, making it more secure, and less costly, than tokenization. It maintains referential context and relational integrity across systems and databases, which is critical in data analysis and software testing .  

Relational integrity retains data validity and consistency, despite undergoing data de-identification. For example, a real credit card number can be replaced by any 16-digit figure. Once masked and validated, the new value will appear consistently across all systems.

There are 2 major differences between data masking and encryption/tokenization:

  1. Masked data is usable in its anonymized form.

  2. Once data is masked, the original value can’t be recovered.

Data masking via customizable functions
Data masking substitutes real information with random characters.

Complimentary DOWNLOAD

Free Gartner Report: Market Guide for Data Masking

Learn all about data masking from industry analyst Gartner:

  • Market description, including dynamic and static data masking techniques

  • Critical capabilities, such as PII discovery, rule management, operations, and reporting

  • Data masking vendors, broken down by category

Get Whitepaper
Data masking technologies by Gartner

09

Benefits of data masking

Data masking offers many benefits, including:

  1. Data compliance
    Industries like financial services and healthcare are subject to strict data protection laws. Data masking ensures that only authorized users get access to confidential information.
  2. Data efficiency
    Data masking increases data efficiency by allowing for the reuse of masked datasets, the reduction of data linkage errors, and no need for secure storage.
  3. Data privacy
    Data masking minimizes the risk of exposing someone's identity by preventing unauthorized access to personal information.
  4. Data security
    Data masking reduces the chance of a data breach by concealing PII and other sensitive data. Simply put, masked datasets are a lot less attractive to malicious actors. 
  5. Data sharing 
    Sometimes you need to share data with third parties for analysis or research purposes. Data masking ensures that no sensitive data is ever revealed in the process. 
  6. Data testing
    Data masking lets you use realistic data for testing purposes, an increasingly important factor in software development.

10

Entity-based data masking

Entity-based data masking resolves the most common data masking challenges, while enabling data masking best practices. It masks all of the sensitive data associated with a specific business entity – e.g., customer, loan, order, or payment – and makes it accessible to authorized data consumers based on role-based access controls. 

While other data protection methods store sensitive data in a staging environment, entity-based data masking ingests, masks, and delivers masked data inflight. 

Here's how it works:

  1. The business entity data is ingested from all relevant sources.

  2. The sensitive data is masked with referential integrity maintained.

  3. The masked dataset is delivered to downstream systems.

For example, if customer data is stored in 4 different source systems (let's say orders, invoices, payments, and service tickets), then entity-based data masking ingests and unifies customer data from the 4 systems to create a "golden record" for each customer. The PII data associated with the individual customer is masked consistently, and the anonymized customer data is provisioned to the downstream systems or data stores. Moreover, if, for example, the customer's status was masked to "VIP", which requires a certain payment threshold to have been met, then the customer's payments are increased accordingly to ensure semantic consistency with the VIP status.

With business entities, it’s easy to protect data – at rest and in transit – for data analytics, software testing, and training environments. The entity-based approach supports structured and unstructured data masking, static and dynamic data masking, test data masking, and more. Images, PDFs, text, and XML files that may contain PII are protected, while operational and analytical workloads continue to run without interruption. 

If you’d like the latest data masking capabilities and want to avoid the vulnerabilities of conventional methods, a business entity approach is the right way to go.

Gartner Data Masking and Synthetic Data Get Gartner Report