State of GenAI Data Readiness in 2024 - Survey results are in!

Get Survey Report arrow--cta

Table of Contents

    Table of Contents

    Data De-Identification: What is it and Why do I Need it?

    Amitai Richman

    Amitai Richman

    Product Marketing Director

    Data de-identification is a data masking method that severs the connection between data and the person associated with it, to ensure privacy compliance. 

    Table of Contents


    What is Data De-Identification?  
    How Does Data De-Identification Work? 
    Regulatory Demands for Data De-Identification  
    Data De-Identification vs Sanitization, Anonymization, and Tokenization 
    Data De-Identification Based on Business Entities  

    What is Data De-Identification? 

    Data de-identification is a method of data masking that removes Personally Identifiable Information (PII) from a document, media source, or other data record. It is the simplest and fastest way to protect the sensitive information found in datasets, while at the same time bringing organizations in-line with data security regulations like HIPAA, GDPR, and more.  

    Data de-identification enables information in a dataset to be used for research, customer service, marketing, or any other authorized internal or external use without the possibility of individual privacy being compromised. By applying data de-identification techniques to their data masking tools, organizations can safeguard privacy, build trust and brand equity, and enhance their competitive edge. 

    How Does Data De-Identification Work? 

    To understand de-identification, it’s first necessary to differentiate between direct identifiers and indirect identifiers (also known as quasi-identifiers). This is the because the method for securing identifiers depends, in a large part, on the type of identifiers being secured. 

    • Direct identifiers are values which could clearly and uniquely identify an individual: name, email, address, social security number, etc.

    • Indirect identifiers are values that can potentially identify a person, but are more important in their value for analysis: demographic information, socio-economic details, etc.

    Once you’ve identified the type of identifier requiring data de-identification, there are numerous techniques to de-identify these values. Some of the most common include:

    1. Differential privacy
      Patterns in a dataset are analyzed, without revealing directly identifiable data.

    2. Omission 
      Direct identifiers, like names, are simply omitted from the datasets. 

    3. Redaction
      Direct or indirect identifiers are removed or obfuscated from all types of data records, including video or audio files (via pixelation or other techniques). 

    4. Suppression
      Values are removed from the dataset or are replaced with similar indicative information.

    5. Hashing 
      Identifiers are permanently encrypted, with no option of decryption. 

    6. Swapping 
      Values are exchanged between individuals (for example, Joan’s salary is swapped with Dave’s), while leaving the aggregate value of the dataset field valid 

    7. Pseudonymization 
      Pseudonymization substitutes direct or indirect identifiers with temporary but unique IDs or codes. 

    8. Micro-aggregation 
      Similar numerical values are grouped together, with individual values represented as the mean of the group (for example, for everyone aged 15, 16, and 17, the age field is changed to 16). 

    9. Generalization 
      The generalization technique substitutes an exact value with a less-specific value (for example, changing an exact birth date to just month/year). 

    10. K Anonymity 
      K Anonymity defines quasi-identifiers and ensures that at least “K” number of individuals have the same combination of values. 

    11. Addition of noise 
      Noise addition refers to the generation and addition of a new value to an original variable with mean zero and positive variance. 

    Regulatory Demands for Data De-Identification  

    Numerous regulatory regimes require data de-identification to maintain compliance. For example, the US Health Insurance Portability and Accountability Act of 1996 (HIPAA) delineates separate techniques for data de-identification: 

    • Expert Determination
      Expert determination applies statistical and scientific principles to data to dramatically reduce the risk of re-identification. It’s considered the most flexible method of data de-identification because it can be customized to each use case. By using quantitative methods to reduce the risk of re-identification of anonymized data, expert determination also enables data generalization and automation. However, expert determination is an inherently manual process – requiring (as the name suggests) the involvement of a human statistical expert. This makes expert determination prohibitively expensive at scale.

    • Safe Harbor
      The Safe Harbor technique of data de-identification developed by the US Department of Health and Human Services (HHS), requires the removal of 18 types of identifiers (both direct and indirect), to assure that the information can’t be linked to a specific individual. Common identifiers include: 

    1. Name 

    10.    Health plan beneficiary number  

    2. Date of birth 

    11.    Business license number 

    3. Phone number 

    12.    Vehicle registration number 

    4. Street address 

    13.    Web URL 

    5. Fax number 

    14.    Device serial number 

    6. Social Security Number 

    15.    Internet Protocol (IP) address 

    7. Email address 

    16.    Passport or driver’s license photo 

    8. Bank account number 

    17.    Biometric identifier 

    9 Medical record number 

    18.    Any unique ID number 

    HIPAA classifies these identifiers as Protected Health Information (PHI). This means that their usage and disclosure are limited – which is why they require data de-identification. Although the Safe Harbor technique is simple and cost-effective, it’s not suitable for every use case. In some cases, it is overly restrictive, rendering the data unusable; in others, it’s overly permissive, leaving multiple direct identifiers unsecured. 

    Data De-Identification vs Sanitization, Anonymization, and Tokenization 

    Data de-identification should be differentiated from several other data privacy methodologies, including data anonymization, data anonymization, and data tokenization:

    • Sanitization 
      Also known as data cleansing or scrubbing, data sanitization detects, corrects, or removes personal or sensitive information from a dataset to keep unauthorized users from identifying specific people. Sanitation is typically the method of choice for data deletion or transfer (for example, in the recycling of a company computer). 

    • Anonymization 
      Data anonymization removes or obfuscates sensitive values, replacing them with realistic fake data and creating a version of the dataset that can’t be decoded or reverse engineered. There are a few ways to accomplish this, including word or character replacement or shuffling and encryption. Data anonymization tools are usually applied to direct identifiers like names and phone numbers – to make the information look consistent, remain usable, and appear real. 

    • Tokenization 
      Tokenization substitutes personal data with a random token – often completely random numbers or tokens created by one-way functions like hashes. Although a link is frequently maintained between the original information and the token (in an offsite token vault), there is no direct mathematical relationship between them. So, when data tokenization tools are employed, the tokenized data can’t be deciphered or reverse engineered (without the keys to the vault).

    Data De-Identification Based on Business Entities  

    The most effective method of data de-identification relies on the business entity approach to data masking challenges – where a “business entity” is a given asset or attribute of the business itself: a customer, device, or invoice. The data associated with each business entity instance is aggregated and stored in an individually encrypted Micro-Database™. By basing data de-identification on business entities, organizations raise productivity without compromising compliance and customer privacy. 

    Achieve better business outcomeswith the K2view Data Product Platform

    Solution Overview

    Discover the
    #1 Anonymization Tool

    Learn how K2view anonymizes data in-flight from any data source

    Solution Overview