State of GenAI Data Readiness in 2024 - Survey results are in!

Get Survey Report arrow--cta

Table of Contents

    Table of Contents

    Confused by De-Identification vs Pseudonymization?

    Amitai Richman

    Amitai Richman

    Product Marketing Director

    Confusing de-identification with pseudonymization can put personal or sensitive data at risk, create a false sense of security, and endanger compliance.   

    Table of Contents


    What is De-identification?
    What is Pseudonymization? 
    Why are De-identification vs Pseudonymization Needed? 
    Re-identification Risk Assessment via Intruder Testing 
    De-identification vs Pseudonymization Using Business Entities  

    What is De-identification? 

    Data de-identification eliminates direct and indirect identifiers in a dataset by removing Personally Identifiable Information (PII) from documents, media sources, and other data records. De-identification is one of the quickest and easiest ways to both keep personal information safe and simultaneously bring the organization in-line with data security regulations like HIPAA, GDPR, CPRA, and others.  

    Using data de-identification techniques, enterprises can effectively break the linkage between data and people’s identities. This empowers organizations to use their datasets for marketing, research, customer service, and many other internal or external uses – while still protecting individual privacy and ensuring compliance with data protection laws. 

    Since the method for de-identifying a dataset depends on the type of identifiers it contains, understanding how de-identification works requires differentiating between direct identifiers and indirect identifiers. Direct identifiers are values that can clearly and uniquely identify an individual, like name, address, phone number, email, Social Security number, and more. Indirect identifiers are values that might identify a person, yet are also important for analysis, like socio-economic classifications, demographic data, and more. 

    There are actually 2 types of de-identified data: 

    1. Standard de-identified data is suppressed, generalized, or swapped. For example, a measurement like 3.2 might be replaced by a range of 3.0-3.5, or a value like male could be reversed to female. 

    2. Protected de-identified data is de-identified using the same techniques, but is further secured with safeguards and controls, adding an extra layer of safety. 

    Interestingly, although “data de-identification” and “data anonymization” are frequently used interchangeably, they are different in that de-identification hides or removes only explicit identifiers, while anonymization ensures that any data can’t be linked to an individual in any way. 

    What is Pseudonymization? 

    Pseudonymization is a data anonymization technique that makes the sensitive data in a dataset unidentifiable unless additional information is applied. The ‘additional information’, which changes according to the pseudonymization technique in use, needs to be stored separately in accordance with data security standards. Note that pseudonymized data can still be indirectly linked to an individual and is thus considered Personally Identifiable Information (PII) under regulations like GDPR. 

    Usually, pseudonymization is accomplished by substituting PII values like name, ID number, or date of birth with a random code. But there are numerous other methods of pseudonymization, including the use of: 

    • Cryptographic hash techniques, that arbitrarily input strings to fixed length outputs and then apply them directly to the identifier  

    • Random number generators, that create a random number and then assign it to an identifier 

    • Message authentication codes, which are keyed-hash functions that require a secret key to generate the pseudonym for each data field 

    • Monotonic counters, that substitute an identifier with a unique, non-repeating value

    • Encryption, that safeguards identifiers as long as the encryption key remains uncompromised 

    Why are De-identification vs Pseudonymization Needed? 

    With data breaches and cyberattacks more common than ever, data that reveals private information about individuals is in demand by malicious actors, and therefore highly vulnerable. At the same time, the regulatory environment is growing ever stricter – potentially exposing organizations that need to collect and use PII to fines and penalties for non-compliance with privacy laws.  

    The EU’s General Data Protection Regulation (GDPR) requires that data products and services store, process and display as little PII as possible. Organizations accomplish this wherever possible by: 

    • Not processing data that could directly identify an individual 

    • Gathering only non-sensitive data 

    • Pseudonymizing, or de-identifying, sensitive datasets on demand 

    Interestingly, de-identification and pseudonymization fall under different GDPR categories – making it critical for organizations to understand the difference. Because some form of re-identification is always possible, pseudonymized data is considered PII under GDPR, and thus subject to the regulation’s built-in restrictions. De-identified data is not. This means that companies wishing to avoid regulatory liability can choose to de-identify their sensitive data, while those that might need access to data identifiers will often choose pseudonymization. 

    Re-identification Risk Assessment via Intruder Testing 

    Re-identification risk assessment can be an important part of determining whether de-identification vs pseudonymization is preferable for your organization – and how effectively the chosen system has been applied.  

    One methodology for re-identification risk assessment is intruder testing, which assesses whether “friendly intruders” (opportunistic players, as opposed to specialist hackers) could achieve re-identification of anonymized data if they tried. In a typical test scenario, an attacker attempts to re-identify one or more individuals in the de-identified dataset based on a predefined level of assumed knowledge, computational skill, and financial resources.  

    Intruder testing can help organizations determine if their de-identification is sufficient. To set up such a test, certain assumptions must be made. For example, whether or not it would be possible for an unauthorized person to re-identify data by discovering which sections contain anonymized data, and then matching them with public records. Also, the skill and toolset level of a likely attacker based on the nature of the dataset must be defined. For example, a dataset liable to attract state-supported threat actors would be attacked with more substantial re-identification resources than a less interesting dataset. 

    De-identification vs Pseudonymization Using Business Entities  

    The most effective and technologically advanced techniques for data de-identification and pseudonymization rely on entity-based data masking technology.  

    Under this model, a business entity is defined as all the data associated with a specific customer, invoice, device, etc. Data relating to each business entity instance (a single customer, for example) is stored and managed in an individually encrypted Micro-Database™.  

    The entity-based approach leverages intelligent business rules, enabling enterprises to better maintain productivity, while still ensuring compliance and customer privacy. 

    Achieve better business outcomeswith the K2view Data Product Platform

    Solution Overview

    Discover the
    #1 Anonymization Tool

    Learn how K2view anonymizes data in-flight from any data source

    Solution Overview