Data de-identification is a data masking method that severs the connection between data and the person associated with it, to ensure privacy compliance.
Table of Contents
What is Data De-Identification?
How Does Data De-Identification Work?
Regulatory Demands for Data De-Identification
Data De-Identification vs Sanitization, Anonymization, and Tokenization
Data De-Identification Based on Business Entities
Data de-identification is a method of data masking that removes Personally Identifiable Information (PII) from a document, media source, or other data record. It is the simplest and fastest way to protect the sensitive information found in datasets, while at the same time bringing organizations in-line with data security regulations like HIPAA, GDPR, and more.
Data de-identification enables information in a dataset to be used for research, customer service, marketing, or any other authorized internal or external use without the possibility of individual privacy being compromised. By applying data de-identification techniques to their data masking tools, organizations can safeguard privacy, build trust and brand equity, and enhance their competitive edge.
To understand de-identification, it’s first necessary to differentiate between direct identifiers and indirect identifiers (also known as quasi-identifiers). This is the because the method for securing identifiers depends, in a large part, on the type of identifiers being secured.
Direct identifiers are values which could clearly and uniquely identify an individual: name, email, address, social security number, etc.
Indirect identifiers are values that can potentially identify a person, but are more important in their value for analysis: demographic information, socio-economic details, etc.
Once you’ve identified the type of identifier requiring data de-identification, there are numerous techniques to de-identify these values. Some of the most common include:
Differential privacy
Patterns in a dataset are analyzed, without revealing directly identifiable data.
Omission
Direct identifiers, like names, are simply omitted from the datasets.
Redaction
Direct or indirect identifiers are removed or obfuscated from all types of data records, including video or audio files (via pixelation or other techniques).
Suppression
Values are removed from the dataset or are replaced with similar indicative information.
Hashing
Identifiers are permanently encrypted, with no option of decryption.
Swapping
Values are exchanged between individuals (for example, Joan’s salary is swapped with Dave’s), while leaving the aggregate value of the dataset field valid
Pseudonymization
Pseudonymization substitutes direct or indirect identifiers with temporary but unique IDs or codes.
Micro-aggregation
Similar numerical values are grouped together, with individual values represented as the mean of the group (for example, for everyone aged 15, 16, and 17, the age field is changed to 16).
Generalization
The generalization technique substitutes an exact value with a less-specific value (for example, changing an exact birth date to just month/year).
K Anonymity
K Anonymity defines quasi-identifiers and ensures that at least “K” number of individuals have the same combination of values.
Addition of noise
Noise addition refers to the generation and addition of a new value to an original variable with mean zero and positive variance.
Numerous regulatory regimes require data de-identification to maintain compliance. For example, the US Health Insurance Portability and Accountability Act of 1996 (HIPAA) delineates separate techniques for data de-identification:
Expert Determination
Expert determination applies statistical and scientific principles to data to dramatically reduce the risk of re-identification. It’s considered the most flexible method of data de-identification because it can be customized to each use case. By using quantitative methods to reduce the risk of re-identification of anonymized data, expert determination also enables data generalization and automation. However, expert determination is an inherently manual process – requiring (as the name suggests) the involvement of a human statistical expert. This makes expert determination prohibitively expensive at scale.
Safe Harbor
The Safe Harbor technique of data de-identification developed by the US Department of Health and Human Services (HHS), requires the removal of 18 types of identifiers (both direct and indirect), to assure that the information can’t be linked to a specific individual. Common identifiers include:
1. Name |
10. Health plan beneficiary number |
2. Date of birth |
11. Business license number |
3. Phone number |
12. Vehicle registration number |
4. Street address |
13. Web URL |
5. Fax number |
14. Device serial number |
6. Social Security Number |
15. Internet Protocol (IP) address |
7. Email address |
16. Passport or driver’s license photo |
8. Bank account number |
17. Biometric identifier |
9 Medical record number |
18. Any unique ID number |
HIPAA classifies these identifiers as Protected Health Information (PHI). This means that their usage and disclosure are limited – which is why they require data de-identification. Although the Safe Harbor technique is simple and cost-effective, it’s not suitable for every use case. In some cases, it is overly restrictive, rendering the data unusable; in others, it’s overly permissive, leaving multiple direct identifiers unsecured.
Data de-identification should be differentiated from several other data privacy methodologies, including data anonymization, data anonymization, and data tokenization:
Sanitization
Also known as data cleansing or scrubbing, data sanitization detects, corrects, or removes personal or sensitive information from a dataset to keep unauthorized users from identifying specific people. Sanitation is typically the method of choice for data deletion or transfer (for example, in the recycling of a company computer).
Anonymization
Data anonymization removes or obfuscates sensitive values, replacing them with realistic fake data and creating a version of the dataset that can’t be decoded or reverse engineered. There are a few ways to accomplish this, including word or character replacement or shuffling and encryption. Data anonymization tools are usually applied to direct identifiers like names and phone numbers – to make the information look consistent, remain usable, and appear real.
Tokenization
Tokenization substitutes personal data with a random token – often completely random numbers or tokens created by one-way functions like hashes. Although a link is frequently maintained between the original information and the token (in an offsite token vault), there is no direct mathematical relationship between them. So, when data tokenization tools are employed, the tokenized data can’t be deciphered or reverse engineered (without the keys to the vault).
The most effective method of data de-identification relies on the business entity approach to data masking challenges – where a “business entity” is a given asset or attribute of the business itself: a customer, device, or invoice. The data associated with each business entity instance is aggregated and stored in an individually encrypted Micro-Database™. By basing data de-identification on business entities, organizations raise productivity without compromising compliance and customer privacy.