Re-identification of anonymized data occurs when an individual can be identified by linking masked data with public records or combined personal attributes.
Table of Contents
What is Re-Identification of Anonymized Data?
Why is Re-Identification of Anonymized Data a Growing Concern?
How Can Companies Prevent Re-Identification of Anonymized Data?
Preventing Re-Identification of Anonymized Data with Business Entity Masking
Re-identification of anonymized data happens when people (or other business entities) can be recognized based on the masked information stored in their respective datasets. While data anonymization aims to obscure or remove all Personally Identifiable Information (PII) in order to protect the privacy of the people associated with that data, it must be implemented correctly to avoid risk.
One of the main ways attackers can re-identify anonymized data is through linkage attacks, which cross-reference anonymized datasets with publicly available records. Another way is via inference attacks, which combine personal attributes, such as age, gender, or marital status, to infer the identity of the subjects.
Determined attackers are able to piece together the details they need to re-identify anonymized datasets with a very small amount of data. In fact, a 2019 study found that 99.98% of Americans could be correctly re-identified in any dataset using just 15 demographic attributes.
Data re-identification has become a growing concern as ML (Machine Learning) models become more adept at analyzing patterns in anonymized datasets, and as the overall amount of data on individuals accumulate on the web and in other publicly accessible places. As data mining and data linkage methods become more advanced, combining multiple datasets to perform re-identification will become even easier.
Re-identification of anonymized data can lead to privacy violations, identity theft, or other forms of malicious behavior. For example, if an insurance company were able to re-identify an individual in a healthcare dataset, they would have access to private medical information they could use to their advantage. Attackers can also target individuals with scams or phishing attempts, and personal information can be used to exploit individuals.
Organizations have a variety of options at their disposal for minimizing the risk of data re-identification. The first line of defense is often data minimization – which not only limits access to data, but also reduces the overall amount of individual data being collected – through access controls, data sharing agreements, and regular auditing and monitoring practices.
Such methods are based on the hypothesis that reducing the total amount of data collected, stored, and shared in an organization will better protect its users.
However, even if you limit access to specifically authorized individuals inside the organization, the data is still vulnerable to attacks from outside. And very little data is required to uniquely identify an individual, with studies showing that over 60% of the US population could be identified just by combining statistics on gender, date of birth, and zip code.
It also matters what type of data is collected, because not all information is equally sensitive. There are different levels of identifiability. In order of importance, identifiers could be:
Direct, such as names and Social Security Numbers
Indirect, such as places of work and membership in organizations
Ambiguous, such as favorite restaurants and movies
There’s also data that can’t be linked to any single person, like aggregated census data or survey results, and data that doesn’t relate to individuals at all, like weather reports and geographic data. Of course, the more data is scrubbed of personal information, the less useful it becomes for research and analytics.
When it comes to data, there’s a constant tension between privacy and utility. Collecting less data may increase a user’s privacy, but it lessens an organization’s ability to equip their business with the data it needs. This means that enterprises must find sustainable, scalable ways to protect their user’s data, while also enabling different business domains to access and use data as needed.
Once data has been collected, organizations need to make it less vulnerable to hackers. They typically employ data masking tools – along with a portfolio of data masking techniques, including redaction, shuffling, scrambling, and synthetic data substitution – to make data less susceptible to attack.
Software replaces real PII with false, yet statistically equivalent, data. For example, substituted synthetic names, images, and contact information could be used. Among the most highly secure data anonymization tools, data masking is irreversible by design. Sensitive data is protected, while the masked data is still useful to business functions.
One of the biggest risks when it comes to re-identification of anonymized data is that the entire database is vulnerable if one individual is exposed. When attackers can cross-reference specific data and match it to an individual, they’re often able to crack the entire database. Once they’ve figured out the masking algorithm, they can identify anyone in the database.
In order to avoid this, enterprises are now turning to the entity-based data masking technology. By using individually encrypted Micro-Databases™, authorized users can access data related to a specific business entity (such as customers, payments, orders, and devices), but the entities exist separate from one another, making it impossible for a mass data breach to occur.
Additionally, entity-based data masking obscures unstructured data while maintaining relational consistency throughout the organization. Unstructured data, such as images, PDFs, receipts, and text files, are protected with static and dynamic data masking capabilities. For example, real photo IDs are replaced with fake ones, and digital versions of receipts, checks, and contracts can be synthetically generated for testing and analytics purposes.
The benefit of using a business entity approach to data masking is that even if one dataset were to be re-identified, the other business entities wouldn’t be affected. It’s a comprehensive solution for protecting sensitive information that uses data masking best practices, but also ensures the security of the data at large.