L Diversity reduces the risk of re-identification of sensitive data by ensuring that individual records in a dataset are not too similar to each another.
Table of Contents
What is L Diversity?
Backgrounder: What is K Anonymity?
How do K Anonymity and L Diversity Enhance Data Anonymization?
How does L Diversity Work?
L Diversity Challenges
L Diversity Based on Business Entities
An enhancement to the K Anonymity data masking model, the L Diversity extension was developed to reduce the granularity of data representation in a dataset. More specifically, L Diversity corrects some inherent weaknesses in the K Anonymity model. How? K Anonymity leverages generalization, suppressionת and other techniques that enable mapping of each specific record onto a minimum of “K minus 1” other records in the dataset. Yet protecting identities down the K individual level cannot always protect the sensitive values which were masked, especially when these values are homogenic within the dataset. To solve this, L Diversity promotes intra-group diversity of sensitive values as one of the key data masking best practices.
L Diversity ensures that no individual’s information can be identified from at least L other individuals in the dataset based on a sensitive attribute – protecting both sensitive attributes and general attributes. By way of example, in a dataset that contains sensitive attributes like prescription medicines or medical diagnoses, L Diversity ensures that there are at least L people in the dataset for any specific sensitive attribute – thus protecting the identities of any specific individual.
Similar to K Anonymity, L Diversity can’t guarantee absolute privacy protection for anonymized data. And it’s much more difficult to implement than K Anonymity since identification and protection of sensitive attributes can only work if there are at least L distinct values for each sensitive attribute in the dataset.
In use since that late 1990’s, K Anonymity is one of the key data anonymization techniques. It uses data generalization, data masking, or pseudonymization to protect Personally Identifiable Information (PII) in a dataset – to ensure that no single individual can be identified. Under the K Anonymity model, a dataset is anonymized if there are at least “K minus 1” other people with various sensitive attributes.
What is the “K” in K Anonymity? It refers to a variable, similar to the “X” used in high school algebra. This K, however, refers to the number of times a combination of values appears in a dataset. So, if K=2, the data points have been masked so that there are at least two sets of every combination of data. This means, for example, that if a given dataset contains addresses and ages for a group of people, these attributes would need to be anonymized so that each address and age pair appears at least twice.
K Anonymity works on the principle that if you combine data with similar attributes, you can obscure identifying information about any individual contributing to that data. It’s basically the ability to disappear in a crowd – since a sensitive data attribute masked using K Anonymity could actually correspond to any single individual in the pooled dataset.
K Anonymity and L Diversity are important components of data anonymization tools, increasingly used by enterprises to:
Limit regulatory liability
Regulatory fines and penalties following breaches of privacy cannot be taken lightly, and data anonymization, enhanced with K Anonymity and L Diversity, helps meet tough compliance demands.
Obtain data consistency, governance, digital transformation
Anonymized data undergoing K Anonymity and L Diversity is cleaner and more compliant. This enables organizations to drive digital transformation, enjoy data-driven decision-making, and also leverage big data analytics – all without compromising on privacy.
Reduce risk
Breaches don’t just compromise sensitive data, they affect market share and public trust. Masking data, with the help of K Anonymity and L Diversity, protects against the potential loss of faith in an organization following a breach, which can significantly damage both brand equity and revenue.
Mitigate insider threats
The anonymization of data, facilitated by K Anonymity and L Diversity, guards against accidental or malicious data mishaps, data misuse, or data exploitation by partners, employees, or third parties.
L Diversity enhances the inherent deficiencies in the K Anonymity model – notably eliminating the possibility of a homogeneous pattern attack or a background knowledge attack – by introducing further entropy (or diversity) into a dataset. The result is significantly reduced risk of re-identification of anonymized data.
For example, if all users with a similar Quasi-Identifier (QI) tuple (sequence of elements) are in the same dataset, and all sensitive values are the same within that dataset, PII could be exposed in the event of a breach. L Diversity imposes diversity in the sensitive values associated with a more generalized tuple, enhancing privacy by making sure that each dataset has at least L distinct sensitive values, that correspond to at least L users.
To drill down, let’s look at a sensitive data record that comprises 3 types of data – ID, key attributes, and confidential outcome attributes. L Diversity extends equivalence classes created using K Anonymity by generalizing and masking QI groupings to confidential attributes in the record, too. It requires that for each QI grouping, at most 1/L of its tuples will have an identical sensitive attribute value.
L Diversity is not immune from attacks that end up disclosing sensitive information or adversely affecting the organization. 2 possible threat vectors are:
Skewing attack
Even if a dataset satisfies L Diversity, an unwelcome intruder can sometimes link an individual in a group to an attribute that has an unusually higher probability of occurrence accuracy. In this case, L Diversity can actually result in information loss, since many of the non-sensitive records may have been dropped from the dataset to meet the diversity requirements. This can end up skewing data-driven analysis, causing an organization to err the attacker’s favor.
Similarity attack
If sensitive attribute values are distinct, yet semantically similar, an attacker can learn important information. For example, if a confidential attribute is numerical and other values within a dataset are L-diverse yet similar – an intruder can accurately estimate the confidential attribute value for an individual in that group.
L Diversity can benefit from the entity-based data masking technology – which enables faster and more efficient data anonymization. The business entity approach integrates and organizes diverse data from multiple sources according to data schemas, where each schema corresponds to a business entity.
A business entity, in this case, is any entity relevant to the business: a customer, service plan, invoice, and more. Entity-based data anonymization manages data from each business entity in its own encrypted Micro-Database™, which is either stored, or cached in memory. By enabling data anonymization inflight, entity-based data masking tools are able to dramatically reduce time to value and total cost of ownership.