Data anonymisation and pseudonymisation are the two primary technical approaches for reducing personal data protection risk under GDPR β but they have fundamentally different legal consequences that data engineers and privacy counsel must understand. Anonymised data is outside GDPR scope entirely; pseudonymised data remains personal data subject to GDPR's full requirements. The distinction determines whether you need a legal basis to process the data, whether data subject rights apply, and whether cross-border transfer restrictions apply. This guide clarifies the difference and the practical implementation of each approach.
The Legal Distinction
GDPR Article 4 and Recital 26: The Key Distinction
Anonymised data: information that cannot be attributed to an identified or identifiable natural person, either directly or by combining with other information reasonably available. Genuinely anonymised data is outside GDPR scope β no legal basis required, data subject rights don't apply, transfer restrictions don't apply. The key test: "taking into account all the means reasonably likely to be used" β if any plausible combination of information could re-identify individuals, the data is NOT anonymised. The bar is high; most "anonymised" datasets are not truly anonymous. Pseudonymised data: data where direct identifiers are replaced with pseudonyms, but re-identification is possible with additional information. Pseudonymised data remains personal data β all GDPR obligations apply. The benefit: pseudonymisation is a recognised security measure that reduces risk and may affect the proportionality of regulatory responses to breaches.
Anonymisation and Pseudonymisation Techniques
| Technique | Type | Re-identification Risk | Data Utility |
| Data masking (field replacement) | Pseudonymisation | Medium β original can be reconstructed with key | High β realistic looking data |
| Tokenisation | Pseudonymisation | Low if key is secure | High β token is consistent reference |
| k-anonymity | Anonymisation (approach) | Medium β linkage attacks possible | Medium β some generalisation required |
| Differential privacy | Anonymisation (method) | Low β mathematical privacy guarantee | Medium β noise reduces precision |
| Data aggregation | Anonymisation if kβ₯5 | Low if properly aggregated | Medium β individual-level data lost |
| Synthetic data generation | Anonymisation (if properly generated) | Low if DCR is high | High β retains statistical properties |
k-anonymity
k-anonymity requires every individual in a dataset to be indistinguishable from at least k-1 other individuals based on quasi-identifiers (age group, postal code, gender). k=5 is the minimum practical threshold; k=10+ for sensitive datasets. Limitation: still vulnerable to linkage attacks with external datasets β l-diversity and t-closeness are stronger extensions
Differential privacy
The gold standard for mathematical anonymisation β calibrated noise added to query results provides a formal privacy guarantee (epsilon-differential privacy). Used by Apple (iOS telemetry), Google (Chrome histograms), and the US Census Bureau. The only technique that provides a mathematical proof of privacy protection rather than a best-effort approach
Re-identification
The primary anonymisation risk β Latanya Sweeney's landmark research showed 87% of Americans are uniquely identifiable by 5-digit ZIP code, gender, and date of birth. "Anonymised" datasets with quasi-identifiers routinely fail in practice when combined with publicly available data. Test re-identification risk before declaring data anonymous
π
Pseudonymisation for Analytics
Pseudonymisation for analytics pipelines: replace direct identifiers (name, email, national ID) with consistent tokens (SHA-256 hash of identifier + secret salt, or UUID mapped in a secure key store). The token is consistent across systems β you can join customer behaviour data by token without exposing the underlying identifier. Key management: store the identifierβtoken mapping in a hardware security module (HSM) or KMS with restricted access. Separation of duties: the analytics team never has access to the key; only the key-management team can de-pseudonymise for legitimate purposes. This architecture satisfies GDPR's requirement for pseudonymisation as a security measure while maintaining analytical utility.
π
Differential Privacy for Aggregate Reporting
Implement differential privacy for aggregate statistics published externally: add calibrated Laplace or Gaussian noise to query results so that no individual's data has a significant impact on the output. Python implementation: from diffprivlib import mechanisms; mech = mechanisms.Laplace(sensitivity=1, epsilon=0.5); dp_count = true_count + mech.randomise(). Google's DP library and Apple's Swift DP tools provide production-ready implementations. Use case: publish aggregate statistics (average income by age bracket, customer conversion rate by region) with formal DP guarantee β defensible position that the output cannot be used to infer individual records. Required epsilon calibration: consult privacy counsel for appropriate epsilon for your use case.
π§ͺ
Re-identification Risk Testing
Before declaring a dataset anonymised, test re-identification risk: (1) Count quasi-identifiers in the dataset β combinations of age, gender, location, and temporal data that could identify individuals; (2) Apply k-anonymity check with ARX tool (arx-deidentifier.org) β determine the current k-value of the dataset; (3) Check k-anonymity against realistic external datasets (public records, social media) that could be combined for linkage attacks; (4) Document the re-identification risk assessment and the controls applied; (5) Obtain a legal opinion from privacy counsel confirming anonymisation conclusion. Never declare a dataset anonymous without this assessment β GDPR enforcement has penalised organisations that incorrectly classified personal data as anonymous.
ποΈ
Data Anonymisation Architecture
Production data anonymisation architecture for analytics environments: (1) Raw personal data in GDPR-compliant production database with full access controls; (2) ETL pipeline applies pseudonymisation (tokenisation of identifiers) β writes to analytics database; (3) Analytics database contains only pseudonymised data; (4) Further aggregation layer applies k-anonymity or DP for any data shared externally; (5) Synthetic data generation for development/testing environments (Gretel.ai, MOSTLY AI). This architecture ensures: no personal data in analytics systems, mathematical privacy guarantees for external sharing, and GDPR-compliant development databases. Connect to your
data platform via standard ETL tooling.