The most expensive mistake in document privacy is treating "redacted" and "anonymous" as synonyms. They are not. Legally they sit either side of a line that decides whether the GDPR still applies to the file. Technically they sit either side of a more interesting line, one that the privacy research literature has been mapping for two decades. This piece walks that map.
The audience is people who already understand the legal frame and want the technical picture: data scientists, ML engineers, DPOs with statistics backgrounds, and the architects of redaction pipelines that need to do more than satisfy a checklist.
The reference event
In 2000, Latanya Sweeney published a short paper at Carnegie Mellon that has cast a long shadow over data anonymisation. Using 1990 US census data, she showed that 87% of the US population could be uniquely identified by the combination of 5-digit ZIP, gender, and full date of birth.[1] A follow-up by Golle in 2006, using 2000 census data and a methodological refinement, put the figure closer to 63%, which is still a high number for what most data publishers would call "anonymised demographics."
The point of the Sweeney result was not the precise percentage. It was the structural insight: identifiers are not only the things we label as identifiers. Combinations of innocuous fields can identify uniquely. The field of anonymisation research has spent the years since formalising that insight into a set of definitions, each correcting weaknesses in the last.
k-anonymity: the first formal definition
Sweeney's 2002 paper introduced k-anonymity.[3] A dataset is k-anonymous with respect to a set of quasi-identifiers if each record is indistinguishable from at least k-1 other records on those quasi-identifiers. Concretely, if quasi-identifiers are , then a 5-anonymous dataset has at least five records sharing every combination of ZIP, age band, and gender that appears.
The intuition is straightforward: an adversary who knows a target's quasi-identifiers can narrow the target down to a group of at least k indistinguishable records. The bigger k is, the weaker the inference.
Achieving k-anonymity is done by generalisation (broadening age bands, truncating ZIPs to three digits) and suppression (removing outlier records). The cost is loss of analytical resolution. The benefit is a provable property that any released dataset must satisfy.
Where k-anonymity fails
Two attacks broke k-anonymity's claim to be sufficient.
First, the homogeneity attack. If a group of k indistinguishable records all share a sensitive attribute, the adversary learns that attribute regardless of which specific record is the target. Five records that all share "HIV positive" tell the adversary the target is HIV positive, even though they cannot identify which row.
Second, the background knowledge attack. An adversary with side information can rule out values within a group. If a 5-group has four "diabetes" records and one "tuberculosis" record, and the adversary knows the target does not have diabetes, identification collapses to a single record.
l-diversity: making the sensitive values heterogeneous
Machanavajjhala et al. proposed l-diversity in 2007 to address the homogeneity attack.[4] A dataset is l-diverse if each k-anonymous group has at least l "well-represented" values for every sensitive attribute. Different definitions of "well-represented" give different variants (distinct l-diversity, entropy l-diversity, recursive (c, l)-diversity).
l-diversity is stronger than k-anonymity. It is also harder to achieve while preserving utility, and its semantics depend on what counts as a sensitive attribute, which is context-specific.
Where l-diversity also fails
l-diversity does not control how similar the diverse values are to each other. A 5-group with five diabetes-related diagnoses is l-diverse by the distinct count, but the adversary still learns that the target has a diabetes-related diagnosis. The skewness attack exploits this: if the overall population has 1% prevalence of a condition but the target's group has 100%, the adversary learns the prevalence-conditioned identity even without picking the row.
t-closeness: matching the population distribution
Li et al. proposed t-closeness in 2007 as a refinement.[5] A k-anonymous group has t-closeness if the distribution of sensitive attribute values inside the group is within distance t of the distribution in the overall population, under a chosen distance metric (Earth Mover's Distance is the standard).
t-closeness is the strongest of the syntactic anonymisation definitions. It directly addresses the skewness attack. The cost is severe utility loss for highly skewed sensitive attributes, which is most of the interesting ones.
The structural problem with syntactic definitions
k-anonymity, l-diversity, and t-closeness all share a common shape: they define what a published dataset must look like and assume the adversary's background knowledge is bounded by what the data publisher anticipates. Real adversaries do not respect those assumptions. New auxiliary datasets become available; new linkage techniques emerge; the publisher's threat model goes stale.
The 2019 Nature Communications paper by Rocher, Hendrickx and de Montjoye made this concrete.[2] They built a generative model that, given a heavily sampled dataset, could correctly estimate whether a record matched a target with 99.98% accuracy on 15 demographic attributes. The implication: standard syntactic anonymisation, even when applied competently, leaves residual re-identifiability under modern techniques.
Differential privacy: a different shape
Differential privacy, formalised by Dwork and colleagues starting in 2006 and consolidated in Dwork and Roth's 2014 textbook[6], takes a fundamentally different approach. Rather than defining a property of the output dataset, it defines a property of the release mechanism.
A randomised algorithm is ε-differentially private if, for any two datasets differing in a single record, the probability of any output of the algorithm changes by at most a factor of exp(ε). The ε parameter quantifies the privacy budget; smaller ε means stronger privacy.
The structural advantage is composability. If two analyses each consume ε₁ and ε₂ of the budget, the combination consumes at most ε₁ + ε₂. Background knowledge of the adversary is not a free variable; the guarantee holds against any adversary. The privacy claim is auditable in a way the syntactic definitions are not.
The structural cost is utility. To get strong privacy on fine-grained queries you must add enough noise to wash out the signal. Aggregate statistics and many machine-learning training scenarios tolerate this well. Document-level redaction does not; differential privacy is not the right tool for "redact this PDF and hand it back unchanged."
Where each tool fits in a document pipeline
For a redaction pipeline operating on documents that will be shared with specific recipients, the calculus changes by use case.
Use case: regulatory disclosure
Documents go to a regulator, court, or specific named third party under contract. The threat model is well-defined: the recipient is known and bound. Pseudonymisation by careful redaction of direct identifiers plus quasi-identifiers, plus contractual restrictions on re-identification attempts, is usually sufficient. The GDPR treats the output as personal data; that is fine. The recipient has the lawful basis to receive it.
Use case: public release of redacted document
Documents go to the public (FOI requests, court filings, journalism). The threat model is unbounded. WP216's three-lens test (singling out, linkability, inference)[7] applies, and most pipelines do not pass it for non-trivial documents. The honest options are: release with the legal acceptance that the document is pseudonymised personal data and that the GDPR continues to apply, or do not release.
The temptation in this use case is to over-redact. Replacing every potentially-quasi-identifying phrase produces an uninformative document and still does not achieve formal anonymisation. The structural problem is the same as with k-anonymity: you are guessing the adversary's background knowledge.
Use case: training data for AI
You want to train or fine-tune a model on documents containing personal data without retaining individual-level information in the model. Differential privacy is the right framework here: train with a differentially private optimiser (DP-SGD), accept the utility cost, and ship a model with a quantified ε.
This is the pattern used by the major model providers when they make public statements about privacy of training data. The reason it is still rare in practice is that DP training is slower, more memory-hungry, and produces models that lag the non-private variants on benchmark metrics. For sensitive document corpora, the trade-off is often justified.
Use case: statistical analysis of document corpora
You want to publish counts, distributions, or aggregate metrics derived from a sensitive corpus without releasing the documents themselves. Apply differential privacy to the output. This is what the US Census Bureau did with its 2020 census, and what many academic statistical agencies now do for sensitive-population work.
A pipeline architecture that takes both seriously
A document-AI pipeline that takes the technical and legal lines seriously tends to have the following shape:
- Direct-identifier redaction in the body of each document, applied automatically with a high-precision NER step and a reviewer for low-confidence detections.
- Quasi-identifier review, document-by-document, with a defined threat model: who will see this, with what side information, under what contractual restrictions.
- Separation of un-redacted source from redacted output, in line with the EDPB 2025 pseudonymisation-domain concept.
- Differential privacy applied to any aggregate statistics derived from the corpus.
- No claim of anonymisation for individual documents released to the public, unless the WP216 three-lens test has actually been run and passed.
The fifth bullet is where most pipelines fail. Saying "anonymised" because the names are removed is the failure mode the literature has been documenting for twenty-five years.
A short reading list
For anyone building a redaction pipeline who wants to understand the technical line:
- Sweeney 2000 for the structural insight on quasi-identifiers.[1]
- Sweeney 2002 for k-anonymity.[3]
- Machanavajjhala et al. 2007 for l-diversity.[4]
- Li et al. 2007 for t-closeness.[5]
- Dwork and Roth 2014 for differential privacy.[6]
- Rocher et al. 2019 for what modern re-identification looks like in practice.[2]
Read in sequence, they describe an arc from a hopeful syntactic notion to a probabilistic guarantee with rigorous semantics. The right tool for a given problem is rarely "all of them"; it is one or two, picked deliberately, against a threat model that has been written down.
The closing observation is the same one Sweeney's 2000 paper landed on, with more force every year since: people are easier to identify than we think, and the right response is to design the system on that assumption.
References
- Latanya Sweeney, Simple Demographics Often Identify People Uniquely, Carnegie Mellon University, Data Privacy Working Paper 3 (2000) ↩
- Rocher, Hendrickx, de Montjoye, Estimating the success of re-identifications in incomplete datasets using generative models, Nature Communications 10:3069 (2019) ↩
- Latanya Sweeney, k-Anonymity: A Model for Protecting Privacy, International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 10(5):557-570 (2002) ↩
- Ninghui Li, Tiancheng Li, Suresh Venkatasubramanian, t-Closeness: Privacy Beyond k-Anonymity and l-Diversity, IEEE ICDE 2007 ↩
- Cynthia Dwork, Aaron Roth, The Algorithmic Foundations of Differential Privacy, Foundations and Trends in Theoretical Computer Science 9(3-4):211-407 (2014) ↩
- Article 29 Working Party, Opinion 05/2014 on Anonymisation Techniques (WP216) ↩