There is a small lie embedded in most data-protection conversations about redaction. The lie is that a black box drawn over a name on a PDF makes the document GDPR-safe. It does not. It can be a very useful step. It can also be the kind of half-measure that turns a routine disclosure into a reportable breach. The difference depends on details that almost nobody reads carefully until something goes wrong.
This piece is for people who have to read those details carefully: compliance leads, legal counsel, DPOs, and the engineers who build the workflows around them. The aim is to walk through what the GDPR actually requires when you redact personal data from a document, where pseudonymisation ends and anonymisation begins, and what the European Data Protection Board's recent guidelines[1] change about the day-to-day work.
The legal frame: what redaction is, and what it is not
Redaction, as a word, does not appear in the GDPR. The regulation works at a higher level of abstraction. It defines two relevant operations:
- Pseudonymisation is processing personal data so that it can no longer be attributed to a specific data subject without additional information, where that additional information is kept separately and protected.[2] The original mapping (who was who) still exists somewhere. The data is still personal data. The GDPR still applies.
- Anonymisation is the threshold at which the data can no longer be linked back to a natural person, even with reasonable effort. Once you cross it, the data falls out of the GDPR's scope entirely. WP216, the 2014 Article 29 opinion that still serves as the working definition, set a high bar: resistance to singling out, linkability, and inference attacks.[4]
A redacted document sits somewhere on this spectrum. Where exactly is the question that matters.
Why a black box is rarely anonymisation
Take a contract where the personal details (name, address, national ID, signature) have been blacked out. To the human reading the page, the document looks anonymised. To a court, a regulator, or a determined adversary, it often is not.
Three failure modes recur:
- Layered PDFs. A "black box" added on top of text in a PDF viewer hides the text from view but leaves it intact in the underlying byte stream. Copying it out, opening it in a different reader, or running it through a text extractor reveals everything. This is not theoretical. It is the failure mode behind a series of public-sector disclosure incidents over the last decade, including high-profile reports involving redacted court documents, government memos, and even WikiLeaks-era diplomatic cables.
- Quasi-identifiers. A document can name no individuals and still identify them. A medical report that says "the 47-year-old patient from Lillehammer treated at the regional hospital on 12 March 2024 with a rare lung condition" identifies one specific person, even though no name appears. WP216 calls these quasi-identifiers, and they are the central reason genuine anonymisation is hard. Latanya Sweeney's 2000 paper famously showed that 87% of the US population could be re-identified from just date of birth, ZIP code, and gender. A 2019 paper in Nature Communications estimated that 99.98% of Americans would be correctly re-identified in any dataset using fifteen demographic attributes.[5]
- Context. The same redacted document can be anonymous in one context and identifying in another. A redacted internal memo released to the public may be anonymous; the same memo handed to the company's competitors may be trivially re-identifiable because they already know the protagonists.
The practical consequence: most "redacted" documents are, in GDPR terms, pseudonymised, not anonymised. The data remains personal data. The legal obligations remain in force.
What the EDPB's 2025 guidelines change
On 16 January 2025, the European Data Protection Board adopted Guidelines 01/2025 on Pseudonymisation, the first authoritative EU-level update to the practice since GDPR took effect.[1] Two ideas in the guidelines are particularly relevant to anyone running a redaction workflow.
The "pseudonymisation domain"
The guidelines introduce the concept of a pseudonymisation domain: a defined boundary within which only pseudonymised data is processed, and where no person inside the boundary has access to the "additional information" that would re-link the data to identifiable individuals. The boundary can be a team, a system, a contract, or a network segment, but it has to be concrete and enforced.
For redaction work, the implication is direct. Producing a redacted document and then storing the un-redacted original on the same server, accessible by the same team, with the same credentials, fails the test. The "additional information" must be kept separately, both technically (different storage, different access controls) and organisationally (different people, different policies).
The end of "pseudonymisation by accident"
The guidelines are explicit that pseudonymisation is not the absence of identifiers. It is the deliberate engineering of a state in which the data cannot be re-identified without specific separately-held information. Documents where names happen to be missing, but where the original is one click away, do not qualify. Pseudonymisation is a process, not a property of the output file.
A working framework: three questions before you redact
Out of this comes a checklist that holds up surprisingly well across legal, medical, public-sector, and commercial work.
1. What is the legal basis for the underlying processing?
Redaction is itself processing of personal data under GDPR Article 4(2). You need a lawful basis for the redaction operation, not only for the final use of the document.[2] In most regulated work the basis is straightforward (legal obligation, legitimate interest, contract), but it is worth naming explicitly in the record of processing activities.
2. What is the threat model?
Who might see the redacted document, and what other information do they already have? A redacted file released to the world has a different threat model than the same file delivered to a court under seal, which has a different threat model again from a file shared with one named third party under contract. Pretending the threat model is "everyone" when it is actually "two named recipients with an NDA" leads to over-redaction. Pretending it is "two named recipients" when the file is publicly downloadable is worse.
3. What is the residual risk after redaction?
Apply WP216's three lenses one at a time:
- Singling out. Can a single individual be picked out of the data?
- Linkability. Can two records about the same person be linked?
- Inference. Can attributes be inferred about an individual with significant probability?
If any of the three is yes, the document is pseudonymised, not anonymised. That is fine, in many contexts, as long as you act accordingly: apply Article 32 security measures[3], honour data-subject rights, log access, and budget for the possibility of a breach.
Technical measures that actually matter
Datatilsynet's guidance on the fundamental data-protection principles[6] is clear that data minimisation is not a one-off act at the point of collection. It is a continuous obligation that runs through the lifecycle of the document. For a redaction pipeline, that translates into a small set of measures most organisations either implement or quietly skip:
- Burn the redaction into the raster, or remove the underlying text. A black box on a PDF layer is not redaction. Either rasterise the redacted regions, or strip the text content from the document object model. Modern redaction tools do this by default. Tools that "draw on top" do not.
- Audit the metadata. PDFs carry comment threads, author names, edit history, original file paths, embedded thumbnails, and EXIF data on attached images. All of it can leak identifiers. The audit step is non-negotiable.
- Treat the un-redacted original as restricted, not archived. The pseudonymisation-domain principle requires real separation. If the original lives in the same SharePoint folder as the redacted version, the redaction is decorative.
- Log every redaction. Who redacted what, when, against which rule set. The log is required for both the accountability principle in Article 5(2) and for any later subject access request that touches the document.
Where AI changes the picture, and where it does not
Automated redaction has become genuinely useful in the last two years. Named-entity recognition models reach precision in the high 80s and low 90s on common entity types (names, addresses, IDs) across most document formats. For high-volume work, that is the difference between a process that takes weeks and one that takes hours.
What AI does not change is the legal calculus above. The model does not decide whether the resulting document is anonymised or pseudonymised. The threat model and the residual-risk analysis do. AI redaction tools that quietly claim "GDPR compliance" as a feature are selling a category error: the tool is one input into a compliance posture that the organisation, not the tool, is responsible for.
What to take away
Three things are worth carrying out of this piece into the next workflow review.
First, almost all redacted documents are pseudonymised, not anonymised. The GDPR still applies. Plan accordingly.
Second, the 2025 EDPB guidelines make the separation of the additional information legally significant in a way that retrofits badly to many existing setups. Redacted output sitting next to un-redacted source, accessible by the same people, is no longer defensible as a pseudonymisation strategy. It probably never was.
Third, AI tools change the throughput of the work, not its legal shape. The hard questions are still the same: what is the threat model, what is the residual risk, and what would survive a regulator's audit. Those questions belong to the organisation, not to a model.
The good news is that none of this is unworkable. Done well, redaction is one of the highest-leverage data-minimisation moves an organisation can make. Done carelessly, it is a way of looking compliant while being something else.
References
- European Data Protection Board, Guidelines 01/2025 on Pseudonymisation (adopted 16 January 2025) ↩
- Regulation (EU) 2016/679 (GDPR), Article 4(5): definition of pseudonymisation ↩
- Regulation (EU) 2016/679 (GDPR), Article 32: security of processing ↩
- Article 29 Working Party, Opinion 05/2014 on Anonymisation Techniques (WP216) ↩
- Rocher, Hendrickx, de Montjoye, Estimating the success of re-identifications in incomplete datasets using generative models, Nature Communications 10:3069 (2019) ↩
- Datatilsynet, Veiledning om de grunnleggende personvernprinsippene ↩