Daniel Lopresti, a professor of computer science at Lehigh University who has studied redaction techniques, says the research is impressive. It “presents a comprehensive study of redaction tools and the ways in which they can be broken, including exploiting nearly invisible aspects of a document’s typography,” says Lopresti, who was not involved with the research. “The picture it paints is scary; too often redaction is done badly.”
The vast majority of the organizations impacted by real-world redaction failures highlighted in the research—including the US Department of Justice, the US courts system, the Office of Inspector General, and Adobe—did not respond to WIRED’s request for comment. Bland and the research paper say that many of the organizations have engaged with the team’s research.
Microsoft did not address data being leaked from Word documents that are converted to PDFs. “Customers can save a document as a PDF, but it is the role of the redaction tool to censor or obscure information,” says Jeff Jones, senior director, Microsoft. Jones adds that people should “review” data and their files before converting them to a format that is going to be shared.
Meanwhile, Mike Lissner, executive director of the Free Law Project, a nonprofit that helps open up court data and provided access to legal documents for the research, says the organization has developed a system that can help identify badly redacted documents. “This works well, but by the time a document is published in a court’s filing system, the secret is out, so we’re working on tools that will integrate with document management systems that lawyers use,” Lissner says.
Digital document redaction has proved challenging for years, with unnumbered examples of failures to properly secure sensitive information. Sometimes it is human error; other times, technical failings are at fault. “It’s hard to redact something as complicated as a PDF to completely remove the information,” Levchenko says. PDFs can contain text, images, tables, metadata, and more information.
Multiple high-profile redaction failures have exposed information that someone wanted to keep secret. These have involved mistakes in the redaction process, failure to properly protect the information, and the inclusion of enough details to allow people to decipher what the redactions were meant to be.
For instance, in 1991 researchers used a “desktop computer” to reverse engineer the Dead Sea Scrolls to reveal their full text and open the documents up to more people. Back in 2008, details about secret wiretapping agreements between the US government and telecoms firms could be accessed using copy and paste. In 2016, Edward Snowden was revealed as the target of US spying following a failure to redact his personal details. In October 2020, journalists were able to decipher redactions in Ghislaine Maxwell’s court deposition. And in February 2021, the European Commission published a version of its Covid-19 contract for the AstraZeneca vaccine that it didn’t properly redact.
When it comes to effectively redacting documents and protecting people’s information, the Illinois researchers hope their work will highlight another way PDFs can be attacked and encourage the creators of software to include measures that prevent hidden information from being leaked. They say that for now the NSA’s guidelines for redacting documents are perhaps the best way to protect redactions. The guide says if you redact Word documents, you should change the content of the original document before redacting the resulting PDF. Change someone’s name to a row of “x” characters or the word “redacted,” just to be safe.