UMD Libraries Releases New Open Source Web Application Developed for White House Pool Reports Digital Collection
Custom-built email editor automates redaction of sensitive information in reports by journalists covering the president
The White House Correspondents’ Association (WHCA) was founded in 1914 to promote excellence in journalism, robust reporting on the U. S. presidency, and support democracy through a free press. The White House Press Corps is made up of journalists credentialed by WHCA. This press pool provides reporting on the President’s daily activities and events. The UMD Libraries’ WHCA Pool Reports Collection consists of email pool reports created while covering the U.S. President and Vice President dating back to June 2020. It is updated monthly and the collection is available to anyone at https://whpool.lib.umd.edu/.
To address issues of privacy and security in making these emails publicly available, an automated solution for redacting sensitive information was created by an in-house team of UMD Libraries’ designers and developers. The resulting production tool, called SCUTES, has now been released as an open source web application for processing email and redacting personal identification information (PII) of journalists in the reporters pool. The source code is available at https://github.com/umd-lib/scutes.
There are four critical issues when digitally archiving these born-digital records of public information. First, how to provide a uniform display for end-users regardless of the source email application. Second, personal privacy and safety of journalists' private lives is an important concern, especially in the current political climate. Third, on today's media stage, embedded links and images are ubiquitous. To sustain an accurate historical record this requires converting the links to stable URLS. Lastly, and perhaps most importantly, as a news product, reports documenting presidential activities and events continue daily, requiring continuous acquisition of new email messages. In order to produce a workflow that wasn't labor-intensive automation solutions were sought and when not found on the shelf were designed in house.
When the Pool Reports Collection Project began, available open-source email editing and processing tools lacked automated PII redaction. Thus, software tools and processes needed to be designed and produced. Ultimately, the design team developed a four steps workflow to capture and download messages into an archive, extract and clean the files, redact personal identification information (PII), and convert the emails for import into the digital repository for public use.
Prior to the release of SCUTES the process was labor intensive, time consuming, and overly complex, leading to potential errors and an unsustainable workload. Individual message texts, any attached images or documents, and a CSV containing metadata were manually edited, requiring multiple people to be involved to process the collection – four separate manual uploads/downloads by staff. In addition, steps required handling multiple files using different tools – a spreadsheet program, a text editor, command line tools, and a web browser.
SCUTES streamlines and automates continuous acquisition and quality processing of new email content by reducing the number of times files are handled, reducing the number of programs needed to process the collection, and providing an efficient editor for curators to quality control the redacted information. Processing time has been substantially reduced from an average of 24 hours/batch to 4 hours/batch.
SCUTES was developed by UMD Libraries’ faculty and staff members Tim Kanke, Patti Cossard, and Ben Wallberg. Special thanks to Tim Kanke, contract research developer, for his work on analyzing existing workflows, evaluating available solutions, and designing and building SCUTES. We would also like to thank Cathy Merrill and the Merrill Foundation, Inc. for funding this important work.