BagIt File Packaging Format: file transfer and fixity

Fixity is important for digital content that has been selected for long-term preservation by the Library, where trust is critical in our efforts to ensure the authenticity, integrity, provenance, longevity and ongoing accessibility of our digital collections. We need to ensure that we maintain fixity with our digitised and born-digital collections, ensuring the integrity of a file and verifying that it has not been corrupted or altered in an undocumented manner. We can establish and monitor fixity using technical methods including checksums. Often referred to as a “digital fingerprint”, any changes to the bits in a file will result in a different checksum. The biggest risk for data integrity errors often occurs during the process of transferring files and checksums allow us to safeguard against this.

The Library has been utilising checksums with born-digital acquisitions for many years, but the methods used to both generate and store them varied. The use of diverse file types and data structures meant that we could not validate and maintain fixity on a large, automated scale. It also affected the acquisitions workflow, with multiple programs and steps needed to offload digital content to our internal network storage for preparation for ingestion into our digital preservation system.

How were checksums generated a stored? Generated with ExactFile, FTK Imager, Shotput Pro and TeraCopy. Stored as XLXS, CSV, TXT, PNG, JPG

In 2016 we began to investigate the BagIt File Packaging Format for use with our born-digital acquisitions. Designed to support storage and transfer of arbitrary digital content via both physical carrier and network transfers, BagIt is widely used for digital curation and digital preservation of content over time. A “bag” consists of a “payload” (the digital files) and “tags”, which are metadata intended to facilitate and document the storage and transfer of the bag (including checksums).

There are many open source tools available that have been designed to create and manage bags in several programming environments. It was important for us to have a program with a graphical user interface (GUI), which led to the decision to use Bagger by the Library of Congress. We implemented the use of Bagger and the BagIt Format in May 2017 and immediately found that it eliminated multiple steps and tools in our previous workflow and democratised the process of offloading content by making it more user friendly. Another benefit of using Bagger is the ability to create a custom Project Profile, which provides a template for metadata fields and values (stored in a bag-info.txt file) to ensure that the same metadata is being generated and stored with each bag.

Diagram of workflow improvement, showing steps and processes where using Bagger removed two processes in the previous workflow

Born-digital acquisitions workflow before and after using Bagger.

Screenshot showing the BagIt structure used by the Library as well as the text files containing metadata about the payload.

An example of a bag created using Bagger and the SLNSW Project Profile.

Born-digital collections are stored in a bag on a network drive in preparation for ingestion into Rosetta, our digital preservation system. Our aim was to create bags that are as self-descriptive as possible, containing a catalogue record number as the bag name and utilising a Project Profile to record human readable information such as who transferred the files, where the physical carrier is stored and the corporate file number. We also create a metadata folder outside the payload (data) folder to store malware/virus scans and any other documents that are useful for the ingestion process. For born-digital photographs, this includes descriptive metadata spreadsheets from the photographer which are used to semi-automatically create item level records for each photograph.

Using Bagger at the point of acquisition when we receive digital assets allows us to establish fixity and validate at multiple steps throughout our pre-ingest and ingest workflows. We do not use bags as a Submission Information Package (SIP) for ingestion into Rosetta, but include the checksums from the manifest in the SIP. The checksums generated by Bagger are validated by and stored in Rosetta, where no material of unverified integrity can enter preservation storage. During the ingestion workflow, Rosetta also generates additional checksums using different algorithms.

Checksums are validated at acquisition, internal network transfer and ingestion

While the initial intention was for use with born-digital collections, we are beginning to use the BagIt Format for all digital collections. This includes outsourced digitisation projects where we added the requirement to receive deliveries in the BagIt Format this year. Digitised workflows are slightly different to born-digital, in that files and their checksums will change until they are ready for ingestion due to the injection of descriptive metadata into the files. Because of this, we can only use bags for sending and receiving deliveries and then re-establish fixity by creating a new bag once they have been prepared for ingestion.

Bagger is useful for dealing with the creation and validation of a single bag but we utilise the command line tool bagit python when working with larger numbers. Command line tools can be used to validate multiple bags inside a directory which is useful when transferring large amounts of data. We are also utilising bagit python in our automated workflow for the bulk ingestion of our digitised and born-digital backlog. Our bespoke bulk ingestion tool, Panda (Preservation aND Digital Access), will validate existing bags or create a new bag for legacy material that enters the process without the BagIt structure.

Utilising the BagIt File Packaging Format has standardised our file packaging across multiple workflows and has allowed us to facilitate fixity checks from the point of acquisition. The use of checksums allows us to ensure fixity in both pre-ingest and preservation storage where Rosetta takes over the active management of digital content over time.

Matthew Burgess, Digital Collections Analyst.

Log in to post comments