PanDA – digital asset ingestion at scale

Since 2012 the State Library of New South Wales has been undergoing mass digitisation of the libraries collections as part of the Digital Excellence Program (DEP). As part of this digitisation program the Library has produced over 11.8 million master files. A need for a digital preservation system was identified and Rosetta was selected as part of the Integrated Collection Management System (ICMS) in 2016.

During the transition from the legacy digital asset management system to Rosetta (and going into production) a massive backlog of files built up. Rosetta follows the Open Archival Information System (OAIS) reference model and requires the creation of Submission Information Packages (SIPs) to deposit digital assets. A solution was needed to prepare, validate and package these files into SIPs efficiently and reliably.

Initial efforts focused on Python scripts as part of a semi-automated workflow. This initial solution was reliable however, not very efficient. Scripts had to be manually executed, SIPs manually transferred using rsync and deposits manually triggered in Rosetta. In addition, constant modifications needed to be made to the scripts to handle the variety of workflows needed for the different collections and file formats. It became clear a better solution was required.

A project team was established including representatives from across digital systems and library collection and service teams. High-level requirements were gathered, a story backlog was created and PanDA (Preservation and Digital Access) was born. The project was implemented following agile practices.

The overarching goals of the project included:

  • Automate and streamline the entire ingestion and integration process;
  • Clear the digitisation backlog by EOFY 2017/18;
  • Support both digitised and born digital ingestions.

In order to achieve these goals, the PanDA was designed to be flexible, automated, reliable, scalable, integrated and transparent.

Flexible

PanDA has the concept of tasks and jobs. A Panda task is a discrete piece of work designed to achieve a very specific outcome. For example, remove unwanted files from folders, package files into the BagIt format, create a METS metadata file, and create derivatives are all Panda tasks.

In Panda, tasks are grouped together to create a job, where the job is the objective and the tasks are the discrete pieces of work required to achieve that objective. For example, one such job id is INGEST.ADLIB.BORNDIGITAL.PHOTOGRAPH.PROD, which when translated means to ingest born digital photography into production.

A digital dashboard showing a list of task functions in a Panda job.

PanDA born digital photographs job

 This ability to group tasks makes Panda extremely flexible, and you can create them in three different ways:

  • you can use a job template that contains all the tasks you require;
  • you can use a job template that contains most of tasks you require and add one or more tasks to the job as required;
  • you can put together a unique job by selecting the required task(s) from the Panda Task List (which is a list of all the available tasks in Panda).

Automated

PanDA has the concept of an entity. An entity is something which needs to be processed and tracked. The most common example is a folder on the filesystem containing files.

The first and upmost goal of the system is to automate the processing of entities. PanDA has two main mechanisms to begin processing. Jobs can be configured to watch a hot folder for any new subfolders. These subfolders can be dropped into the hot folder manually by staff or automatically by other software. Alternatively, an API call can be used to send a location of a subfolder ready to be processed. The Library’s digitisation workflow software uses this API to trigger PanDA processing whenever a work order is ready to be ingested. In either case the subfolder then becomes a registered entity in the PanDA system and can be tracked as it is processed by the tasks in the job.

Reliable

In most cases staff will not be required to do anything else. PanDA will go ahead and process the entity, validate the bag, embed metadata into files, turn the entity into a SIP, upload it to deposit storage, trigger the deposit, track the ingestion, validate the success of the ingestion, integrate it with the catalogue and notify the workflow software that the entity has completed processing. Staff can sit back and relax confident in the knowledge that PanDA has got it all covered!

Of course, in the real-world things go wrong. To deal with this, error handling has been built into PanDA’s DNA. PanDA distinguishes between what it calls soft errors and hard errors. Soft errors are problems that are likely to be ephemeral. An example of a soft error is an API timeout. In this case PanDA will wait and try again a short time later. If three attempts fail PanDA will then flag a hard error for this entity. Hard errors are problems which require a staff member to investigate and fix. An example of a hard error is when DROID cannot identify a file. Running DROID over the file again isn’t going to change anything, someone has to step in and do something. Another example of a hard error is anything that appears during the quality assurance checks after ingestion is complete.

An error log for a PanDA process, with fields including "ID", "Exception Code", "Action Taken", and "Message".

PanDA hard error - incorrect number of representations found in Rosetta IE

Scalable

PanDA needs to perform all these tasks on thousands of files every single day without interruption. PanDA achieves this using Celery, a distributed task queue and the RabbitMQ message broker. Distributed worker processes pull tasks from the central queue. Currently we are running three virtual machines with four processes each, making a total of 12. This way at any given time PanDA is doing 12 different things! If we want to speed things up it’s just a matter of spinning up a new VM and deploying the PanDA worker. The new worker will register itself and begin pulling tasks from the queue.

There is no hard limit on how far we can horizontally scale however, in reality bottlenecks do appear. Disk IO and bandwidth tend to eventually become bottlenecks at which point scaling out compute just makes the problem worse.  The scalable processing given to us by PanDA has massively increased our overall throughput.

A queue of PanDA processes displayed in a list in the PanDA process dashboard.

PanDA process dashboard

Integrated

PanDA is integrated with multiple systems. Rosetta integration is included to trigger and track ingestions. As described above, the Library’s workflow software 'talks' with PanDA to start off processing. We also integrate with the Library management system Alma and the Archival management system Adlib. With both these systems PanDA will retrieve metadata to be embedded in the files and as part of the SIP. PanDA also writes to these systems to enable the integration between catalogue record and digital assets.

Transparent

Every single entity is tracked as it goes through PanDA and every single operation is logged. Logs are automatically exported and archived in a structured JSON format. This ensures that at any point in the future we can easily pull up a detailed audit trail of any file which has been processed by PaNDA. It also allows us to generate reports of ingest rates, error categories etc.

PanDA tech stack

Some of the technologies used:

 

Peter Brotherton,
Systems and Applications Developer

Log in to post comments