Description of the platform used for ingesting and storing the digitised newspapers

The system for receiving and ingesting the digitised newspapers from Ninestars is detailed below. The newspapers will be received in batches, containing thousands of files, see the Batch Description.

It can be understood as a series of steps that must be done for each batch. Each batch will be either accepted or rejected by the steps of the validation process. These steps involve both human checking and automated systems. A rejected batch should be fixed and resubmitted by Ninestars or (in case of false rejection) the validation process should be updated and rerun, until the batch can be accepted. If a batch is neither accepted nor rejected within a given timeframe, then Ninestars are not obliged to fix the errors.

Throughout the process, the state of a Batch can be tracked by the Surveillance Interface

What matters is not that there is a surveillance interface but that there is a monitor component that records a persistent state for each batch. - CSR

Add something like "The state of each batch is stored in DOMS and accessed through an API which queries a caching layer (e.g. a lucene index of batch objects in DOMS). So there will be some latency between updates to DOMS batch objects and the result of API queries."

Newspaper Digitisation Process Monitor

Each step in the process is handled by an autonomous component. Are these what are commonly called Autonomous Agents? - CSR

Autonomous Components

The Autonomous components can be though of as little robots, working on an assembly line. They all know their place at the assembly line, relative to the other robots. That is to say, each robot knows which events must have transpired for a batch before the batch reaches him. Please note, the robots are virtual, so several robots can occupy the same spot on the assembly line, and several robots can work on the same batch simultaneously.

Most of their life is spent sleeping. Periodically, they open their eyes, to check if a new piece of work have arrived for them. This is what we call "The Polling Step". They poll the Newspaper Batch Event Framework, SBOI, to find all batches, which have experienced a set of events corresponding to this robot's place. 

If the robot finds that a batch is ready to work on, it starts working on this batch. The robot will not poll for more work while working.

Some of the robots will be able to multitask, and others will not. Multitasking is done in a slightly different way for robots than for humans. The multitasking robot will build a new little non-multitasking robot child, give him the batch to work on, and then go back to sleep. Next time it polls for work, it will have to remember not the start work on batches that already have been assigned to any of the robots children. This also means that there must only ever be one instance of each type of multitasking robot running.

When a robot have finished work on a batch, it must record this, so the assembly line can move forward. It records the event Batch Object, stored in the Digital Object Management System, DOMS, system, in the form of an event somewhat like "I, <ROBOT>, did this <THIS WORK> on batch <ID> with <THIS RESULT>". The SBOI will then periodically (often) query DOMS for updates to the Batch objects. When the SBOI discover a batch object update, it updates the index, so that robots further along the assembly line can work on the batch.

Autonomous Dependency Graph

Ingest and Metadata creation

The first robot on the assembly line is not really a robot. It is the company called Ninestars. They

  • digitize a batch of newspaper microfilm
  • Upload the batch to our servers (by rsync)
  • Notify us that the receival process can begin for this batch (The "Batch Object Creation" event)

There are some minor questions of detail here about who creates the intial object (e.g. a record in state "NEW" in a database) is it:

  • Us, when we send the batch out to Ninestars
  • Ninestars, after they have successfully uploaded the files to us
  • Us, after we have received a message (in some form ...) from Ninestars that the files have been uploaded

Any of these could be made to work. - CSR

The first real robot is the "Autonomous Bitrepository Ingester". It polls for "Batch Object Creation" events, so it will receive batches right after Ninestars have uploaded them. For a batch, it will iterate over the jpeg2000 files and for each:

  • Ingest it into the bit repository
  • generate a unique url
  • create a file object with this url in DOMS

Finally, it will add the "Bitrepository Ingest" event to the batch object

The next robot is the "Autonomous Doms Ingester". It polls for the "Bitrepository Ingest" event, so it will always run on batches after they have been ingested into the bit repository. It will create the metadata structure (batch->reel->newspaper->page) structure in DOMS with all the supplied metadata. When this task is done, we have no further need of the data in the Scratch storage, as it should all have been ingested into our preservation platform. Finally, it will add the "Metadata Ingest" event to the batch object. What is the story with content models? Does metadata-ingest include content-model validation? -CSR

Robots can occupy the same location on the assembly line. Here we have the first example of this. The "Autonomous JPylyzer" is a robot that, like the "Autonomous Doms Ingester" polls for "Bitrepository Ingest" events. The task of this robot is to run jpylyzer on the jpeg2000 files in the batch. The task will be done as a hadoop job. This assumes that the ABI ensures that the jp2 files are in hdfs, or that the Autonomous Jpylyzer can bring them in if necessary. -CSR

  • As the map step, run JPylyzer on each jpeg2000 file
  • As the reduce step, add the output of this process to the file object in DOMS

Finally, it will add the "JPylyzed" event to the batch object. 


System overview


Automatic Quality checks

After the ingest and metadata creation steps, the automatic quality checks can begin. There is not really a technical distinction between the two phases, but the kind of tasks being completed are somewhat different (And the diagram will get VERY complex).

We have two robots, that might end up working concurrently. The first is the "Autonomous Batch Structure Checker". This robot might in fact be a set of robots (TO BE DECIDED), but for now we can think of it as a single step. It polls for "Metadata Ingest" events in the SBOI (Batch Status API -CSR). It will perform a series of checks of the metadata in DOMS. When done, it will add the "Batch Structure Checked" event to the Batch object.

The other robot is the "Autonomous JPylyzer checker". As the name suggests, it will poll for the "JPylyzed" event. The task is, for each file in the batch, validate that the jpylyzer data against a specified profile. When done, it will add a "JPylyzer Checked" Event to the batch object.

The final robot in this phase is the "Autonomous Automatic QA Batch Approver". The work of this robot is not hard. It polls for batches that have BOTH the events "JPylyzer Checked" and "Batch Structure Checked". For these batches, it sets the "Automatic QA Complete" event, to mark the batch as having passed the automatic quality checks.

Metadata Processing

Failing Batches

In the above sections, we did not handle the case of failing batches. If a batch cannot be validated, Ninestars should be notified, and they should resubmit the batch. However, and this is important, the automatic system cannot request a batch to be resubmitted without human interaction.

If an autonomous component fails to complete it's task on a batch, it should note this in the event sends to DOMS. So, if the "Autonomous Batch Structure Checker" finds that there are several violations of the requirements in the metadata, it should still mark the batch with the event "Batch Structure Checked", but it should include the violations it found in this event.

When the batch then reaches the Manual QA phase, the human evaluator will have to review the errors collected. If he finds that the error is in the batch, not in the process checking the batch, he will then mark the batch as failed and notify Ninestars. This notification will include information about the errors found. If the evaluator finds the error to be in the check, not the batch, he will notify the systems department, which should then fix the check. He will then mark the batch as failed, and request that the same copy be resubmitted.

Manual QA


Accepting Batches

When a batch is accepted, a little bit of cleanup sometimes needs to be done.

Since we do not delete a batch when failing it, there could be a number of failed versions of the batch in the system. These should now be deleted. Something like this process is followed

  • Query the bit repository for any files pertaining previous versions of the batch
  • If any is found, request the delete key
  • Delete the files
  • Query DOMS for any previous run-number objects. If any is found, purge them and their subtree.
  • Need to clarify what metadata we want to keep from failed batches, since it could be very useful for comparison with the new version. -CSR


  • No labels