Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3
Description of the platform used for ingesting and storing the digitised newspapers

The system for receiving and ingesting the digitised newspapers from Ninestars is detailed below. The newspapers will be received in batches, containing thousands of files, see the Batch Description.

It can be understood as a series of steps that must be done for each batch. Each batch will be either accepted or rejected by the steps of the validation process. These steps involve both human checking and automated systems. A rejected batch should be fixed and resubmitted by Ninestars or (in case of false rejection) the validation process should be updated and rerun, until the batch can be accepted. If a batch is neither accepted nor rejected within a given timeframe, then Ninestars are not obliged to fix the errors.

Throughout the process, the state of a Batch can be tracked by the Surveillance Interface

What matters is not that there is a surveillance interface but that there is a monitor component that records a persistent state for each batch. - CSR

Add something like "The state of each batch is stored in DOMS and accessed through an API which queries a caching layer (e.g. a lucene index of batch objects in DOMS). So there will be some latency between updates to DOMS batch objects and the result of API queries."

Excerpt Include
Newspaper Digitisation Process Monitor
Newspaper Digitisation Process Monitor

Each step in the process is handled by an autonomous component. Are these what are commonly called Autonomous Agents? - CSR

Include Page
Autonomous Components
Autonomous Components

Ingest and Metadata creation

The first robot on the assembly line is not really a robot. It is the company called Ninestars. They

  • digitize a batch of newspaper microfilm
  • Upload the batch to our servers (by rsync)
  • Notify us that the receival process can begin for this batch (The "Batch Object Creation" event)

There are some minor questions of detail here about who creates the intial object (e.g. a record in state "NEW" in a database) is it:

  • Us, when we send the batch out to Ninestars
  • Ninestars, after they have successfully uploaded the files to us
  • Us, after we have received a message (in some form ...) from Ninestars that the files have been uploaded

Any of these could be made to work. - CSR

The first real robot is the "Autonomous Bitrepository Ingester". It polls for "Batch Object Creation" events, so it will receive batches right after Ninestars have uploaded them. For a batch, it will iterate over the jpeg2000 files and for each:

  • Ingest it into the bit repository
  • generate a unique url
  • create a file object with this url in DOMS

Finally, it will add the "Bitrepository Ingest" event to the batch object

The next robot is the "Autonomous Doms Ingester". It polls for the "Bitrepository Ingest" event, so it will always run on batches after they have been ingested into the bit repository. It will create the metadata structure (batch->reel->newspaper->page) structure in DOMS with all the supplied metadata. When this task is done, we have no further need of the data in the Scratch storage, as it should all have been ingested into our preservation platform. Finally, it will add the "Metadata Ingest" event to the batch object. What is the story with content models? Does metadata-ingest include content-model validation? -CSR

Robots can occupy the same location on the assembly line. Here we have the first example of this. The "Autonomous JPylyzer" is a robot that, like the "Autonomous Doms Ingester" polls for "Bitrepository Ingest" events. The task of this robot is to run jpylyzer on the jpeg2000 files in the batch. The task will be done as a hadoop job. This assumes that the ABI ensures that the jp2 files are in hdfs, or that the Autonomous Jpylyzer can bring them in if necessary. -CSR

  • As the map step, run JPylyzer on each jpeg2000 file
  • As the reduce step, add the output of this process to the file object in DOMS

Finally, it will add the "JPylyzed" event to the batch object. 


Gliffy Diagram
nameSystem overview


Automatic Quality checks

After the ingest and metadata creation steps, the automatic quality checks can begin. There is not really a technical distinction between the two phases, but the kind of tasks being completed are somewhat different (And the diagram will get VERY complex).

We have two robots, that might end up working concurrently. The first is the "Autonomous Batch Structure Checker". This robot might in fact be a set of robots (TO BE DECIDED), but for now we can think of it as a single step. It polls for "Metadata Ingest" events in the SBOI (Batch Status API -CSR). It will perform a series of checks of the metadata in DOMS. When done, it will add the "Batch Structure Checked" event to the Batch object.

The other robot is the "Autonomous JPylyzer checker". As the name suggests, it will poll for the "JPylyzed" event. The task is, for each file in the batch, validate that the jpylyzer data against a specified profile. When done, it will add a "JPylyzer Checked" Event to the batch object.

The final robot in this phase is the "Autonomous Automatic QA Batch Approver". The work of this robot is not hard. It polls for batches that have BOTH the events "JPylyzer Checked" and "Batch Structure Checked". For these batches, it sets the "Automatic QA Complete" event, to mark the batch as having passed the automatic quality checks.

Gliffy Diagram
nameMetadata Processing

Failing Batches

In the above sections, we did not handle the case of failing batches. If a batch cannot be validated, Ninestars should be notified, and they should resubmit the batch. However, and this is important, the automatic system cannot request a batch to be resubmitted without human interaction.

If an autonomous component fails to complete it's task on a batch, it should note this in the event sends to DOMS. So, if the "Autonomous Batch Structure Checker" finds that there are several violations of the requirements in the metadata, it should still mark the batch with the event "Batch Structure Checked", but it should include the violations it found in this event.

When the batch then reaches the Manual QA phase, the human evaluator will have to review the errors collected. If he finds that the error is in the batch, not in the process checking the batch, he will then mark the batch as failed and notify Ninestars. This notification will include information about the errors found. If the evaluator finds the error to be in the check, not the batch, he will notify the systems department, which should then fix the check. He will then mark the batch as failed, and request that the same copy be resubmitted.

Manual QA


Accepting Batches

When a batch is accepted, a little bit of cleanup sometimes needs to be done.

Since we do not delete a batch when failing it, there could be a number of failed versions of the batch in the system. These should now be deleted. Something like this process is followed

  • Query the bit repository for any files pertaining previous versions of the batch
  • If any is found, request the delete key
  • Delete the files
  • Query DOMS for any previous run-number objects. If any is found, purge them and their subtree.
  • Need to clarify what metadata we want to keep from failed batches, since it could be very useful for comparison with the new version. -CSR

Children Display