dk.netarkivet.harvester.harvesting
Class HarvestDocumentation

java.lang.Object
  extended by dk.netarkivet.harvester.harvesting.HarvestDocumentation

public class HarvestDocumentation
extends java.lang.Object

This class contains code for documenting a harvest. Metadata is read from the directories associated with a given harvest-job-attempt (i.e. one DoCrawlMessage sent to a harvest server). The collected metadata are written to a new metadata file that is managed by IngestableFiles. Temporary metadata files will be deleted after this metadata file has been written.


Constructor Summary
HarvestDocumentation()
           
 
Method Summary
static void documentHarvest(IngestableFiles ingestables)
          Documents the harvest under the given dir in a packaged metadata arc file in a directory 'metadata' under the current dir.
static java.net.URI getAlternateCDXURI(long jobID, java.lang.String filename)
          Generates a URI identifying CDX info for one harvested ARC file.
static java.net.URI getCDXURI(java.lang.String harvestID, java.lang.String jobID, java.lang.String filename)
          Generates a URI identifying CDX info for one harvested (W)ARC file.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

HarvestDocumentation

public HarvestDocumentation()
Method Detail

documentHarvest

public static void documentHarvest(IngestableFiles ingestables)
                            throws IOFailure
Documents the harvest under the given dir in a packaged metadata arc file in a directory 'metadata' under the current dir. Only documents the files belonging to the given jobID, the rest are moved to oldjobs. In the current implementation, the documentation consists of CDX indices over all ARC files (with one CDX record per harvested ARC file), plus packaging of log files. If this method finishes without an exception, it is guaranteed that metadata is ready for upload. TODO Place preharvestmetadata in IngestableFiles-defined area TODO This method may be a good place to copy deduplicate information from the crawl log to the cdx file.

Parameters:
ingestables - Information about the finished crawl (crawldir, jobId, harvestID).
Throws:
ArgumentNotValid - if crawlDir is null or does not exist, or if jobID or harvestID is negative.
IOFailure - if - reading ARC files or temporary files fails - writing a file to arcFilesDir fails

getCDXURI

public static java.net.URI getCDXURI(java.lang.String harvestID,
                                     java.lang.String jobID,
                                     java.lang.String filename)
                              throws ArgumentNotValid,
                                     UnknownID
Generates a URI identifying CDX info for one harvested (W)ARC file. In Netarkivet, all of the parameters below are in the (W)ARC file's name.

Parameters:
harvestID - The number of the harvest that generated the (W)ARC file.
jobID - The number of the job that generated the (W)ARC file.
filename - The name of the ARC or WARC file behind the cdx-data
Returns:
A URI in the proprietary schema "metadata".
Throws:
ArgumentNotValid - if any parameter is null.
UnknownID - if something goes terribly wrong in our URI construction.

getAlternateCDXURI

public static java.net.URI getAlternateCDXURI(long jobID,
                                              java.lang.String filename)
                                       throws ArgumentNotValid,
                                              UnknownID
Generates a URI identifying CDX info for one harvested ARC file.

Parameters:
jobID - The number of the job that generated the ARC file.
filename - the filename.
Returns:
A URI in the proprietary schema "metadata".
Throws:
ArgumentNotValid - if any parameter is null.
UnknownID - if something goes terribly wrong in our URI construction.