dk.netarkivet.harvester.harvesting
Class HarvestDocumentation

java.lang.Object
  extended by dk.netarkivet.harvester.harvesting.HarvestDocumentation

public class HarvestDocumentation
extends java.lang.Object

This class contains code for documenting a harvest. Metadata is read from the directories associated with a given harvest-job-attempt (i.e. one DoCrawlMessage sent to a harvest server). The collected metadata are written to a new ARC file that is managed by IngestableFiles. Temporary metadata files will be deleted after this metadata-ARC file has been written.


Field Summary
static java.util.regex.Pattern metadataFilenamePattern
           
 
Constructor Summary
HarvestDocumentation()
           
 
Method Summary
static void documentHarvest(java.io.File crawlDir, long jobID, long harvestID)
          Documents the harvest under the given dir in a packaged metadata arc file in a directory 'metadata' under the current dir.
static java.util.List<java.io.File> documentOldJob(java.io.File crawlDir, long jobID, long harvestID)
          Document an old job from an oldjobs directory on the harvesters.
static java.net.URI getCDXURI(java.lang.String harvestID, java.lang.String jobID, java.lang.String timeStamp, java.lang.String serialNumber)
          Generates a URI identifying CDX info for one harvested ARC file.
static java.lang.String getMetadataARCFileName(java.lang.String jobID)
          Generates a name for an ARC file containing metadata regarding a given job.
static java.lang.String getPreharvestMetadataARCFileName(long jobID)
          Generates a name for an ARC file containing "preharvest" metadata regarding a given job (e.g.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

metadataFilenamePattern

public static final java.util.regex.Pattern metadataFilenamePattern
Constructor Detail

HarvestDocumentation

public HarvestDocumentation()
Method Detail

documentHarvest

public static void documentHarvest(java.io.File crawlDir,
                                   long jobID,
                                   long harvestID)
                            throws IOFailure
Documents the harvest under the given dir in a packaged metadata arc file in a directory 'metadata' under the current dir. Only documents the files belonging to the given jobID, the rest are moved to oldjobs. In the current implementation, the documentation consists of CDX indices over all ARC files (with one CDX record per harvested ARC file), plus packaging of log files. If this method finishes without an exception, it is guaranteed that metadata is ready for upload.

Parameters:
crawlDir - The directory the crawl was performed in
jobID - the ID of the job for this harvest
harvestID - the ID of the harvestdefinition which created this job.
Throws:
ArgumentNotValid - if crawlDir is null or does not exist, or if jobID or harvestID is negative.
IOFailure - if - reading ARC files or temporary files fails - writing a file to arcFilesDir fails

getCDXURI

public static java.net.URI getCDXURI(java.lang.String harvestID,
                                     java.lang.String jobID,
                                     java.lang.String timeStamp,
                                     java.lang.String serialNumber)
                              throws ArgumentNotValid,
                                     UnknownID
Generates a URI identifying CDX info for one harvested ARC file. In Netarkivet, all of the parameters below are in the ARC file's name.

Parameters:
harvestID - The number of the harvest that generated the ARC file.
jobID - The number of the job that generated the ARC file.
timeStamp - The timestamp in the name of the ARC file.
serialNumber - The serial no. in the name of the ARC file.
Returns:
A URI in the proprietary schema "metadata".
Throws:
ArgumentNotValid - if any parameter is null.
UnknownID - if something goes terribly wrong in our URI construction.

getPreharvestMetadataARCFileName

public static java.lang.String getPreharvestMetadataARCFileName(long jobID)
                                                         throws ArgumentNotValid
Generates a name for an ARC file containing "preharvest" metadata regarding a given job (e.g. excluded alises).

Parameters:
jobID - the number of the harvester job
Returns:
The file name to use for the preharvest metadata, as a String.
Throws:
ArgumentNotValid - If jobId is negative

documentOldJob

public static java.util.List<java.io.File> documentOldJob(java.io.File crawlDir,
                                                          long jobID,
                                                          long harvestID)
Document an old job from an oldjobs directory on the harvesters. Generates a file named -metadata-2.arc. Note: This method sets "heritrixVersion" to the heritrix-version written in the user-agent. This version may not be correct!

Parameters:
crawlDir - the given crawlDir
jobID - the given job-identifier
harvestID - the given harvest-identifier
Returns:
a list of files added to the arcfile

getMetadataARCFileName

public static java.lang.String getMetadataARCFileName(java.lang.String jobID)
                                               throws ArgumentNotValid
Generates a name for an ARC file containing metadata regarding a given job.

Parameters:
jobID - The number of the job that generated the ARC file.
Returns:
A "flat" file name (i.e. no path) containing the jobID parameter and ending on "-metadata-N.arc", where N is the serial number of the metadata files for this job, e.g. "42-metadata-1.arc". Currently, only one file is ever made.
Throws:
ArgumentNotValid - if any parameter was null.