Class HarvestDocumentation
- java.lang.Object
-
- dk.netarkivet.harvester.heritrix3.HarvestDocumentation
-
public class HarvestDocumentation extends java.lang.Object
This class contains code for documenting a H3 harvest. Metadata is read from the directories associated with a given harvest-job-attempt (i.e. one DoCrawlMessage sent to a harvest server). The collected metadata are written to a new metadata file that is managed by IngestableFiles. Temporary metadata files will be deleted after this metadata file has been written.
-
-
Constructor Summary
Constructors Constructor Description HarvestDocumentation()
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description static void
documentHarvest(IngestableFiles ingestables)
Documents the harvest under the given dir in a packaged metadata arc file in a directory 'metadata' under the current dir.
-
-
-
Constructor Detail
-
HarvestDocumentation
public HarvestDocumentation()
-
-
Method Detail
-
documentHarvest
public static void documentHarvest(IngestableFiles ingestables) throws IOFailure
Documents the harvest under the given dir in a packaged metadata arc file in a directory 'metadata' under the current dir. Only documents the files belonging to the given jobID, the rest are moved to oldjobs.In the current implementation, the documentation consists of CDX indices over all ARC files (with one CDX record per harvested ARC file), plus packaging of log files.
If this method finishes without an exception, it is guaranteed that metadata is ready for upload.
TODO Place preharvestmetadata in IngestableFiles-defined area TODO This method may be a good place to copy deduplicate information from the crawl log to the cdx file.
- Parameters:
ingestables
- Information about the finished crawl (crawldir, jobId, harvestID).- Throws:
ArgumentNotValid
- if crawlDir is null or does not exist, or if jobID or harvestID is negative.IOFailure
- if - reading ARC files or temporary files fails - writing a file to arcFilesDir fails
-
-