Class HarvestDocumentation


  • public class HarvestDocumentation
    extends java.lang.Object
    This class contains code for documenting a H3 harvest. Metadata is read from the directories associated with a given harvest-job-attempt (i.e. one DoCrawlMessage sent to a harvest server). The collected metadata are written to a new metadata file that is managed by IngestableFiles. Temporary metadata files will be deleted after this metadata file has been written.
    • Method Summary

      All Methods Static Methods Concrete Methods 
      Modifier and Type Method Description
      static void documentHarvest​(IngestableFiles ingestables)
      Documents the harvest under the given dir in a packaged metadata arc file in a directory 'metadata' under the current dir.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Method Detail

      • documentHarvest

        public static void documentHarvest​(IngestableFiles ingestables)
                                    throws IOFailure
        Documents the harvest under the given dir in a packaged metadata arc file in a directory 'metadata' under the current dir. Only documents the files belonging to the given jobID, the rest are moved to oldjobs.

        In the current implementation, the documentation consists of CDX indices over all ARC files (with one CDX record per harvested ARC file), plus packaging of log files.

        If this method finishes without an exception, it is guaranteed that metadata is ready for upload.

        TODO Place preharvestmetadata in IngestableFiles-defined area TODO This method may be a good place to copy deduplicate information from the crawl log to the cdx file.

        Parameters:
        ingestables - Information about the finished crawl (crawldir, jobId, harvestID).
        Throws:
        ArgumentNotValid - if crawlDir is null or does not exist, or if jobID or harvestID is negative.
        IOFailure - if - reading ARC files or temporary files fails - writing a file to arcFilesDir fails