Class IngestableFiles


  • public class IngestableFiles
    extends java.lang.Object
    Encapsulation of files to be ingested into the archive. These files are presently placed subdirectories under the crawldir.
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      void cleanup()
      Remove any temporary files.
      void closeMetadataFile()
      Marks generated metadata as final, closes the writer, and moves the temporary metadata file to its final position.
      void closeOpenFiles​(int waitSeconds)
      Close any ".open" files left by a crashed Heritrix.
      protected void closeOpenFiles​(java.lang.String archiveDirName, java.io.FilenameFilter filter)
      Given an archive sub-directory name and a filter to match against this method tries to rename the matched files.
      java.util.List<java.io.File> getArcFiles()
      Get a list of all ARC files that should get ingested.
      java.io.File getArcsDir()  
      java.io.File getCrawlDir()  
      long getHarvestID()  
      java.lang.String getHarvestnamePrefix()  
      java.io.File getHeritrix3JobDir()  
      long getJobId()  
      java.util.List<java.io.File> getMetadataArcFiles()
      Gets the files containing the metadata.
      protected java.io.File getMetadataFile()
      Constructs the single metadata arc file from the crawlDir and the jobID.
      MetadataFileWriter getMetadataWriter()
      Get a MetaDatafileWriter for the temporary metadata file.
      java.io.File getReportsDir()  
      java.io.File getTmpMetadataDir()
      Constructs the TEMPORARY metadata subdir from the crawlDir.
      java.util.List<java.io.File> getWarcFiles()
      Get a list of all WARC files that should get ingested.
      java.io.File getWarcsDir()  
      boolean isMetadataFailed()
      Return true if the metadata generation process is known to have failed.
      boolean isMetadataReady()
      Check, if the metadatafile already exists.
      void setErrorState​(boolean isError)
      Set error state.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

      • IngestableFiles

        public IngestableFiles​(Heritrix3Files files)
        Constructor for this class. HeritrixFiles contains information about crawlDir, jobId, and harvestnameprefix for a specific finished harvestjob.
        Parameters:
        files - An instance of Heritrix3Files
        Throws:
        ArgumentNotValid - if null-arguments are given; if jobID < 1; if crawlDir does not exist
    • Method Detail

      • isMetadataReady

        public boolean isMetadataReady()
        Check, if the metadatafile already exists. If this is true, metadata has been successfully generated. If false, either metadata has not finished being generated, or there was an error generating them.
        Returns:
        true, if it does exist; false otherwise.
      • isMetadataFailed

        public boolean isMetadataFailed()
        Return true if the metadata generation process is known to have failed.
        Returns:
        True if metadata generation is finished without success, false if generation is still ongoing or has been successfully done.
      • closeMetadataFile

        public void closeMetadataFile()
        Marks generated metadata as final, closes the writer, and moves the temporary metadata file to its final position.
        Throws:
        PermissionDenied - If the metadata has already been marked as ready, or if no metadata file exists upon success.
        IOFailure - if there is an error marking the metadata as ready.
      • setErrorState

        public void setErrorState​(boolean isError)
        Set error state.
        Parameters:
        isError - True, if error, otherwise false;
      • getMetadataWriter

        public MetadataFileWriter getMetadataWriter()
        Get a MetaDatafileWriter for the temporary metadata file. Successive calls to this method on the same object will return the same writer. Once the metadata have been finalized, calling this method will fail.
        Returns:
        a MetaDatafileWriter for the temporary metadata file.
        Throws:
        PermissionDenied - if metadata generation is already finished.
      • getMetadataArcFiles

        public java.util.List<java.io.File> getMetadataArcFiles()
        Gets the files containing the metadata.
        Returns:
        the files in the metadata dir
        Throws:
        IllegalState - if the metadata file is not ready, either because generation is still going on or there was an error generating the metadata.
      • getMetadataFile

        protected java.io.File getMetadataFile()
        Constructs the single metadata arc file from the crawlDir and the jobID.
        Returns:
        metadata arc file as a File
      • getTmpMetadataDir

        public java.io.File getTmpMetadataDir()
        Constructs the TEMPORARY metadata subdir from the crawlDir.
        Returns:
        The tmp-metadata subdir as a File
      • getArcFiles

        public java.util.List<java.io.File> getArcFiles()
        Get a list of all ARC files that should get ingested. Any open files should be closed with closeOpenFiles first.
        Returns:
        The ARC files that are ready to get ingested.
      • getArcsDir

        public java.io.File getArcsDir()
        Returns:
        the arcs dir in the our crawl directory.
      • getWarcsDir

        public java.io.File getWarcsDir()
        Returns:
        the warcs dir in the our crawl directory.
      • getReportsDir

        public java.io.File getReportsDir()
        Returns:
        the warcs dir in the our crawl directory.
      • getWarcFiles

        public java.util.List<java.io.File> getWarcFiles()
        Get a list of all WARC files that should get ingested. Any open files should be closed with closeOpenFiles first.
        Returns:
        The WARC files that are ready to get ingested.
      • closeOpenFiles

        public void closeOpenFiles​(int waitSeconds)
        Close any ".open" files left by a crashed Heritrix. ARC and/or WARC files ending in .open indicate that Heritrix is still writing to them. If Heritrix has died, we can just rename them before we upload. This must not be done while harvesting is still in progress.
        Parameters:
        waitSeconds - How many seconds to wait before closing files. This may be done in order to allow Heritrix to finish writing before we close the files.
      • closeOpenFiles

        protected void closeOpenFiles​(java.lang.String archiveDirName,
                                      java.io.FilenameFilter filter)
        Given an archive sub-directory name and a filter to match against this method tries to rename the matched files. Files that can not be renamed generate a log message. The filter should always match files that end with ".open" as a minimum.
        Parameters:
        archiveDirName - archive directory name, currently "arc" or "warc"
        filter - filename filter used to select ".open" files to rename
      • cleanup

        public void cleanup()
        Remove any temporary files.
      • getJobId

        public long getJobId()
        Returns:
        the jobID of the harvest job being processed.
      • getHarvestID

        public long getHarvestID()
        Returns:
        the harvestID of the harvest job being processed.
      • getHarvestnamePrefix

        public java.lang.String getHarvestnamePrefix()
        Returns:
        the harvestnamePrefix of the harvest job being processed.
      • getCrawlDir

        public java.io.File getCrawlDir()
        Returns:
        the crawlDir of the harvest job being processed.