Class HeritrixFiles


  • public class HeritrixFiles
    extends java.lang.Object
    This class encapsulates all the files that Heritrix gets from our system, and all files we read from Heritrix.
    • Constructor Detail

      • HeritrixFiles

        public HeritrixFiles​(java.io.File crawlDir,
                             JobInfo harvestJob,
                             java.io.File jmxPasswordFile,
                             java.io.File jmxAccessFile)
        Create a new HeritrixFiles object for a job.
        Parameters:
        crawlDir - The dir, where the crawl-files are placed. Assumes, that crawlDir exists already.
        harvestJob - The harvestjob behind this instance of HeritrixFiles
        jmxPasswordFile - The jmx password file to be used by Heritrix 1. The existence of this file is checked another place.
        jmxAccessFile - The JMX access file to be used by Heritrix 1. The existence of this file is checked another place.
        Throws:
        ArgumentNotValid - if null crawlDir, or non-positive jobID and harvestID.
    • Method Detail

      • getCrawlDir

        public java.io.File getCrawlDir()
        Returns the directory that crawls are performed inside.
        Returns:
        A directory (that is created as part of harvest setup) that all of Heritrix' files live in.
      • getArchiveFilePrefix

        public java.lang.String getArchiveFilePrefix()
        Returns the prefix used to generate Archive files (ARC or WARC).
        Returns:
        The archive file prefix, currently jobID-harvestID.
      • getOrderXmlFile

        public java.io.File getOrderXmlFile()
        Returns the order.xml file object.
        Returns:
        A file object for the order.xml file (which may not have been written yet).
      • getSeedsTxtFile

        public java.io.File getSeedsTxtFile()
        Returns the seeds.txt file object.
        Returns:
        A file object for the seeds.txt file (which may not have been written yet).
      • getRecoverBackupGzFile

        public java.io.File getRecoverBackupGzFile()
        Returns the recoverbackup file object.
        Returns:
        A file object for the recoverbackup.gz. file (which may or may not exist).
      • writeRecoverBackupfile

        public boolean writeRecoverBackupfile​(java.io.InputStream recoverlog)
        Try to write the recover-backup file.
        Parameters:
        recoverlog - The recoverlog in the form of an InputStream
        Returns:
        true, if operation succeeds, otherwise false
      • writeSeedsTxt

        public void writeSeedsTxt​(java.lang.String seeds)
        Writes the given content to the seeds.txt file.
        Parameters:
        seeds - The intended content of seeds.txt
        Throws:
        ArgumentNotValid - if seeds is null or empty
      • writeOrderXml

        public void writeOrderXml​(HeritrixTemplate doc)
        Writes the given order.xml content to the order.xml file.
        Parameters:
        doc - The intended content of order.xml
      • getHeritrixOutput

        public java.io.File getHeritrixOutput()
        Get the file that contains output from Heritrix on stdout/stderr.
        Returns:
        File that contains output from Heritrix on stdout/stderr.
      • setIndexDir

        public void setIndexDir​(java.io.File indexDir)
        Set the deduplicate index dir.
        Parameters:
        indexDir - the cache dir containing unzipped files
        Throws:
        ArgumentNotValid - if indexDir is not a directory or is null
      • getIndexDir

        public java.io.File getIndexDir()
        Returns the index directory, if one has been set.
        Returns:
        the index directory or null if no index has been set.
      • getDisposableFiles

        public java.io.File[] getDisposableFiles()
        Return a list of disposable heritrix-files. Currently the list consists of the File "state.job", and the directories: "checkpoints", "state", "scratch".
        Returns:
        a list of disposable heritrix-files.
      • getCrawlLog

        public java.io.File getCrawlLog()
        Retrieve the crawlLog as a File object.
        Returns:
        the crawlLog as a File object.
      • getProgressStatisticsLog

        public java.io.File getProgressStatisticsLog()
        Retrieve the progress statistics log as a File object.
        Returns:
        the progress statistics log as a File object.
      • getJobID

        public java.lang.Long getJobID()
        Get the job ID.
        Returns:
        Job ID this heritrix files object is for.
      • getHarvestID

        public java.lang.Long getHarvestID()
        Get the harvest ID.
        Returns:
        Harvest ID this heritrix files object is for.
      • cleanUpAfterHarvest

        public void cleanUpAfterHarvest​(java.io.File oldJobsDir)
        Delete statefile etc. and move crawl directory to oldjobs.
        Parameters:
        oldJobsDir - Directory to move the rest of any existing files to.
      • deleteFinalLogs

        public void deleteFinalLogs()
        Helper method to delete the crawl.log and progress statistics log. Will log errors but otherwise continue.
      • getArcsDir

        public java.io.File getArcsDir()
        Return the directory, where Heritrix writes its arcfiles.
        Returns:
        the directory, where Heritrix writes its arcfiles.
      • getWarcsDir

        public java.io.File getWarcsDir()
        Return the directory, where Heritrix writes its warcfiles.
        Returns:
        the directory, where Heritrix writes its warcfiles.
      • getJmxPasswordFile

        public java.io.File getJmxPasswordFile()
        Method for retrieving the jmxremote.password file.
        Returns:
        the jmxPasswordFile.
      • getJmxAccessFile

        public java.io.File getJmxAccessFile()
        Method for retrieving the jmxremote.access file.
        Returns:
        the jmxAccessFile.