Class Job

  • All Implemented Interfaces:
    JobInfo, Serializable

    public class Job
    extends Object
    implements Serializable, JobInfo
    This class represents one job to run by Heritrix. It's based on a number of configurations all based on the same order.xml and at most one configuration for each domain. Each job consists of configurations of the approximate same size; that is the difference in expectation from the smallest configuration to the largest configuration is within a factor of each other defined as limMaxRelSize (although differences smaller than limMinAbsSize are ignored) There is a limit limMaxTotalSize on the total size of the job in objects.

    A job may also be limited on bytes or objects, defined either by the configurations in the job or the harvest definition the job is generated by.

    The job contains the order file, the seedlist and the current status of the job, as well as the ID of the harvest definition that defined it and names of all the configurations it is based on.

    See Also:
    Serialized Form
    • Field Detail

      • origHarvestDefinitionID

        protected Long origHarvestDefinitionID
        The Id of the harvestdefinition, that generated this job.
      • status

        protected JobStatus status
        The status of the job. See the JobStatus class for the possible states.
    • Constructor Detail

      • Job

        protected Job()
      • Job

        public Job​(Long harvestID,
                   DomainConfiguration cfg,
                   HeritrixTemplate orderXMLdoc,
                   HarvestChannel channel,
                   long forceMaxObjectsPerDomain,
                   long forceMaxBytesPerDomain,
                   long forceMaxJobRunningTime,
                   int harvestNum)
            throws ArgumentNotValid
        Package private constructor for common initialisation.
        Parameters:
        harvestID - the id of the harvestdefinition
        cfg - the configuration to base the Job on
        orderXMLdoc -
        channel - the channel on which the job will be submitted.
        forceMaxObjectsPerDomain - the maximum number of objects harvested from a domain, overrides individual configuration settings. -1 means no limit
        forceMaxBytesPerDomain - The maximum number of objects harvested from a domain, or -1 for no limit.
        forceMaxJobRunningTime - The max time in seconds given to the harvester for this job
        harvestNum - the run number of the harvest definition
        Throws:
        ArgumentNotValid - if cfg or priority is null or harvestID is invalid, or if any limit < -1
    • Method Detail

      • addConfiguration

        public void addConfiguration​(DomainConfiguration cfg)
        Adds a configuration to this Job. Seedlists and settings are updated accordingly.
        Parameters:
        cfg - the configuration to add
        Throws:
        ArgumentNotValid - if cfg is null or cfg uses a different orderxml than this job or if this job already contains a configuration associated with domain of configuration cfg.
      • getOrderXMLName

        public String getOrderXMLName()
        Get the name of the order XML file used by this Job.
        Returns:
        the name of the orderXML file
      • getActualStop

        public Date getActualStop()
        Get the actual time when this job was stopped/completed.
        Returns:
        the time as Date
      • getActualStart

        public Date getActualStart()
        Get the actual time when this job was started.
        Returns:
        the time as Date
      • getSubmittedDate

        public Date getSubmittedDate()
        Get the time when this job was submitted.
        Returns:
        the time as Date
      • getCreationDate

        public Date getCreationDate()
        Get the time when this job was created.
        Returns:
        the creation time as a Date
      • getSettingsXMLfiles

        public File[] getSettingsXMLfiles()
        Get a list of Heritrix settings.xml files. Note that these files have nothing to do with NetarchiveSuite settings files. They are files that supplement the Heritrix order.xml files, and contain overrides for specific domains.
        Returns:
        the list of Files as an array
      • getOrigHarvestDefinitionID

        public Long getOrigHarvestDefinitionID()
        Get the id of the HarvestDefinition from which this job originates.
        Specified by:
        getOrigHarvestDefinitionID in interface JobInfo
        Returns:
        the id as a Long
      • getJobID

        public Long getJobID()
        Get the id of this Job.
        Specified by:
        getJobID in interface JobInfo
        Returns:
        the id as a Long
      • setJobID

        public void setJobID​(Long id)
        Set the id of this Job.
        Parameters:
        id - The Id for this job.
      • getCountDomains

        public int getCountDomains()
        Get's the total number of different domains harvested by this job.
        Returns:
        the number of configurations added to this domain
      • setActualStart

        public void setActualStart​(Date actualStart)
        Set the actual time when this job was started.

        Sends a notification, if actualStart is set to a time after actualStop.

        Parameters:
        actualStart - A Date object representing the time when this job was started.
      • setActualStop

        public void setActualStop​(Date actualStop)
                           throws ArgumentNotValid
        Set the actual time when this job was stopped/completed. Sends a notification, if actualStop is set to a time before actualStart.
        Parameters:
        actualStop - A Date object representing the time when this job was stopped.
        Throws:
        ArgumentNotValid
      • setOrderXMLDoc

        public void setOrderXMLDoc​(HeritrixTemplate doc)
        Set the orderxml for this job.
        Parameters:
        doc - A orderxml to be used by this job
      • getOrderXMLdoc

        public HeritrixTemplate getOrderXMLdoc()
        Gets a document representation of the order.xml associated with this Job.
        Returns:
        the XML as a org.dom4j.Document
      • setSeedList

        public void setSeedList​(String seedList)
        Set the seedlist of the job from the seedList argument. Individual seeds are separated by a '\n' character. Duplicate seeds are removed.
        Parameters:
        seedList - List of seeds as one String
      • getSeedListAsString

        public String getSeedListAsString()
        Get the seedlist as a String. The individual seeds are separated by the character '\n'. The order of the seeds are unknown.
        Returns:
        the seedlist as a String
      • getStatus

        public JobStatus getStatus()
        Get the current status of this Job.
        Returns:
        the status as an int in the range 0 to 4.
      • setStatus

        public void setStatus​(JobStatus newStatus)
        Sets status of this job.
        Parameters:
        newStatus - Must be one of the values STATUS_NEW, ..., STATUS_FAILED
        Throws:
        ArgumentNotValid - in case of invalid status argument or invalid status change
      • getDomainConfigurationMap

        public Map<String,​String> getDomainConfigurationMap()
        Returns a map of domain names and name of their corresponding configuration.

        The returned Map cannot be changed.

        Returns:
        a read-only Map (, )
      • getMaxObjectsPerDomain

        public long getMaxObjectsPerDomain()
        Gets the maximum number of objects harvested per domain.
        Returns:
        The maximum number of objects harvested per domain. 0 means no limit.
      • getMaxBytesPerDomain

        public long getMaxBytesPerDomain()
        Gets the maximum number of bytes harvested per domain.
        Returns:
        The maximum number of bytes harvested per domain. -1 means no limit.
      • setHarvestChannel

        public void setHarvestChannel​(HarvestChannel harvestChannel)
      • setChannel

        public void setChannel​(String channel)
        Sets the associated HarvestChannel name.
        Parameters:
        channel - the channel name
      • isSnapshot

        public boolean isSnapshot()
        Returns:
        true if the job belongs to a snapshot harvest, false if it belongs to a focused harvest.
      • setSnapshot

        public void setSnapshot​(boolean isSnapshot)
        Sets whether job belongs to a snapshot or focused harvest.
        Parameters:
        isSnapshot - true if the job belongs to a snapshot harvest, false if it belongs to a focused harvest.
      • getForceMaxObjectsPerDomain

        public long getForceMaxObjectsPerDomain()
        Returns:
        Returns the forceMaxObjectsPerDomain. 0 means no limit.
      • setMaxObjectsPerDomain

        protected void setMaxObjectsPerDomain​(long maxObjectsPerDomain)
        Sets the maxObjectsPerDomain value.
        Parameters:
        maxObjectsPerDomain - The forceMaxObjectsPerDomain to set. 0 means no limit.
        Throws:
        IOFailure - Thrown from auxiliary method editOrderXML_maxObjectsPerDomain.
      • setMaxBytesPerDomain

        protected void setMaxBytesPerDomain​(long maxBytesPerDomain)
        Set the maxbytes per domain value.
        Parameters:
        maxBytesPerDomain - The maxBytesPerDomain to set, or -1 for no limit.
      • setMaxJobRunningTime

        protected void setMaxJobRunningTime​(long maxJobRunningTime)
        Set the maxJobRunningTime value.
        Parameters:
        maxJobRunningTime - The maxJobRunningTime in seconds to set, or 0 for no limit.
      • getMaxJobRunningTime

        public long getMaxJobRunningTime()
        Returns:
        Returns the MaxJobRunningTime. 0 means no limit.
      • getHarvestNum

        public int getHarvestNum()
        Get the harvestNum for this job. The number reflects which run of the harvest definition this is.
        Returns:
        the harvestNum for this job.
      • setHarvestNum

        public void setHarvestNum​(int harvestNum)
        Set the harvestNum for this job. The number reflects which run of the harvest definition this is. ONLY TO BE USED IN THE CONSTRUCTION PHASE.
        Parameters:
        harvestNum - a given harvestNum
      • getHarvestErrors

        public String getHarvestErrors()
        Get the list of harvest errors for this job. If no harvest errors, null is returned This value is not meaningful until the job is finished (FAILED,DONE, RESUBMITTED)
        Returns:
        the harvest errors for this job or null if no harvest errors.
      • appendHarvestErrors

        public void appendHarvestErrors​(String harvestErrors)
        Append to the list of harvest errors for this job. Nothing happens, if argument harvestErrors is null.
        Parameters:
        harvestErrors - a string containing harvest errors (may be null)
      • getHarvestErrorDetails

        public String getHarvestErrorDetails()
        Get the list of harvest error details for this job. If no harvest error details, null is returned This value is not meaningful until the job is finished (FAILED,DONE, RESUBMITTED)
        Returns:
        the list of harvest error details for this job or null if no harvest error details.
      • appendHarvestErrorDetails

        public void appendHarvestErrorDetails​(String harvestErrorDetails)
        Append to the list of harvest error details for this job. Nothing happens, if argument harvestErrorDetails is null.
        Parameters:
        harvestErrorDetails - a string containing harvest error details.
      • getUploadErrors

        public String getUploadErrors()
        Get the list of upload errors. If no upload errors, null is returned. This value is not meaningful until the job is finished (FAILED,DONE, RESUBMITTED)
        Returns:
        the list of upload errors as String, or null if no upload errors.
      • appendUploadErrors

        public void appendUploadErrors​(String uploadErrors)
        Append to the list of upload errors. Nothing happens, if argument uploadErrors is null.
        Parameters:
        uploadErrors - a string containing upload errors.
      • getUploadErrorDetails

        public String getUploadErrorDetails()
        Get the list of upload error details. If no upload error details, null is returned. This value is not meaningful until the job is finished (FAILED,DONE, RESUBMITTED)
        Returns:
        the list of upload error details as String, or null if no upload error details
      • appendUploadErrorDetails

        public void appendUploadErrorDetails​(String uploadErrorDetails)
        Append to the list of upload error details. Nothing happens, if argument uploadErrorDetails is null.
        Parameters:
        uploadErrorDetails - a string containing upload error details.
      • getResubmittedAsJob

        public Long getResubmittedAsJob()
        Get the ID for the job which this job was resubmitted as. If null, this job has not been resubmitted.
        Returns:
        this ID.
      • setSubmittedDate

        public void setSubmittedDate​(Date submittedDate)
        Set the Date for when this job was submitted. If null, this job has not been submitted.
        Parameters:
        submittedDate - The date when this was submitted
      • setCreationDate

        public void setCreationDate​(Date creationDate)
        Set the Date for when this job was created. If null, this job has not been created.
        Parameters:
        creationDate - The date when this was created
      • setResubmittedAsJob

        public void setResubmittedAsJob​(Long resubmittedAsJob)
        Set the ID for the job which this job was resubmitted as.
        Parameters:
        resubmittedAsJob - An Id for a new job.
      • getContinuationOf

        public Long getContinuationOf()
        Returns:
        id of the job that this job is supposed to continue using Heritrix recover-log or null if it starts from scratch.
      • getHarvestFilenamePrefix

        public String getHarvestFilenamePrefix()
        Description copied from interface: JobInfo
        Get the harvestFilename prefix.
        Specified by:
        getHarvestFilenamePrefix in interface JobInfo
        Returns:
        the harvestFilename prefix.
      • setHarvestFilenamePrefix

        public void setHarvestFilenamePrefix​(String prefix)
        Parameters:
        prefix -
      • getForceMaxBytesPerDomain

        public long getForceMaxBytesPerDomain()
        Returns:
        the forceMaxBytesPerDomain
      • isConfigurationSetsObjectLimit

        public boolean isConfigurationSetsObjectLimit()
        Returns:
        the configurationSetsObjectLimit
      • isConfigurationSetsByteLimit

        public boolean isConfigurationSetsByteLimit()
        Returns:
        the configurationSetsByteLimit
      • getMinCountObjects

        public long getMinCountObjects()
        Returns:
        the minCountObjects
      • getMaxCountObjects

        public long getMaxCountObjects()
        Returns:
        the maxCountObjects
      • getTotalCountObjects

        public long getTotalCountObjects()
        Returns:
        the totalCountObjects
      • getHarvestAudience

        public String getHarvestAudience()
        Returns:
        the harvest-audience.
      • setHarvestAudience

        public void setHarvestAudience​(String theAudience)
        Set the harvest audience for this job. Taken from the harvestdefinition that generated this job.
        Parameters:
        theAudience - the harvest-audience.
      • getSortedSeedList

        public List<String> getSortedSeedList()
        Returns a list of sorted seeds for this job. The sorting is by domain, and inside each domain, the list is sorted by url
        Returns:
        a list of sorted seeds for this job.