dk.netarkivet.harvester.datamodel
Class Job

java.lang.Object
  extended by dk.netarkivet.harvester.datamodel.Job
All Implemented Interfaces:
JobInfo, java.io.Serializable

public class Job
extends java.lang.Object
implements java.io.Serializable, JobInfo

This class represents one job to run by Heritrix. It's based on a number of configurations all based on the same order.xml and at most one configuration for each domain. Each job consists of configurations of the approximate same size; that is the difference in expectation from the smallest configuration to the largest configuration is within a factor of each other defined as limMaxRelSize (although differences smaller than limMinAbsSize are ignored) There is a limit limMaxTotalSize on the total size of the job in objects. A job may also be limited on bytes or objects, defined either by the configurations in the job or the harvest definition the job is generated by. The job contains the order file, the seedlist and the current status of the job, as well as the ID of the harvest definition that defined it and names of all the configurations it is based on.

See Also:
Serialized Form

Field Summary
(package private)  boolean configsChanged
          A hint to the DAO that configurations have changed.
 
Constructor Summary
Job(java.lang.Long harvestID, DomainConfiguration cfg, HarvestChannel channel, long forceMaxObjectsPerDomain, long forceMaxBytesPerDomain, long forceMaxJobRunningTime, int harvestNum)
          Package private constructor for common initialisation.
Job(java.lang.Long harvestID, java.util.Map<java.lang.String,java.lang.String> configurations, java.lang.String channel, boolean snapshot, long forceMaxObjectsPerDomain, long forceMaxBytesPerDomain, long forceMaxJobRunningTime, JobStatus status, java.lang.String orderXMLname, org.dom4j.Document orderXMLdoc, java.lang.String seedlist, int harvestNum, java.lang.Long continuationOf)
          Create a new Job object from basic information storable in the DAO.
 
Method Summary
 void addConfiguration(DomainConfiguration cfg)
          Adds a configuration to this Job.
 void appendHarvestErrorDetails(java.lang.String harvestErrorDetails)
          Append to the list of harvest error details for this job.
 void appendHarvestErrors(java.lang.String harvestErrors)
          Append to the list of harvest errors for this job.
 void appendUploadErrorDetails(java.lang.String uploadErrorDetails)
          Append to the list of upload error details.
 void appendUploadErrors(java.lang.String uploadErrors)
          Append to the list of upload errors.
static Job createJob(java.lang.Long harvestID, HarvestChannel channel, DomainConfiguration cfg, int harvestNum)
          Create new Job configured according to the properties of the supplied DomainConfiguration.
static Job createSnapShotJob(java.lang.Long harvestID, HarvestChannel channel, DomainConfiguration cfg, long maxObjectsPerDomain, long maxBytesPerDomain, long maxJobRunningTime, int harvestNum)
          Create new instance of Job suitable for snapshot harvesting.
 java.util.Date getActualStart()
          Get the actual time when this job was started.
 java.util.Date getActualStop()
          Get the actual time when this job was stopped/completed.
 java.lang.String getChannel()
           
 java.lang.Long getContinuationOf()
           
 int getCountDomains()
          Get's the total number of different domains harvested by this job.
 java.util.Date getCreationDate()
          Get the time when this job was created.
 java.util.Map<java.lang.String,java.lang.String> getDomainConfigurationMap()
          Returns a map of domain names and name of their corresponding configuration.
(package private)  long getEdition()
          Get the edition number.
 long getForceMaxBytesPerDomain()
           
 long getForceMaxObjectsPerDomain()
           
 java.lang.String getHarvestAudience()
           
 java.lang.String getHarvestErrorDetails()
          Get the list of harvest error details for this job.
 java.lang.String getHarvestErrors()
          Get the list of harvest errors for this job.
 java.lang.String getHarvestFilenamePrefix()
          Get the harvestFilename prefix.
 int getHarvestNum()
          Get the harvestNum for this job.
 java.util.List<AliasInfo> getJobAliasInfo()
          Get a list of AliasInfo objects for all the domains included in the job.
 java.lang.Long getJobID()
          Get the id of this Job.
 long getMaxBytesPerDomain()
          Gets the maximum number of bytes harvested per domain.
 long getMaxCountObjects()
           
 long getMaxJobRunningTime()
           
 long getMaxObjectsPerDomain()
          Gets the maximum number of objects harvested per domain.
 long getMinCountObjects()
           
 org.dom4j.Document getOrderXMLdoc()
          Gets a document representation of the order.xml associated with this Job.
 java.lang.String getOrderXMLName()
          Get the name of the order XML file used by this Job.
 java.lang.Long getOrigHarvestDefinitionID()
          Get the id of the HarvestDefinition from which this job originates.
 java.lang.Long getResubmittedAsJob()
          Get the ID for the job which this job was resubmitted as.
 java.lang.String getSeedListAsString()
          Get the seedlist as a String.
 org.dom4j.Document[] getSettingsXMLdocs()
          Gets a list of document representations of the settings.xml's associated with this Job.
 java.io.File[] getSettingsXMLfiles()
          Get a list of Heritrix settings.xml files.
 java.util.List<java.lang.String> getSortedSeedList()
          Returns a list of sorted seeds for this job.
 JobStatus getStatus()
          Get the current status of this Job.
 java.util.Date getSubmittedDate()
          Get the time when this job was submitted.
 long getTotalCountObjects()
           
 java.lang.String getUploadErrorDetails()
          Get the list of upload error details.
 java.lang.String getUploadErrors()
          Get the list of upload errors.
 boolean isConfigurationSetsByteLimit()
           
 boolean isConfigurationSetsObjectLimit()
           
 boolean isSnapshot()
           
 void setActualStart(java.util.Date actualStart)
          Set the actual time when this job was started.
 void setActualStop(java.util.Date actualStop)
          Set the actual time when this job was stopped/completed.
 void setChannel(java.lang.String channel)
          Sets the associated HarvestChannel name.
 void setCreationDate(java.util.Date creationDate)
          Set the Date for when this job was created.
(package private)  void setDefaultHarvestNamePrefix()
           
(package private)  void setEdition(long edition)
          Set the edition number.
 void setHarvestAudience(java.lang.String theAudience)
          Set the harvest audience for this job.
 void setHarvestFilenamePrefix(java.lang.String prefix)
           
 void setHarvestNum(int harvestNum)
          Set the harvestNum for this job.
 void setJobID(java.lang.Long id)
          Set the id of this Job.
 void setOrderXMLDoc(org.dom4j.Document doc)
          Set the orderxml for this job.
 void setResubmittedAsJob(java.lang.Long resubmittedAsJob)
          Set the ID for the job which this job was resubmitted as.
 void setSeedList(java.lang.String seedList)
          Set the seedlist of the job from the seedList argument.
 void setSnapshot(boolean isSnapshot)
          Sets whether job belongs to a snapshot or focused harvest.
 void setStatus(JobStatus newStatus)
          Sets status of this job.
 void setSubmittedDate(java.util.Date submittedDate)
          Set the Date for when this job was submitted.
 java.lang.String toString()
          toString method for the Job class.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

configsChanged

boolean configsChanged
A hint to the DAO that configurations have changed. Since configurations are large, the DAO can use that this is false to avoid updating the config list. The DAO can set it to false after saving configurations.

Constructor Detail

Job

Job(java.lang.Long harvestID,
    DomainConfiguration cfg,
    HarvestChannel channel,
    long forceMaxObjectsPerDomain,
    long forceMaxBytesPerDomain,
    long forceMaxJobRunningTime,
    int harvestNum)
throws ArgumentNotValid
Package private constructor for common initialisation.

Parameters:
harvestID - the id of the harvestdefinition
cfg - the configuration to base the Job on
channel - the channel on which the job will be submitted.
forceMaxObjectsPerDomain - the maximum number of objects harvested from a domain, overrides individual configuration settings. -1 means no limit
forceMaxBytesPerDomain - The maximum number of objects harvested from a domain, or -1 for no limit.
forceMaxJobRunningTime - The max time in seconds given to the harvester for this job
harvestNum - the run number of the harvest definition
Throws:
ArgumentNotValid - if cfg or priority is null or harvestID is invalid, or if any limit < -1
UnknownID - If the priority is invalid.

Job

Job(java.lang.Long harvestID,
    java.util.Map<java.lang.String,java.lang.String> configurations,
    java.lang.String channel,
    boolean snapshot,
    long forceMaxObjectsPerDomain,
    long forceMaxBytesPerDomain,
    long forceMaxJobRunningTime,
    JobStatus status,
    java.lang.String orderXMLname,
    org.dom4j.Document orderXMLdoc,
    java.lang.String seedlist,
    int harvestNum,
    java.lang.Long continuationOf)
Create a new Job object from basic information storable in the DAO.

Parameters:
harvestID - the id of the harvestdefinition
configurations - the configurations to base the Job on
channel - the name of the channel on which the job will be submitted.
snapshot - whether the job belongs to a snapshot harvest
forceMaxObjectsPerDomain - the maximum number of objects harvested from a domain, overrides individual configuration settings. 0 means no limit.
forceMaxBytesPerDomain - The maximum number of objects harvested from a domain, or -1 for no limit.
forceMaxJobRunningTime - The max time in seconds given to the harvester for this job
status - the current status of the job.
orderXMLname - the name of the order template used.
orderXMLdoc - the (possibly modified) template
seedlist - the combined seedlist from all configs.
harvestNum - the run number of the harvest definition
Method Detail

createJob

public static Job createJob(java.lang.Long harvestID,
                            HarvestChannel channel,
                            DomainConfiguration cfg,
                            int harvestNum)
Create new Job configured according to the properties of the supplied DomainConfiguration.

Parameters:
harvestID - the id of the harvestdefinition
channel - the HarvestChannel
cfg - the configuration to base the Job on
harvestNum - Which run of the harvest definition this is.
Returns:
newly created Job.
Throws:
ArgumentNotValid - if cfg is null or harvestID is invalid

createSnapShotJob

public static Job createSnapShotJob(java.lang.Long harvestID,
                                    HarvestChannel channel,
                                    DomainConfiguration cfg,
                                    long maxObjectsPerDomain,
                                    long maxBytesPerDomain,
                                    long maxJobRunningTime,
                                    int harvestNum)
                             throws ArgumentNotValid
Create new instance of Job suitable for snapshot harvesting. This job is configured according to the properties of the supplied DomainConfiguration. The maximum number of objects retrieved from all domains added to this job is determined by maxObjectsPerDomain, regardless of the configuration settings, that are overridden.

Parameters:
harvestID - the id of the harvestdefinition
channel - the channel for the job
cfg - the configuration to base the Job on
maxObjectsPerDomain - the maximum number of objects to harvest from a domain, overrides individual configuration settings unless the domain has overrideLimits set. 0 means no limit.
maxBytesPerDomain - the maximum number of bytes to harvest from a domain, overrides individual configuration settings unless the domain has overrideLimits set. -1 means no limit.
maxJobRunningTime - The maximum of seconds which the harvest can spend on the harvest. 0 means no limit.
harvestNum - Which run of the harvest definition this is (should always be 1).
Returns:
SnapShotJob
Throws:
ArgumentNotValid - if cfg is null or harvestID is invalid

addConfiguration

public void addConfiguration(DomainConfiguration cfg)
Adds a configuration to this Job. Seedlists and settings are updated accordingly.

Parameters:
cfg - the configuration to add
Throws:
ArgumentNotValid - if cfg is null or cfg uses a different orderxml than this job or if this job already contains a configuration associated with domain of configuration cfg.

getOrderXMLName

public java.lang.String getOrderXMLName()
Get the name of the order XML file used by this Job.

Returns:
the name of the orderXML file

getActualStop

public java.util.Date getActualStop()
Get the actual time when this job was stopped/completed.

Returns:
the time as Date

getActualStart

public java.util.Date getActualStart()
Get the actual time when this job was started.

Returns:
the time as Date

getSubmittedDate

public java.util.Date getSubmittedDate()
Get the time when this job was submitted.

Returns:
the time as Date

getCreationDate

public java.util.Date getCreationDate()
Get the time when this job was created.

Returns:
the creation time as a Date

getSettingsXMLfiles

public java.io.File[] getSettingsXMLfiles()
Get a list of Heritrix settings.xml files. Note that these files have nothing to do with NetarchiveSuite settings files. They are files that supplement the Heritrix order.xml files, and contain overrides for specific domains.

Returns:
the list of Files as an array

getOrigHarvestDefinitionID

public java.lang.Long getOrigHarvestDefinitionID()
Get the id of the HarvestDefinition from which this job originates.

Specified by:
getOrigHarvestDefinitionID in interface JobInfo
Returns:
the id as a Long

getJobID

public java.lang.Long getJobID()
Get the id of this Job.

Specified by:
getJobID in interface JobInfo
Returns:
the id as a Long

setJobID

public void setJobID(java.lang.Long id)
Set the id of this Job.

Parameters:
id - The Id for this job.

getCountDomains

public int getCountDomains()
Get's the total number of different domains harvested by this job.

Returns:
the number of configurations added to this domain

setActualStart

public void setActualStart(java.util.Date actualStart)
Set the actual time when this job was started. Sends a notification, if actualStart is set to a time after actualStop.

Parameters:
actualStart - A Date object representing the time when this job was started.

setActualStop

public void setActualStop(java.util.Date actualStop)
Set the actual time when this job was stopped/completed. Sends a notification, if actualStop is set to a time before actualStart.

Parameters:
actualStop - A Date object representing the time when this job was stopped.

setOrderXMLDoc

public void setOrderXMLDoc(org.dom4j.Document doc)
Set the orderxml for this job.

Parameters:
doc - A orderxml to be used by this job

getOrderXMLdoc

public org.dom4j.Document getOrderXMLdoc()
Gets a document representation of the order.xml associated with this Job.

Returns:
the XML as a org.dom4j.Document

getSettingsXMLdocs

public org.dom4j.Document[] getSettingsXMLdocs()
Gets a list of document representations of the settings.xml's associated with this Job.

Returns:
the XML as an array of org.dom4j.Document

getSortedSeedList

public java.util.List<java.lang.String> getSortedSeedList()
Returns a list of sorted seeds for this job. The sorting is by domain, and inside each domain, the list is sorted by url

Returns:
a list of sorted seeds for this job.

setSeedList

public void setSeedList(java.lang.String seedList)
Set the seedlist of the job from the seedList argument. Individual seeds are separated by a '\n' character. Duplicate seeds are removed.

Parameters:
seedList - List of seeds as one String

getSeedListAsString

public java.lang.String getSeedListAsString()
Get the seedlist as a String. The individual seeds are separated by the character '\n'. The order of the seeds are unknown.

Returns:
the seedlist as a String

getStatus

public JobStatus getStatus()
Get the current status of this Job.

Returns:
the status as an int in the range 0 to 4.

setStatus

public void setStatus(JobStatus newStatus)
Sets status of this job.

Parameters:
newStatus - Must be one of the values STATUS_NEW, ..., STATUS_FAILED
Throws:
ArgumentNotValid - in case of invalid status argument or invalid status change

getDomainConfigurationMap

public java.util.Map<java.lang.String,java.lang.String> getDomainConfigurationMap()
Returns a map of domain names and name of their corresponding configuration. The returned Map cannot be changed.

Returns:
a read-only Map (, )

getMaxObjectsPerDomain

public long getMaxObjectsPerDomain()
Gets the maximum number of objects harvested per domain.

Returns:
The maximum number of objects harvested per domain. 0 means no limit.

getMaxBytesPerDomain

public long getMaxBytesPerDomain()
Gets the maximum number of bytes harvested per domain.

Returns:
The maximum number of bytes harvested per domain. -1 means no limit.

getEdition

long getEdition()
Get the edition number.

Returns:
The edition number

setEdition

void setEdition(long edition)
Set the edition number.

Parameters:
edition - the new edition number

getChannel

public java.lang.String getChannel()
Returns:
the associated HarvestChannel name.

setChannel

public void setChannel(java.lang.String channel)
Sets the associated HarvestChannel name.

Parameters:
channel - the channel name

isSnapshot

public boolean isSnapshot()
Returns:
true if the job belongs to a snapshot harvest, false if it belongs to a focused harvest.

setSnapshot

public void setSnapshot(boolean isSnapshot)
Sets whether job belongs to a snapshot or focused harvest.

Parameters:
isSnapshot - true if the job belongs to a snapshot harvest, false if it belongs to a focused harvest.

toString

public java.lang.String toString()
toString method for the Job class.

Overrides:
toString in class java.lang.Object
Returns:
a human readable string representing this object.
See Also:
Object.toString()

getForceMaxObjectsPerDomain

public long getForceMaxObjectsPerDomain()
Returns:
Returns the forceMaxObjectsPerDomain. 0 means no limit.

getMaxJobRunningTime

public long getMaxJobRunningTime()
Returns:
Returns the MaxJobRunningTime. 0 means no limit.

getHarvestNum

public int getHarvestNum()
Get the harvestNum for this job. The number reflects which run of the harvest definition this is.

Returns:
the harvestNum for this job.

setHarvestNum

public void setHarvestNum(int harvestNum)
Set the harvestNum for this job. The number reflects which run of the harvest definition this is. ONLY TO BE USED IN THE CONSTRUCTION PHASE.

Parameters:
harvestNum - a given harvestNum

getHarvestErrors

public java.lang.String getHarvestErrors()
Get the list of harvest errors for this job. If no harvest errors, null is returned This value is not meaningful until the job is finished (FAILED,DONE, RESUBMITTED)

Returns:
the harvest errors for this job or null if no harvest errors.

appendHarvestErrors

public void appendHarvestErrors(java.lang.String harvestErrors)
Append to the list of harvest errors for this job. Nothing happens, if argument harvestErrors is null.

Parameters:
harvestErrors - a string containing harvest errors (may be null)

getHarvestErrorDetails

public java.lang.String getHarvestErrorDetails()
Get the list of harvest error details for this job. If no harvest error details, null is returned This value is not meaningful until the job is finished (FAILED,DONE, RESUBMITTED)

Returns:
the list of harvest error details for this job or null if no harvest error details.

appendHarvestErrorDetails

public void appendHarvestErrorDetails(java.lang.String harvestErrorDetails)
Append to the list of harvest error details for this job. Nothing happens, if argument harvestErrorDetails is null.

Parameters:
harvestErrorDetails - a string containing harvest error details.

getUploadErrors

public java.lang.String getUploadErrors()
Get the list of upload errors. If no upload errors, null is returned. This value is not meaningful until the job is finished (FAILED,DONE, RESUBMITTED)

Returns:
the list of upload errors as String, or null if no upload errors.

appendUploadErrors

public void appendUploadErrors(java.lang.String uploadErrors)
Append to the list of upload errors. Nothing happens, if argument uploadErrors is null.

Parameters:
uploadErrors - a string containing upload errors.

getUploadErrorDetails

public java.lang.String getUploadErrorDetails()
Get the list of upload error details. If no upload error details, null is returned. This value is not meaningful until the job is finished (FAILED,DONE, RESUBMITTED)

Returns:
the list of upload error details as String, or null if no upload error details

appendUploadErrorDetails

public void appendUploadErrorDetails(java.lang.String uploadErrorDetails)
Append to the list of upload error details. Nothing happens, if argument uploadErrorDetails is null.

Parameters:
uploadErrorDetails - a string containing upload error details.

getJobAliasInfo

public java.util.List<AliasInfo> getJobAliasInfo()
Get a list of AliasInfo objects for all the domains included in the job.

Returns:
a list of AliasInfo objects for all the domains included in the job.

getResubmittedAsJob

public java.lang.Long getResubmittedAsJob()
Get the ID for the job which this job was resubmitted as. If null, this job has not been resubmitted.

Returns:
this ID.

setSubmittedDate

public void setSubmittedDate(java.util.Date submittedDate)
Set the Date for when this job was submitted. If null, this job has not been submitted.

Parameters:
submittedDate - The date when this was submitted

setCreationDate

public void setCreationDate(java.util.Date creationDate)
Set the Date for when this job was created. If null, this job has not been created.

Parameters:
creationDate - The date when this was created

setResubmittedAsJob

public void setResubmittedAsJob(java.lang.Long resubmittedAsJob)
Set the ID for the job which this job was resubmitted as.

Parameters:
resubmittedAsJob - An Id for a new job.

getContinuationOf

public java.lang.Long getContinuationOf()
Returns:
id of the job that this job is supposed to continue using Heritrix recover-log or null if it starts from scratch.

getHarvestFilenamePrefix

public java.lang.String getHarvestFilenamePrefix()
Description copied from interface: JobInfo
Get the harvestFilename prefix.

Specified by:
getHarvestFilenamePrefix in interface JobInfo
Returns:
the harvestFilename prefix.

setHarvestFilenamePrefix

public void setHarvestFilenamePrefix(java.lang.String prefix)
Parameters:
prefix -

getForceMaxBytesPerDomain

public long getForceMaxBytesPerDomain()
Returns:
the forceMaxBytesPerDomain

isConfigurationSetsObjectLimit

public boolean isConfigurationSetsObjectLimit()
Returns:
the configurationSetsObjectLimit

isConfigurationSetsByteLimit

public boolean isConfigurationSetsByteLimit()
Returns:
the configurationSetsByteLimit

getMinCountObjects

public long getMinCountObjects()
Returns:
the minCountObjects

getMaxCountObjects

public long getMaxCountObjects()
Returns:
the maxCountObjects

getTotalCountObjects

public long getTotalCountObjects()
Returns:
the totalCountObjects

setDefaultHarvestNamePrefix

void setDefaultHarvestNamePrefix()

getHarvestAudience

public java.lang.String getHarvestAudience()
Returns:
the harvestaudience.

setHarvestAudience

public void setHarvestAudience(java.lang.String theAudience)
Set the harvest audience for this job. Taken from the harvestdefinition that generated this job.

Parameters:
theAudience - the harvestaudience.