dk.netarkivet.harvester.datamodel
Class Job

java.lang.Object
  extended by dk.netarkivet.harvester.datamodel.Job
All Implemented Interfaces:
java.io.Serializable

public class Job
extends java.lang.Object
implements java.io.Serializable

This class represents one job to run by Heritrix. It's based on a number of configurations all based on the same order.xml and at most one configuration for each domain. Each job consists of configurations of the approximate same size; that is the difference in expectation from the smallest configuration to the largest configuration is within a factor of each other defined as limMaxRelSize (although differences smaller than limMinAbsSize are ignored) There is a limit limMaxTotalSize on the total size of the job in objects. A job may also be limited on bytes or objects, defined either by the configurations in the job or the harvest definition the job is generated by. The job contains the order file, the seedlist and the current status of the job, as well as the ID of the harvest definition that defined it and names of all the configurations it is based on.

See Also:
Serialized Form

Field Summary
(package private)  boolean configsChanged
          A hint to the DAO that configurations have changed.
 
Constructor Summary
Job(java.lang.Long harvestID, DomainConfiguration cfg, JobPriority priority, long forceMaxObjectsPerDomain, long forceMaxBytesPerDomain, long forceMaxJobRunningTime, int harvestNum)
          Package private constructor for common initialisation.
Job(java.lang.Long harvestID, java.util.Map<java.lang.String,java.lang.String> configurations, JobPriority priority, long forceMaxObjectsPerDomain, long forceMaxBytesPerDomain, long forceMaxJobRunningTime, JobStatus status, java.lang.String orderXMLname, org.dom4j.Document orderXMLdoc, java.lang.String seedlist, int harvestNum)
          Create a new Job object from basic information storable in the DAO.
 
Method Summary
 void addConfiguration(DomainConfiguration cfg)
          Adds a configuration to this Job.
 void appendHarvestErrorDetails(java.lang.String harvestErrorDetails)
          Append to the list of harvest error details for this job.
 void appendHarvestErrors(java.lang.String harvestErrors)
          Append to the list of harvest errors for this job.
 void appendUploadErrorDetails(java.lang.String uploadErrorDetails)
          Append to the list of upload error details.
 void appendUploadErrors(java.lang.String uploadErrors)
          Append to the list of upload errors.
 boolean canAccept(DomainConfiguration cfg)
          Tests if a configuration fits into this Job.
static Job createJob(java.lang.Long harvestID, DomainConfiguration cfg, int harvestNum)
          Create new Job configured according to the properties of the supplied DomainConfiguration.
static Job createSnapShotJob(java.lang.Long harvestID, DomainConfiguration cfg, long maxObjectsPerDomain, long maxBytesPerDomain, long maxJobRunningTime, int harvestNum)
          Create new instance of Job suitable for snapshot harvesting.
 java.util.Date getActualStart()
          Get the actual time when this job was started.
 java.util.Date getActualStop()
          Get the actual time when this job was stopped/completed.
 int getCountDomains()
          Get's the total number of different domains harvested by this job.
 java.util.Map<java.lang.String,java.lang.String> getDomainConfigurationMap()
          Returns a map of domain names and name of their corresponding configuration.
(package private)  long getEdition()
          Get the edition number.
(package private)  long getForceMaxObjectsPerDomain()
           
 java.lang.String getHarvestErrorDetails()
          Get the list of harvest error details for this job.
 java.lang.String getHarvestErrors()
          Get the list of harvest errors for this job.
 int getHarvestNum()
          Get the harvestNum for this job.
 java.util.List<AliasInfo> getJobAliasInfo()
          Get a list of AliasInfo objects for all the domains included in the job.
 java.lang.Long getJobID()
          Get the id of this Job.
 long getMaxBytesPerDomain()
          Gets the maximum number of bytes harvested per domain.
 long getMaxJobRunningTime()
           
 long getMaxObjectsPerDomain()
          Gets the maximum number of objects harvested per domain.
 org.dom4j.Document getOrderXMLdoc()
          Gets a document representation of the order.xml associated with this Job.
 java.lang.String getOrderXMLName()
          Get the name of the order XML file used by this Job.
 java.lang.Long getOrigHarvestDefinitionID()
          Get the id of the HarvestDefinition from which this job originates.
 JobPriority getPriority()
          Get the priority of this job.
 java.lang.Long getResubmittedAsJob()
          Get the ID for the job which this job was resubmitted as.
 java.lang.String getSeedListAsString()
          Get the seedlist as a String.
 org.dom4j.Document[] getSettingsXMLdocs()
          Gets a list of document representations of the settings.xml's associated with this Job.
 java.io.File[] getSettingsXMLfiles()
          Get a list of Heritrix settings.xml files.
 java.util.List<java.lang.String> getSortedSeedList()
          Returns a list of sorted seeds for this job.
 JobStatus getStatus()
          Get the current status of this Job.
 java.util.Date getSubmittedDate()
          Get the time when this job was submitted.
 java.lang.String getUploadErrorDetails()
          Get the list of upload error details.
 java.lang.String getUploadErrors()
          Get the list of upload errors.
 void setActualStart(java.util.Date actualStart)
          Set the actual time when this job was started.
 void setActualStop(java.util.Date actualStop)
          Set the actual time when this job was stopped/completed.
(package private)  void setEdition(long edition)
          Set the edition number.
 void setHarvestNum(int harvestNum)
          Set the harvestNum for this job.
 void setJobID(java.lang.Long id)
          Set the id of this Job.
 void setOrderXMLDoc(org.dom4j.Document doc)
          Set the orderxml for this job.
 void setResubmittedAsJob(java.lang.Long resubmittedAsJob)
          Set the ID for the job which this job was resubmitted as.
 void setSeedList(java.lang.String seedList)
          Set the seedlist from a seedlist, where the individual seeds are separated by a '\n' character.
 void setStatus(JobStatus newStatus)
          Sets status of this job.
 void setSubmittedDate(java.util.Date submittedDate)
          Set the Date for when this job was submitted.
 java.lang.String toString()
          toString method for the Job class.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

configsChanged

boolean configsChanged
A hint to the DAO that configurations have changed. Since configurations are large, the DAO can use that this is false to avoid updating the config list. The DAO can set it to false after saving configurations.

Constructor Detail

Job

Job(java.lang.Long harvestID,
    DomainConfiguration cfg,
    JobPriority priority,
    long forceMaxObjectsPerDomain,
    long forceMaxBytesPerDomain,
    long forceMaxJobRunningTime,
    int harvestNum)
throws ArgumentNotValid
Package private constructor for common initialisation.

Parameters:
harvestID - the id of the harvestdefinition
cfg - the configuration to base the Job on
priority - the priority of the job
forceMaxObjectsPerDomain - the maximum number of objects harvested from a domain, overrides individual configuration settings. -1 means no limit
forceMaxBytesPerDomain - The maximum number of objects harvested from a domain, or -1 for no limit.
forceMaxJobRunningTime - The max time in seconds given to the harvester for this job
harvestNum - the run number of the harvest definition
Throws:
ArgumentNotValid - if cfg or priority is null or harvestID is invalid, or if any limit < -1
UnknownID - If the priority is invalid.

Job

Job(java.lang.Long harvestID,
    java.util.Map<java.lang.String,java.lang.String> configurations,
    JobPriority priority,
    long forceMaxObjectsPerDomain,
    long forceMaxBytesPerDomain,
    long forceMaxJobRunningTime,
    JobStatus status,
    java.lang.String orderXMLname,
    org.dom4j.Document orderXMLdoc,
    java.lang.String seedlist,
    int harvestNum)
Create a new Job object from basic information storable in the DAO.

Parameters:
harvestID - the id of the harvestdefinition
configurations - the configurations to base the Job on
priority - the priority of the job
forceMaxObjectsPerDomain - the maximum number of objects harvested from a domain, overrides individual configuration settings. 0 means no limit.
forceMaxBytesPerDomain - The maximum number of objects harvested from a domain, or -1 for no limit.
forceMaxJobRunningTime - The max time in seconds given to the harvester for this job
status - the current status of the job.
orderXMLname - the name of the order template used.
orderXMLdoc - the (possibly modified) template
seedlist - the combined seedlist from all configs.
harvestNum - the run number of the harvest definition
Method Detail

createJob

public static Job createJob(java.lang.Long harvestID,
                            DomainConfiguration cfg,
                            int harvestNum)
Create new Job configured according to the properties of the supplied DomainConfiguration.

Parameters:
harvestID - the id of the harvestdefinition
cfg - the configuration to base the Job on
harvestNum - Which run of the harvest definition this is.
Returns:
newly created Job.
Throws:
ArgumentNotValid - if cfg is null or harvestID is invalid

createSnapShotJob

public static Job createSnapShotJob(java.lang.Long harvestID,
                                    DomainConfiguration cfg,
                                    long maxObjectsPerDomain,
                                    long maxBytesPerDomain,
                                    long maxJobRunningTime,
                                    int harvestNum)
                             throws ArgumentNotValid
Create new instance of Job suitable for snapshot harvesting. This job is configured according to the properties of the supplied DomainConfiguration. The maximum number of objects retrieved from all domains added to this job is determined by maxObjectsPerDomain, regardless of the configuration settings, that are overridden.

Parameters:
harvestID - the id of the harvestdefinition
cfg - the configuration to base the Job on
maxObjectsPerDomain - the maximum number of objects to harvest from a domain, overrides individual configuration settings unless the domain has overrideLimits set. 0 means no limit.
maxBytesPerDomain - the maximum number of bytes to harvest from a domain, overrides individual configuration settings unless the domain has overrideLimits set. -1 means no limit.
maxJobRunningTime - The maximum of seconds which the harvest can spend on the harvest. 0 means no limit.
harvestNum - Which run of the harvest definition this is (should always be 1).
Returns:
SnapShotJob
Throws:
ArgumentNotValid - if cfg is null or harvestID is invalid

addConfiguration

public void addConfiguration(DomainConfiguration cfg)
Adds a configuration to this Job. Seedlists and settings are updated accordingly.

Parameters:
cfg - the configuration to add
Throws:
ArgumentNotValid - if cfg is null or cfg uses a different orderxml than this job or if this job already contains a configuration associated with domain of configuration cfg.

canAccept

public boolean canAccept(DomainConfiguration cfg)
Tests if a configuration fits into this Job. First tests if it's the right type of order-template and bytelimit, and whether the bytelimit is right for the job. The Job limits are compared against the configuration estimates and if no limits are exceeded true is returned otherwise false is returned.

Parameters:
cfg - the configuration to check
Returns:
true if adding the configuration to this Job does not exceed any of the Job limits.
Throws:
ArgumentNotValid - if cfg is null

getOrderXMLName

public java.lang.String getOrderXMLName()
Get the name of the order XML file used by this Job.

Returns:
the name of the orderXML file

getActualStop

public java.util.Date getActualStop()
Get the actual time when this job was stopped/completed.

Returns:
the time as Date

getActualStart

public java.util.Date getActualStart()
Get the actual time when this job was started.

Returns:
the time as Date

getSubmittedDate

public java.util.Date getSubmittedDate()
Get the time when this job was submitted.

Returns:
the time as Date

getSettingsXMLfiles

public java.io.File[] getSettingsXMLfiles()
Get a list of Heritrix settings.xml files. Note that these files have nothing to do with NetarchiveSuite settings files. They are files that supplement the Heritrix order.xml files, and contain overrides for specific domains.

Returns:
the list of Files as an array

getOrigHarvestDefinitionID

public java.lang.Long getOrigHarvestDefinitionID()
Get the id of the HarvestDefinition from which this job originates.

Returns:
the id as a Long

getJobID

public java.lang.Long getJobID()
Get the id of this Job.

Returns:
the id as a Long

setJobID

public void setJobID(java.lang.Long id)
Set the id of this Job.

Parameters:
id - The Id for this job.

getCountDomains

public int getCountDomains()
Get's the total number of different domains harvested by this job.

Returns:
the number of configurations added to this domain

setActualStart

public void setActualStart(java.util.Date actualStart)
Set the actual time when this job was started. Sends a notification, if actualStart is set to a time after actualStop.

Parameters:
actualStart - A Date object representing the time when this job was started.

setActualStop

public void setActualStop(java.util.Date actualStop)
Set the actual time when this job was stopped/completed. Sends a notification, if actualStop is set to a time before actualStart.

Parameters:
actualStop - A Date object representing the time when this job was stopped.

setOrderXMLDoc

public void setOrderXMLDoc(org.dom4j.Document doc)
Set the orderxml for this job.

Parameters:
doc - A orderxml to be used by this job

getOrderXMLdoc

public org.dom4j.Document getOrderXMLdoc()
Gets a document representation of the order.xml associated with this Job.

Returns:
the XML as a org.dom4j.Document

getSettingsXMLdocs

public org.dom4j.Document[] getSettingsXMLdocs()
Gets a list of document representations of the settings.xml's associated with this Job.

Returns:
the XML as an array of org.dom4j.Document

getSortedSeedList

public java.util.List<java.lang.String> getSortedSeedList()
Returns a list of sorted seeds for this job. The sorting is by domain, and inside each domain, the list is sorted by url

Returns:
a list of sorted seeds for this job.

setSeedList

public void setSeedList(java.lang.String seedList)
Set the seedlist from a seedlist, where the individual seeds are separated by a '\n' character. Duplicate seeds are removed.

Parameters:
seedList - List of seeds as one String

getSeedListAsString

public java.lang.String getSeedListAsString()
Get the seedlist as a String. The individual seeds are separated by the character '\n'. The order of the seeds are unknown.

Returns:
the seedlist as a String

getStatus

public JobStatus getStatus()
Get the current status of this Job.

Returns:
the status as an int in the range 0 to 4.

setStatus

public void setStatus(JobStatus newStatus)
Sets status of this job.

Parameters:
newStatus - Must be one of the values STATUS_NEW, ..., STATUS_FAILED
Throws:
ArgumentNotValid - in case of invalid status argument or invalid status change

getDomainConfigurationMap

public java.util.Map<java.lang.String,java.lang.String> getDomainConfigurationMap()
Returns a map of domain names and name of their corresponding configuration. The returned Map cannot be changed.

Returns:
a read-only Map (, )

getMaxObjectsPerDomain

public long getMaxObjectsPerDomain()
Gets the maximum number of objects harvested per domain.

Returns:
The maximum number of objects harvested per domain. 0 means no limit.

getMaxBytesPerDomain

public long getMaxBytesPerDomain()
Gets the maximum number of bytes harvested per domain.

Returns:
The maximum number of bytes harvested per domain. -1 means no limit.

getEdition

long getEdition()
Get the edition number.

Returns:
The edition number

setEdition

void setEdition(long edition)
Set the edition number.

Parameters:
edition - the new edition number

toString

public java.lang.String toString()
toString method for the Job class.

Overrides:
toString in class java.lang.Object
Returns:
a human readable string representing this object.
See Also:
Object.toString()

getForceMaxObjectsPerDomain

long getForceMaxObjectsPerDomain()
Returns:
Returns the forceMaxObjectsPerDomain. 0 means no limit.

getMaxJobRunningTime

public long getMaxJobRunningTime()
Returns:
Returns the MaxJobRunningTime. 0 means no limit.

getPriority

public JobPriority getPriority()
Get the priority of this job.

Returns:
The priority. The return values can only be one of the priorities defined above: LOWPRIORITY and HIGHPRIORITY

getHarvestNum

public int getHarvestNum()
Get the harvestNum for this job. The number reflects which run of the harvest definition this is.

Returns:
the harvestNum for this job.

setHarvestNum

public void setHarvestNum(int harvestNum)
Set the harvestNum for this job. The number reflects which run of the harvest definition this is. ONLY TO BE USED IN THE CONSTRUCTION PHASE.

Parameters:
harvestNum - a given harvestNum

getHarvestErrors

public java.lang.String getHarvestErrors()
Get the list of harvest errors for this job. If no harvest errors, null is returned This value is not meaningful until the job is finished (FAILED,DONE, RESUBMITTED)

Returns:
the harvest errors for this job or null if no harvest errors.

appendHarvestErrors

public void appendHarvestErrors(java.lang.String harvestErrors)
Append to the list of harvest errors for this job. Nothing happens, if argument harvestErrors is null.

Parameters:
harvestErrors - a string containing harvest errors (may be null)

getHarvestErrorDetails

public java.lang.String getHarvestErrorDetails()
Get the list of harvest error details for this job. If no harvest error details, null is returned This value is not meaningful until the job is finished (FAILED,DONE, RESUBMITTED)

Returns:
the list of harvest error details for this job or null if no harvest error details.

appendHarvestErrorDetails

public void appendHarvestErrorDetails(java.lang.String harvestErrorDetails)
Append to the list of harvest error details for this job. Nothing happens, if argument harvestErrorDetails is null.

Parameters:
harvestErrorDetails - a string containing harvest error details.

getUploadErrors

public java.lang.String getUploadErrors()
Get the list of upload errors. If no upload errors, null is returned. This value is not meaningful until the job is finished (FAILED,DONE, RESUBMITTED)

Returns:
the list of upload errors as String, or null if no upload errors.

appendUploadErrors

public void appendUploadErrors(java.lang.String uploadErrors)
Append to the list of upload errors. Nothing happens, if argument uploadErrors is null.

Parameters:
uploadErrors - a string containing upload errors.

getUploadErrorDetails

public java.lang.String getUploadErrorDetails()
Get the list of upload error details. If no upload error details, null is returned. This value is not meaningful until the job is finished (FAILED,DONE, RESUBMITTED)

Returns:
the list of upload error details as String, or null if no upload error details

appendUploadErrorDetails

public void appendUploadErrorDetails(java.lang.String uploadErrorDetails)
Append to the list of upload error details. Nothing happens, if argument uploadErrorDetails is null.

Parameters:
uploadErrorDetails - a string containing upload error details.

getJobAliasInfo

public java.util.List<AliasInfo> getJobAliasInfo()
Get a list of AliasInfo objects for all the domains included in the job.

Returns:
a list of AliasInfo objects for all the domains included in the job.

getResubmittedAsJob

public java.lang.Long getResubmittedAsJob()
Get the ID for the job which this job was resubmitted as. If null, this job has not been resubmitted.

Returns:
this ID.

setSubmittedDate

public void setSubmittedDate(java.util.Date submittedDate)
Set the Date for when this job was submitted. If null, this job has not been submitted.

Parameters:
submittedDate - The date when this was submitted

setResubmittedAsJob

public void setResubmittedAsJob(java.lang.Long resubmittedAsJob)
Set the ID for the job which this job was resubmitted as.

Parameters:
resubmittedAsJob - An Id for a new job.