Class HeritrixTemplate
- java.lang.Object
-
- dk.netarkivet.harvester.datamodel.HeritrixTemplate
-
- All Implemented Interfaces:
java.io.Serializable
- Direct Known Subclasses:
H1HeritrixTemplate
,H3HeritrixTemplate
public abstract class HeritrixTemplate extends java.lang.Object implements java.io.Serializable
Abstract class for manipulating Heritrix Templates.- See Also:
- Serialized Form
-
-
Field Summary
Fields Modifier and Type Field Description protected static java.lang.String
HARVESTINFO_AUDIENCE
protected static java.lang.String
HARVESTINFO_CHANNEL
protected static java.lang.String
HARVESTINFO_HARVESTFILENAMEPREFIX
protected static java.lang.String
HARVESTINFO_HARVESTNUM
protected static java.lang.String
HARVESTINFO_JOBID
protected static java.lang.String
HARVESTINFO_JOBSUBMITDATE
protected static java.lang.String
HARVESTINFO_MAXBYTESPERDOMAIN
protected static java.lang.String
HARVESTINFO_MAXOBJECTSPERDOMAIN
protected static java.lang.String
HARVESTINFO_OPERATOR
protected static java.lang.String
HARVESTINFO_ORDERXMLDESCRIPTION
protected static java.lang.String
HARVESTINFO_ORDERXMLNAME
protected static java.lang.String
HARVESTINFO_ORDERXMLUPDATEDATE
protected static java.lang.String
HARVESTINFO_ORIGHARVESTDEFINITIONCOMMENTS
protected static java.lang.String
HARVESTINFO_ORIGHARVESTDEFINITIONID
protected static java.lang.String
HARVESTINFO_ORIGHARVESTDEFINITIONNAME
protected static java.lang.String
HARVESTINFO_PERFORMER
protected static java.lang.String
HARVESTINFO_SCHEDULENAME
protected static java.lang.String
HARVESTINFO_VERSION
protected static java.lang.String
HARVESTINFO_VERSION_NUMBER
long
template_id
We need the persistent template id if we want to attach any attributes to it.
-
Constructor Summary
Constructors Constructor Description HeritrixTemplate()
-
Method Summary
All Methods Static Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description abstract void
configureQuotaEnforcer(boolean maxObjectsIsSetByQuotaEnforcer, long forceMaxBytesPerDomain, long forceMaxObjectsPerDomain)
Activates or deactivate the quota-enforcer, depending on budget definition.void
editOrderXMLAddPerDomainCrawlerTraps(DomainConfiguration cfg)
Updates the order.xml to include a MatchesListRegExpDecideRule for each crawler-trap associated with for the given DomainConfiguration.abstract void
enableOrDisableDeduplication(boolean enabled)
abstract java.lang.Long
getMaxBytesPerDomain()
abstract java.lang.Long
getMaxObjectsPerDomain()
static HeritrixTemplate
getTemplateFromString(long template_id, java.lang.String templateAsString)
Construct a H1HeritrixTemplate or H3HeritrixTemplate based on the signature of the given string.abstract java.lang.String
getXML()
abstract boolean
hasContent()
abstract void
insertAttributes(java.util.List<EAV.AttributeAndType> attributesAndTypes)
Try to insert the given list of attributes into the template.abstract void
insertCrawlerTraps(java.lang.String elementName, java.util.List<java.lang.String> crawlertraps)
Method to add a list of crawler traps with a given element name.abstract void
insertUmbrabean(java.lang.String jobName, java.lang.String rabbitMQUrl, java.lang.String limitSearchRegEx)
Inserts all nevessary umbra-related beans in this template.abstract void
insertWarcInfoMetadata(Job ajob, java.lang.String origHarvestdefinitionName, java.lang.String origHarvestdefinitionComments, java.lang.String scheduleName, java.lang.String performer)
Method to add settings to the WARCWriterProcesser, so that it can generate a proper WARCINFO record.boolean
isActive()
abstract boolean
IsDeduplicationEnabled()
abstract boolean
isValid()
static HeritrixTemplate
read(long template_id, java.io.Reader orderTemplateReader)
Read the template using the given Reader.static HeritrixTemplate
read(java.io.File orderXmlFile)
Read the given template from file.abstract void
removeDeduplicatorIfPresent()
Try to remove the deduplicator, if present in the template.abstract void
setArchiveFilePrefix(java.lang.String archiveFilePrefix)
abstract void
setArchiveFormat(java.lang.String archiveFormat)
Make sure that Heritrix will archive its data in the chosen archiveFormat.abstract void
setDeduplicationIndexLocation(java.lang.String absolutePath)
abstract void
setDiskPath(java.lang.String absolutePath)
void
setIsActive(boolean isActive)
abstract void
setMaxBytesPerDomain(java.lang.Long maxbytesL)
abstract void
setMaxJobRunningTime(java.lang.Long maxJobRunningTimeSecondsL)
Set the maxRunning time for the harvestabstract void
setMaxObjectsPerDomain(java.lang.Long maxobjectsL)
abstract void
setRecoverlogNode(java.io.File recoverlogGzFile)
abstract void
setSeedsFilePath(java.lang.String absolutePath)
abstract void
writeTemplate(java.io.OutputStream os)
abstract void
writeTemplate(javax.servlet.jsp.JspWriter out)
abstract void
writeToFile(java.io.File orderXmlFile)
-
-
-
Field Detail
-
HARVESTINFO_VERSION_NUMBER
protected static final java.lang.String HARVESTINFO_VERSION_NUMBER
- See Also:
- Constant Field Values
-
HARVESTINFO_VERSION
protected static final java.lang.String HARVESTINFO_VERSION
- See Also:
- Constant Field Values
-
HARVESTINFO_JOBID
protected static final java.lang.String HARVESTINFO_JOBID
- See Also:
- Constant Field Values
-
HARVESTINFO_CHANNEL
protected static final java.lang.String HARVESTINFO_CHANNEL
- See Also:
- Constant Field Values
-
HARVESTINFO_HARVESTNUM
protected static final java.lang.String HARVESTINFO_HARVESTNUM
- See Also:
- Constant Field Values
-
HARVESTINFO_ORIGHARVESTDEFINITIONID
protected static final java.lang.String HARVESTINFO_ORIGHARVESTDEFINITIONID
- See Also:
- Constant Field Values
-
HARVESTINFO_MAXBYTESPERDOMAIN
protected static final java.lang.String HARVESTINFO_MAXBYTESPERDOMAIN
- See Also:
- Constant Field Values
-
HARVESTINFO_MAXOBJECTSPERDOMAIN
protected static final java.lang.String HARVESTINFO_MAXOBJECTSPERDOMAIN
- See Also:
- Constant Field Values
-
HARVESTINFO_ORDERXMLNAME
protected static final java.lang.String HARVESTINFO_ORDERXMLNAME
- See Also:
- Constant Field Values
-
HARVESTINFO_ORDERXMLUPDATEDATE
protected static final java.lang.String HARVESTINFO_ORDERXMLUPDATEDATE
- See Also:
- Constant Field Values
-
HARVESTINFO_ORDERXMLDESCRIPTION
protected static final java.lang.String HARVESTINFO_ORDERXMLDESCRIPTION
- See Also:
- Constant Field Values
-
HARVESTINFO_ORIGHARVESTDEFINITIONNAME
protected static final java.lang.String HARVESTINFO_ORIGHARVESTDEFINITIONNAME
- See Also:
- Constant Field Values
-
HARVESTINFO_ORIGHARVESTDEFINITIONCOMMENTS
protected static final java.lang.String HARVESTINFO_ORIGHARVESTDEFINITIONCOMMENTS
- See Also:
- Constant Field Values
-
HARVESTINFO_SCHEDULENAME
protected static final java.lang.String HARVESTINFO_SCHEDULENAME
- See Also:
- Constant Field Values
-
HARVESTINFO_HARVESTFILENAMEPREFIX
protected static final java.lang.String HARVESTINFO_HARVESTFILENAMEPREFIX
- See Also:
- Constant Field Values
-
HARVESTINFO_JOBSUBMITDATE
protected static final java.lang.String HARVESTINFO_JOBSUBMITDATE
- See Also:
- Constant Field Values
-
HARVESTINFO_PERFORMER
protected static final java.lang.String HARVESTINFO_PERFORMER
- See Also:
- Constant Field Values
-
HARVESTINFO_OPERATOR
protected static final java.lang.String HARVESTINFO_OPERATOR
- See Also:
- Constant Field Values
-
HARVESTINFO_AUDIENCE
protected static final java.lang.String HARVESTINFO_AUDIENCE
- See Also:
- Constant Field Values
-
template_id
public long template_id
We need the persistent template id if we want to attach any attributes to it.
-
-
Constructor Detail
-
HeritrixTemplate
public HeritrixTemplate()
-
-
Method Detail
-
configureQuotaEnforcer
public abstract void configureQuotaEnforcer(boolean maxObjectsIsSetByQuotaEnforcer, long forceMaxBytesPerDomain, long forceMaxObjectsPerDomain)
Activates or deactivate the quota-enforcer, depending on budget definition. Object limit can be defined either by using the queue-total-budget property or the quota enforcer. Which is chosen is set by the argument maxObjectsIsSetByQuotaEnforcer}'s value. So quota enforcer is set as follows:- Object limit is not set by quota enforcer, disabled only if there is no byte limit.
- Object limit is set by quota enforcer, so it should be enabled whether a byte or object limit is set.
- Parameters:
maxObjectsIsSetByQuotaEnforcer
- Decides whether the maxObjectsIsSetByQuotaEnforcer or not.forceMaxBytesPerDomain
- The number of max bytes per domain enforced (can be no limit)forceMaxObjectsPerDomain
- The number of max objects per domain enforced (can be no limit)
-
isActive
public boolean isActive()
-
setIsActive
public void setIsActive(boolean isActive)
-
setMaxBytesPerDomain
public abstract void setMaxBytesPerDomain(java.lang.Long maxbytesL)
-
getMaxBytesPerDomain
public abstract java.lang.Long getMaxBytesPerDomain()
-
setMaxObjectsPerDomain
public abstract void setMaxObjectsPerDomain(java.lang.Long maxobjectsL)
-
getMaxObjectsPerDomain
public abstract java.lang.Long getMaxObjectsPerDomain()
-
IsDeduplicationEnabled
public abstract boolean IsDeduplicationEnabled()
- Returns:
- true, if deduplication is enabled in the template (used for determine whether or not to request a deduplication index from the indexserver)
-
isValid
public abstract boolean isValid()
- Returns:
- true, if the template is valid, otherwise false
-
getXML
public abstract java.lang.String getXML()
- Returns:
- the XML behind this template
-
insertCrawlerTraps
public abstract void insertCrawlerTraps(java.lang.String elementName, java.util.List<java.lang.String> crawlertraps)
Method to add a list of crawler traps with a given element name. It is used both to add per-domain traps and global traps.- Parameters:
elementName
- The name of the added element.crawlertraps
- A list of crawler trap regular expressions to add to this job.
-
setArchiveFormat
public abstract void setArchiveFormat(java.lang.String archiveFormat)
Make sure that Heritrix will archive its data in the chosen archiveFormat.- Parameters:
archiveFormat
- the chosen archiveformat ('arc' or 'warc' supported) Throws ArgumentNotValid If the chosen archiveFormat is not supported.
-
setMaxJobRunningTime
public abstract void setMaxJobRunningTime(java.lang.Long maxJobRunningTimeSecondsL)
Set the maxRunning time for the harvest- Parameters:
maxJobRunningTimeSecondsL
- Limit the harvest to this number of seconds
-
insertAttributes
public abstract void insertAttributes(java.util.List<EAV.AttributeAndType> attributesAndTypes)
Try to insert the given list of attributes into the template.- Parameters:
attributesAndTypes
-
-
editOrderXMLAddPerDomainCrawlerTraps
public void editOrderXMLAddPerDomainCrawlerTraps(DomainConfiguration cfg)
Updates the order.xml to include a MatchesListRegExpDecideRule for each crawler-trap associated with for the given DomainConfiguration.The added nodes have the form
REJECT OR theFirstRegexp theSecondRegexp - Parameters:
cfg
- The DomainConfiguration for which to generate crawler trap deciderules- Throws:
IllegalState
- If unable to update order.xml due to wrong order.xml format
-
setDeduplicationIndexLocation
public abstract void setDeduplicationIndexLocation(java.lang.String absolutePath)
-
setSeedsFilePath
public abstract void setSeedsFilePath(java.lang.String absolutePath)
-
setArchiveFilePrefix
public abstract void setArchiveFilePrefix(java.lang.String archiveFilePrefix)
-
setDiskPath
public abstract void setDiskPath(java.lang.String absolutePath)
-
writeTemplate
public abstract void writeTemplate(java.io.OutputStream os) throws java.io.IOException, ArgumentNotValid
- Throws:
java.io.IOException
ArgumentNotValid
-
writeTemplate
public abstract void writeTemplate(javax.servlet.jsp.JspWriter out)
-
hasContent
public abstract boolean hasContent()
-
writeToFile
public abstract void writeToFile(java.io.File orderXmlFile)
-
setRecoverlogNode
public abstract void setRecoverlogNode(java.io.File recoverlogGzFile)
-
getTemplateFromString
public static HeritrixTemplate getTemplateFromString(long template_id, java.lang.String templateAsString)
Construct a H1HeritrixTemplate or H3HeritrixTemplate based on the signature of the given string.- Parameters:
template_id
- The id of the templatetemplateAsString
- The template as a String object- Returns:
- a HeritrixTemplate based on the signature of the given string.
-
read
public static HeritrixTemplate read(java.io.File orderXmlFile)
Read the given template from file.- Parameters:
orderXmlFile
- a given HeritrixTemplate (H1 or H3) as a File- Returns:
- the given HeritrixTemplate (H1 or H3) as a HeritrixTemplate object
-
read
public static HeritrixTemplate read(long template_id, java.io.Reader orderTemplateReader)
Read the template using the given Reader.- Parameters:
template_id
- The id of the templateorderTemplateReader
- A given Reader to read a template- Returns:
- a HeritrixTemplate object
-
removeDeduplicatorIfPresent
public abstract void removeDeduplicatorIfPresent()
Try to remove the deduplicator, if present in the template.
-
enableOrDisableDeduplication
public abstract void enableOrDisableDeduplication(boolean enabled)
-
insertWarcInfoMetadata
public abstract void insertWarcInfoMetadata(Job ajob, java.lang.String origHarvestdefinitionName, java.lang.String origHarvestdefinitionComments, java.lang.String scheduleName, java.lang.String performer)
Method to add settings to the WARCWriterProcesser, so that it can generate a proper WARCINFO record.- Parameters:
ajob
- a HarvestJoborigHarvestdefinitionName
- The name of the harvestdefinition behind this jobscheduleName
- The name of the schedule used. (Will be null, if the job is not a selectiveHarvest).performer
- The name of organisation/person doing this harvest
-
insertUmbrabean
public abstract void insertUmbrabean(java.lang.String jobName, java.lang.String rabbitMQUrl, java.lang.String limitSearchRegEx)
Inserts all nevessary umbra-related beans in this template.- Parameters:
jobName
- a String representing the job - must be unique for the this NAS environment for all timerabbitMQUrl
- the URL of the rabbitMQ socket connection (amqp://) to which umbra requests are to be sentlimitSearchRegEx
- the regular expression used to limit the heritrix search-path of urls to be sent to Umbra.
-
-