Class H3HeritrixTemplate
- java.lang.Object
-
- dk.netarkivet.harvester.datamodel.HeritrixTemplate
-
- dk.netarkivet.harvester.datamodel.H3HeritrixTemplate
-
- All Implemented Interfaces:
java.io.Serializable
public class H3HeritrixTemplate extends HeritrixTemplate implements java.io.Serializable
Class encapsulating the Heritrix crawler-beans.cxml fileHeritrix3 has a new model based on spring, So the XPATH is no good for processing. Instead we use placeholders instead, marked by %{..} instead of ${..}, which is used by Heritrix3 already. The template is a H3 template if it contains the string: "xmlns="http://www.springframework.org/...."
- See Also:
- Serialized Form
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
H3HeritrixTemplate.MetadataInfo
-
Field Summary
Fields Modifier and Type Field Description static java.lang.String
ARCHIVE_FILE_PREFIX_PLACEHOLDER
static java.lang.String
CRAWLERTRAPS_PLACEHOLDER
static java.util.regex.Pattern
DEDUPLICATION_BEAN_PATTERN
static java.util.regex.Pattern
DEDUPLICATION_BEAN_REFERENCE_PATTERN
static java.lang.String
DEDUPLICATION_ENABLED_PLACEHOLDER
static java.lang.String
DEDUPLICATION_INDEX_LOCATION_PLACEHOLDER
static java.lang.String
FRONTIER_QUEUE_TOTAL_BUDGET_PLACEHOLDER
static java.lang.String
MAX_TIME_SECONDS_PLACEHOLDER
static java.lang.String
METADATA_ITEMS_PLACEHOLDER
java.util.Map<H3HeritrixTemplate.MetadataInfo,java.lang.String>
metadataInfoMap
static java.lang.String
QUOTA_ENFORCER_GROUP_MAX_FETCH_SUCCES_PLACEHOLDER
static java.lang.String
QUOTA_ENFORCER_MAX_BYTES_PLACEHOLDER
static java.lang.String
UMBRA_BEAN_REF_PLACEHOLDER
static java.lang.String
UMBRA_PUBLISH_BEAN_PLACEHOLDER
static java.lang.String
UMBRA_RECEIVE_BEAN_PLACEHOLDER
static java.lang.String
UMBRA_SIMPLEOVERRIDES_PLACEHOLDER
-
Fields inherited from class dk.netarkivet.harvester.datamodel.HeritrixTemplate
HARVESTINFO_AUDIENCE, HARVESTINFO_CHANNEL, HARVESTINFO_HARVESTFILENAMEPREFIX, HARVESTINFO_HARVESTNUM, HARVESTINFO_JOBID, HARVESTINFO_JOBSUBMITDATE, HARVESTINFO_MAXBYTESPERDOMAIN, HARVESTINFO_MAXOBJECTSPERDOMAIN, HARVESTINFO_OPERATOR, HARVESTINFO_ORDERXMLDESCRIPTION, HARVESTINFO_ORDERXMLNAME, HARVESTINFO_ORDERXMLUPDATEDATE, HARVESTINFO_ORIGHARVESTDEFINITIONCOMMENTS, HARVESTINFO_ORIGHARVESTDEFINITIONID, HARVESTINFO_ORIGHARVESTDEFINITIONNAME, HARVESTINFO_PERFORMER, HARVESTINFO_SCHEDULENAME, HARVESTINFO_VERSION, HARVESTINFO_VERSION_NUMBER, template_id
-
-
Constructor Summary
Constructors Constructor Description H3HeritrixTemplate(long template_id, java.lang.String template)
Constructor for HeritrixTemplate class.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
configureQuotaEnforcer(boolean maxObjectsIsSetByQuotaEnforcer, long forceMaxBytesPerDomain, long forceMaxObjectsPerDomain)
Configuring the quota-enforcer, depending on budget definition.void
enableOrDisableDeduplication(boolean enabled)
java.lang.String
getAmqpUrlreceiverPlaceholder()
AMQP url receiver text that will replace AMQP_URLRECEIVER_PLACEHOLDER in the template *java.lang.String
getCallUmbrabean()
Call of the Umbra bean text that will replace CALL_UMBRABEAN_PLACEHOLDER in the template *java.lang.Long
getMaxBytesPerDomain()
java.lang.Long
getMaxObjectsPerDomain()
java.lang.String
getMetadataInfo(H3HeritrixTemplate.MetadataInfo info)
HeritrixTemplate
getTemplate()
return the template.java.lang.String
getUmbraBeanInformationInSimpleoverridesBean(java.lang.String jobName, java.lang.String rabbitMQUrl, java.lang.String limitSearchRegEx)
Umbrabean text from the current harvest job that will replace the placeholder in the Simpleoverride beanjava.lang.String
getUmbrabeanPlaceholder()
Umbrabean text that will replace UMBRA_BEAN_PLACEHOLDER in the template *java.lang.String
getXML()
Return HeritrixTemplate as XML.boolean
hasContent()
void
insertAttributes(java.util.List<EAV.AttributeAndType> attributesAndTypes)
Try to insert the given list of attributes into the template.void
insertCrawlerTraps(java.lang.String elementName, java.util.List<java.lang.String> crawlertraps)
Method to add a list of crawler traps with a given element name.void
insertUmbrabean(java.lang.String jobName, java.lang.String rabbitMQUrl, java.lang.String limitSearchRegEx)
Inserts all nevessary umbra-related beans in this template.void
insertWarcInfoMetadata(Job ajob, java.lang.String origHarvestdefinitionName, java.lang.String origHarvestdefinitionComments, java.lang.String scheduleName, java.lang.String performer)
Method to add settings to the WARCWriterProcesser, so that it can generate a proper WARCINFO record.boolean
IsDeduplicationEnabled()
boolean
isValid()
boolean
isVerified()
Has Template been verified?void
removeDeduplicatorIfPresent()
Try to remove the deduplicator, if present in the template.void
removePlaceholders()
Hack to remove existing placeholders, that is still present after template manipulation is completed.void
setArchiveFilePrefix(java.lang.String archiveFilePrefix)
void
setArchiveFormat(java.lang.String archiveFormat)
Make sure that Heritrix will archive its data in the chosen archiveFormat.void
setDeduplicationIndexLocation(java.lang.String absolutePath)
void
setDiskPath(java.lang.String absolutePath)
void
setMaxBytesPerDomain(java.lang.Long maxbytesL)
void
setMaxJobRunningTime(java.lang.Long maxJobRunningTimeSecondsL)
Update the maxTimeSeconds property in the heritrix3 template, if possible.void
setMaxObjectsPerDomain(java.lang.Long maxobjectsL)
void
setRecoverlogNode(java.io.File recoverlogGzFile)
void
setSeedsFilePath(java.lang.String absolutePath)
void
writeTemplate(java.io.OutputStream os)
void
writeTemplate(javax.servlet.jsp.JspWriter out)
void
writeToFile(java.io.File orderXmlFile)
-
Methods inherited from class dk.netarkivet.harvester.datamodel.HeritrixTemplate
editOrderXMLAddPerDomainCrawlerTraps, getTemplateFromString, isActive, read, read, setIsActive
-
-
-
-
Field Detail
-
METADATA_ITEMS_PLACEHOLDER
public static final java.lang.String METADATA_ITEMS_PLACEHOLDER
- See Also:
- Constant Field Values
-
MAX_TIME_SECONDS_PLACEHOLDER
public static final java.lang.String MAX_TIME_SECONDS_PLACEHOLDER
- See Also:
- Constant Field Values
-
CRAWLERTRAPS_PLACEHOLDER
public static final java.lang.String CRAWLERTRAPS_PLACEHOLDER
- See Also:
- Constant Field Values
-
DEDUPLICATION_BEAN_REFERENCE_PATTERN
public static final java.util.regex.Pattern DEDUPLICATION_BEAN_REFERENCE_PATTERN
-
DEDUPLICATION_BEAN_PATTERN
public static final java.util.regex.Pattern DEDUPLICATION_BEAN_PATTERN
-
DEDUPLICATION_INDEX_LOCATION_PLACEHOLDER
public static final java.lang.String DEDUPLICATION_INDEX_LOCATION_PLACEHOLDER
- See Also:
- Constant Field Values
-
ARCHIVE_FILE_PREFIX_PLACEHOLDER
public static final java.lang.String ARCHIVE_FILE_PREFIX_PLACEHOLDER
- See Also:
- Constant Field Values
-
FRONTIER_QUEUE_TOTAL_BUDGET_PLACEHOLDER
public static final java.lang.String FRONTIER_QUEUE_TOTAL_BUDGET_PLACEHOLDER
- See Also:
- Constant Field Values
-
QUOTA_ENFORCER_GROUP_MAX_FETCH_SUCCES_PLACEHOLDER
public static final java.lang.String QUOTA_ENFORCER_GROUP_MAX_FETCH_SUCCES_PLACEHOLDER
- See Also:
- Constant Field Values
-
QUOTA_ENFORCER_MAX_BYTES_PLACEHOLDER
public static final java.lang.String QUOTA_ENFORCER_MAX_BYTES_PLACEHOLDER
- See Also:
- Constant Field Values
-
DEDUPLICATION_ENABLED_PLACEHOLDER
public static final java.lang.String DEDUPLICATION_ENABLED_PLACEHOLDER
- See Also:
- Constant Field Values
-
UMBRA_SIMPLEOVERRIDES_PLACEHOLDER
public static final java.lang.String UMBRA_SIMPLEOVERRIDES_PLACEHOLDER
- See Also:
- Constant Field Values
-
UMBRA_PUBLISH_BEAN_PLACEHOLDER
public static final java.lang.String UMBRA_PUBLISH_BEAN_PLACEHOLDER
- See Also:
- Constant Field Values
-
UMBRA_RECEIVE_BEAN_PLACEHOLDER
public static final java.lang.String UMBRA_RECEIVE_BEAN_PLACEHOLDER
- See Also:
- Constant Field Values
-
UMBRA_BEAN_REF_PLACEHOLDER
public static final java.lang.String UMBRA_BEAN_REF_PLACEHOLDER
- See Also:
- Constant Field Values
-
metadataInfoMap
public java.util.Map<H3HeritrixTemplate.MetadataInfo,java.lang.String> metadataInfoMap
-
-
Constructor Detail
-
H3HeritrixTemplate
public H3HeritrixTemplate(long template_id, java.lang.String template)
Constructor for HeritrixTemplate class.- Parameters:
template_id
- The persistent id of the template in the databasetemplate
- The template as String object- Throws:
ArgumentNotValid
- if template is null.
-
-
Method Detail
-
getTemplate
public HeritrixTemplate getTemplate()
return the template.- Returns:
- the template
-
isVerified
public boolean isVerified()
Has Template been verified?- Returns:
- true, if verified on construction, otherwise false
-
getXML
public java.lang.String getXML()
Return HeritrixTemplate as XML.- Specified by:
getXML
in classHeritrixTemplate
- Returns:
- HeritrixTemplate as XML
-
setMaxJobRunningTime
public void setMaxJobRunningTime(java.lang.Long maxJobRunningTimeSecondsL)
Update the maxTimeSeconds property in the heritrix3 template, if possible.- Specified by:
setMaxJobRunningTime
in classHeritrixTemplate
- Parameters:
maxJobRunningTimeSecondsL
- Force the harvestJob to end after this number of seconds Property of the org.archive.crawler.framework.CrawlLimitEnforcer
-
setMaxBytesPerDomain
public void setMaxBytesPerDomain(java.lang.Long maxbytesL)
- Specified by:
setMaxBytesPerDomain
in classHeritrixTemplate
-
getMaxBytesPerDomain
public java.lang.Long getMaxBytesPerDomain()
- Specified by:
getMaxBytesPerDomain
in classHeritrixTemplate
-
setMaxObjectsPerDomain
public void setMaxObjectsPerDomain(java.lang.Long maxobjectsL)
- Specified by:
setMaxObjectsPerDomain
in classHeritrixTemplate
-
getMaxObjectsPerDomain
public java.lang.Long getMaxObjectsPerDomain()
- Specified by:
getMaxObjectsPerDomain
in classHeritrixTemplate
-
isValid
public boolean isValid()
- Specified by:
isValid
in classHeritrixTemplate
- Returns:
- true, if the template is valid, otherwise false
-
insertUmbrabean
public void insertUmbrabean(java.lang.String jobName, java.lang.String rabbitMQUrl, java.lang.String limitSearchRegEx)
Inserts all nevessary umbra-related beans in this template.- Specified by:
insertUmbrabean
in classHeritrixTemplate
- Parameters:
jobName
- a String representing the job - must be unique for the this NAS environment for all timerabbitMQUrl
- the URL of the rabbitMQ socket connection (amqp://) to which umbra requests are to be sentlimitSearchRegEx
- the regular expression used to limit the heritrix search-path of urls to be sent to Umbra.
-
getUmbraBeanInformationInSimpleoverridesBean
public java.lang.String getUmbraBeanInformationInSimpleoverridesBean(java.lang.String jobName, java.lang.String rabbitMQUrl, java.lang.String limitSearchRegEx)
Umbrabean text from the current harvest job that will replace the placeholder in the Simpleoverride bean- Parameters:
jobName
- a String representing the job - must be unique for the this NAS environment for all timerabbitMQUrl
- the URL of the rabbitMQ socket connection (amqp://) to which umbra requests are to be sentlimitSearchRegEx
- the regular expression used to limit the heritrix search-path of urls to be sent to Umbra.
-
getUmbrabeanPlaceholder
public java.lang.String getUmbrabeanPlaceholder()
Umbrabean text that will replace UMBRA_BEAN_PLACEHOLDER in the template *
-
getAmqpUrlreceiverPlaceholder
public java.lang.String getAmqpUrlreceiverPlaceholder()
AMQP url receiver text that will replace AMQP_URLRECEIVER_PLACEHOLDER in the template *
-
getCallUmbrabean
public java.lang.String getCallUmbrabean()
Call of the Umbra bean text that will replace CALL_UMBRABEAN_PLACEHOLDER in the template *
-
IsDeduplicationEnabled
public boolean IsDeduplicationEnabled()
- Specified by:
IsDeduplicationEnabled
in classHeritrixTemplate
- Returns:
- true, if deduplication is enabled in the template (used for determine whether or not to request a deduplication index from the indexserver)
-
configureQuotaEnforcer
public void configureQuotaEnforcer(boolean maxObjectsIsSetByQuotaEnforcer, long forceMaxBytesPerDomain, long forceMaxObjectsPerDomain)
Configuring the quota-enforcer, depending on budget definition. Object limit can be defined either by using the queue-total-budget property or the quota enforcer. Which is chosen is set by the argument maxObjectsIsSetByQuotaEnforcer}'s value. So quota enforcer is set as follows: If all values in the quotaEnforcer is infinity, it is in effect disabled- Object limit is not set by quota enforcer, disabled only if there is no byte limit.
- Object limit is set by quota enforcer, so it should be enabled if a byte or object limit is set.
- Specified by:
configureQuotaEnforcer
in classHeritrixTemplate
- Parameters:
maxObjectsIsSetByQuotaEnforcer
- Decides whether the maxObjectsIsSetByQuotaEnforcer or not.forceMaxBytesPerDomain
- The number of max bytes per domain enforced (can be no limit)forceMaxObjectsPerDomain
- The number of max objects per domain enforced (can be no limit)
-
setArchiveFormat
public void setArchiveFormat(java.lang.String archiveFormat)
Make sure that Heritrix will archive its data in the chosen archiveFormat.- Specified by:
setArchiveFormat
in classHeritrixTemplate
- Parameters:
archiveFormat
- the chosen archiveformat ('arc' or 'warc' supported)- Throws:
ArgumentNotValid
- If the chosen archiveFormat is not supported.
-
insertCrawlerTraps
public void insertCrawlerTraps(java.lang.String elementName, java.util.List<java.lang.String> crawlertraps)
Description copied from class:HeritrixTemplate
Method to add a list of crawler traps with a given element name. It is used both to add per-domain traps and global traps.- Specified by:
insertCrawlerTraps
in classHeritrixTemplate
- Parameters:
elementName
- The name of the added element.crawlertraps
- A list of crawler trap regular expressions to add to this job.
-
getMetadataInfo
public java.lang.String getMetadataInfo(H3HeritrixTemplate.MetadataInfo info)
-
writeTemplate
public void writeTemplate(java.io.OutputStream os) throws IOFailure
- Specified by:
writeTemplate
in classHeritrixTemplate
- Throws:
IOFailure
-
hasContent
public boolean hasContent()
- Specified by:
hasContent
in classHeritrixTemplate
-
writeToFile
public void writeToFile(java.io.File orderXmlFile)
- Specified by:
writeToFile
in classHeritrixTemplate
-
setRecoverlogNode
public void setRecoverlogNode(java.io.File recoverlogGzFile)
- Specified by:
setRecoverlogNode
in classHeritrixTemplate
-
setDeduplicationIndexLocation
public void setDeduplicationIndexLocation(java.lang.String absolutePath)
- Specified by:
setDeduplicationIndexLocation
in classHeritrixTemplate
-
setSeedsFilePath
public void setSeedsFilePath(java.lang.String absolutePath)
- Specified by:
setSeedsFilePath
in classHeritrixTemplate
-
setArchiveFilePrefix
public void setArchiveFilePrefix(java.lang.String archiveFilePrefix)
- Specified by:
setArchiveFilePrefix
in classHeritrixTemplate
-
setDiskPath
public void setDiskPath(java.lang.String absolutePath)
- Specified by:
setDiskPath
in classHeritrixTemplate
-
removeDeduplicatorIfPresent
public void removeDeduplicatorIfPresent()
Description copied from class:HeritrixTemplate
Try to remove the deduplicator, if present in the template.- Specified by:
removeDeduplicatorIfPresent
in classHeritrixTemplate
-
enableOrDisableDeduplication
public void enableOrDisableDeduplication(boolean enabled)
- Specified by:
enableOrDisableDeduplication
in classHeritrixTemplate
-
insertWarcInfoMetadata
public void insertWarcInfoMetadata(Job ajob, java.lang.String origHarvestdefinitionName, java.lang.String origHarvestdefinitionComments, java.lang.String scheduleName, java.lang.String performer)
Description copied from class:HeritrixTemplate
Method to add settings to the WARCWriterProcesser, so that it can generate a proper WARCINFO record.- Specified by:
insertWarcInfoMetadata
in classHeritrixTemplate
- Parameters:
ajob
- a HarvestJoborigHarvestdefinitionName
- The name of the harvestdefinition behind this jobscheduleName
- The name of the schedule used. (Will be null, if the job is not a selectiveHarvest).performer
- The name of organisation/person doing this harvest
-
insertAttributes
public void insertAttributes(java.util.List<EAV.AttributeAndType> attributesAndTypes)
Description copied from class:HeritrixTemplate
Try to insert the given list of attributes into the template.- Specified by:
insertAttributes
in classHeritrixTemplate
-
writeTemplate
public void writeTemplate(javax.servlet.jsp.JspWriter out) throws IOFailure
- Specified by:
writeTemplate
in classHeritrixTemplate
- Throws:
IOFailure
-
removePlaceholders
public void removePlaceholders()
Hack to remove existing placeholders, that is still present after template manipulation is completed.
-
-