public class H1HeritrixTemplate extends HeritrixTemplate implements Serializable
The class assumes the type of order.xml used in configuring Heritrix version 1.10+. Information about the Heritrix crawler, and its processes and modules can be found in the Heritrix developer and user manuals found on http://crawler.archive.org
Modifier and Type | Field and Description |
---|---|
static String |
ARC_ARCHIVER_PATH_XPATH
Xpath to check, that all templates use the same ARC archiver path,
Constants.ARCDIRECTORY_NAME . |
static String |
ARCHIVEFILE_PREFIX_XPATH
Xpath for the arcfile 'prefix' in the order.xml .
|
static String |
ARCS_ENABLED_XPATH |
static String |
ARCSDIR_XPATH
Xpath for the ARCs dir in the order.xml.
|
static String |
ARCWRITERPROCESSOR_XPATH |
static String |
DECIDERULES_ACCEPT_IF_PREREQUISITE_XPATH
Xpath needed by Job.editOrderXML_crawlerTraps().
|
static String |
DECIDERULES_MAP_XPATH
Xpath needed by Job.editOrderXML_crawlerTraps().
|
static String |
DECIDINGSCOPE_XPATH
Xpath to check, that all templates use the DecidingScope.
|
static String |
DEDUPLICATOR_ENABLED
Xpath for the boolean telling if the deduplicator is enabled in order.xml documents.
|
static String |
DEDUPLICATOR_INDEX_LOCATION_XPATH
Xpath for the deduplicator index directory node in order.xml documents.
|
static String |
DEDUPLICATOR_XPATH
Xpath for the deduplicator node in order.xml documents.
|
static String |
DISK_PATH_XPATH
Xpath for the 'disk-path' in the order.xml .
|
static String |
GROUP_MAX_ALL_KB_XPATH
Xpath needed by Job.editOrderXML_maxBytesPerDomain().
|
static String |
GROUP_MAX_FETCH_SUCCESS_XPATH
Xpath needed by Job.editOrderXML_maxObjectsPerDomain().
|
static String |
HERITRIX_FROM_XPATH
Xpath checked by Heritrix for correct mail address.
|
static String |
HERITRIX_USER_AGENT_XPATH
Xpath checked by Heritrix for correct user-agent field in requests.
|
static String |
MAXTIMESEC_PATH_XPATH
Xpath to check, that all templates have the max-time-sec attribute.
|
static String |
METADATA_ITEMS_XPATH
Xpath for the WARC metadata in the order.xml.
|
static String |
QUEUE_TOTAL_BUDGET_XPATH
Xpath needed by Job.editOrderXML_maxObjectsPerDomain().
|
static String |
QUOTA_ENFORCER_ENABLED_XPATH
Xpath needed by Job.editOrderXML_maxBytesPerDomain().
|
static String |
SEEDS_FILE_XPATH
Xpath for the 'seedsfile' in the order.xml.
|
static String |
WARC_ARCHIVER_PATH_XPATH
Xpath to check, that all templates use the same WARC archiver path,
Constants.WARCDIRECTORY_NAME . |
static String |
WARCS_ENABLED_XPATH
Xpath for the WARCs dir in the order.xml.
|
static String |
WARCS_SKIP_IDENTICAL_DIGESTS_XPATH |
static String |
WARCS_WRITE_METADATA_OUTLINKS_XPATH |
static String |
WARCS_WRITE_METADATA_XPATH |
static String |
WARCS_WRITE_REQUESTS_XPATH |
static String |
WARCS_WRITE_REVISIT_FOR_IDENTICAL_DIGESTS_XPATH |
static String |
WARCS_WRITE_REVISIT_FOR_NOT_MODIFIED_XPATH |
static String |
WARCSDIR_XPATH
Xpath for the WARCs dir in the order.xml.
|
static String |
WARCWRITERPROCESSOR_XPATH |
HARVESTINFO_AUDIENCE, HARVESTINFO_CHANNEL, HARVESTINFO_HARVESTFILENAMEPREFIX, HARVESTINFO_HARVESTNUM, HARVESTINFO_JOBID, HARVESTINFO_JOBSUBMITDATE, HARVESTINFO_MAXBYTESPERDOMAIN, HARVESTINFO_MAXOBJECTSPERDOMAIN, HARVESTINFO_ORDERXMLNAME, HARVESTINFO_ORIGHARVESTDEFINITIONID, HARVESTINFO_ORIGHARVESTDEFINITIONNAME, HARVESTINFO_PERFORMER, HARVESTINFO_SCHEDULENAME, HARVESTINFO_VERSION, HARVESTINFO_VERSION_NUMBER, template_id
Constructor and Description |
---|
H1HeritrixTemplate(org.dom4j.Document doc)
Alternate constructor, which always verifies the given document.
|
H1HeritrixTemplate(org.dom4j.Document doc,
boolean verify)
Constructor for HeritrixTemplate class.
|
H1HeritrixTemplate(long template_id,
String templateAsString) |
Modifier and Type | Method and Description |
---|---|
void |
configureQuotaEnforcer(boolean maxObjectsIsSetByQuotaEnforcer,
long forceMaxBytesPerDomain,
long forceMaxObjectsPerDomain)
Activates or deactivate the quota-enforcer, depending on budget definition.
|
static void |
editOrderXML_configureQuotaEnforcer(org.dom4j.Document orderXMLdoc,
boolean maxObjectsIsSetByQuotaEnforcer,
long forceMaxBytesPerDomain,
long forceMaxObjectsPerDomain)
Activates or deactivate the quota-enforcer, depending on budget definition.
|
static void |
editOrderXML_maxObjectsPerDomain(org.dom4j.Document orderXMLdoc,
long forceMaxObjectsPerDomain,
boolean maxObjectsIsSetByQuotaEnforcer)
Auxiliary method to modify the orderXMLdoc Document with respect to setting the maximum number of objects to be
retrieved per domain.
|
static void |
editOrderXMLAddCrawlerTraps(org.dom4j.Document orderXMLdoc,
String elementName,
List<String> crawlerTraps)
Method to add a list of crawler traps with a given element name.
|
Long |
getMaxBytesPerDomain() |
Long |
getMaxObjectsPerDomain() |
org.dom4j.Document |
getTemplate()
return the template.
|
String |
getText()
Only available for H1 templates.
|
String |
getXML()
Return HeritrixTemplate as XML.
|
boolean |
hasContent() |
void |
insertAttributes(List<EAV.AttributeAndType> attributesAndTypes)
Try to insert the given list of attributes into the template.
|
void |
insertCrawlerTraps(String elementName,
List<String> crawlerTraps)
Method to add a list of crawler traps with a given element name.
|
void |
insertWarcInfoMetadata(Job ajob,
String origHarvestdefinitionName,
String scheduleName,
String performer)
Method to add settings to the WARCWriterProcesser, so that it can generate a proper WARCINFO record.
|
boolean |
IsDeduplicationEnabled()
Return true if the templatefile has deduplication enabled.
|
boolean |
isValid() |
boolean |
isVerified()
Has Template been verified?
|
void |
removeDeduplicatorIfPresent()
Try to remove the deduplicator, if present in the template.
|
void |
setArchiveFilePrefix(String archiveFilePrefix) |
void |
setArchiveFormat(String archiveFormat)
Make sure that Heritrix will archive its data in the chosen archiveFormat.
|
void |
setDeduplicationIndexLocation(String absolutePath) |
void |
setDiskPath(String absolutePath) |
void |
setMaxBytesPerDomain(Long forceMaxBytesPerDomain)
Auxiliary method to modify the orderXMLdoc Document with respect to setting the maximum number of bytes to
retrieve per domain.
|
void |
setMaxJobRunningTime(Long maxJobRunningTimeSecondsL)
Set the maxRunning time for the harvest
|
void |
setMaxObjectsPerDomain(Long maxobjectsL) |
void |
setRecoverlogNode(File recoverlogGzFile) |
void |
setSeedsFilePath(String absolutePath) |
void |
writeTemplate(javax.servlet.jsp.JspWriter out) |
void |
writeTemplate(OutputStream os) |
void |
writeToFile(File orderXmlFile) |
editOrderXMLAddPerDomainCrawlerTraps, getTemplateFromString, isActive, read, read, setIsActive
public static final String QUOTA_ENFORCER_ENABLED_XPATH
public static final String GROUP_MAX_ALL_KB_XPATH
public static final String GROUP_MAX_FETCH_SUCCESS_XPATH
public static final String QUEUE_TOTAL_BUDGET_XPATH
public static final String DECIDERULES_MAP_XPATH
public static final String DECIDERULES_ACCEPT_IF_PREREQUISITE_XPATH
public static final String HERITRIX_USER_AGENT_XPATH
public static final String HERITRIX_FROM_XPATH
public static final String DECIDINGSCOPE_XPATH
public static final String DEDUPLICATOR_XPATH
public static final String ARC_ARCHIVER_PATH_XPATH
Constants.ARCDIRECTORY_NAME
. The archive path tells Heritrix to which directory it
shall write its arc files.public static final String WARC_ARCHIVER_PATH_XPATH
Constants.WARCDIRECTORY_NAME
. The archive path tells Heritrix to which directory it
shall write its arc files.public static final String DEDUPLICATOR_INDEX_LOCATION_XPATH
public static final String DEDUPLICATOR_ENABLED
public static final String DISK_PATH_XPATH
public static final String ARCHIVEFILE_PREFIX_XPATH
public static final String ARCSDIR_XPATH
public static final String WARCWRITERPROCESSOR_XPATH
public static final String ARCWRITERPROCESSOR_XPATH
public static final String WARCSDIR_XPATH
public static final String SEEDS_FILE_XPATH
public static final String ARCS_ENABLED_XPATH
public static final String WARCS_ENABLED_XPATH
public static final String WARCS_WRITE_REQUESTS_XPATH
public static final String WARCS_WRITE_METADATA_XPATH
public static final String WARCS_WRITE_METADATA_OUTLINKS_XPATH
public static final String WARCS_SKIP_IDENTICAL_DIGESTS_XPATH
public static final String WARCS_WRITE_REVISIT_FOR_IDENTICAL_DIGESTS_XPATH
public static final String WARCS_WRITE_REVISIT_FOR_NOT_MODIFIED_XPATH
public static final String METADATA_ITEMS_XPATH
public static final String MAXTIMESEC_PATH_XPATH
public H1HeritrixTemplate(org.dom4j.Document doc, boolean verify)
doc
- the order.xmlverify
- If true, verifies if the given dom4j Document contains the elements required by our software.ArgumentNotValid
- if doc is null, or verify is true and doc does not obey the constraints required by our
software.public H1HeritrixTemplate(org.dom4j.Document doc)
doc
- public H1HeritrixTemplate(long template_id, String templateAsString) throws org.dom4j.DocumentException
org.dom4j.DocumentException
public org.dom4j.Document getTemplate()
public boolean isVerified()
public String getXML()
getXML
in class HeritrixTemplate
public static void editOrderXMLAddCrawlerTraps(org.dom4j.Document orderXMLdoc, String elementName, List<String> crawlerTraps)
elementName
- The name of the added element.crawlerTraps
- A list of crawler trap regular expressions to add to this job.public static void editOrderXML_maxObjectsPerDomain(org.dom4j.Document orderXMLdoc, long forceMaxObjectsPerDomain, boolean maxObjectsIsSetByQuotaEnforcer)
orderXMLdoc
- forceMaxObjectsPerDomain
- The maximum number of objects to retrieve per domain, or 0 for no limit.PermissionDenied
- If unable to replace the frontier node of the orderXMLdoc DocumentIOFailure
- If the group-max-fetch-success element is not found in the orderXml. TODO The
group-max-fetch-success check should also be performed in TemplateDAO.create, TemplateDAO.updatepublic static void editOrderXML_configureQuotaEnforcer(org.dom4j.Document orderXMLdoc, boolean maxObjectsIsSetByQuotaEnforcer, long forceMaxBytesPerDomain, long forceMaxObjectsPerDomain)
orderXMLdoc
- the template to modifymaxObjectsIsSetByQuotaEnforcer
- Decides whether the maxObjectsIsSetByQuotaEnforcer or not.forceMaxBytesPerDomain
- The number of max bytes per domain enforced (can be no limit)forceMaxObjectsPerDomain
- The number of max objects per domain enforced (can be no limit)public boolean isValid()
isValid
in class HeritrixTemplate
public void configureQuotaEnforcer(boolean maxObjectsIsSetByQuotaEnforcer, long forceMaxBytesPerDomain, long forceMaxObjectsPerDomain)
HeritrixTemplate
configureQuotaEnforcer
in class HeritrixTemplate
maxObjectsIsSetByQuotaEnforcer
- Decides whether the maxObjectsIsSetByQuotaEnforcer or not.forceMaxBytesPerDomain
- The number of max bytes per domain enforced (can be no limit)forceMaxObjectsPerDomain
- The number of max objects per domain enforced (can be no limit)public void setMaxBytesPerDomain(Long forceMaxBytesPerDomain)
setMaxBytesPerDomain
in class HeritrixTemplate
forceMaxBytesPerDomain
- The maximum number of byte to retrieve per domain, or -1 for no limit. Note that
the number is divided by 1024 before being inserted into the orderXml, as Heritrix expects KB.PermissionDenied
- If unable to replace the QuotaEnforcer node of the orderXMLdoc DocumentIOFailure
- If the group-max-all-kb element cannot be found. TODO This group-max-all-kb check also be
performed in TemplateDAO.create, TemplateDAO.updatepublic Long getMaxBytesPerDomain()
getMaxBytesPerDomain
in class HeritrixTemplate
public void setMaxObjectsPerDomain(Long maxobjectsL)
setMaxObjectsPerDomain
in class HeritrixTemplate
public Long getMaxObjectsPerDomain()
getMaxObjectsPerDomain
in class HeritrixTemplate
public boolean IsDeduplicationEnabled()
IsDeduplicationEnabled
in class HeritrixTemplate
public void setArchiveFormat(String archiveFormat)
HeritrixTemplate
setArchiveFormat
in class HeritrixTemplate
archiveFormat
- the chosen archiveformat ('arc' or 'warc' supported) Throws ArgumentNotValid If the chosen
archiveFormat is not supported.public void setMaxJobRunningTime(Long maxJobRunningTimeSecondsL)
HeritrixTemplate
setMaxJobRunningTime
in class HeritrixTemplate
maxJobRunningTimeSecondsL
- Limit the harvest to this number of secondspublic void writeTemplate(OutputStream os) throws IOException, ArgumentNotValid
writeTemplate
in class HeritrixTemplate
IOException
ArgumentNotValid
public void insertCrawlerTraps(String elementName, List<String> crawlerTraps)
HeritrixTemplate
insertCrawlerTraps
in class HeritrixTemplate
elementName
- The name of the added element.crawlerTraps
- A list of crawler trap regular expressions to add to this job.public boolean hasContent()
hasContent
in class HeritrixTemplate
public void writeToFile(File orderXmlFile)
writeToFile
in class HeritrixTemplate
public void setRecoverlogNode(File recoverlogGzFile)
setRecoverlogNode
in class HeritrixTemplate
public void setDeduplicationIndexLocation(String absolutePath)
setDeduplicationIndexLocation
in class HeritrixTemplate
public void setSeedsFilePath(String absolutePath)
setSeedsFilePath
in class HeritrixTemplate
public void setArchiveFilePrefix(String archiveFilePrefix)
setArchiveFilePrefix
in class HeritrixTemplate
public void setDiskPath(String absolutePath)
setDiskPath
in class HeritrixTemplate
public void removeDeduplicatorIfPresent()
HeritrixTemplate
removeDeduplicatorIfPresent
in class HeritrixTemplate
public void insertWarcInfoMetadata(Job ajob, String origHarvestdefinitionName, String scheduleName, String performer)
HeritrixTemplate
insertWarcInfoMetadata
in class HeritrixTemplate
ajob
- a HarvestJoborigHarvestdefinitionName
- The name of the harvestdefinition behind this jobscheduleName
- The name of the schedule used. (Will be null, if the job is not a selectiveHarvest).performer
- The name of organisation/person doing this harvestpublic void insertAttributes(List<EAV.AttributeAndType> attributesAndTypes)
HeritrixTemplate
insertAttributes
in class HeritrixTemplate
public void writeTemplate(javax.servlet.jsp.JspWriter out) throws IOFailure
writeTemplate
in class HeritrixTemplate
IOFailure
Copyright © 2005–2016 The Royal Danish Library, the Danish State and University Library, the National Library of France and the Austrian National Library.. All rights reserved.