|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectdk.netarkivet.harvester.datamodel.HeritrixTemplate
public class HeritrixTemplate
Class encapsulating the Heritrix order.xml. Enables verification that dom4j Document obey the constraints required by our software, specifically the Job class. The class assumes the type of order.xml used in configuring Heritrix version 1.10+. Information about the Heritrix crawler, and its processes and modules can be found in the Heritrix developer and user manuals found on http://crawler.archive.org
Field Summary | |
---|---|
static java.lang.String |
ARC_ARCHIVER_PATH_XPATH
Xpath to check, that all templates use the same ARC archiver path, Constants.ARCDIRECTORY_NAME . |
static java.lang.String |
ARCHIVEFILE_PREFIX_XPATH
Xpath for the arcfile 'prefix' in the order.xml . |
static java.lang.String |
ARCS_ENABLED_XPATH
|
static java.lang.String |
ARCSDIR_XPATH
Xpath for the ARCs dir in the order.xml. |
static java.lang.String |
DECIDERULES_ACCEPT_IF_PREREQUISITE_XPATH
Xpath needed by Job.editOrderXML_crawlerTraps(). |
static java.lang.String |
DECIDERULES_MAP_XPATH
Xpath needed by Job.editOrderXML_crawlerTraps(). |
static java.lang.String |
DECIDINGSCOPE_XPATH
Xpath to check, that all templates use the DecidingScope. |
static java.lang.String |
DEDUPLICATOR_ENABLED
Xpath for the boolean telling if the deduplicator is enabled in order.xml documents. |
static java.lang.String |
DEDUPLICATOR_INDEX_LOCATION_XPATH
Xpath for the deduplicator index directory node in order.xml documents. |
static java.lang.String |
DEDUPLICATOR_XPATH
Xpath for the deduplicator node in order.xml documents. |
static java.lang.String |
DISK_PATH_XPATH
Xpath for the 'disk-path' in the order.xml . |
static java.lang.String |
GROUP_MAX_ALL_KB_XPATH
Xpath needed by Job.editOrderXML_maxBytesPerDomain(). |
static java.lang.String |
GROUP_MAX_FETCH_SUCCESS_XPATH
Xpath needed by Job.editOrderXML_maxObjectsPerDomain(). |
static java.lang.String |
HERITRIX_FROM_XPATH
Xpath checked by Heritrix for correct mail address. |
static java.lang.String |
HERITRIX_USER_AGENT_XPATH
Xpath checked by Heritrix for correct user-agent field in requests. |
static java.lang.String |
MAXTIMESEC_PATH_XPATH
Xpath to check, that all templates have the max-time-sec attribute. |
static java.lang.String |
QUEUE_TOTAL_BUDGET_XPATH
Xpath needed by Job.editOrderXML_maxObjectsPerDomain(). |
static java.lang.String |
QUOTA_ENFORCER_ENABLED_XPATH
Xpath needed by Job.editOrderXML_maxBytesPerDomain(). |
static java.lang.String |
SEEDS_FILE_XPATH
Xpath for the 'seedsfile' in the order.xml. |
static java.lang.String |
WARC_ARCHIVER_PATH_XPATH
Xpath to check, that all templates use the same WARC archiver path, Constants.WARCDIRECTORY_NAME . |
static java.lang.String |
WARCS_ENABLED_XPATH
Xpath for the WARCs dir in the order.xml. |
static java.lang.String |
WARCS_SKIP_IDENTICAL_DIGESTS_XPATH
|
static java.lang.String |
WARCS_WRITE_METADATA_OUTLINKS_XPATH
|
static java.lang.String |
WARCS_WRITE_METADATA_XPATH
|
static java.lang.String |
WARCS_WRITE_REQUESTS_XPATH
|
static java.lang.String |
WARCS_WRITE_REVISIT_FOR_IDENTICAL_DIGESTS_XPATH
|
static java.lang.String |
WARCS_WRITE_REVISIT_FOR_NOT_MODIFIED_XPATH
|
static java.lang.String |
WARCSDIR_XPATH
Xpath for the WARCs dir in the order.xml. |
Constructor Summary | |
---|---|
HeritrixTemplate(org.dom4j.Document doc)
Alternate constructor, which always verifies the given document. |
|
HeritrixTemplate(org.dom4j.Document doc,
boolean verify)
Constructor for HeritrixTemplate class. |
Method Summary | |
---|---|
static void |
editOrderXML_ArchiveFormat(org.dom4j.Document orderXML,
java.lang.String archiveFormat)
Make sure that Heritrix will archive its data in the chosen archiveFormat. |
static void |
editOrderXML_configureQuotaEnforcer(org.dom4j.Document orderXMLdoc,
boolean maxObjectsIsSetByQuotaEnforcer,
long forceMaxBytesPerDomain,
long forceMaxObjectsPerDomain)
Activates or deactivate the quota-enforcer, depending on budget definition. |
static void |
editOrderXML_maxBytesPerDomain(org.dom4j.Document orderXMLdoc,
long forceMaxBytesPerDomain)
Auxiliary method to modify the orderXMLdoc Document with respect to setting the maximum number of bytes to retrieve per domain. |
static void |
editOrderXML_maxJobRunningTime(org.dom4j.Document orderXMLdoc,
long maxJobRunningTime)
|
static void |
editOrderXML_maxObjectsPerDomain(org.dom4j.Document orderXMLdoc,
long forceMaxObjectsPerDomain,
boolean maxObjectsIsSetByQuotaEnforcer)
Auxiliary method to modify the orderXMLdoc Document with respect to setting the maximum number of objects to be retrieved per domain. |
static void |
editOrderXMLAddCrawlerTraps(org.dom4j.Document orderXMLdoc,
java.lang.String elementName,
java.util.List<java.lang.String> crawlerTraps)
Method to add a list of crawler traps with a given element name. |
static void |
editOrderXMLAddPerDomainCrawlerTraps(org.dom4j.Document orderXmlDoc,
DomainConfiguration cfg)
Updates the order.xml to include a MatchesListRegExpDecideRule for each crawlertrap associated with for the given DomainConfiguration. |
org.dom4j.Document |
getTemplate()
return the template. |
java.lang.String |
getXML()
Return HeritrixTemplate as XML. |
static boolean |
isDeduplicationEnabledInTemplate(org.dom4j.Document doc)
Return true if the given order.xml file has deduplication enabled. |
boolean |
isVerified()
Has Template been verified? |
static void |
makeOrderfileReadyForHeritrix(HeritrixFiles files)
This method prepares the orderfile used by the Heritrix crawler. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final java.lang.String QUOTA_ENFORCER_ENABLED_XPATH
public static final java.lang.String GROUP_MAX_ALL_KB_XPATH
public static final java.lang.String GROUP_MAX_FETCH_SUCCESS_XPATH
public static final java.lang.String QUEUE_TOTAL_BUDGET_XPATH
public static final java.lang.String DECIDERULES_MAP_XPATH
public static final java.lang.String DECIDERULES_ACCEPT_IF_PREREQUISITE_XPATH
public static final java.lang.String HERITRIX_USER_AGENT_XPATH
public static final java.lang.String HERITRIX_FROM_XPATH
public static final java.lang.String DECIDINGSCOPE_XPATH
public static final java.lang.String DEDUPLICATOR_XPATH
public static final java.lang.String ARC_ARCHIVER_PATH_XPATH
Constants.ARCDIRECTORY_NAME
.
The archive path tells Heritrix to which directory it shall write
its arc files.
public static final java.lang.String WARC_ARCHIVER_PATH_XPATH
Constants.WARCDIRECTORY_NAME
.
The archive path tells Heritrix to which directory it shall write
its arc files.
public static final java.lang.String DEDUPLICATOR_INDEX_LOCATION_XPATH
public static final java.lang.String DEDUPLICATOR_ENABLED
public static final java.lang.String DISK_PATH_XPATH
public static final java.lang.String ARCHIVEFILE_PREFIX_XPATH
public static final java.lang.String ARCSDIR_XPATH
public static final java.lang.String WARCSDIR_XPATH
public static final java.lang.String SEEDS_FILE_XPATH
public static final java.lang.String ARCS_ENABLED_XPATH
public static final java.lang.String WARCS_ENABLED_XPATH
public static final java.lang.String WARCS_WRITE_REQUESTS_XPATH
public static final java.lang.String WARCS_WRITE_METADATA_XPATH
public static final java.lang.String WARCS_WRITE_METADATA_OUTLINKS_XPATH
public static final java.lang.String WARCS_SKIP_IDENTICAL_DIGESTS_XPATH
public static final java.lang.String WARCS_WRITE_REVISIT_FOR_IDENTICAL_DIGESTS_XPATH
public static final java.lang.String WARCS_WRITE_REVISIT_FOR_NOT_MODIFIED_XPATH
public static final java.lang.String MAXTIMESEC_PATH_XPATH
Constructor Detail |
---|
public HeritrixTemplate(org.dom4j.Document doc, boolean verify)
doc
- the order.xmlverify
- If true, verifies if the given dom4j Document contains
the elements required by our software.
ArgumentNotValid
- if doc is null, or verify is true and doc does
not obey the constraints required by our software.public HeritrixTemplate(org.dom4j.Document doc)
doc
- Method Detail |
---|
public org.dom4j.Document getTemplate()
public boolean isVerified()
public java.lang.String getXML()
public static void editOrderXMLAddCrawlerTraps(org.dom4j.Document orderXMLdoc, java.lang.String elementName, java.util.List<java.lang.String> crawlerTraps)
elementName
- The name of the added element.crawlerTraps
- A list of crawler trap regular expressions to add
to this job.public static void editOrderXMLAddPerDomainCrawlerTraps(org.dom4j.Document orderXmlDoc, DomainConfiguration cfg)
cfg
- The DomainConfiguration for which to generate crawler trap deciderules
IllegalState
- If unable to update order.xml due to wrong order.xml formatpublic static void editOrderXML_ArchiveFormat(org.dom4j.Document orderXML, java.lang.String archiveFormat)
orderXML
- the specific heritrix template to modify.archiveFormat
- the chosen archiveformat ('arc' or 'warc' supported)
Throws ArgumentNotValid If the chosen archiveFormat is not supported.public static void editOrderXML_maxJobRunningTime(org.dom4j.Document orderXMLdoc, long maxJobRunningTime)
maxJobRunningTime
- Force the harvestjob to end after maxJobRunningTimepublic static void editOrderXML_maxObjectsPerDomain(org.dom4j.Document orderXMLdoc, long forceMaxObjectsPerDomain, boolean maxObjectsIsSetByQuotaEnforcer)
orderXMLdoc
- forceMaxObjectsPerDomain
- The maximum number of objects to retrieve per domain, or 0
for no limit.
PermissionDenied
- If unable to replace the frontier node of
the orderXMLdoc Document
IOFailure
- If the group-max-fetch-success element is not found in the orderXml.
TODO The group-max-fetch-success check should also be performed in
TemplateDAO.create, TemplateDAO.updatepublic static void editOrderXML_configureQuotaEnforcer(org.dom4j.Document orderXMLdoc, boolean maxObjectsIsSetByQuotaEnforcer, long forceMaxBytesPerDomain, long forceMaxObjectsPerDomain)
orderXMLdoc
- the template to modifymaxObjectsIsSetByQuotaEnforcer
- Decides whether the maxObjectsIsSetByQuotaEnforcer or not.forceMaxBytesPerDomain
- The number of max bytes per domain enforced (can be no limit)forceMaxObjectsPerDomain
- The number of max objects per domain enforced (can be no limit)public static void editOrderXML_maxBytesPerDomain(org.dom4j.Document orderXMLdoc, long forceMaxBytesPerDomain)
forceMaxBytesPerDomain
- The maximum number of byte to retrieve per domain,
or -1 for no limit.
Note that the number is divided by 1024 before being inserted into
the orderXml, as Heritrix expects KB.
PermissionDenied
- If unable to replace the QuotaEnforcer node of the
orderXMLdoc Document
IOFailure
- If the group-max-all-kb element cannot be found.
TODO This group-max-all-kb check also be performed in
TemplateDAO.create, TemplateDAO.updatepublic static boolean isDeduplicationEnabledInTemplate(org.dom4j.Document doc)
doc
- An order.xml document
public static void makeOrderfileReadyForHeritrix(HeritrixFiles files) throws IOFailure
IOFailure
- - When the orderfile could not be saved to disk When a
specific node is not found in the XML-document When the
SAXReader cannot parse the XML
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |