dk.netarkivet.harvester.datamodel
Class HeritrixTemplate

java.lang.Object
  extended by dk.netarkivet.harvester.datamodel.HeritrixTemplate

public class HeritrixTemplate
extends java.lang.Object

Class encapsulating the Heritrix order.xml. Enables verification that dom4j Document obey the constraints required by our software, specifically the Job class. The class assumes the type of order.xml used in configuring Heritrix version 1.10+. Information about the Heritrix crawler, and its processes and modules can be found in the Heritrix developer and user manuals found on http://crawler.archive.org


Field Summary
static java.lang.String ARC_ARCHIVER_PATH_XPATH
          Xpath to check, that all templates use the same ARC archiver path, Constants.ARCDIRECTORY_NAME.
static java.lang.String ARCHIVEFILE_PREFIX_XPATH
          Xpath for the arcfile 'prefix' in the order.xml .
static java.lang.String ARCS_ENABLED_XPATH
           
static java.lang.String ARCSDIR_XPATH
          Xpath for the ARCs dir in the order.xml.
static java.lang.String DECIDERULES_ACCEPT_IF_PREREQUISITE_XPATH
          Xpath needed by Job.editOrderXML_crawlerTraps().
static java.lang.String DECIDERULES_MAP_XPATH
          Xpath needed by Job.editOrderXML_crawlerTraps().
static java.lang.String DECIDINGSCOPE_XPATH
          Xpath to check, that all templates use the DecidingScope.
static java.lang.String DEDUPLICATOR_ENABLED
          Xpath for the boolean telling if the deduplicator is enabled in order.xml documents.
static java.lang.String DEDUPLICATOR_INDEX_LOCATION_XPATH
          Xpath for the deduplicator index directory node in order.xml documents.
static java.lang.String DEDUPLICATOR_XPATH
          Xpath for the deduplicator node in order.xml documents.
static java.lang.String DISK_PATH_XPATH
          Xpath for the 'disk-path' in the order.xml .
static java.lang.String GROUP_MAX_ALL_KB_XPATH
          Xpath needed by Job.editOrderXML_maxBytesPerDomain().
static java.lang.String GROUP_MAX_FETCH_SUCCESS_XPATH
          Xpath needed by Job.editOrderXML_maxObjectsPerDomain().
static java.lang.String HERITRIX_FROM_XPATH
          Xpath checked by Heritrix for correct mail address.
static java.lang.String HERITRIX_USER_AGENT_XPATH
          Xpath checked by Heritrix for correct user-agent field in requests.
static java.lang.String MAXTIMESEC_PATH_XPATH
          Xpath to check, that all templates have the max-time-sec attribute.
static java.lang.String QUEUE_TOTAL_BUDGET_XPATH
          Xpath needed by Job.editOrderXML_maxObjectsPerDomain().
static java.lang.String QUOTA_ENFORCER_ENABLED_XPATH
          Xpath needed by Job.editOrderXML_maxBytesPerDomain().
static java.lang.String SEEDS_FILE_XPATH
          Xpath for the 'seedsfile' in the order.xml.
static java.lang.String WARC_ARCHIVER_PATH_XPATH
          Xpath to check, that all templates use the same WARC archiver path, Constants.WARCDIRECTORY_NAME.
static java.lang.String WARCS_ENABLED_XPATH
          Xpath for the WARCs dir in the order.xml.
static java.lang.String WARCS_SKIP_IDENTICAL_DIGESTS_XPATH
           
static java.lang.String WARCS_WRITE_METADATA_OUTLINKS_XPATH
           
static java.lang.String WARCS_WRITE_METADATA_XPATH
           
static java.lang.String WARCS_WRITE_REQUESTS_XPATH
           
static java.lang.String WARCS_WRITE_REVISIT_FOR_IDENTICAL_DIGESTS_XPATH
           
static java.lang.String WARCS_WRITE_REVISIT_FOR_NOT_MODIFIED_XPATH
           
static java.lang.String WARCSDIR_XPATH
          Xpath for the WARCs dir in the order.xml.
 
Constructor Summary
HeritrixTemplate(org.dom4j.Document doc)
          Alternate constructor, which always verifies the given document.
HeritrixTemplate(org.dom4j.Document doc, boolean verify)
          Constructor for HeritrixTemplate class.
 
Method Summary
static void editOrderXML_ArchiveFormat(org.dom4j.Document orderXML, java.lang.String archiveFormat)
          Make sure that Heritrix will archive its data in the chosen archiveFormat.
static void editOrderXML_configureQuotaEnforcer(org.dom4j.Document orderXMLdoc, boolean maxObjectsIsSetByQuotaEnforcer, long forceMaxBytesPerDomain, long forceMaxObjectsPerDomain)
          Activates or deactivate the quota-enforcer, depending on budget definition.
static void editOrderXML_maxBytesPerDomain(org.dom4j.Document orderXMLdoc, long forceMaxBytesPerDomain)
          Auxiliary method to modify the orderXMLdoc Document with respect to setting the maximum number of bytes to retrieve per domain.
static void editOrderXML_maxJobRunningTime(org.dom4j.Document orderXMLdoc, long maxJobRunningTime)
           
static void editOrderXML_maxObjectsPerDomain(org.dom4j.Document orderXMLdoc, long forceMaxObjectsPerDomain, boolean maxObjectsIsSetByQuotaEnforcer)
          Auxiliary method to modify the orderXMLdoc Document with respect to setting the maximum number of objects to be retrieved per domain.
static void editOrderXMLAddCrawlerTraps(org.dom4j.Document orderXMLdoc, java.lang.String elementName, java.util.List<java.lang.String> crawlerTraps)
          Method to add a list of crawler traps with a given element name.
static void editOrderXMLAddPerDomainCrawlerTraps(org.dom4j.Document orderXmlDoc, DomainConfiguration cfg)
          Updates the order.xml to include a MatchesListRegExpDecideRule for each crawlertrap associated with for the given DomainConfiguration.
 org.dom4j.Document getTemplate()
          return the template.
 java.lang.String getXML()
          Return HeritrixTemplate as XML.
static boolean isDeduplicationEnabledInTemplate(org.dom4j.Document doc)
          Return true if the given order.xml file has deduplication enabled.
 boolean isVerified()
          Has Template been verified?
static void makeOrderfileReadyForHeritrix(HeritrixFiles files)
          This method prepares the orderfile used by the Heritrix crawler.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

QUOTA_ENFORCER_ENABLED_XPATH

public static final java.lang.String QUOTA_ENFORCER_ENABLED_XPATH
Xpath needed by Job.editOrderXML_maxBytesPerDomain().

See Also:
Constant Field Values

GROUP_MAX_ALL_KB_XPATH

public static final java.lang.String GROUP_MAX_ALL_KB_XPATH
Xpath needed by Job.editOrderXML_maxBytesPerDomain().

See Also:
Constant Field Values

GROUP_MAX_FETCH_SUCCESS_XPATH

public static final java.lang.String GROUP_MAX_FETCH_SUCCESS_XPATH
Xpath needed by Job.editOrderXML_maxObjectsPerDomain().

See Also:
Constant Field Values

QUEUE_TOTAL_BUDGET_XPATH

public static final java.lang.String QUEUE_TOTAL_BUDGET_XPATH
Xpath needed by Job.editOrderXML_maxObjectsPerDomain().

See Also:
Constant Field Values

DECIDERULES_MAP_XPATH

public static final java.lang.String DECIDERULES_MAP_XPATH
Xpath needed by Job.editOrderXML_crawlerTraps().

See Also:
Constant Field Values

DECIDERULES_ACCEPT_IF_PREREQUISITE_XPATH

public static final java.lang.String DECIDERULES_ACCEPT_IF_PREREQUISITE_XPATH
Xpath needed by Job.editOrderXML_crawlerTraps().

See Also:
Constant Field Values

HERITRIX_USER_AGENT_XPATH

public static final java.lang.String HERITRIX_USER_AGENT_XPATH
Xpath checked by Heritrix for correct user-agent field in requests.

See Also:
Constant Field Values

HERITRIX_FROM_XPATH

public static final java.lang.String HERITRIX_FROM_XPATH
Xpath checked by Heritrix for correct mail address.

See Also:
Constant Field Values

DECIDINGSCOPE_XPATH

public static final java.lang.String DECIDINGSCOPE_XPATH
Xpath to check, that all templates use the DecidingScope.


DEDUPLICATOR_XPATH

public static final java.lang.String DEDUPLICATOR_XPATH
Xpath for the deduplicator node in order.xml documents.

See Also:
Constant Field Values

ARC_ARCHIVER_PATH_XPATH

public static final java.lang.String ARC_ARCHIVER_PATH_XPATH
Xpath to check, that all templates use the same ARC archiver path, Constants.ARCDIRECTORY_NAME. The archive path tells Heritrix to which directory it shall write its arc files.

See Also:
Constant Field Values

WARC_ARCHIVER_PATH_XPATH

public static final java.lang.String WARC_ARCHIVER_PATH_XPATH
Xpath to check, that all templates use the same WARC archiver path, Constants.WARCDIRECTORY_NAME. The archive path tells Heritrix to which directory it shall write its arc files.

See Also:
Constant Field Values

DEDUPLICATOR_INDEX_LOCATION_XPATH

public static final java.lang.String DEDUPLICATOR_INDEX_LOCATION_XPATH
Xpath for the deduplicator index directory node in order.xml documents.

See Also:
Constant Field Values

DEDUPLICATOR_ENABLED

public static final java.lang.String DEDUPLICATOR_ENABLED
Xpath for the boolean telling if the deduplicator is enabled in order.xml documents.

See Also:
Constant Field Values

DISK_PATH_XPATH

public static final java.lang.String DISK_PATH_XPATH
Xpath for the 'disk-path' in the order.xml .

See Also:
Constant Field Values

ARCHIVEFILE_PREFIX_XPATH

public static final java.lang.String ARCHIVEFILE_PREFIX_XPATH
Xpath for the arcfile 'prefix' in the order.xml .

See Also:
Constant Field Values

ARCSDIR_XPATH

public static final java.lang.String ARCSDIR_XPATH
Xpath for the ARCs dir in the order.xml.

See Also:
Constant Field Values

WARCSDIR_XPATH

public static final java.lang.String WARCSDIR_XPATH
Xpath for the WARCs dir in the order.xml.

See Also:
Constant Field Values

SEEDS_FILE_XPATH

public static final java.lang.String SEEDS_FILE_XPATH
Xpath for the 'seedsfile' in the order.xml.

See Also:
Constant Field Values

ARCS_ENABLED_XPATH

public static final java.lang.String ARCS_ENABLED_XPATH
See Also:
Constant Field Values

WARCS_ENABLED_XPATH

public static final java.lang.String WARCS_ENABLED_XPATH
Xpath for the WARCs dir in the order.xml.

See Also:
Constant Field Values

WARCS_WRITE_REQUESTS_XPATH

public static final java.lang.String WARCS_WRITE_REQUESTS_XPATH
See Also:
Constant Field Values

WARCS_WRITE_METADATA_XPATH

public static final java.lang.String WARCS_WRITE_METADATA_XPATH
See Also:
Constant Field Values

WARCS_WRITE_METADATA_OUTLINKS_XPATH

public static final java.lang.String WARCS_WRITE_METADATA_OUTLINKS_XPATH
See Also:
Constant Field Values

WARCS_SKIP_IDENTICAL_DIGESTS_XPATH

public static final java.lang.String WARCS_SKIP_IDENTICAL_DIGESTS_XPATH
See Also:
Constant Field Values

WARCS_WRITE_REVISIT_FOR_IDENTICAL_DIGESTS_XPATH

public static final java.lang.String WARCS_WRITE_REVISIT_FOR_IDENTICAL_DIGESTS_XPATH
See Also:
Constant Field Values

WARCS_WRITE_REVISIT_FOR_NOT_MODIFIED_XPATH

public static final java.lang.String WARCS_WRITE_REVISIT_FOR_NOT_MODIFIED_XPATH
See Also:
Constant Field Values

MAXTIMESEC_PATH_XPATH

public static final java.lang.String MAXTIMESEC_PATH_XPATH
Xpath to check, that all templates have the max-time-sec attribute.

See Also:
Constant Field Values
Constructor Detail

HeritrixTemplate

public HeritrixTemplate(org.dom4j.Document doc,
                        boolean verify)
Constructor for HeritrixTemplate class.

Parameters:
doc - the order.xml
verify - If true, verifies if the given dom4j Document contains the elements required by our software.
Throws:
ArgumentNotValid - if doc is null, or verify is true and doc does not obey the constraints required by our software.

HeritrixTemplate

public HeritrixTemplate(org.dom4j.Document doc)
Alternate constructor, which always verifies the given document.

Parameters:
doc -
Method Detail

getTemplate

public org.dom4j.Document getTemplate()
return the template.

Returns:
the template

isVerified

public boolean isVerified()
Has Template been verified?

Returns:
true, if verified on construction, otherwise false

getXML

public java.lang.String getXML()
Return HeritrixTemplate as XML.

Returns:
HeritrixTemplate as XML

editOrderXMLAddCrawlerTraps

public static void editOrderXMLAddCrawlerTraps(org.dom4j.Document orderXMLdoc,
                                               java.lang.String elementName,
                                               java.util.List<java.lang.String> crawlerTraps)
Method to add a list of crawler traps with a given element name. It is used both to add per-domain traps and global traps.

Parameters:
elementName - The name of the added element.
crawlerTraps - A list of crawler trap regular expressions to add to this job.

editOrderXMLAddPerDomainCrawlerTraps

public static void editOrderXMLAddPerDomainCrawlerTraps(org.dom4j.Document orderXmlDoc,
                                                        DomainConfiguration cfg)
Updates the order.xml to include a MatchesListRegExpDecideRule for each crawlertrap associated with for the given DomainConfiguration. The added nodes have the form REJECT OR theFirstRegexp theSecondRegexp

Parameters:
cfg - The DomainConfiguration for which to generate crawler trap deciderules
Throws:
IllegalState - If unable to update order.xml due to wrong order.xml format

editOrderXML_ArchiveFormat

public static void editOrderXML_ArchiveFormat(org.dom4j.Document orderXML,
                                              java.lang.String archiveFormat)
Make sure that Heritrix will archive its data in the chosen archiveFormat.

Parameters:
orderXML - the specific heritrix template to modify.
archiveFormat - the chosen archiveformat ('arc' or 'warc' supported) Throws ArgumentNotValid If the chosen archiveFormat is not supported.

editOrderXML_maxJobRunningTime

public static void editOrderXML_maxJobRunningTime(org.dom4j.Document orderXMLdoc,
                                                  long maxJobRunningTime)
Parameters:
maxJobRunningTime - Force the harvestjob to end after maxJobRunningTime

editOrderXML_maxObjectsPerDomain

public static void editOrderXML_maxObjectsPerDomain(org.dom4j.Document orderXMLdoc,
                                                    long forceMaxObjectsPerDomain,
                                                    boolean maxObjectsIsSetByQuotaEnforcer)
Auxiliary method to modify the orderXMLdoc Document with respect to setting the maximum number of objects to be retrieved per domain. This method updates 'group-max-fetch-success' element of the QuotaEnforcer pre-fetch processor node (org.archive.crawler.frontier.BdbFrontier) with the value of the argument forceMaxObjectsPerDomain

Parameters:
orderXMLdoc -
forceMaxObjectsPerDomain - The maximum number of objects to retrieve per domain, or 0 for no limit.
Throws:
PermissionDenied - If unable to replace the frontier node of the orderXMLdoc Document
IOFailure - If the group-max-fetch-success element is not found in the orderXml. TODO The group-max-fetch-success check should also be performed in TemplateDAO.create, TemplateDAO.update

editOrderXML_configureQuotaEnforcer

public static void editOrderXML_configureQuotaEnforcer(org.dom4j.Document orderXMLdoc,
                                                       boolean maxObjectsIsSetByQuotaEnforcer,
                                                       long forceMaxBytesPerDomain,
                                                       long forceMaxObjectsPerDomain)
Activates or deactivate the quota-enforcer, depending on budget definition. Object limit can be defined either by using the queue-total-budget property or the quota enforcer. Which is chosen is set by the argument maxObjectsIsSetByQuotaEnforcer}'s value. So quota enforcer is set as follows:

Parameters:
orderXMLdoc - the template to modify
maxObjectsIsSetByQuotaEnforcer - Decides whether the maxObjectsIsSetByQuotaEnforcer or not.
forceMaxBytesPerDomain - The number of max bytes per domain enforced (can be no limit)
forceMaxObjectsPerDomain - The number of max objects per domain enforced (can be no limit)

editOrderXML_maxBytesPerDomain

public static void editOrderXML_maxBytesPerDomain(org.dom4j.Document orderXMLdoc,
                                                  long forceMaxBytesPerDomain)
Auxiliary method to modify the orderXMLdoc Document with respect to setting the maximum number of bytes to retrieve per domain. This method updates 'group-max-all-kb' element of the 'QuotaEnforcer' node, which again is a subelement of 'pre-fetch-processors' node. with the value of the argument forceMaxBytesPerDomain

Parameters:
forceMaxBytesPerDomain - The maximum number of byte to retrieve per domain, or -1 for no limit. Note that the number is divided by 1024 before being inserted into the orderXml, as Heritrix expects KB.
Throws:
PermissionDenied - If unable to replace the QuotaEnforcer node of the orderXMLdoc Document
IOFailure - If the group-max-all-kb element cannot be found. TODO This group-max-all-kb check also be performed in TemplateDAO.create, TemplateDAO.update

isDeduplicationEnabledInTemplate

public static boolean isDeduplicationEnabledInTemplate(org.dom4j.Document doc)
Return true if the given order.xml file has deduplication enabled.

Parameters:
doc - An order.xml document
Returns:
True if Deduplicator is enabled.

makeOrderfileReadyForHeritrix

public static void makeOrderfileReadyForHeritrix(HeritrixFiles files)
                                          throws IOFailure
This method prepares the orderfile used by the Heritrix crawler.

1. alters the orderfile in the following-way: (overriding whatever is in the orderfile)
  1. sets the disk-path to the outputdir specified in HeritrixFiles.
  2. sets the seedsfile to the seedsfile specified in HeritrixFiles.
  3. sets the prefix of the arcfiles to unique prefix defined in HeritrixFiles
  4. checks that the arcs-file dir is 'arcs' - to ensure that we know where the arc-files are when crawl finishes
  5. if deduplication is enabled, sets the node pointing to index directory for deduplication (see step 3)
2. saves the orderfile back to disk

3. if deduplication is enabled in the order.xml, it writes the absolute path of the lucene index used by the deduplication processor.

Throws:
IOFailure - - When the orderfile could not be saved to disk When a specific node is not found in the XML-document When the SAXReader cannot parse the XML