Class H1HeritrixTemplate

  • All Implemented Interfaces:
    Serializable

    public class H1HeritrixTemplate
    extends HeritrixTemplate
    implements Serializable
    Class encapsulating the Heritrix order.xml. Enables verification that dom4j Document obey the constraints required by our software, specifically the Job class.

    The class assumes the type of order.xml used in configuring Heritrix version 1.10+. Information about the Heritrix crawler, and its processes and modules can be found in the Heritrix developer and user manuals found on http://crawler.archive.org

    See Also:
    Serialized Form
    • Field Detail

      • QUOTA_ENFORCER_ENABLED_XPATH

        public static final String QUOTA_ENFORCER_ENABLED_XPATH
        Xpath needed by Job.editOrderXML_maxBytesPerDomain().
        See Also:
        Constant Field Values
      • GROUP_MAX_ALL_KB_XPATH

        public static final String GROUP_MAX_ALL_KB_XPATH
        Xpath needed by Job.editOrderXML_maxBytesPerDomain().
        See Also:
        Constant Field Values
      • GROUP_MAX_FETCH_SUCCESS_XPATH

        public static final String GROUP_MAX_FETCH_SUCCESS_XPATH
        Xpath needed by Job.editOrderXML_maxObjectsPerDomain().
        See Also:
        Constant Field Values
      • QUEUE_TOTAL_BUDGET_XPATH

        public static final String QUEUE_TOTAL_BUDGET_XPATH
        Xpath needed by Job.editOrderXML_maxObjectsPerDomain().
        See Also:
        Constant Field Values
      • DECIDERULES_MAP_XPATH

        public static final String DECIDERULES_MAP_XPATH
        Xpath needed by Job.editOrderXML_crawlerTraps().
        See Also:
        Constant Field Values
      • DECIDERULES_ACCEPT_IF_PREREQUISITE_XPATH

        public static final String DECIDERULES_ACCEPT_IF_PREREQUISITE_XPATH
        Xpath needed by Job.editOrderXML_crawlerTraps().
        See Also:
        Constant Field Values
      • HERITRIX_USER_AGENT_XPATH

        public static final String HERITRIX_USER_AGENT_XPATH
        Xpath checked by Heritrix for correct user-agent field in requests.
        See Also:
        Constant Field Values
      • HERITRIX_FROM_XPATH

        public static final String HERITRIX_FROM_XPATH
        Xpath checked by Heritrix for correct mail address.
        See Also:
        Constant Field Values
      • DECIDINGSCOPE_XPATH

        public static final String DECIDINGSCOPE_XPATH
        Xpath to check, that all templates use the DecidingScope.
        See Also:
        Constant Field Values
      • DEDUPLICATOR_XPATH

        public static final String DEDUPLICATOR_XPATH
        Xpath for the deduplicator node in order.xml documents.
        See Also:
        Constant Field Values
      • ARC_ARCHIVER_PATH_XPATH

        public static final String ARC_ARCHIVER_PATH_XPATH
        Xpath to check, that all templates use the same ARC archiver path, Constants.ARCDIRECTORY_NAME. The archive path tells Heritrix to which directory it shall write its arc files.
        See Also:
        Constant Field Values
      • WARC_ARCHIVER_PATH_XPATH

        public static final String WARC_ARCHIVER_PATH_XPATH
        Xpath to check, that all templates use the same WARC archiver path, Constants.WARCDIRECTORY_NAME. The archive path tells Heritrix to which directory it shall write its arc files.
        See Also:
        Constant Field Values
      • DEDUPLICATOR_INDEX_LOCATION_XPATH

        public static final String DEDUPLICATOR_INDEX_LOCATION_XPATH
        Xpath for the deduplicator index directory node in order.xml documents.
        See Also:
        Constant Field Values
      • DEDUPLICATOR_ENABLED

        public static final String DEDUPLICATOR_ENABLED
        Xpath for the boolean telling if the deduplicator is enabled in order.xml documents.
        See Also:
        Constant Field Values
      • DISK_PATH_XPATH

        public static final String DISK_PATH_XPATH
        Xpath for the 'disk-path' in the order.xml .
        See Also:
        Constant Field Values
      • ARCHIVEFILE_PREFIX_XPATH

        public static final String ARCHIVEFILE_PREFIX_XPATH
        Xpath for the arcfile 'prefix' in the order.xml .
        See Also:
        Constant Field Values
      • SEEDS_FILE_XPATH

        public static final String SEEDS_FILE_XPATH
        Xpath for the 'seedsfile' in the order.xml.
        See Also:
        Constant Field Values
      • WARCS_ENABLED_XPATH

        public static final String WARCS_ENABLED_XPATH
        Xpath for the WARCs dir in the order.xml.
        See Also:
        Constant Field Values
      • WARCS_WRITE_REVISIT_FOR_IDENTICAL_DIGESTS_XPATH

        public static final String WARCS_WRITE_REVISIT_FOR_IDENTICAL_DIGESTS_XPATH
        See Also:
        Constant Field Values
      • WARCS_WRITE_REVISIT_FOR_NOT_MODIFIED_XPATH

        public static final String WARCS_WRITE_REVISIT_FOR_NOT_MODIFIED_XPATH
        See Also:
        Constant Field Values
      • METADATA_ITEMS_XPATH

        public static final String METADATA_ITEMS_XPATH
        Xpath for the WARC metadata in the order.xml.
        See Also:
        Constant Field Values
      • MAXTIMESEC_PATH_XPATH

        public static final String MAXTIMESEC_PATH_XPATH
        Xpath to check, that all templates have the max-time-sec attribute.
        See Also:
        Constant Field Values
    • Constructor Detail

      • H1HeritrixTemplate

        public H1HeritrixTemplate​(org.dom4j.Document doc,
                                  boolean verify)
        Constructor for HeritrixTemplate class.
        Parameters:
        doc - the order.xml
        verify - If true, verifies if the given dom4j Document contains the elements required by our software.
        Throws:
        ArgumentNotValid - if doc is null, or verify is true and doc does not obey the constraints required by our software.
      • H1HeritrixTemplate

        public H1HeritrixTemplate​(org.dom4j.Document doc)
        Alternate constructor, which always verifies the given document.
        Parameters:
        doc -
      • H1HeritrixTemplate

        public H1HeritrixTemplate​(long template_id,
                                  String templateAsString)
                           throws org.dom4j.DocumentException
        Throws:
        org.dom4j.DocumentException
    • Method Detail

      • getTemplate

        public org.dom4j.Document getTemplate()
        return the template.
        Returns:
        the template
      • isVerified

        public boolean isVerified()
        Has Template been verified?
        Returns:
        true, if verified on construction, otherwise false
      • getXML

        public String getXML()
        Return HeritrixTemplate as XML.
        Specified by:
        getXML in class HeritrixTemplate
        Returns:
        HeritrixTemplate as XML
      • editOrderXMLAddCrawlerTraps

        public static void editOrderXMLAddCrawlerTraps​(org.dom4j.Document orderXMLdoc,
                                                       String elementName,
                                                       List<String> crawlerTraps)
        Method to add a list of crawler traps with a given element name. It is used both to add per-domain traps and global traps.
        Parameters:
        elementName - The name of the added element.
        crawlerTraps - A list of crawler trap regular expressions to add to this job.
      • editOrderXML_maxObjectsPerDomain

        public static void editOrderXML_maxObjectsPerDomain​(org.dom4j.Document orderXMLdoc,
                                                            long forceMaxObjectsPerDomain,
                                                            boolean maxObjectsIsSetByQuotaEnforcer)
        Auxiliary method to modify the orderXMLdoc Document with respect to setting the maximum number of objects to be retrieved per domain. This method updates 'group-max-fetch-success' element of the QuotaEnforcer pre-fetch processor node (org.archive.crawler.frontier.BdbFrontier) with the value of the argument forceMaxObjectsPerDomain
        Parameters:
        orderXMLdoc -
        forceMaxObjectsPerDomain - The maximum number of objects to retrieve per domain, or 0 for no limit.
        Throws:
        PermissionDenied - If unable to replace the frontier node of the orderXMLdoc Document
        IOFailure - If the group-max-fetch-success element is not found in the orderXml. TODO The group-max-fetch-success check should also be performed in TemplateDAO.create, TemplateDAO.update
      • editOrderXML_configureQuotaEnforcer

        public static void editOrderXML_configureQuotaEnforcer​(org.dom4j.Document orderXMLdoc,
                                                               boolean maxObjectsIsSetByQuotaEnforcer,
                                                               long forceMaxBytesPerDomain,
                                                               long forceMaxObjectsPerDomain)
        Activates or deactivate the quota-enforcer, depending on budget definition. Object limit can be defined either by using the queue-total-budget property or the quota enforcer. Which is chosen is set by the argument maxObjectsIsSetByQuotaEnforcer}'s value. So quota enforcer is set as follows:
        • Object limit is not set by quota enforcer, disabled only if there is no byte limit.
        • Object limit is set by quota enforcer, so it should be enabled whether a byte or object limit is set.
        Parameters:
        orderXMLdoc - the template to modify
        maxObjectsIsSetByQuotaEnforcer - Decides whether the maxObjectsIsSetByQuotaEnforcer or not.
        forceMaxBytesPerDomain - The number of max bytes per domain enforced (can be no limit)
        forceMaxObjectsPerDomain - The number of max objects per domain enforced (can be no limit)
      • isValid

        public boolean isValid()
        Specified by:
        isValid in class HeritrixTemplate
        Returns:
        true, if the template is valid, otherwise false
      • configureQuotaEnforcer

        public void configureQuotaEnforcer​(boolean maxObjectsIsSetByQuotaEnforcer,
                                           long forceMaxBytesPerDomain,
                                           long forceMaxObjectsPerDomain)
        Description copied from class: HeritrixTemplate
        Activates or deactivate the quota-enforcer, depending on budget definition. Object limit can be defined either by using the queue-total-budget property or the quota enforcer. Which is chosen is set by the argument maxObjectsIsSetByQuotaEnforcer}'s value. So quota enforcer is set as follows:
        • Object limit is not set by quota enforcer, disabled only if there is no byte limit.
        • Object limit is set by quota enforcer, so it should be enabled whether a byte or object limit is set.
        Specified by:
        configureQuotaEnforcer in class HeritrixTemplate
        Parameters:
        maxObjectsIsSetByQuotaEnforcer - Decides whether the maxObjectsIsSetByQuotaEnforcer or not.
        forceMaxBytesPerDomain - The number of max bytes per domain enforced (can be no limit)
        forceMaxObjectsPerDomain - The number of max objects per domain enforced (can be no limit)
      • setMaxBytesPerDomain

        public void setMaxBytesPerDomain​(Long forceMaxBytesPerDomain)
        Auxiliary method to modify the orderXMLdoc Document with respect to setting the maximum number of bytes to retrieve per domain. This method updates 'group-max-all-kb' element of the 'QuotaEnforcer' node, which again is a subelement of 'pre-fetch-processors' node. with the value of the argument forceMaxBytesPerDomain
        Specified by:
        setMaxBytesPerDomain in class HeritrixTemplate
        Parameters:
        forceMaxBytesPerDomain - The maximum number of byte to retrieve per domain, or -1 for no limit. Note that the number is divided by 1024 before being inserted into the orderXml, as Heritrix expects KB.
        Throws:
        PermissionDenied - If unable to replace the QuotaEnforcer node of the orderXMLdoc Document
        IOFailure - If the group-max-all-kb element cannot be found. TODO This group-max-all-kb check also be performed in TemplateDAO.create, TemplateDAO.update
      • IsDeduplicationEnabled

        public boolean IsDeduplicationEnabled()
        Return true if the templatefile has deduplication enabled.
        Specified by:
        IsDeduplicationEnabled in class HeritrixTemplate
        Returns:
        True if Deduplicator is enabled.
      • setArchiveFormat

        public void setArchiveFormat​(String archiveFormat)
        Description copied from class: HeritrixTemplate
        Make sure that Heritrix will archive its data in the chosen archiveFormat.
        Specified by:
        setArchiveFormat in class HeritrixTemplate
        Parameters:
        archiveFormat - the chosen archiveformat ('arc' or 'warc' supported) Throws ArgumentNotValid If the chosen archiveFormat is not supported.
      • setMaxJobRunningTime

        public void setMaxJobRunningTime​(Long maxJobRunningTimeSecondsL)
        Description copied from class: HeritrixTemplate
        Set the maxRunning time for the harvest
        Specified by:
        setMaxJobRunningTime in class HeritrixTemplate
        Parameters:
        maxJobRunningTimeSecondsL - Limit the harvest to this number of seconds
      • getText

        public String getText()
        Only available for H1 templates.
        Returns:
        the template as a String.
      • insertCrawlerTraps

        public void insertCrawlerTraps​(String elementName,
                                       List<String> crawlerTraps)
        Description copied from class: HeritrixTemplate
        Method to add a list of crawler traps with a given element name. It is used both to add per-domain traps and global traps.
        Specified by:
        insertCrawlerTraps in class HeritrixTemplate
        Parameters:
        elementName - The name of the added element.
        crawlerTraps - A list of crawler trap regular expressions to add to this job.
      • insertWarcInfoMetadata

        public void insertWarcInfoMetadata​(Job ajob,
                                           String origHarvestdefinitionName,
                                           String origHarvestdefinitionComments,
                                           String scheduleName,
                                           String performer)
        Description copied from class: HeritrixTemplate
        Method to add settings to the WARCWriterProcesser, so that it can generate a proper WARCINFO record.
        Specified by:
        insertWarcInfoMetadata in class HeritrixTemplate
        Parameters:
        ajob - a HarvestJob
        origHarvestdefinitionName - The name of the harvestdefinition behind this job
        scheduleName - The name of the schedule used. (Will be null, if the job is not a selectiveHarvest).
        performer - The name of organisation/person doing this harvest
      • insertUmbrabean

        public void insertUmbrabean​(String jobName,
                                    String rabbitMQUrl,
                                    String limitSearchRegEx)
        Description copied from class: HeritrixTemplate
        Inserts all nevessary umbra-related beans in this template.
        Specified by:
        insertUmbrabean in class HeritrixTemplate
        Parameters:
        jobName - a String representing the job - must be unique for the this NAS environment for all time
        rabbitMQUrl - the URL of the rabbitMQ socket connection (amqp://) to which umbra requests are to be sent
        limitSearchRegEx - the regular expression used to limit the heritrix search-path of urls to be sent to Umbra.