Class HeritrixTemplate

    • Method Detail

      • configureQuotaEnforcer

        public abstract void configureQuotaEnforcer​(boolean maxObjectsIsSetByQuotaEnforcer,
                                                    long forceMaxBytesPerDomain,
                                                    long forceMaxObjectsPerDomain)
        Activates or deactivate the quota-enforcer, depending on budget definition. Object limit can be defined either by using the queue-total-budget property or the quota enforcer. Which is chosen is set by the argument maxObjectsIsSetByQuotaEnforcer}'s value. So quota enforcer is set as follows:
        • Object limit is not set by quota enforcer, disabled only if there is no byte limit.
        • Object limit is set by quota enforcer, so it should be enabled whether a byte or object limit is set.
        Parameters:
        maxObjectsIsSetByQuotaEnforcer - Decides whether the maxObjectsIsSetByQuotaEnforcer or not.
        forceMaxBytesPerDomain - The number of max bytes per domain enforced (can be no limit)
        forceMaxObjectsPerDomain - The number of max objects per domain enforced (can be no limit)
      • setIsActive

        public void setIsActive​(boolean isActive)
      • IsDeduplicationEnabled

        public abstract boolean IsDeduplicationEnabled()
        Returns:
        true, if deduplication is enabled in the template (used for determine whether or not to request a deduplication index from the indexserver)
      • isValid

        public abstract boolean isValid()
        Returns:
        true, if the template is valid, otherwise false
      • getXML

        public abstract java.lang.String getXML()
        Returns:
        the XML behind this template
      • insertCrawlerTraps

        public abstract void insertCrawlerTraps​(java.lang.String elementName,
                                                java.util.List<java.lang.String> crawlertraps)
        Method to add a list of crawler traps with a given element name. It is used both to add per-domain traps and global traps.
        Parameters:
        elementName - The name of the added element.
        crawlertraps - A list of crawler trap regular expressions to add to this job.
      • setArchiveFormat

        public abstract void setArchiveFormat​(java.lang.String archiveFormat)
        Make sure that Heritrix will archive its data in the chosen archiveFormat.
        Parameters:
        archiveFormat - the chosen archiveformat ('arc' or 'warc' supported) Throws ArgumentNotValid If the chosen archiveFormat is not supported.
      • setMaxJobRunningTime

        public abstract void setMaxJobRunningTime​(java.lang.Long maxJobRunningTimeSecondsL)
        Set the maxRunning time for the harvest
        Parameters:
        maxJobRunningTimeSecondsL - Limit the harvest to this number of seconds
      • insertAttributes

        public abstract void insertAttributes​(java.util.List<EAV.AttributeAndType> attributesAndTypes)
        Try to insert the given list of attributes into the template.
        Parameters:
        attributesAndTypes -
      • editOrderXMLAddPerDomainCrawlerTraps

        public void editOrderXMLAddPerDomainCrawlerTraps​(DomainConfiguration cfg)
        Updates the order.xml to include a MatchesListRegExpDecideRule for each crawler-trap associated with for the given DomainConfiguration.

        The added nodes have the form

        REJECT OR theFirstRegexp theSecondRegexp

        Parameters:
        cfg - The DomainConfiguration for which to generate crawler trap deciderules
        Throws:
        IllegalState - If unable to update order.xml due to wrong order.xml format
      • setSeedsFilePath

        public abstract void setSeedsFilePath​(java.lang.String absolutePath)
      • setArchiveFilePrefix

        public abstract void setArchiveFilePrefix​(java.lang.String archiveFilePrefix)
      • setDiskPath

        public abstract void setDiskPath​(java.lang.String absolutePath)
      • writeTemplate

        public abstract void writeTemplate​(javax.servlet.jsp.JspWriter out)
      • hasContent

        public abstract boolean hasContent()
      • writeToFile

        public abstract void writeToFile​(java.io.File orderXmlFile)
      • setRecoverlogNode

        public abstract void setRecoverlogNode​(java.io.File recoverlogGzFile)
      • getTemplateFromString

        public static HeritrixTemplate getTemplateFromString​(long template_id,
                                                             java.lang.String templateAsString)
        Construct a H1HeritrixTemplate or H3HeritrixTemplate based on the signature of the given string.
        Parameters:
        template_id - The id of the template
        templateAsString - The template as a String object
        Returns:
        a HeritrixTemplate based on the signature of the given string.
      • read

        public static HeritrixTemplate read​(java.io.File orderXmlFile)
        Read the given template from file.
        Parameters:
        orderXmlFile - a given HeritrixTemplate (H1 or H3) as a File
        Returns:
        the given HeritrixTemplate (H1 or H3) as a HeritrixTemplate object
      • read

        public static HeritrixTemplate read​(long template_id,
                                            java.io.Reader orderTemplateReader)
        Read the template using the given Reader.
        Parameters:
        template_id - The id of the template
        orderTemplateReader - A given Reader to read a template
        Returns:
        a HeritrixTemplate object
      • removeDeduplicatorIfPresent

        public abstract void removeDeduplicatorIfPresent()
        Try to remove the deduplicator, if present in the template.
      • insertWarcInfoMetadata

        public abstract void insertWarcInfoMetadata​(Job ajob,
                                                    java.lang.String origHarvestdefinitionName,
                                                    java.lang.String origHarvestdefinitionComments,
                                                    java.lang.String scheduleName,
                                                    java.lang.String performer)
        Method to add settings to the WARCWriterProcesser, so that it can generate a proper WARCINFO record.
        Parameters:
        ajob - a HarvestJob
        origHarvestdefinitionName - The name of the harvestdefinition behind this job
        scheduleName - The name of the schedule used. (Will be null, if the job is not a selectiveHarvest).
        performer - The name of organisation/person doing this harvest
      • insertUmbrabean

        public abstract void insertUmbrabean​(java.lang.String jobName,
                                             java.lang.String rabbitMQUrl,
                                             java.lang.String limitSearchRegEx)
        Inserts all nevessary umbra-related beans in this template.
        Parameters:
        jobName - a String representing the job - must be unique for the this NAS environment for all time
        rabbitMQUrl - the URL of the rabbitMQ socket connection (amqp://) to which umbra requests are to be sent
        limitSearchRegEx - the regular expression used to limit the heritrix search-path of urls to be sent to Umbra.