Child pages
  • Appendix B - Managing Heritrix Harvest Templates (order.xml)

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

Contents

Table of Contents

The NetarchiveSuite software uses Heritrix 1.14.4 to harvest webpages. A harvest done by Heritrix is specified with a harvest template (invariably named order.xml). A harvest template describes how much to harvest and from where. Furthermore a seedlist is always associated with a given order.xml.

The standard harvest template used by NetarchiveSuite follow the order.xml standard of Heritrix 1.10+.

Our default harvest template can be seen here in full: default_orderxml.xml

If you intend to build your own templates, it is recommended to use this template as a baseline.

Mandatory elements in the NetarchiveSuite and their role

A number of elements in the order.xml are required in all NetarchiveSuite harvest templates:

A. The QuotaEnforcer

The QuotaEnforcer is used to restrict the number of bytes harvested from each domain in the seedlist.

Code Block
<newObject name="QuotaEnforcer" class="org.archive.crawler.prefetch.QuotaEnforcer">
        <boolean name="force-retire">false</boolean>
        <boolean name="enabled">true</boolean>
        <newObject name="QuotaEnforcer#decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence">
                <map name="rules">
                </map>
        </newObject>
        <long name="server-max-fetch-successes">-1</long>
        <long name="server-max-success-kb">-1</long>
        <long name="server-max-fetch-responses">-1</long>
        <long name="server-max-all-kb">-1</long>
        <long name="host-max-fetch-successes">-1</long>
        <long name="host-max-success-kb">-1</long>
        <long name="host-max-fetch-responses">-1</long>
        <long name="host-max-all-kb">-1</long>
        <long name="group-max-fetch-successes">-1</long>
        <long name="group-max-success-kb">-1</long>
        <long name="group-max-fetch-responses">-1</long>
        <long name="group-max-all-kb">-1</long>
        <boolean name="use-sparse-range-filter">true</boolean>
</newObject>

B. The DeDuplicator

The DeDuplicator is a module authored by Kristinn Sigurdsson from the National Library of Iceland. It is part of the Write-processor chain. It enables us to avoid saving duplicates in our storage. It does this by looking up the url of the potential duplicate object in the index associated with this module. If the url is found in the index, and the checksum for the url in the index is unaltered, the object is not stored. However a reference to where the object is stored is written to the crawl log. If the url for the object is not found in the index, the object is stored normally. Note that only non-text objects are examined by this module, i.e. where the mimetype of the object does not match "^text/." (like text/html or text/plain). Note that the deduplication is disabled if either the DeDuplicator element in the harvest template is disabled (the value of the attribute "enabled" is set to false), or the general setting *settings.harvester.harvesting.deduplication.enabled is set to false. NetarchiveSuite uses version 0.4.0 of the deduplicator.

Code Block
<newObject name="DeDuplicator" class="is.hi.bok.deduplicator.DeDuplicator">
        <boolean name="enabled">true</boolean>
        <map name="filters">
        </map>
        <string name="index-location"/>
        <string name="matching-method">By URL</string>
        <boolean name="try-equivalent">true</boolean>
        <boolean name="change-content-size">false</boolean>
        <string name="mime-filter">^text/.*</string>
        <string name="filter-mode">Blacklist</string>
        <string name="analysis-mode">Timestamp</string>
        <string name="log-level">SEVERE</string>
        <string name="origin"/>
        <string name="origin-handling">Use index information</string>
        <boolean name="stats-per-host">true</boolean>
</newObject>

C. The "http-headers" element

This element describes, how Heritrix will present itself to the webservers when fetching data. It points by default to the non-existing webpage http://my_website.com/my_infopage.html and the equally non-existing mail address " my_email@my_website.com ". Please update this to your own institution and email

Code Block
        <map name="http-headers">
            <string name="user-agent">Mozilla/5.0 (compatible; heritrix/1.14.3 +http://my_website.com/my_infopage.html)</string>
            <string name="from">my_email@my_website.com</string>
        </map>

D. The Archiver element

This element does the actual writing of the fetched objects to an arcfile. In the future we may want to write to WARC files instead, which can be easily be done. Heritrix allows you to have multiple 'Writers' in use at the same time. For instance, you can write your objects to both ARC and WARC at the same time, as well as writing the objects to a database.

Code Block
<newObject name="Archiver" class="org.archive.crawler.writer.ARCWriterProcessor">
                <boolean name="enabled">true</boolean>
                <newObject name="Archiver#decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence">
                        <map name="rules">
                        </map>
                </newObject>
                <boolean name="compress">false</boolean>
                <string name="prefix">IAH</string>
                <string name="suffix">${HOSTNAME}</string>
                <integer name="max-size-bytes">100000000</integer>
                <stringList name="path">
                    <string>arcs</string>
                </stringList>
                <integer name="pool-max-active">5</integer>
                <integer name="pool-max-wait">300000</integer>
                <long name="total-bytes-to-write">0</long>
                <boolean name="skip-identical-digests">false</boolean>
    </newObject>

E. The ContentSize element

To have statistics work right when jobs finishes and goes back into the database all templates in NetarchiveSuite require a special content-size annotation post-processor. If this element is not present, the size will allways be 0 in the database for harvests done without this in the template:

Code Block
<newObject name="ContentSize"
class="dk.netarkivet.harvester.harvesting.ContentSizeAnnotationPostProcessor">
                <boolean name="enabled">true</boolean>
                <newObject name="ContentSize#decide-rules"
class="org.archive.crawler.deciderules.DecideRuleSequence">
                        <map name="rules">
                        </map>
                </newObject>
        </newObject>

F. The Scope element

The scope element decides which urls to harvest and which not to harvest. Before release 3.6.0, we used the following three scopes:

A. DomainScope. The standard NetarchiveSuite scope allows the harvester to fetch all objects coming from any 2nd level domains represented by one of the seeds. Embeddded objects, like images, and stylesheets are always fetched even when coming from other domains.
A. HostScope. This scope are restricted to fetching objects from the hosts represented by the seeds.
A. PathScope. This scope are restricted to fetching objects from

These 3 scopes were all deprecated from Heritrix 1.10.0, and now all NetarchiveSuite templates are required to use the DecidingScope instead. This type of Scope uses a sequence of DecideRules to define the scope of the harvest. We now emulate these three scopes by adding a specific DecideRule to the DecidingScope. In the case of DomainScope, it required designing our own DecideRule (dk.netarkivet.harvester.harvesting.OnNSDomainsDecideRule). So for DomainScope type scopes, you add the following element:

Code Block
<newObject name="acceptURIFromSeedDomains" class="dk.netarkivet.harvester.harvesting.OnNSDomainsDecideRule">
                                <string name="decision">ACCEPT</string>
                                <string name="surts-source-file">seeds.txt</string>
                                <boolean name="seeds-as-surt-prefixes">false</boolean>
                                <string name="surts-dump-file"/>
                                <boolean name="also-check-via">false</boolean>
                                <boolean name="rebuild-on-reconfig">true</boolean>
</newObject>

Emulating the HostScope requires adding the OnHostsDecideRule element:

Code Block
<newObject name="acceptIfOnSeedsHosts" class="org.archive.crawler.deciderules.OnHostsDecideRule">
                                <string name="decision">ACCEPT</string>
                                <string name="surts-dump-file"></string>
                                <boolean name="also-check-via">false</boolean>
                                <boolean name="rebuild-on-reconfig">true</boolean>
                        </newObject>

Emulating the PathScope requires adding the SurtPrefixesDecideRule element:

Code Block
<newObject name="acceptIfSurtPrefixed" class="org.archive.crawler.deciderules.SurtPrefixedDecideRule">
                                <string name="decision">ACCEPT</string>
                                <string name="surts-source-file"></string>
                                <boolean name="seeds-as-surt-prefixes">true</boolean>
                                <string name="surts-dump-file"></string>
                                <boolean name="also-check-via">false</boolean>
                                <boolean name="rebuild-on-reconfig">true</boolean>
                        </newObject>

An example of a complete DecidingScope element is shown below.

Code Block
        <newObject name="scope" class="org.archive.crawler.deciderules.DecidingScope">
            <boolean name="enabled">true</boolean>
            <string name="seedsfile">seeds.txt</string>
            <boolean name="reread-seeds-on-config">true</boolean>
            <!-- DecideRuleSequence. Multiple DecideRules applied in order with last non-PASS the resulting decision -->
            <newObject name="decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence">
                <map name="rules">
                        <newObject name="rejectByDefault" class="org.archive.crawler.deciderules.RejectDecideRule"/>
                        <newObject name="acceptURIFromSeedDomains" class="dk.netarkivet.harvester.harvesting.OnNSDomainsDecideRule">
                                <string name="decision">ACCEPT</string>
                                <string name="surts-source-file"></string>
                                <boolean name="seeds-as-surt-prefixes">true</boolean>
                                <string name="surts-dump-file"/>
                                <boolean name="also-check-via">false</boolean>
                                <boolean name="rebuild-on-reconfig">true</boolean>
                        </newObject>
                        <newObject name="rejectIfTooManyHops" class="org.archive.crawler.deciderules.TooManyHopsDecideRule">
                                <integer name="max-hops">25</integer>
                        </newObject>
                        <newObject name="rejectIfPathological" class="org.archive.crawler.deciderules.PathologicalPathDecideRule">
                                <integer name="max-repetitions">3</integer>
                        </newObject>
                        <newObject name="acceptIfTranscluded" class="org.archive.crawler.deciderules.TransclusionDecideRule">
                                <integer name="max-trans-hops">25</integer>
                                <integer name="max-speculative-hops">1</integer>
                        </newObject>
                        <newObject name="pathdepthfilter" class="org.archive.crawler.deciderules.TooManyPathSegmentsDecideRule">
                                <integer name="max-path-depth">20</integer>
                        </newObject>
                        <newObject name="global_crawlertraps" class="org.archive.crawler.deciderules.MatchesListRegExpDecideRule">
                             <string name="decision">REJECT</string>
                             <string name="list-logic">OR</string>
                             <stringList name="regexp-list">
                             <string>.*core\.UserAdmin.*core\.UserLogin.*</string>
                             <string>.*core\.UserAdmin.*register\.UserSelfRegistration.*</string>
                             <string>.*\/w\/index\.php\?title=Speci[ae]l:Recentchanges.*</string>
                             <string>.*act=calendar&amp;cal_id=.*</string>
                             <string>.*advCalendar_pi.*</string>
                             <string>.*cal\.asp\?date=.*</string>
                             <string>.*cal\.asp\?view=monthly&amp;date=.*</string>
                             <string>.*cal\.asp\?view=weekly&amp;date=.*</string>
                             <string>.*cal\.asp\?view=yearly&amp;date=.*</string>
                             .....
                             <string>.*index\.php\?iDate=.*</string>
                             <string>.*index\.php\?module=PostCalendar&amp;func=view.*</string>
                             <string>.*index\.php\?option=com_events&amp;task=view.*</string>
                             <string>.*index\.php\?option=com_events&amp;task=view_day&amp;year=.*</string>
                             <string>.*index\.php\?option=com_events&amp;task=view_detail&amp;year=.*</string>
                             <string>.*index\.php\?option=com_events&amp;task=view_month&amp;year=.*</string>
                             <string>.*index\.php\?option=com_events&amp;task=view_week&amp;year=.*</string>
                        </stringList>
                    </newObject>
                </map> <!-- end rules -->
            </newObject> <!-- end decide-rules -->
        </newObject> <!-- End DecidingScope -->
The anatomy of a decidingscope

Finally, we describe the rest of the components of a decidingscope element.

The header
Code Block
<newObject name="scope" class="org.archive.crawler.deciderules.DecidingScope">
            <boolean name="enabled">true</boolean>
            <string name="seedsfile">seeds.txt</string>
            <boolean name="reread-seeds-on-config">true</boolean>
            <!-- DecideRuleSequence. Multiple DecideRules applied in order with last non-PASS the resulting decision -->
            <newObject name="decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence">
                  <map name="rules">
                        <newObject name="rejectByDefault" class="org.archive.crawler.deciderules.RejectDecideRule"/>
The defining deciderule

Here we have the deciderule, that defines this as either a DomainScope, a HostScope, or a PathScope

Standard harvest rules

These rules add more restrictions to the scope:

Code Block
                        <newObject name="rejectIfTooManyHops" class="org.archive.crawler.deciderules.TooManyHopsDecideRule">
                                <integer name="max-hops">25</integer>
                        </newObject>
                        <newObject name="rejectIfPathological" class="org.archive.crawler.deciderules.PathologicalPathDecideRule">
                                <integer name="max-repetitions">3</integer>
                        </newObject>
                        <newObject name="acceptIfTranscluded" class="org.archive.crawler.deciderules.TransclusionDecideRule">
                                <integer name="max-trans-hops">25</integer>
                                <integer name="max-speculative-hops">1</integer>
                        </newObject>
                        <newObject name="pathdepthfilter" class="org.archive.crawler.deciderules.TooManyPathSegmentsDecideRule">
                                <integer name="max-path-depth">20</integer>
                        </newObject>
Define general crawlertraps to be avoided

Lists of crawlertraps to be avoided are defined with a MatchesListRegExpDecideRule. Here we list all crawlertraps (defined by a regular expression). If any object matches one of these regular expression, the object is not fetched (unless a previous rule require the object to be fetched).

Code Block
<newObject name="global_crawlertraps" class="org.archive.crawler.deciderules.MatchesListRegExpDecideRule">
                             <string name="decision">REJECT</string>
                             <string name="list-logic">OR</string>
                             <stringList name="regexp-list">
                               <string>.*core\.UserAdmin.*core\.UserLogin.*
                             </stringList>

When creating a new Harvestjob, another MatchesListRegExpDecideRule is added to the harvestTemplate, that specifies the crawlertraps to be avoided.

The HarvestTemplateApplication tool

You can upload and download the templates using our GUI. This is described in our Harvester Templates. But you can also upload and download the templates using the commandline HarvestTemplateApplication. This application allows you to create, download, update templates. We have made a script to make it easier to use this application: HarvestTemplateApplication.sh.txt

Code Block
java dk.netarkivet.harvester.tools.HarvestTemplateApplication <command> <args>
create <template-name> <xml-file for this template>
download [<template-name>]
update <template-name> <xml-file to replace this template>
showall

Predefined harvest templates

All our templates fall in three categories depending on the scope defined in the template. Note that our templates generally do not obey robots.txt. This is because the Danish legislation allows is to ignore the constraints dictated by robots.txt. However, there are two exceptions to this rule:

  • default_obeyrobots.xml
  • default_obeyrobots_withforms.xml

Even though DomainScope, HostScope, PathScope are now emulated using DecidingScope, these categories are still useful:

Templates w/ DomainScope

  1. default_orderxml.xml (standard template)
  2. default_withforms.xml (standard template that can handle forms)
  3. default_obeyrobots.xml (standard template that can handle forms)
  4. default_obeyrobots_withforms.xml (standard template that obeys robots.txt and handles forms)
  5. default_orderxml_low_bandwidth.xml (standard template for sites with low bandwidth)
  6. frontpages.xml (harvest template that only harvest the seeds and associated stylesheets and images)
  7. frontpages_plus_1level.xml (The above plus one extra level extra)
  8. frontpages_plus_2levels.xml (The above plus 2 extra levels)

Templates w/ HostScope

  1. host_10levels_orderxml.xml (harvest the hosts of the seeds up to 10 levels from seeds)
  2. host_100levels_orderxml.xml (harvest the hosts of the seeds up to 100 levels from seeds)

Templates w/ PathScope

  1. path_10levels_orderxml.xml (harvest the hosts of the seeds up to 10 levels from seeds)
  2. path_100levels_orderxml.xml (harvest the hosts of the seeds up to 100 levels from seeds)
Section



Column

Column
width100%

Column