Page tree

Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.

Release Date: 4th November 2016


Table of Contents


Java 8

NetarchiveSuite now requires a Java 8 runtime for all components.

New Settings

  • ChecksumFileApplication


    Code Block
    * <b>settings.archive.checksum.usePrecomputedChecksum</b>: This decides whether or not use the pre-computed checksum sent as part of the StoreMessage and UploadMessage
    * The default is false
        public static String CHECKSUM_USE_PRECOMPUTED_CHECKSUM_DURING_UPLOAD= "settings.archive.checksum.usePrecomputedChecksumDuringUpload";

    This boolean can be used to optimise the upload process to the bitarchives.


  • GUIApplication, HarvestJobManager

    Code Block
     * <b>settings.common.topLevelDomains.tld</b>: <br>
     * Extra valid top level domain, like, .dk, .org., not part of current embedded public_suffix_list.dat file 
     * in common/common-core/src/main/resources/dk/netarkivet/common/utils/public_suffix_list.dat
     * downloaded from
    public static String TLDS = "settings.common.topLevelDomains.tld";
  • HarvestControllerApplication

    Code Block
     * The version number which goes in metadata file names like 12345-metadata-&lt;version number&gt;.warc.gz
    public static String METADATA_FILE_VERSION_NUMBER = "settings.harvester.harvesting.metadata.filename.versionnumber";

    This parameter allows for the definition of different generations of metadata file.

    Code Block
     * <b>settings.harvester.harvesting.metadata.compression</b> Do we compress the
     * metadata associated with a given harvest job. 
     * default: false 
    public static String METADATA_COMPRESSION = "settings.harvester.harvesting.metadata.compression";

    Controls whether metadata files are generated in compressed (warc.gz) format.

  • ViewerproxyApplication, IndexServerApplication, WaybackIndexerApplication

    Code Block
     * Specifies the suffix of a regex which can identify valid metadata files by job number. Thus preceding
     * the value of this setting with .* will find all metadata files.
    public static String METADATAFILE_REGEX_SUFFIX = "settings.common.metadata.fileregexsuffix";

    This parameter allows one to determine which metadata files to include in indexing (for Viewerproxy or Wayback). The full regex string to be searched consists of the string <jobid>-<harvestid> followed by this suffix. The default value is -metadata-[0-9]+.(w)?arc(.gz)? which matches all metadata files using the standard NetarchiveSuite naming scheme.

  • GUIApplication

    Code Block
         * <b>settings.harvester.viewerproxy.allowFileDownloads</b> If set to false, there will be no links to
         * allow download of warcfiles via the Viewerproxy GUI.
        public static String ALLOW_FILE_DOWNLOADS = "settings.harvester.viewerproxy.allowFileDownloads";

    A simple security feature to hinder operators from easily downloading harvested archive files. (default: true)

    Code Block
       public static String HERITRIX3_MONITOR_TEMP_PATH = "settings.harvester.harvesting.monitor.tempPath";

    Path to a directory which the new Heritrix3 monitor feature can use for caching. This is empty by default, and falls back to the system-wide temporary directory (usually /tmp).

Control Heritrix from NetarchiveSuite (beta)

In earlier versions of NetarchiveSuite, there was limited monitoring of running heritrix harvests in the NetarchiveSuite GUI, but management of running jobs required opening the Heritrix3 console itself. From NetarchiveSuite 5.2, much of the Heritrix3 console functionality has been moved into NetarchiveSuite. It is now possible, from NAS itself to:

  • pause, unpause or terminate running heritrix jobs
  • to inspect reports on running jobs
  • to show the crawl-log of a running job, either in entirety or filtered by regex
  • to show and manipulate the Heritrix frontier

These extensive new features are experimental in NAS 5.2 and the developers welcome feedback, bug-reports, and code-patches.

Top-Level Domains Can Be Defined Externally

From NAS 5.2, all ICANN-recognized domains are recognized as valid in NAS. NAS contains an embedded copy of, but this may be overridden, if necessary, by placing an alternative copy at the hard-coded path conf/public_suffix_list.dat in the installation on the machine where the GUIApplication and HarvestJobManager run. 

warc.gz metadata files

NAS now supports compression of metadata files (warc.gz format) via the setting settings.harvester.harvesting.metadata.compression.

Warc Revisit Records

NAS now generates WARC revisit records when using the is.hi.bok.deduplicator.DeDuplicator deduplicator.


The web GUI now uses an embedded tomcat, rather than Jetty, as a servlet container. This changeover should be invisible to the end user.

New Heritrix Version

NAS now uses the most recent (unofficial) Heritrix release from Kristinn Sigurðsson at the National Library of Iceland (version 3.3.0-LBS-2016-02).

RSS Crawling

The heritrix crawl-rss extension from Kristinn Sigurðsson at the National Library of Iceland now also comes bundles with NAS, and is therefore available for use in NAS crawls. (See RSS Harvests for documentation).

GUI Styling

The styling of the web interface has been improved.

Full list of issues resolved in this release

jqlQueryproject = NAS AND issuetype in standardIssueTypes() AND fixVersion = 5.2 AND NOT component = Test ORDER BY priority DESC, created ASC

Known issues

jqlQueryproject = NAS AND issuetype = Bug AND affectedVersion = 5.2 ORDER BY priority DESC, cf[10010] ASC, fixVersion ASC