Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Excerpt

5.2.2 Release Date 25th November 2016

5.2.1 Release Date 23rd November 2016

5.2 Release Date: 4th November 2016

Contents

Table of Contents
minLevel2
indent6px
exclude(Download.*)|(Javadoc)|(Manuals)

Highlights in 5.2.2

NAS 5.2.2 restores the functionality missing since the upgrade to Heritrix 3, which allows one to control switch deduplication on or off as a setting to the HarvestJobManager component. The setting in settings_HarvestJobManagerApplication.xml is setting harvester.harvesting.deduplication.enabled which is binary valued. The setting is applied to harvests generated using any crawler template which includes the DeDuplicator bean and which specifies the appropriate placeholder, for example as follows:

Code Block
  <bean id="DeDuplicator" class="is.hi.bok.deduplicator.DeDuplicator">
    <!-- DEDUPLICATION_INDEX_LOCATION_PLACEHOLDER is replaced by path on harvest-server -->
    <property name="indexLocation" value="%{DEDUPLICATION_INDEX_LOCATION_PLACEHOLDER}"/>
    <property name="matchingMethod" value="URL"/>
    <property name="tryEquivalent" value="TRUE"/>
    <property name="changeContentSize" value="false"/>
    <property name="mimeFilter" value="^text/.*"/>
    <property name="filterMode" value="BLACKLIST"/>
    <property name="origin" value=""/>
    <property name="originHandling" value="INDEX"/>
    <property name="statsPerHost" value="true"/>
    <property name="enabled" value="%{DEDUPLICATION_ENABLED_PLACEHOLDER}" />
  </bean>

The %{DEDUPLICATION_ENABLED_PLACEHOLDER} is replaced with the current value of the setting when jobs are generated. The placeholder is optional, and deduplication will be enabled by default for any template which includes the DeDuplicator in its position chain and for which the "enabled" property is not explicitly defined.  

Highlights in 5.2.1

NAS 5.2.1 is a bugfix release addressing in issue in wayback-indexing of deduplicate records. 

Highlights in 5.2

Java 8


NetarchiveSuite now requires a Java 8 runtime for all components.

New Settings

  • ChecksumFileApplication

     

    Code Block
    /**
    * <b>settings.archive.checksum.usePrecomputedChecksum</b>: This decides whether or not use the pre-computed checksum sent as part of the StoreMessage and UploadMessage
    * The default is false
    */
        public static String CHECKSUM_USE_PRECOMPUTED_CHECKSUM_DURING_UPLOAD= "settings.archive.checksum.usePrecomputedChecksumDuringUpload";

    This boolean can be used to optimise the upload process to the bitarchives.

     

  • GUIApplication, HarvestJobManager

    Code Block
    /**
     * <b>settings.common.topLevelDomains.tld</b>: <br>
     * Extra valid top level domain, like .co.uk, .dk, .org., not part of current embedded public_suffix_list.dat file 
     * in common/common-core/src/main/resources/dk/netarkivet/common/utils/public_suffix_list.dat
     * downloaded from https://www.publicsuffix.org/list/public_suffix_list.dat
     */
    public static String TLDS = "settings.common.topLevelDomains.tld";
  • HarvestControllerApplication

    Code Block
    /**
     * The version number which goes in metadata file names like 12345-metadata-&lt;version number&gt;.warc.gz
     */
    public static String METADATA_FILE_VERSION_NUMBER = "settings.harvester.harvesting.metadata.filename.versionnumber";

    This parameter allows for the definition of different generations of metadata file.

    Code Block
    /**
     * <b>settings.harvester.harvesting.metadata.compression</b> Do we compress the
     * metadata associated with a given harvest job. 
     * default: false 
     */
    public static String METADATA_COMPRESSION = "settings.harvester.harvesting.metadata.compression";

    Controls whether metadata files are generated in compressed (warc.gz) format.

  • ViewerproxyApplication, IndexServerApplication, WaybackIndexerApplication

    Code Block
    /**
     * Specifies the suffix of a regex which can identify valid metadata files by job number. Thus preceding
     * the value of this setting with .* will find all metadata files.
     */
    public static String METADATAFILE_REGEX_SUFFIX = "settings.common.metadata.fileregexsuffix";

    This parameter allows one to determine which metadata files to include in indexing (for Viewerproxy or Wayback). The full regex string to be searched consists of the string <jobid>-<harvestid> followed by this suffix. The default value is -metadata-[0-9]+.(w)?arc(.gz)? which matches all metadata files using the standard NetarchiveSuite naming scheme.

  • GUIApplication

    Code Block
        /**
         * <b>settings.harvester.viewerproxy.allowFileDownloads</b> If set to false, there will be no links to
         * allow download of warcfiles via the Viewerproxy GUI.
         */
        public static String ALLOW_FILE_DOWNLOADS = "settings.harvester.viewerproxy.allowFileDownloads";

    A simple security feature to hinder operators from easily downloading harvested archive files. (default: true)

    Code Block
       public static String HERITRIX3_MONITOR_TEMP_PATH = "settings.harvester.harvesting.monitor.tempPath";

    Path to a directory which the new Heritrix3 monitor feature can use for caching. This is empty by default, and falls back to the system-wide temporary directory (usually /tmp).

Control Heritrix from NetarchiveSuite (beta)


In earlier versions of NetarchiveSuite, there was limited monitoring of running heritrix harvests in the NetarchiveSuite GUI, but management of running jobs required opening the Heritrix3 console itself. From NetarchiveSuite 5.2, much of the Heritrix3 console functionality has been moved into NetarchiveSuite. It is now possible, from NAS itself to:

  • pause, unpause or terminate running heritrix jobs
  • to inspect reports on running jobs
  • to show the crawl-log of a running job, either in entirety or filtered by regex
  • to show and manipulate the Heritrix frontier

These extensive new features are experimental in NAS 5.2 and the developers welcome feedback, bug-reports, and code-patches.

Top-Level Domains Can Be Defined Externally

From NAS 5.2, all ICANN-recognized domains are recognized as valid in NAS. NAS contains an embedded copy of https://publicsuffix.org/list/public_suffix_list.dat, but this may be overridden, if necessary, by placing an alternative copy at the hard-coded path conf/public_suffix_list.dat in the installation on the machine where the GUIApplication and HarvestJobManager run. 

warc.gz metadata files

NAS now supports compression of metadata files (warc.gz format) via the setting settings.harvester.harvesting.metadata.compression.

Warc Revisit Records

NAS now generates WARC revisit records when using the is.hi.bok.deduplicator.DeDuplicator deduplicator.

Tomcat

The web GUI now uses an embedded tomcat, rather than Jetty, as a servlet container. This changeover should be invisible to the end user.

New Heritrix Version

NAS now uses the most recent (unofficial) Heritrix release from Kristinn Sigurðsson at the National Library of Iceland (version 3.3.0-LBS-2016-02).

RSS Crawling

The heritrix crawl-rss extension from Kristinn Sigurðsson at the National Library of Iceland now also comes bundles with NAS, and is therefore available for use in NAS crawls. (See RSS Harvests for documentation).

GUI Styling

The styling of the web interface has been improved.

Panel

Most-recent updates for 5.2.x:

Issues resolved in release 5.2.1

JIRA
serverSBForge
columnstype,summary,status
maximumIssues1000
jqlQueryproject = NAS AND issuetype in standardIssueTypes() AND fixVersion = 5.2.1 AND NOT component = Test ORDER BY priority DESC, created ASC
serverId327e372c-baf0-3de4-afa1-7694d9fcf12b

Issues resolved in release 5.2

JIRA
serverSBForge
columnstype,summary,status
maximumIssues1000
jqlQueryproject = NAS AND issuetype in standardIssueTypes() AND fixVersion = 5.2 AND NOT component = Test ORDER BY priority DESC, created ASC
serverId327e372c-baf0-3de4-afa1-7694d9fcf12b

Known issues

JIRA
serverSBForge
columnstype,key,priority,summary,fixversions
maximumIssues20
jqlQueryproject = NAS AND issuetype = Bug AND affectedVersion = 5.2 ORDER BY priority DESC, cf[10010] ASC, fixVersion ASC
serverId327e372c-baf0-3de4-afa1-7694d9fcf12b