NetarchiveSuite now requires a Java 8 runtime for all components.
* <b>settings.archive.checksum.usePrecomputedChecksum</b>: This decides whether or not use the pre-computed checksum sent as part of the StoreMessage and UploadMessage
* The default is false
public static String CHECKSUM_USE_PRECOMPUTED_CHECKSUM_DURING_UPLOAD= "settings.archive.checksum.usePrecomputedChecksumDuringUpload";
This boolean can be used to optimise the upload process to the bitarchives.
* <b>settings.common.topLevelDomains.tld</b>: <br>
* Extra valid top level domain, like .co.uk, .dk, .org., not part of current embedded public_suffix_list.dat file
* in common/common-core/src/main/resources/dk/netarkivet/common/utils/public_suffix_list.dat
* downloaded from https://www.publicsuffix.org/list/public_suffix_list.dat
public static String TLDS = "settings.common.topLevelDomains.tld";
* The version number which goes in metadata file names like 12345-metadata-<version number>.warc.gz
public static String METADATA_FILE_VERSION_NUMBER = "settings.harvester.harvesting.metadata.filename.versionnumber";
This parameter allows for the definition of different generations of metadata file.
* <b>settings.harvester.harvesting.metadata.compression</b> Do we compress the
* metadata associated with a given harvest job.
* default: false
public static String METADATA_COMPRESSION = "settings.harvester.harvesting.metadata.compression";
Controls whether metadata files are generated in compressed (warc.gz) format.
* Specifies the suffix of a regex which can identify valid metadata files by job number. Thus preceding
* the value of this setting with .* will find all metadata files.
public static String METADATAFILE_REGEX_SUFFIX = "settings.common.metadata.fileregexsuffix";
This parameter allows one to determine which metadata files to include in indexing (for Viewerproxy or Wayback). The full regex string to be searched consists of the string <jobid>-<harvestid> followed by this suffix. The default value is -metadata-[0-9]+.(w)?arc(.gz)? which matches all metadata files using the standard NetarchiveSuite naming scheme.
* <b>settings.harvester.viewerproxy.allowFileDownloads</b> If set to false, there will be no links to
* allow download of warcfiles via the Viewerproxy GUI.
public static String ALLOW_FILE_DOWNLOADS = "settings.harvester.viewerproxy.allowFileDownloads";
A simple security feature to hinder operators from easily downloading harvested archive files. (default: true)
public static String HERITRIX3_MONITOR_TEMP_PATH = "settings.harvester.harvesting.monitor.tempPath";
Path to a directory which the new Heritrix3 monitor feature can use for caching. This is empty by default, and falls back to the system-wide temporary directory (usually /tmp).
Control Heritrix from NetarchiveSuite (beta)
In earlier versions of NetarchiveSuite, there was limited monitoring of running heritrix harvests in the NetarchiveSuite GUI, but management of running jobs required opening the Heritrix3 console itself. From NetarchiveSuite 5.2, much of the Heritrix3 console functionality has been moved into NetarchiveSuite. It is now possible, from NAS itself to:
pause, unpause or terminate running heritrix jobs
to inspect reports on running jobs
to show the crawl-log of a running job, either in entirety or filtered by regex
to show and manipulate the Heritrix frontier
These extensive new features are experimental in NAS 5.2 and the developers welcome feedback, bug-reports, and code-patches.
Top-Level Domains Can Be Defined Externally
From NAS 5.2, all ICANN-recognized domains are recognized as valid in NAS. NAS contains an embedded copy of https://publicsuffix.org/list/public_suffix_list.dat, but this may be overridden, if necessary, by placing an alternative copy at the hard-coded path conf/public_suffix_list.dat in the installation on the machine where the GUIApplication and HarvestJobManager run.
warc.gz metadata files
NAS now supports compression of metadata files (warc.gz format) via the setting settings.harvester.harvesting.metadata.compression.
Warc Revisit Records
NAS now generates WARC revisit records when using the is.hi.bok.deduplicator.DeDuplicator deduplicator.
The web GUI now uses an embedded tomcat, rather than Jetty, as a servlet container. This changeover should be invisible to the end user.
New Heritrix Version
NAS now uses the most recent (unofficial) Heritrix release from Kristinn Sigurðsson at the National Library of Iceland (version 3.3.0-LBS-2016-02).
The heritrix crawl-rss extension from Kristinn Sigurðsson at the National Library of Iceland now also comes bundles with NAS, and is therefore available for use in NAS crawls. (See RSS Harvests for documentation).
The styling of the web interface has been improved.