This release addresses issue
Getting issue details...STATUS
which resulted in many url's receiving crawl-status code -50 in some harvests. It is only relevant for users of SeedUriDomainnameQueueAssignmentPolicy. The fix is in two parts:
A new QuotaEnforcer implementation dk.netarkivet.harvester.harvesting.PrerequisiteIgnoringQuotaEnforcer which can be used in a crawler-bean harvest-template, and which never enforces harvesting quotas on prerequisite url's (typically dns lookups and robots.txt), and
An alteration to SeedUriDomainnameQueueAssignmentPolicy to ensure that dns queries are queued on the same queue as other url's for the same seed. This appears to work around an undocumented race condition in heritrix which was causing many crawl failures.
BugFix Release 5.4.1
NAS 5.4.1 is a Bug-Fix release addressing some issues found during the Acceptance Test phase of NAS 5.4. The issues addressed are
A memory leak introduced by a new feature in NAS 5.4 (
Getting issue details...STATUS
) to manage the number of jobs on the JMS queues, and
An error in the functionality for searching/browsing in the frontier of running jobs
Introduction of a new setting (settings.harvester.indexserver.tryToMigrateDuplicationRecords), a switch, to disable new functionality associated with the Danish netarchive's project to compress their archive. This functionality caused an unnecessary slowdown in indexing functionality, but is now disabled by default.
The functionality for browsing in the Heritrix frontier is still somewhat experimental and is in need of a usability overhaul. This is a priority for a future release.
NetarchiveSuite now ships with a customised version of Heritrix 3, forked from the version maintained by Kristinn Sigurdsson at the National Library of Iceland.
The integration between the NetarchiveSuite Web interface and Heritrix 3 has been much improved, both in regard to scaling and usability.
There is significant improvement to the job generation algorithm, so that the production of spurious duplicate jobs is now largely eliminated.
Support for Heritrix1 has now been removed from the distribution.
You can now define a limit to how many jobs are submitted to each jobchannel simultaneously, if you enable limitSubmittedJobsInQueue by setting settings.harvester.scheduler.limitSubmittedJobsInQueue to true. The default value if you enable this is one job at a time. You can change this value by overriding the settings.harvester.scheduler.submittedJobsInQueueLimit. The latter setting is ignored, if limitSubmittedJobsInQueue is false, which is the default setting.
The setting settings.harvester.scheduler.jobgenerationperiode has been renamed settings.harvester.scheduler.jobgenerationperiod (default value is still 60 a.k.a 1 minute)
Added new setting to choose between filtering methods on History/Harveststatus-running.jsp: settings.webinterface.runningjobsFilteringMethod (default: database alternative: cachedLogs)
Upgrading from previous releases of Netarchivesuite
Upgrading the database: After finishing the installation of NetarchiveSuite and starting it for the first time, please go the server where GUIApplication and HarvestJobManager is installed and run:
Please examine the INSTALLDIR/update_external_harvest_database.log for any errors.