- This line was added.
- This line was removed.
- Formatting was changed.
|Table of Contents|
5.4.1 Release Date: 2018-05-28
BugFix Release 5.4.2 (pending)
This release addresses issue
- A new QuotaEnforcer implementation dk.netarkivet.harvester.harvesting.PrerequisiteIgnoringQuotaEnforcer which can be used in a crawler-bean harvest-template, and which never enforces harvesting quotas on prerequisite url's (typically dns lookups and robots.txt), and
- An alteration to SeedUriDomainnameQueueAssignmentPolicy to ensure that dns queries are queued on the same queue as other url's for the same seed. This appears to work around an undocumented race condition in heritrix which was causing many crawl failures.
BugFix Release 5.4.1
NAS 5.4.1 is a Bug-Fix release addressing some issues found during the Acceptance Test phase of NAS 5.4. The issues addressed are
- A memory leak introduced by a new feature in NAS 5.4 (
) to manage the number of jobs on the JMS queues, and
Jira server SBForge columns key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution serverId 327e372c-baf0-3de4-afa1-7694d9fcf12b key NAS-2614
- An error in the functionality for searching/browsing in the frontier of running jobs
- Introduction of a new setting (
settings.harvester.indexserver.tryToMigrateDuplicationRecords), a switch, to disable new functionality associated with the Danish netarchive's project to compress their archive. This functionality caused an unnecessary slowdown in indexing functionality, but is now disabled by default.
The functionality for browsing in the Heritrix frontier is still somewhat experimental and is in need of a usability overhaul. This is a priority for a future release.
Highlights in 5.4
- NetarchiveSuite now ships with a customised version of Heritrix 3, forked from the version maintained by Kristinn Sigurdsson.
- The integration between the NetarchiveSuite Web interface and Heritrix 3 has been much improved, both in regard to scaling and usability.
- There is significant improvement to the job generation algorithm, so that the production of spurious duplicate jobs is now largely eliminated.
- Support for Heritrix1 has now been removed from the distribution
- You can now define a limit to how many jobs are submitted to each jobchannel simultaneously, if you enable limitSubmittedJobsInQueue by setting settings.harvester.scheduler.limitSubmittedJobsInQueue to true. The default value if you enable this is one job at a time. You can change this value by overriding the settings.harvester.scheduler.submittedJobsInQueueLimit. The latter setting is ignored, if limitSubmittedJobsInQueue is false, which is the default setting
- The setting settings.harvester.scheduler.jobgenerationperiode renamed as settings.harvester.scheduler.jobgenerationperiod (default value is still 60 a.k.a 1 minute)
- Added new setting to choose between filteringmethods on History/Harveststatus-running.jsp: settings.webinterface.runningjobsFilteringMethod (default: database alternative: cachedLogs)
Upgrading from previous releases of Netarchivesuite
Upgrading the database: After finishing the installation of NetarchiveSuite and starting it for the first time, please go the server where GUIApplication and HarvestJobManager is installed and run:
cd NAS_INSTALLDIR/conf bash update_external_harvest_database.sh
Please examine the INSTALLDIR/update_external_harvest_database.log for any errors.
Issues resolved in release 5.4