Update on NAS latest tests and developments
Feedback on latest works regarding NetarchiveSuite 5.6 (Cookie issue)
Feedback on latest works regarding the integration of IIPC H3 release.
BnF discovered that NAS 5.6 no longer sends cookies with any requests and this was having an effect on the quality of their harvest. The bug was easily reproducible and was found to be have been introduced between versions 5.4.2 and 5.5 so the Netarkivet production system is also affected.
After a considerable amount of detective work, we discovered that the bug came not from a change in NAS code or Heritrix code but from a change in one of the 3rd Party libraries - specifically we had somehow come to downgrade the version of guava shipped with NAS. Simply substituting a more recent guava version in the bundler zip makes the issue go away. What we have not done:
- We don't know exactly why the build started packaging only the older guava version
- We don't know what is wrong with the older guava version that it causes this behaviour, and most importantly
- We haven't released a patch. We should really release patches to both 5.5 and 5.6 branches.
Colin and Clara have spent some time analysing the NAS modifications to Heritrix to see what would be needed to get a community-version of Heritrix that we could use in NAS. There seem to be three changes we would want in:
- Adding a timeout to the crawler-trap regex test
- Adding a filter to prevent inline images being interpreted as links
- Modifying the frontier to add additional methods to browse the queued URLs (including upgrading Berkeley DB - for which Andy Jackson already has a pull request).
Colin has reimplemented each of these in separate branches on top of the current IIPC/IA master branch (including Andy's pull request for nr. 3), and created a fourth branch (https://github.com/netarchivesuite/heritrix3/tree/h3.4-merge) in which all three modifications are merged. There is also a Netarchivesuite branch (https://github.com/netarchivesuite/netarchivesuite/tree/h3.4) which can be built against this heritrix once the heritrix has been installed locally with maven. What we need to do now is:
- Basic functional testing of nas/h3.4 (ie the release candidate for NAS 5.7)
- More extensive (acceptance) testing of nas/h3.4
- Make a series of pull requests to try to get our code into the main Heritrix repository.
Even if we aren't able to get our pull-requests accepted quickly, we should still base future releases on this work as we would then have a Heritrix version very close to the community version, making it easy to pull in future upstream changes.
Status of the production sites