Agenda for the joint NetarchiveSuite tele-conference 2019-12-03, 13:00-14:00.
- BNF: Clara, Sara, Géraldine
- ONB: -
- KB/DK - Copenhagen: Tue, Stephen, Anders, Kristian
- KB/DK - Aarhus: Colin, Sabine, Knud Åge
- BNE: Alicia, María, Manuel, José
- KB/Sweden: Par, Thomas, Peter
Join from PC, Mac, Linux, iOS or Android:
Or an H.323/SIP room system:
Meeting ID: 104 443 571
Or Skype for Business (Lync):
Denmark: +45 89 88 37 88 or +45 32 71 31 57
United Kingdom: +44 203 051 2874 or +44 203 481 5237 or +44 203 966 3809 or +44 131 460 1196
Finland: +358 9 4245 1488 or +358 3 4109 2129
Sweden: +46 850 539 728 or +46 8 4468 2488
Norway: +47 7349 4877 or +47 2396 0588
US: +1 669 900 6833 or +1 646 558 8656
Meeting ID: 104 443 571
International numbers available: https://zoom.us/u/acRu0MV3xJ
You can join a meeting by using apps from a pc, a tablet or a smartphone, but you can also use the browser based version (it works with newer versions of Chrome or Firefox)
Update on NAS latest tests and developments
BnF discovered that NAS 5.6 no longer sends cookies with any requests and this was having an effect on the quality of their harvest. The bug was easily reproducible and was found to be have been introduced between versions 5.4.2 and 5.5 so the Netarkivet production system is also affected.
After a considerable amount of detective work, we discovered that the bug came not from a change in NAS code or Heritrix code but from a change in one of the 3rd Party libraries - specifically we had somehow come to downgrade the version of guava shipped with NAS. Simply substituting a more recent guava version in the bundler zip makes the issue go away. What we have not done:
- We don't know exactly why the build started packaging only the older guava version
- We don't know what is wrong with the older guava version that it causes this behaviour, and most importantly
- We haven't released a patch. We should really release patches to both 5.5 and 5.6 branches.
Colin and Clara have spent some time analysing the NAS modifications to Heritrix to see what would be needed to get a community-version of Heritrix that we could use in NAS. There seem to be three changes we would want in:
- Adding a timeout to the crawler-trap regex test
- Adding a filter to prevent inline images being interpreted as links
- Modifying the frontier to add additional methods to browse the queued URLs (including upgrading Berkeley DB - for which Andy Jackson already has a pull request).
Colin has reimplemented each of these in separate branches on top of the current IIPC/IA master branch (including Andy's pull request for nr. 3), and created a fourth branch (https://github.com/netarchivesuite/heritrix3/tree/h3.4-merge) in which all three modifications are merged. There is also a Netarchivesuite branch (https://github.com/netarchivesuite/netarchivesuite/tree/h3.4) which can be built against this heritrix once the heritrix has been installed locally with maven. What we need to do now is:
- Basic functional testing of nas/h3.4 (ie the release candidate for NAS 5.7)
- More extensive (acceptance) testing of nas/h3.4
- Make a series of pull requests to try to get our code into the main Heritrix repository.
Even if we aren't able to get our pull-requests accepted quickly, we should still base future releases on this work as we would then have a Heritrix version very close to the community version, making it easy to pull in future upstream changes.
Status of the production sites
- December 3
- January 7, 2020
Any other business?