Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

NetarchiveSuite 5 / Heritrix 3 is now used in production by almost all community members. A demo of the latest features and discussion on the migration challenges enabled BNE to have a precise insight on how to move from NAS4 to NAS5. We all gratefully thanked (and thank again) the Netarkivet team for supporting all the development effort and defining the first H3 template files which we all used as a starting point. KB estimates H3 integration took about 2 developers/year + one more of a technical expert to redefine all templates. BnF contributed to NAS 5.3 release (mainly on the H3 Remote Access feature).

The two next releases have been discussed:

NAS 5.3.1: https://sbforge.org/jira/projects/NAS/versions/12945 (end of May, lead DK)
It should only be a bug fix release to solve issues that Netarkivet has had in the last broad crawl (job generation, H3 crawllog caching and bytes limit reach with seedQueueAssignementPolicy). This release will be used to launch the next DK broad crawl in June. BnF will use it in July to run its first broad crawl tests and as a basis for 5.4 developments. DK development team asked for help for the release tests.

NAS 5.4: https://sbforge.org/jira/projects/NAS/versions/12944 (beginning of September, lead BnF)
This release will include fixes from BnF (marked as P1 in this document https://sbforge.org/download/attachments/23101500/BnF-NASbugsandfeatures-april2017.docx?version=1&modificationDate=1493134139845&api=v2).
Two features will be discussed within the community:1)

  1. How to improve the design of H3 job page (it contains too much information and is not easy to use)

...

  1. How to improve the H3 caching feature. First points raised during the discussion were there is no need to cache entire crawllogs from all running jobs, this is impossible (in terms of performance) and unnecessary (only the latest lines are useful to take decisions), it would be useful to have the possibility to configure the number of lines to extract.

BnF will support most development effort but will need participation from Nicholas to work on the H3 crawllog feature.

Procedures to contribute to NAS code have been rediscussed to ease the integration and test processes at all levels. All community developers should work directly into netarchivesuite repository. We should then create a staging branch from the master (e.g. bnf-staging) to include features and bug fixes which are described on Jira. Colin reminds there is documentation for new comers: https://sbforge.org/display/NAS/Development. It is important to keep new code consistent with unit tests (access and how to use Jenkins will be discussed in the next development phase).

NAS 6.0 (no date, no lead)
We discussed what we would like to include in the next major release:-

  • introduction of a login system,

...

  • structuration of crawl documentation and traceability of user actions,

...

  • validation of new and existing seeds,

...

  • easy management of TLDs.

Above all, the most important feature would be to get NAS to work with an additional crawler to improve the harvesting of javascript heavy websites and streamlining the harvesting of videos plateforms (Netarkivet has successfully tested Internet Archive Brozzler: https://github.com/internetarchive/brozzler). There are several options for this:-

  • Option 1: make NAS completely modular and extensible to other even yet non-existing crawlers (this would require a complete refactoring of NAS code (the introduction of H3 offered a basis but is clearly not enough) and the definion of crawler APIs (WASAPI group is currently defining data transfer APIs: https://github.com/WASAPI-Community/data-transfer-apis but to our knowledge there is no existing crawler API).

...

  • Option 2: get Umbra (https://github.com/internetarchive/umbra) or an Umbra-like messaging system to keep Heritrix as our main harvester which would use complementary tools (such as Chromium, PhantomJS, youtube-dl) to identify and extract complex URLs and feed them back to Heritrix.

Option 1 is much more ambitious and satisfactory, option 2 would also need important developments. All members need to go back to their institution to see if resources could be allocated on this topic in 2018. In the meantime, members will keep on testing new tools in small time slots and sharing results.

...

In 2017, the community members, both developers and curators, should focus on the following:-

  • work collaboratively on features and bug fixes included in NAS 5.3.1 and NAS 5.4 and participate to the release tests,

...

  • keep on updating the curator roadmap and make it a discussion tool to exchange needs and ideas on the different features,

...

  • discuss NAS direction regarding the integration of other crawlers,

...

  • move foward in making BCweb an open source project,

...