Summary of H3 curator discussion
The discussions focused on the current NAS and Heritrix features used by curators to monitor and QA crawls.
- important figures to follow the crawl progress are included in the running jobs page. On the Heritrix console, curators are also looking at the job status, the number of active threads, the progress bar and percentage.
- important Heritrix logs/reports used by curators are:
- crawl-report: BnF
- seed-report: BnF, KB, ONB
- host-report: BnF, KB
- source-report: BnF
- mimetype-report: BnF, KB
- responsecode-report: BnF
- frontier-report: BnF, KB/SB, BNE, EST, ONB
- crawl-log: BnF, KB/SB, BNE, EST, ONB
- toe-threads-report: KB
- order.xml: BnF, order.xml in NAS: KB
- important Heritrix features used by curators are:
- search crawl-log and frontier using regular expression: BnF, KB
- view/search/delete URLs from the frontier using regular expression: KB, BnF
- see/add new filters to exclude URLs in the crawl settings and the frontier: BnF
- documentation on response codes and regular expressions is useful to KB curators.
Summary of H3 coding discussion
Mikis and Soren presented the new code structure. The code base needs to be fixed to get contributions. Soren and Nicholas will also decode the structure more to see which parts could be developed by other institutions.
Summary of WARC discussion
ISO has opened a revision process which gives the possibility to adaptations. Tue wondered about the differences between WARC files produced with Archive-it and those produced with NAS. Conclusions are:
- warcinfo record: harvest description produced by NAS is more structured, no changes needed.
- request and metadata record: it should possible to configure templates to generate request and/or metadata records.
- revisit record: NAS should generate revisit records, deduplication information stands currently only in log files, this point is the most important. There is a proposal within the IIPC to add the WARC-Target-URI and the WARC-Date of the previously harvested document to facilitate the indexing or any processing of this information. Officially supported by the IIPC, we could include these fields in NAS from now on, the change of Heritrix version is a good opportunity to change the format.
There is nothing urgent for the 5.0 release, but format has to be consistent with previous H1 releases.