Page tree

Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.


BnF migration to NAS 5 + H3 Update

Feedback from BnF.

  • installalation of NAS 5.2-snapshot (development + stage/pre-production environment)
    • correcting BnF's deploy scripts (nas-deploy) for NAS 5
    • database migration (we've started to prepare sql scripts to migrate to the new database schema but for the moment we use an empty database to test NAS 5)

  • migrating BnF developements from to
  • minor correction of date format
  • migrating host + domain profils from old order.xml format to the crawler-beans.cxml format
    • done by crawl engineer Sébastien Pivain-Leroy

  • correcting BnF's statistical tool for NAS (nas-qual) in order to handle both H1 and H3 reports format

  • pending : generate warc revisit records in format WARC 1.1
  • pending : archivefiles-report.txt missing GMT dates and closing date
    • JIRA
    • can only correct date format, can't get opened date
    • in dk.netarkivet.harvester.heritrix3.HarvestDocumentation, there is this comment :
      // Generate an arcfiles-report.txt if configured to do so.
      // This is not possible to extract from the crawl.log, but we will make one from just listing the files harvested by Heritrix3

      boolean genArcFilesReport = Settings.getBoolean(Heritrix3Settings.METADATA_GENERATE_ARCHIVE_FILES_REPORT);


  • pending : attempt to launch heritrix instance with another version of Java
    • for instance java 9 for new implementation of for https, but keep java 7 for NAS
    • It looks not so easy to do (see classes HeritrixLauncher & Heritrix3Wrapper)


Status of the production sites