Details

    • Hide

      Install Netarchivesuite with setting "settings.harvester.harvesting.metadata.metadataFormat" set to "warc".
      Then run a single selective harvest.

      Then go to the QA/QA-getfiles.jsp?jobid=X
      This page now shows you that a metadata warc-files has been created.
      You can view it by clicking on it (requires viewerproxy to be enabled). Compare with the metadata-warcfile uploaded to this issue. They should be similar.

      The page QA/QA-getreports.jsp?jobid=X also shows that the reports in the metadata-warcfile are the same as before.

      Show
      Install Netarchivesuite with setting "settings.harvester.harvesting.metadata.metadataFormat" set to "warc". Then run a single selective harvest. Then go to the QA/QA-getfiles.jsp?jobid=X This page now shows you that a metadata warc-files has been created. You can view it by clicking on it (requires viewerproxy to be enabled). Compare with the metadata-warcfile uploaded to this issue. They should be similar. The page QA/QA-getreports.jsp?jobid=X also shows that the reports in the metadata-warcfile are the same as before.

    Description

      We need to define how to identify metadata records in the metadata WARC file.
      With the existing metadata arcfile, each kind of metadata (logs/reports/setup/cdx) have their own unique URI:
      e.g.:

      metadata://netarkivet.dk/crawl/setup/duplicatereductionjobs?majorversion=1&minorversion=0&harvestid=1&harvestnum=2&jobid=3 
      
      metadata://netarkivet.dk/crawl/setup/crawl-manifest.txt?heritrixVersion=1.14.4&harvestid=1&jobid=3 
      metadata://netarkivet.dk/crawl/setup/harvestInfo.xml?heritrixVersion=1.14.4&harvestid=1&jobid=3 
      metadata://netarkivet.dk/crawl/setup/order.xml?heritrixVersion=1.14.4&harvestid=1&jobid=3 
      metadata://netarkivet.dk/crawl/setup/seeds.txt?heritrixVersion=1.14.4&harvestid=1&jobid=3 
      metadata://netarkivet.dk/crawl/reports/arcfiles-report.txt?heritrixVersion=1.14.4&harvestid=1&jobid=3 
      
      metadata://netarkivet.dk/crawl/reports/crawl-report.txt?heritrixVersion=1.14.4&harvestid=1&jobid=3 
      metadata://netarkivet.dk/crawl/reports/frontier-report.txt?heritrixVersion=1.14.4&harvestid=1&jobid=3 
      
      metadata://netarkivet.dk/crawl/reports/hosts-report.txt?heritrixVersion=1.14.4&harvestid=1&jobid=3 
      metadata://netarkivet.dk/crawl/reports/mimetype-report.txt?heritrixVersion=1.14.4&harvestid=1&jobid=3 
      
      metadata://netarkivet.dk/crawl/reports/processors-report.txt?heritrixVersion=1.14.4&harvestid=1&jobid=3 
      
      metadata://netarkivet.dk/crawl/reports/responsecode-report.txt?heritrixVersion=1.14.4&harvestid=1&jobid=3 
      
      metadata://netarkivet.dk/crawl/reports/seeds-report.txt?heritrixVersion=1.14.4&harvestid=1&jobid=3 
      metadata://netarkivet.dk/crawl/logs/crawl.log?heritrixVersion=1.14.4&harvestid=1&jobid=3 
      metadata://netarkivet.dk/crawl/logs/heritrix.out?heritrixVersion=1.14.4&harvestid=1&jobid=3 
      metadata://netarkivet.dk/crawl/logs/heritrix_dmesg.log?heritrixVersion=1.14.4&harvestid=1&jobid=3 
      metadata://netarkivet.dk/crawl/logs/local-errors.log?heritrixVersion=1.14.4&harvestid=1&jobid=3 
      metadata://netarkivet.dk/crawl/logs/progress-statistics.log?heritrixVersion=1.14.4&harvestid=1&jobid=3 
      
      metadata://netarkivet.dk/crawl/logs/runtime-errors.log?heritrixVersion=1.14.4&harvestid=1&jobid=3 
      metadata://netarkivet.dk/crawl/logs/uri-errors.log?heritrixVersion=1.14.4&harvestid=1&jobid=3 
      metadata://netarkivet.dk/crawl/index/cdx?majorversion=1&minorversion=0&harvestid=1&jobid=3&timestamp=20120329112720&serialno=00000
      

      Note that the HeritrixVersion, harvestId, jobId are included within the URI as URL parameters.

      Attachments

        Issue Links

          Activity

            People

              nicl@kb.dk Nicholas Clarke (Inactive)
              svc Søren Vejrup Carlsen (Inactive)
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: