Uploaded image for project: 'NetarchiveSuite'
  1. NetarchiveSuite
  2. NAS-1720 Enable WARC file writing and handling in the NetarchiveSuite
  3. NAS-1960

Extend our BatchJob framework to handle WARC-files on record level

    XMLWordPrintable

Details

    • Rough
    • Hide

      This is tested by unittests.
      Also verify that CDX is generated after a harvest.

      Show
      This is tested by unittests. Also verify that CDX is generated after a harvest.

    Description

      Currently our Batch framework only handles ARCfiles on record level.

      Currently we only have an abstract class handling ARCRecords(ARCBatchJob) with these concrete implementations:

      • ExtractCDXJob,
      • HarvestedUrlsForDomainBatchJob (also assumes crawl.log stored in ARC-file with URL "metadata://netarkivet.dk/crawl/logs/crawl.log")

      ARCBatchJob could/should be generalized to handle ArchiveRecords instead of ArcRecords. I have a prototype for such a generalization in the trunk: https://sbforge.org/svn/netarchivesuite/trunk/tests/dk/netarkivet/common/utils/cdx/ArchiveBatchJob.java

      Attachments

        Issue Links

          Activity

            People

              nicl@kb.dk Nicholas Clarke (Inactive)
              mss Mikis Seth Sørensen (Inactive)
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - 35h
                  35h
                  Remaining:
                  Remaining Estimate - 35h
                  35h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified