dk.netarkivet.harvester.harvesting
Class ArchiveFilesReportGenerator

java.lang.Object
  extended by dk.netarkivet.harvester.harvesting.ArchiveFilesReportGenerator

 class ArchiveFilesReportGenerator
extends java.lang.Object

This class generate a report that lists ARC/WARC files (depending on the configured archive format) along with the opening date, closing date (if file was properly closed), and size in bytes. Here is a sample of such a file: [ARCHIVEFILE] [Opened] [Closed] [Size] 5-1-20100720161253-00000-bnf_test.arc.gz "2010-07-20 16:12:53.698" "2010-07-20 16:14:31.792" 162928 The file is named "archivefiles-report.txt" and is generated by parsing the "heritrix.out" file located in the crawl directory. Useful lines match the following examples: 2010-07-20 16:12:53.698 INFO thread-14 org.archive.io.WriterPoolMember.createFile() Opened /somepath/jobs/current/high/5_1279642368951/arcs/5-1-20100720161253-00000.arc.gz.open and 2010-07-20 16:14:31.792 INFO thread-29 org.archive.io.WriterPoolMember.close() Closed /somepath/jobs/current/high/5_1279642368951/arcs/5-1-20100720161253-00000-bnf_test.arc.gz, size 162928 In order to have such messages output to heritrix.out, the "heritrix.properties" file must contain the following, uncommented line: org.archive.io.arc.ARCWriter.level = INFO Note that these strings have changed between Heritrix version 1.14.3 and 1.14.4, so they might change again in the future.


Nested Class Summary
(package private) static class ArchiveFilesReportGenerator.ArchiveFileStatus
          Stores the opening date, closing date and size of an ARC file.
 
Field Summary
static java.text.MessageFormat FILE_CLOSE_FORMAT
          Format used to parse and extract values from lines of heritrix.out pertaining to an ARC/WARC file closing.
static java.text.MessageFormat FILE_OPEN_FORMAT
          Format used to parse and extract values from lines of heritrix.out pertaining to an ARC/WARC file opening.
static java.lang.String REPORT_FILE_HEADER
          The header line of the report file.
static java.lang.String REPORT_FILE_NAME
          The name of the report file.
 
Constructor Summary
ArchiveFilesReportGenerator(java.io.File crawlDir)
          Builds a ARC files report generator, given the Heritrix crawl directory.
 
Method Summary
protected  java.io.File generateReport()
          Parses heritrix.out and generates the ARC/WARC files report.
protected  java.util.Map<java.lang.String,ArchiveFilesReportGenerator.ArchiveFileStatus> parseHeritrixOut()
          Parses the heritrix.out file and maps to every found ARC file an ArchiveFilesReportGenerator.ArchiveFileStatus instance.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

FILE_OPEN_FORMAT

public static final java.text.MessageFormat FILE_OPEN_FORMAT
Format used to parse and extract values from lines of heritrix.out pertaining to an ARC/WARC file opening.


FILE_CLOSE_FORMAT

public static final java.text.MessageFormat FILE_CLOSE_FORMAT
Format used to parse and extract values from lines of heritrix.out pertaining to an ARC/WARC file closing.


REPORT_FILE_NAME

public static final java.lang.String REPORT_FILE_NAME
The name of the report file. It will be generated in the crawl directory.


REPORT_FILE_HEADER

public static final java.lang.String REPORT_FILE_HEADER
The header line of the report file.

Constructor Detail

ArchiveFilesReportGenerator

ArchiveFilesReportGenerator(java.io.File crawlDir)
Builds a ARC files report generator, given the Heritrix crawl directory.

Parameters:
crawlDir - the Heritrix crawl directory.
Method Detail

generateReport

protected java.io.File generateReport()
Parses heritrix.out and generates the ARC/WARC files report.

Returns:
the generated report file.

parseHeritrixOut

protected java.util.Map<java.lang.String,ArchiveFilesReportGenerator.ArchiveFileStatus> parseHeritrixOut()
Parses the heritrix.out file and maps to every found ARC file an ArchiveFilesReportGenerator.ArchiveFileStatus instance.

Returns:
the map of found ARC/WARC files, and related ArchiveFileStatus