dk.netarkivet.harvester.harvesting
Class ArcFilesReportGenerator

java.lang.Object
  extended by dk.netarkivet.harvester.harvesting.ArcFilesReportGenerator

 class ArcFilesReportGenerator
extends java.lang.Object

This class generate a report that lists ARC files along with the opening date, closing date (if file was properly closed), and size in bytes. Here is a sample of such a file: [ARCFILE] [Opened] [Closed] [Size] 5-1-20100720161253-00000-bnf_test.arc.gz "2010-07-20 16:12:53.698" "2010-07-20 16:14:31.792" 162928 The file is named "arcfiles-report.txt" and is generated by parsing the "heritrix.out" file located in the crawl directory. Useful lines match the following examples: 2010-07-20 16:12:53.698 INFO thread-14 org.archive.io.WriterPoolMember.createFile() Opened /somepath/jobs/current/high/5_1279642368951/arcs/5-1-20100720161253-00000.arc.gz.open and 2010-07-20 16:14:31.792 INFO thread-29 org.archive.io.WriterPoolMember.close() Closed /somepath/jobs/current/high/5_1279642368951/arcs/5-1-20100720161253-00000-bnf_test.arc.gz, size 162928 In order to have such messages output to heritrix.out, the "heritrix.properties" file must contain the following, uncommented line: org.archive.io.arc.ARCWriter.level = INFO Note that these strings have changed between Heritrix version 1.14.3 and 1.14.4, so they might change again in the future.


Nested Class Summary
(package private) static class ArcFilesReportGenerator.ArcFileStatus
          Stores the opening date, closing date and size of an ARC file.
 
Field Summary
static java.text.MessageFormat ARC_CLOSE_FORMAT
          Format used to parse and extract values from lines of heritrix.out pertaining to an ARC file closing.
static java.text.MessageFormat ARC_OPEN_FORMAT
          Format used to parse and extract values from lines of heritrix.out pertaining to an ARC file opening.
static java.lang.String REPORT_FILE_NAME
          The name of the report file.
 
Constructor Summary
ArcFilesReportGenerator(java.io.File crawlDir)
          Builds a ARC files report generator, given the Heritrix crawl directory.
 
Method Summary
protected  java.io.File generateReport()
          Parses heritrix.out and generates the ARC files report.
protected  java.util.Map<java.lang.String,ArcFilesReportGenerator.ArcFileStatus> parseHeritrixOut()
          Parses the heritrix.out file and maps to every found ARC file an ArcFilesReportGenerator.ArcFileStatus instance.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

ARC_OPEN_FORMAT

public static final java.text.MessageFormat ARC_OPEN_FORMAT
Format used to parse and extract values from lines of heritrix.out pertaining to an ARC file opening.


ARC_CLOSE_FORMAT

public static final java.text.MessageFormat ARC_CLOSE_FORMAT
Format used to parse and extract values from lines of heritrix.out pertaining to an ARC file closing.


REPORT_FILE_NAME

public static final java.lang.String REPORT_FILE_NAME
The name of the report file. It will be generated in the crawl directory.

See Also:
Constant Field Values
Constructor Detail

ArcFilesReportGenerator

ArcFilesReportGenerator(java.io.File crawlDir)
Builds a ARC files report generator, given the Heritrix crawl directory.

Parameters:
crawlDir - the Heritrix crawl directory.
Method Detail

generateReport

protected java.io.File generateReport()
Parses heritrix.out and generates the ARC files report.

Returns:
the generated report file.

parseHeritrixOut

protected java.util.Map<java.lang.String,ArcFilesReportGenerator.ArcFileStatus> parseHeritrixOut()
Parses the heritrix.out file and maps to every found ARC file an ArcFilesReportGenerator.ArcFileStatus instance.