dk.netarkivet.harvester.harvesting
Class HeritrixFiles

java.lang.Object
  extended by dk.netarkivet.harvester.harvesting.HeritrixFiles

public class HeritrixFiles
extends java.lang.Object

This class encapsulates all the files that Heritrix gets from our system, and all files we read from Heritrix.


Constructor Summary
HeritrixFiles(java.io.File crawlDir, JobInfo harvestJob)
          Alternate constructor that by default reads the jmxPasswordFile, and jmxAccessFile from the current settings.
HeritrixFiles(java.io.File crawlDir, JobInfo harvestJob, java.io.File jmxPasswordFile, java.io.File jmxAccessFile)
          Create a new HeritrixFiles object for a job.
 
Method Summary
 void cleanUpAfterHarvest(java.io.File oldJobsDir)
          Delete statefile etc.
 void deleteFinalLogs()
          Helper method to delete the crawl.log and progress statistics log.
 java.lang.String getArchiveFilePrefix()
          Returns the prefix used to generate Archive files (ARC or WARC).
 java.io.File getArcsDir()
          Return the directory, where Heritrix writes its arcfiles.
 java.io.File getCrawlDir()
          Returns the directory that crawls are performed inside.
 java.io.File getCrawlLog()
          Retrieve the crawlLog as a File object.
 java.io.File[] getDisposableFiles()
          Return a list of disposable heritrix-files.
 java.lang.Long getHarvestID()
          Get the harvest ID.
 java.io.File getHeritrixOutput()
          Get the file that contains output from Heritrix on stdout/stderr.
 java.io.File getIndexDir()
          Returns the index directory, if one has been set.
 java.io.File getJmxAccessFile()
          Method for retrieving the jmxremote.access file.
 java.io.File getJmxPasswordFile()
          Method for retrieving the jmxremote.password file.
 java.lang.Long getJobID()
          Get the job ID.
 java.io.File getOrderXmlFile()
          Returns the order.xml file object.
 java.io.File getProgressStatisticsLog()
          Retrieve the progress statistics log as a File object.
 java.io.File getRecoverBackupGzFile()
          Returns the recoverbackup file object.
 java.io.File getSeedsTxtFile()
          Returns the seeds.txt file object.
 java.io.File getWarcsDir()
          Return the directory, where Heritrix writes its warcfiles.
 void setIndexDir(java.io.File indexDir)
          Set the deduplicate index dir.
 void writeOrderXml(org.dom4j.Document doc)
          Writes the given order.xml content to the order.xml file.
 boolean writeRecoverBackupfile(java.io.InputStream recoverlog)
          Try to write the recover-backup file.
 void writeSeedsTxt(java.lang.String seeds)
          Writes the given content to the seeds.txt file.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

HeritrixFiles

public HeritrixFiles(java.io.File crawlDir,
                     JobInfo harvestJob,
                     java.io.File jmxPasswordFile,
                     java.io.File jmxAccessFile)
Create a new HeritrixFiles object for a job.

Parameters:
crawlDir - The dir, where the crawl-files are placed. Assumes, that crawlDir exists already.
harvestJob - The harvestjob behind this instance of HeritrixFiles
jmxPasswordFile - The jmx password file to be used by Heritrix. The existence of this file is checked another place.
jmxAccessFile - The JMX access file to be used by Heritrix. The existence of this file is checked another place.
Throws:
ArgumentNotValid - if null crawlDir, or non-positive jobID and harvestID.

HeritrixFiles

public HeritrixFiles(java.io.File crawlDir,
                     JobInfo harvestJob)
Alternate constructor that by default reads the jmxPasswordFile, and jmxAccessFile from the current settings.

Parameters:
crawlDir - The dir, where the crawl-files are placed
harvestJob - The harvestjob behind this instance of HeritrixFiles
Method Detail

getCrawlDir

public java.io.File getCrawlDir()
Returns the directory that crawls are performed inside.

Returns:
A directory (that is created as part of harvest setup) that all of Heritrix' files live in.

getArchiveFilePrefix

public java.lang.String getArchiveFilePrefix()
Returns the prefix used to generate Archive files (ARC or WARC).

Returns:
The archive file prefix, currently jobID-harvestID.

getOrderXmlFile

public java.io.File getOrderXmlFile()
Returns the order.xml file object.

Returns:
A file object for the order.xml file (which may not have been written yet).

getSeedsTxtFile

public java.io.File getSeedsTxtFile()
Returns the seeds.txt file object.

Returns:
A file object for the seeds.txt file (which may not have been written yet).

getRecoverBackupGzFile

public java.io.File getRecoverBackupGzFile()
Returns the recoverbackup file object.

Returns:
A file object for the recoverbackup.gz. file (which may or may not exist).

writeRecoverBackupfile

public boolean writeRecoverBackupfile(java.io.InputStream recoverlog)
Try to write the recover-backup file.

Parameters:
recoverlog - The recoverlog in the form of an InputStream
Returns:
true, if operation succeeds, otherwise false

writeSeedsTxt

public void writeSeedsTxt(java.lang.String seeds)
Writes the given content to the seeds.txt file.

Parameters:
seeds - The intended content of seeds.txt
Throws:
ArgumentNotValid - if seeds is null or empty

writeOrderXml

public void writeOrderXml(org.dom4j.Document doc)
Writes the given order.xml content to the order.xml file.

Parameters:
doc - The intended content of order.xml
Throws:
ArgumentNotValid, - if doc is null or empty

getHeritrixOutput

public java.io.File getHeritrixOutput()
Get the file that contains output from Heritrix on stdout/stderr.

Returns:
File that contains output from Heritrix on stdout/stderr.

setIndexDir

public void setIndexDir(java.io.File indexDir)
Set the deduplicate index dir.

Parameters:
indexDir - the cache dir containing unzipped files
Throws:
ArgumentNotValid - if indexDir is not a directory or is null

getIndexDir

public java.io.File getIndexDir()
Returns the index directory, if one has been set.

Returns:
the index directory or null if no index has been set.

getDisposableFiles

public java.io.File[] getDisposableFiles()
Return a list of disposable heritrix-files. Currently the list consists of the File "state.job", and the directories: "checkpoints", "state", "scratch".

Returns:
a list of disposable heritrix-files.

getCrawlLog

public java.io.File getCrawlLog()
Retrieve the crawlLog as a File object.

Returns:
the crawlLog as a File object.

getProgressStatisticsLog

public java.io.File getProgressStatisticsLog()
Retrieve the progress statistics log as a File object.

Returns:
the progress statistics log as a File object.

getJobID

public java.lang.Long getJobID()
Get the job ID.

Returns:
Job ID this heritrix files object is for.

getHarvestID

public java.lang.Long getHarvestID()
Get the harvest ID.

Returns:
Harvest ID this heritrix files object is for.

cleanUpAfterHarvest

public void cleanUpAfterHarvest(java.io.File oldJobsDir)
Delete statefile etc. and move crawl directory to oldjobs.

Parameters:
oldJobsDir - Directory to move the rest of any existing files to.

deleteFinalLogs

public void deleteFinalLogs()
Helper method to delete the crawl.log and progress statistics log. Will log errors but otherwise continue.


getArcsDir

public java.io.File getArcsDir()
Return the directory, where Heritrix writes its arcfiles.

Returns:
the directory, where Heritrix writes its arcfiles.

getWarcsDir

public java.io.File getWarcsDir()
Return the directory, where Heritrix writes its warcfiles.

Returns:
the directory, where Heritrix writes its warcfiles.

getJmxPasswordFile

public java.io.File getJmxPasswordFile()
Method for retrieving the jmxremote.password file.

Returns:
the jmxPasswordFile.

getJmxAccessFile

public java.io.File getJmxAccessFile()
Method for retrieving the jmxremote.access file.

Returns:
the jmxAccessFile.