dk.netarkivet.harvester.harvesting
Class HeritrixFiles

java.lang.Object
  extended by dk.netarkivet.harvester.harvesting.HeritrixFiles

public class HeritrixFiles
extends java.lang.Object

This class encapsulates all the files that Heritrix gets from our system, and all files we read from Heritrix.


Constructor Summary
HeritrixFiles(java.io.File crawlDir, long jobID, long harvestID)
          Alternate constructor that by default reads the jmxPasswordFile, and jmxAccessFile from the current settings.
HeritrixFiles(java.io.File crawlDir, long jobID, long harvestID, java.io.File jmxPasswordFile, java.io.File jmxAccessFile)
          Create a new HeritrixFiles object for a job.
 
Method Summary
 void cleanUpAfterHarvest(java.io.File oldJobsDir)
          Delete statefile etc.
 void deleteFinalLogs()
          Helper method to delete the crawl.log and progress statistics log.
 java.lang.String getArcFilePrefix()
          Returns the prefix used to generate ARC files.
 java.io.File getArcsDir()
          Return the directory, where Heritrix writes its arcfiles.
 java.io.File getCrawlDir()
          Returns the directory that crawls are performed inside.
 java.io.File getCrawlLog()
          Retrieve the crawlLog as a File object.
 java.io.File[] getDisposableFiles()
          Return a list of disposable heritrix-files.
 java.lang.Long getHarvestID()
          Get the harvest ID.
 java.io.File getHeritrixOutput()
          Get the file that contains output from Heritrix on stdout/stderr.
 java.io.File getIndexDir()
          Returns the index directory, if one has been set.
 java.io.File getJmxAccessFile()
          Method for retrieving the jmxremote.access file.
 java.io.File getJmxPasswordFile()
          Method for retrieving the jmxremote.password file.
 java.lang.Long getJobID()
          Get the job ID.
 java.io.File getOrderXmlFile()
          Returns the order.xml file object.
 java.io.File getProgressStatisticsLog()
          Retrieve the progress statistics log as a File object.
 java.io.File getSeedsTxtFile()
          Returns the seeds.txt file object.
 void setIndexDir(java.io.File indexDir)
          Set the deduplicate index dir.
 void writeOrderXml(org.dom4j.Document doc)
          Writes the given order.xml content to the order.xml file.
 void writeSeedsTxt(java.lang.String seeds)
          Writes the given content to the seeds.txt file.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

HeritrixFiles

public HeritrixFiles(java.io.File crawlDir,
                     long jobID,
                     long harvestID,
                     java.io.File jmxPasswordFile,
                     java.io.File jmxAccessFile)
Create a new HeritrixFiles object for a job.

Parameters:
crawlDir - The dir, where the crawl-files are placed. Assumes, that crawlDir exists already.
jobID - The JobID of this crawl.
harvestID - The harvestID of this crawl.
jmxPasswordFile - The jmx password file to be used by Heritrix. The existence of this file is checked another place.
jmxAccessFile - The JMX access file to be used by Heritrix. The existence of this file is checked another place.
Throws:
ArgumentNotValid - if null crawlDir, or non-positive jobID and harvestID.

HeritrixFiles

public HeritrixFiles(java.io.File crawlDir,
                     long jobID,
                     long harvestID)
Alternate constructor that by default reads the jmxPasswordFile, and jmxAccessFile from the current settings.

Parameters:
crawlDir - The dir, where the crawl-files are placed
jobID - The JobID of this crawl.
harvestID - The harvestID of this crawl.
Method Detail

getCrawlDir

public java.io.File getCrawlDir()
Returns the directory that crawls are performed inside.

Returns:
A directory (that is created as part of harvest setup) that all of Heritrix' files live in.

getArcFilePrefix

public java.lang.String getArcFilePrefix()
Returns the prefix used to generate ARC files.

Returns:
The ARC file prefix, currently jobID-harvestID.

getOrderXmlFile

public java.io.File getOrderXmlFile()
Returns the order.xml file object.

Returns:
A file object for the order.xml file (which may not have been written yet).

getSeedsTxtFile

public java.io.File getSeedsTxtFile()
Returns the seeds.txt file object.

Returns:
A file object for the seeds.txt file (which may not have been written yet).

writeSeedsTxt

public void writeSeedsTxt(java.lang.String seeds)
Writes the given content to the seeds.txt file.

Parameters:
seeds - The intended content of seeds.txt
Throws:
ArgumentNotValid - if seeds is null or empty

writeOrderXml

public void writeOrderXml(org.dom4j.Document doc)
Writes the given order.xml content to the order.xml file.

Parameters:
doc - The intended content of order.xml
Throws:
ArgumentNotValid, - if doc is null or empty

getHeritrixOutput

public java.io.File getHeritrixOutput()
Get the file that contains output from Heritrix on stdout/stderr.

Returns:
File that contains output from Heritrix on stdout/stderr.

setIndexDir

public void setIndexDir(java.io.File indexDir)
Set the deduplicate index dir.

Parameters:
indexDir - the cache dir containing unzipped files
Throws:
ArgumentNotValid - if indexDir is not a directory or is null

getIndexDir

public java.io.File getIndexDir()
Returns the index directory, if one has been set.

Returns:
the index directory or null if no index has been set.

getDisposableFiles

public java.io.File[] getDisposableFiles()
Return a list of disposable heritrix-files. Currently the list consists of the File "state.job", and the directories: "checkpoints", "state", "scratch".

Returns:
a list of disposable heritrix-files.

getCrawlLog

public java.io.File getCrawlLog()
Retrieve the crawlLog as a File object.

Returns:
the crawlLog as a File object.

getProgressStatisticsLog

public java.io.File getProgressStatisticsLog()
Retrieve the progress statistics log as a File object.

Returns:
the progress statistics log as a File object.

getJobID

public java.lang.Long getJobID()
Get the job ID.

Returns:
Job ID this heritrix files object is for.

getHarvestID

public java.lang.Long getHarvestID()
Get the harvest ID.

Returns:
Harvest ID this heritrix files object is for.

cleanUpAfterHarvest

public void cleanUpAfterHarvest(java.io.File oldJobsDir)
Delete statefile etc. and move crawl directory to oldjobs.

Parameters:
oldJobsDir - Directory to move the rest of any existing files to.

deleteFinalLogs

public void deleteFinalLogs()
Helper method to delete the crawl.log and progress statistics log. Will log errors but otherwise continue.


getArcsDir

public java.io.File getArcsDir()
Return the directory, where Heritrix writes its arcfiles.

Returns:
the directory, where Heritrix writes its arcfiles.

getJmxPasswordFile

public java.io.File getJmxPasswordFile()
Method for retrieving the jmxremote.password file.

Returns:
the jmxPasswordFile.

getJmxAccessFile

public java.io.File getJmxAccessFile()
Method for retrieving the jmxremote.access file.

Returns:
the jmxAccessFile.