Class HeritrixFiles
- java.lang.Object
-
- dk.netarkivet.harvester.harvesting.HeritrixFiles
-
public class HeritrixFiles extends java.lang.Object
This class encapsulates all the files that Heritrix gets from our system, and all files we read from Heritrix.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
HeritrixFiles.Version
-
Constructor Summary
Constructors Constructor Description HeritrixFiles(java.io.File crawlDir, JobInfo harvestJob, java.io.File jmxPasswordFile, java.io.File jmxAccessFile)
Create a new HeritrixFiles object for a job.HeritrixFiles(java.io.File crawlDir, JobInfo harvestJob, java.io.File jmxPasswordFile, java.io.File jmxAccessFile, HeritrixFiles.Version version)
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description void
cleanUpAfterHarvest(java.io.File oldJobsDir)
Delete statefile etc.void
deleteFinalLogs()
Helper method to delete the crawl.log and progress statistics log.java.lang.String
getArchiveFilePrefix()
Returns the prefix used to generate Archive files (ARC or WARC).java.io.File
getArcsDir()
Return the directory, where Heritrix writes its arcfiles.java.io.File
getCrawlDir()
Returns the directory that crawls are performed inside.java.io.File
getCrawlLog()
Retrieve the crawlLog as a File object.java.io.File[]
getDisposableFiles()
Return a list of disposable heritrix-files.static HeritrixFiles
getH1HeritrixFilesWithDefaultJmxFiles(java.io.File crawlDir, JobInfo harvestJob)
static HeritrixFiles
getH3HeritrixFiles(java.io.File crawlDir, JobInfo harvestJob)
java.lang.Long
getHarvestID()
Get the harvest ID.java.io.File
getHeritrixOutput()
Get the file that contains output from Heritrix on stdout/stderr.java.io.File
getIndexDir()
Returns the index directory, if one has been set.java.io.File
getJmxAccessFile()
Method for retrieving the jmxremote.access file.java.io.File
getJmxPasswordFile()
Method for retrieving the jmxremote.password file.java.lang.Long
getJobID()
Get the job ID.java.io.File
getOrderXmlFile()
Returns the order.xml file object.java.io.File
getProgressStatisticsLog()
Retrieve the progress statistics log as a File object.java.io.File
getRecoverBackupGzFile()
Returns the recoverbackup file object.java.io.File
getSeedsTxtFile()
Returns the seeds.txt file object.java.io.File
getWarcsDir()
Return the directory, where Heritrix writes its warcfiles.void
setIndexDir(java.io.File indexDir)
Set the deduplicate index dir.void
writeOrderXml(HeritrixTemplate doc)
Writes the given order.xml content to the order.xml file.boolean
writeRecoverBackupfile(java.io.InputStream recoverlog)
Try to write the recover-backup file.void
writeSeedsTxt(java.lang.String seeds)
Writes the given content to the seeds.txt file.
-
-
-
Constructor Detail
-
HeritrixFiles
public HeritrixFiles(java.io.File crawlDir, JobInfo harvestJob, java.io.File jmxPasswordFile, java.io.File jmxAccessFile)
Create a new HeritrixFiles object for a job.- Parameters:
crawlDir
- The dir, where the crawl-files are placed. Assumes, that crawlDir exists already.harvestJob
- The harvestjob behind this instance of HeritrixFilesjmxPasswordFile
- The jmx password file to be used by Heritrix 1. The existence of this file is checked another place.jmxAccessFile
- The JMX access file to be used by Heritrix 1. The existence of this file is checked another place.- Throws:
ArgumentNotValid
- if null crawlDir, or non-positive jobID and harvestID.
-
HeritrixFiles
public HeritrixFiles(java.io.File crawlDir, JobInfo harvestJob, java.io.File jmxPasswordFile, java.io.File jmxAccessFile, HeritrixFiles.Version version)
-
-
Method Detail
-
getH1HeritrixFilesWithDefaultJmxFiles
public static HeritrixFiles getH1HeritrixFilesWithDefaultJmxFiles(java.io.File crawlDir, JobInfo harvestJob)
-
getH3HeritrixFiles
public static HeritrixFiles getH3HeritrixFiles(java.io.File crawlDir, JobInfo harvestJob)
-
getCrawlDir
public java.io.File getCrawlDir()
Returns the directory that crawls are performed inside.- Returns:
- A directory (that is created as part of harvest setup) that all of Heritrix' files live in.
-
getArchiveFilePrefix
public java.lang.String getArchiveFilePrefix()
Returns the prefix used to generate Archive files (ARC or WARC).- Returns:
- The archive file prefix, currently jobID-harvestID.
-
getOrderXmlFile
public java.io.File getOrderXmlFile()
Returns the order.xml file object.- Returns:
- A file object for the order.xml file (which may not have been written yet).
-
getSeedsTxtFile
public java.io.File getSeedsTxtFile()
Returns the seeds.txt file object.- Returns:
- A file object for the seeds.txt file (which may not have been written yet).
-
getRecoverBackupGzFile
public java.io.File getRecoverBackupGzFile()
Returns the recoverbackup file object.- Returns:
- A file object for the recoverbackup.gz. file (which may or may not exist).
-
writeRecoverBackupfile
public boolean writeRecoverBackupfile(java.io.InputStream recoverlog)
Try to write the recover-backup file.- Parameters:
recoverlog
- The recoverlog in the form of an InputStream- Returns:
- true, if operation succeeds, otherwise false
-
writeSeedsTxt
public void writeSeedsTxt(java.lang.String seeds)
Writes the given content to the seeds.txt file.- Parameters:
seeds
- The intended content of seeds.txt- Throws:
ArgumentNotValid
- if seeds is null or empty
-
writeOrderXml
public void writeOrderXml(HeritrixTemplate doc)
Writes the given order.xml content to the order.xml file.- Parameters:
doc
- The intended content of order.xml
-
getHeritrixOutput
public java.io.File getHeritrixOutput()
Get the file that contains output from Heritrix on stdout/stderr.- Returns:
- File that contains output from Heritrix on stdout/stderr.
-
setIndexDir
public void setIndexDir(java.io.File indexDir)
Set the deduplicate index dir.- Parameters:
indexDir
- the cache dir containing unzipped files- Throws:
ArgumentNotValid
- if indexDir is not a directory or is null
-
getIndexDir
public java.io.File getIndexDir()
Returns the index directory, if one has been set.- Returns:
- the index directory or null if no index has been set.
-
getDisposableFiles
public java.io.File[] getDisposableFiles()
Return a list of disposable heritrix-files. Currently the list consists of the File "state.job", and the directories: "checkpoints", "state", "scratch".- Returns:
- a list of disposable heritrix-files.
-
getCrawlLog
public java.io.File getCrawlLog()
Retrieve the crawlLog as a File object.- Returns:
- the crawlLog as a File object.
-
getProgressStatisticsLog
public java.io.File getProgressStatisticsLog()
Retrieve the progress statistics log as a File object.- Returns:
- the progress statistics log as a File object.
-
getJobID
public java.lang.Long getJobID()
Get the job ID.- Returns:
- Job ID this heritrix files object is for.
-
getHarvestID
public java.lang.Long getHarvestID()
Get the harvest ID.- Returns:
- Harvest ID this heritrix files object is for.
-
cleanUpAfterHarvest
public void cleanUpAfterHarvest(java.io.File oldJobsDir)
Delete statefile etc. and move crawl directory to oldjobs.- Parameters:
oldJobsDir
- Directory to move the rest of any existing files to.
-
deleteFinalLogs
public void deleteFinalLogs()
Helper method to delete the crawl.log and progress statistics log. Will log errors but otherwise continue.
-
getArcsDir
public java.io.File getArcsDir()
Return the directory, where Heritrix writes its arcfiles.- Returns:
- the directory, where Heritrix writes its arcfiles.
-
getWarcsDir
public java.io.File getWarcsDir()
Return the directory, where Heritrix writes its warcfiles.- Returns:
- the directory, where Heritrix writes its warcfiles.
-
getJmxPasswordFile
public java.io.File getJmxPasswordFile()
Method for retrieving the jmxremote.password file.- Returns:
- the jmxPasswordFile.
-
getJmxAccessFile
public java.io.File getJmxAccessFile()
Method for retrieving the jmxremote.access file.- Returns:
- the jmxAccessFile.
-
-