public class HeritrixFiles extends Object
Modifier and Type | Class and Description |
---|---|
static class |
HeritrixFiles.Version |
Constructor and Description |
---|
HeritrixFiles(File crawlDir,
JobInfo harvestJob,
File jmxPasswordFile,
File jmxAccessFile)
Create a new HeritrixFiles object for a job.
|
HeritrixFiles(File crawlDir,
JobInfo harvestJob,
File jmxPasswordFile,
File jmxAccessFile,
HeritrixFiles.Version version) |
Modifier and Type | Method and Description |
---|---|
void |
cleanUpAfterHarvest(File oldJobsDir)
Delete statefile etc.
|
void |
deleteFinalLogs()
Helper method to delete the crawl.log and progress statistics log.
|
String |
getArchiveFilePrefix()
Returns the prefix used to generate Archive files (ARC or WARC).
|
File |
getArcsDir()
Return the directory, where Heritrix writes its arcfiles.
|
File |
getCrawlDir()
Returns the directory that crawls are performed inside.
|
File |
getCrawlLog()
Retrieve the crawlLog as a File object.
|
File[] |
getDisposableFiles()
Return a list of disposable heritrix-files.
|
static HeritrixFiles |
getH1HeritrixFilesWithDefaultJmxFiles(File crawlDir,
JobInfo harvestJob) |
static HeritrixFiles |
getH3HeritrixFiles(File crawlDir,
JobInfo harvestJob) |
Long |
getHarvestID()
Get the harvest ID.
|
File |
getHeritrixOutput()
Get the file that contains output from Heritrix on stdout/stderr.
|
File |
getIndexDir()
Returns the index directory, if one has been set.
|
File |
getJmxAccessFile()
Method for retrieving the jmxremote.access file.
|
File |
getJmxPasswordFile()
Method for retrieving the jmxremote.password file.
|
Long |
getJobID()
Get the job ID.
|
File |
getOrderXmlFile()
Returns the order.xml file object.
|
File |
getProgressStatisticsLog()
Retrieve the progress statistics log as a File object.
|
File |
getRecoverBackupGzFile()
Returns the recoverbackup file object.
|
File |
getSeedsTxtFile()
Returns the seeds.txt file object.
|
File |
getWarcsDir()
Return the directory, where Heritrix writes its warcfiles.
|
void |
setIndexDir(File indexDir)
Set the deduplicate index dir.
|
void |
writeOrderXml(HeritrixTemplate doc)
Writes the given order.xml content to the order.xml file.
|
boolean |
writeRecoverBackupfile(InputStream recoverlog)
Try to write the recover-backup file.
|
void |
writeSeedsTxt(String seeds)
Writes the given content to the seeds.txt file.
|
public HeritrixFiles(File crawlDir, JobInfo harvestJob, File jmxPasswordFile, File jmxAccessFile)
crawlDir
- The dir, where the crawl-files are placed. Assumes, that crawlDir exists already.harvestJob
- The harvestjob behind this instance of HeritrixFilesjmxPasswordFile
- The jmx password file to be used by Heritrix 1. The existence of this file is checked
another place.jmxAccessFile
- The JMX access file to be used by Heritrix 1. The existence of this file is checked another
place.ArgumentNotValid
- if null crawlDir, or non-positive jobID and harvestID.public HeritrixFiles(File crawlDir, JobInfo harvestJob, File jmxPasswordFile, File jmxAccessFile, HeritrixFiles.Version version)
public static HeritrixFiles getH1HeritrixFilesWithDefaultJmxFiles(File crawlDir, JobInfo harvestJob)
public static HeritrixFiles getH3HeritrixFiles(File crawlDir, JobInfo harvestJob)
public File getCrawlDir()
public String getArchiveFilePrefix()
public File getOrderXmlFile()
public File getSeedsTxtFile()
public File getRecoverBackupGzFile()
public boolean writeRecoverBackupfile(InputStream recoverlog)
recoverlog
- The recoverlog in the form of an InputStreampublic void writeSeedsTxt(String seeds)
seeds
- The intended content of seeds.txtArgumentNotValid
- if seeds is null or emptypublic void writeOrderXml(HeritrixTemplate doc)
doc
- The intended content of order.xmlArgumentNotValid,
- if doc is null or emptypublic File getHeritrixOutput()
public void setIndexDir(File indexDir)
indexDir
- the cache dir containing unzipped filesArgumentNotValid
- if indexDir is not a directory or is nullpublic File getIndexDir()
public File[] getDisposableFiles()
public File getCrawlLog()
public File getProgressStatisticsLog()
public Long getHarvestID()
public void cleanUpAfterHarvest(File oldJobsDir)
oldJobsDir
- Directory to move the rest of any existing files to.public void deleteFinalLogs()
public File getArcsDir()
public File getWarcsDir()
public File getJmxPasswordFile()
public File getJmxAccessFile()
Copyright © 2005–2016 The Royal Danish Library, the Danish State and University Library, the National Library of France and the Austrian National Library.. All rights reserved.