Class IngestableFiles
- java.lang.Object
-
- dk.netarkivet.harvester.heritrix3.IngestableFiles
-
public class IngestableFiles extends java.lang.Object
Encapsulation of files to be ingested into the archive. These files are presently placed subdirectories under the crawldir.
-
-
Field Summary
Fields Modifier and Type Field Description static java.lang.String
METADATA_FILENAME_FORMAT
protected static java.lang.String
METADATA_SUB_DIR
Subdir with final metadata file in it.
-
Constructor Summary
Constructors Constructor Description IngestableFiles(Heritrix3Files files)
Constructor for this class.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
cleanup()
Remove any temporary files.void
closeMetadataFile()
Marks generated metadata as final, closes the writer, and moves the temporary metadata file to its final position.void
closeOpenFiles(int waitSeconds)
Close any ".open" files left by a crashed Heritrix.protected void
closeOpenFiles(java.lang.String archiveDirName, java.io.FilenameFilter filter)
Given an archive sub-directory name and a filter to match against this method tries to rename the matched files.java.util.List<java.io.File>
getArcFiles()
Get a list of all ARC files that should get ingested.java.io.File
getArcsDir()
java.io.File
getCrawlDir()
long
getHarvestID()
java.lang.String
getHarvestnamePrefix()
java.io.File
getHeritrix3JobDir()
long
getJobId()
java.util.List<java.io.File>
getMetadataArcFiles()
Gets the files containing the metadata.protected java.io.File
getMetadataFile()
Constructs the single metadata arc file from the crawlDir and the jobID.MetadataFileWriter
getMetadataWriter()
Get a MetaDatafileWriter for the temporary metadata file.java.io.File
getReportsDir()
java.io.File
getTmpMetadataDir()
Constructs the TEMPORARY metadata subdir from the crawlDir.java.util.List<java.io.File>
getWarcFiles()
Get a list of all WARC files that should get ingested.java.io.File
getWarcsDir()
boolean
isMetadataFailed()
Return true if the metadata generation process is known to have failed.boolean
isMetadataReady()
Check, if the metadatafile already exists.void
setErrorState(boolean isError)
Set error state.
-
-
-
Field Detail
-
METADATA_SUB_DIR
protected static final java.lang.String METADATA_SUB_DIR
Subdir with final metadata file in it.- See Also:
- Constant Field Values
-
METADATA_FILENAME_FORMAT
public static final java.lang.String METADATA_FILENAME_FORMAT
-
-
Constructor Detail
-
IngestableFiles
public IngestableFiles(Heritrix3Files files)
Constructor for this class. HeritrixFiles contains information about crawlDir, jobId, and harvestnameprefix for a specific finished harvestjob.- Parameters:
files
- An instance of Heritrix3Files- Throws:
ArgumentNotValid
- if null-arguments are given; if jobID < 1; if crawlDir does not exist
-
-
Method Detail
-
isMetadataReady
public boolean isMetadataReady()
Check, if the metadatafile already exists. If this is true, metadata has been successfully generated. If false, either metadata has not finished being generated, or there was an error generating them.- Returns:
- true, if it does exist; false otherwise.
-
isMetadataFailed
public boolean isMetadataFailed()
Return true if the metadata generation process is known to have failed.- Returns:
- True if metadata generation is finished without success, false if generation is still ongoing or has been successfully done.
-
closeMetadataFile
public void closeMetadataFile()
Marks generated metadata as final, closes the writer, and moves the temporary metadata file to its final position.- Throws:
PermissionDenied
- If the metadata has already been marked as ready, or if no metadata file exists upon success.IOFailure
- if there is an error marking the metadata as ready.
-
setErrorState
public void setErrorState(boolean isError)
Set error state.- Parameters:
isError
- True, if error, otherwise false;
-
getMetadataWriter
public MetadataFileWriter getMetadataWriter()
Get a MetaDatafileWriter for the temporary metadata file. Successive calls to this method on the same object will return the same writer. Once the metadata have been finalized, calling this method will fail.- Returns:
- a MetaDatafileWriter for the temporary metadata file.
- Throws:
PermissionDenied
- if metadata generation is already finished.
-
getMetadataArcFiles
public java.util.List<java.io.File> getMetadataArcFiles()
Gets the files containing the metadata.- Returns:
- the files in the metadata dir
- Throws:
IllegalState
- if the metadata file is not ready, either because generation is still going on or there was an error generating the metadata.
-
getMetadataFile
protected java.io.File getMetadataFile()
Constructs the single metadata arc file from the crawlDir and the jobID.- Returns:
- metadata arc file as a File
-
getTmpMetadataDir
public java.io.File getTmpMetadataDir()
Constructs the TEMPORARY metadata subdir from the crawlDir.- Returns:
- The tmp-metadata subdir as a File
-
getArcFiles
public java.util.List<java.io.File> getArcFiles()
Get a list of all ARC files that should get ingested. Any open files should be closed with closeOpenFiles first.- Returns:
- The ARC files that are ready to get ingested.
-
getArcsDir
public java.io.File getArcsDir()
- Returns:
- the arcs dir in the our crawl directory.
-
getWarcsDir
public java.io.File getWarcsDir()
- Returns:
- the warcs dir in the our crawl directory.
-
getReportsDir
public java.io.File getReportsDir()
- Returns:
- the warcs dir in the our crawl directory.
-
getWarcFiles
public java.util.List<java.io.File> getWarcFiles()
Get a list of all WARC files that should get ingested. Any open files should be closed with closeOpenFiles first.- Returns:
- The WARC files that are ready to get ingested.
-
getHeritrix3JobDir
public java.io.File getHeritrix3JobDir()
-
closeOpenFiles
public void closeOpenFiles(int waitSeconds)
Close any ".open" files left by a crashed Heritrix. ARC and/or WARC files ending in .open indicate that Heritrix is still writing to them. If Heritrix has died, we can just rename them before we upload. This must not be done while harvesting is still in progress.- Parameters:
waitSeconds
- How many seconds to wait before closing files. This may be done in order to allow Heritrix to finish writing before we close the files.
-
closeOpenFiles
protected void closeOpenFiles(java.lang.String archiveDirName, java.io.FilenameFilter filter)
Given an archive sub-directory name and a filter to match against this method tries to rename the matched files. Files that can not be renamed generate a log message. The filter should always match files that end with ".open" as a minimum.- Parameters:
archiveDirName
- archive directory name, currently "arc" or "warc"filter
- filename filter used to select ".open" files to rename
-
cleanup
public void cleanup()
Remove any temporary files.
-
getJobId
public long getJobId()
- Returns:
- the jobID of the harvest job being processed.
-
getHarvestID
public long getHarvestID()
- Returns:
- the harvestID of the harvest job being processed.
-
getHarvestnamePrefix
public java.lang.String getHarvestnamePrefix()
- Returns:
- the harvestnamePrefix of the harvest job being processed.
-
getCrawlDir
public java.io.File getCrawlDir()
- Returns:
- the crawlDir of the harvest job being processed.
-
-