dk.netarkivet.harvester.harvesting
Class IngestableFiles

java.lang.Object
  extended by dk.netarkivet.harvester.harvesting.IngestableFiles

public class IngestableFiles
extends java.lang.Object

Encapsulation of files to be ingested into the archive. These files are presently placed subdirectories under the crawldir.


Field Summary
static java.lang.String METADATA_FILENAME_FORMAT
           
protected static java.lang.String METADATA_SUB_DIR
          Subdir with final metadata file in it.
 
Constructor Summary
IngestableFiles(HeritrixFiles files)
          Constructor for this class.
 
Method Summary
 void cleanup()
          Remove any temporary files.
 void closeOpenFiles(int waitSeconds)
          Close any ".open" files left by a crashed Heritrix.
protected  void closeOpenFiles(java.lang.String archiveDirName, java.io.FilenameFilter filter)
          Given an archive sub-directory name and a filter to match against this method tries to rename the matched files.
 java.util.List<java.io.File> getArcFiles()
          Get a list of all ARC files that should get ingested.
 java.io.File getArcsDir()
           
 java.io.File getCrawlDir()
           
 long getHarvestID()
           
 java.lang.String getHarvestnamePrefix()
           
 long getJobId()
           
 java.util.List<java.io.File> getMetadataArcFiles()
          Gets the files containing the metadata.
protected  java.io.File getMetadataFile()
          Constructs the single metadata arc file from the crawlDir and the jobID.
 MetadataFileWriter getMetadataWriter()
          Get a MetaDatafileWriter for the temporary metadata file.
 java.io.File getTmpMetadataDir()
          Constructs the TEMPORARY metadata subdir from the crawlDir.
 java.util.List<java.io.File> getWarcFiles()
          Get a list of all WARC files that should get ingested.
 java.io.File getWarcsDir()
           
 boolean isMetadataFailed()
          Return true if the metadata generation process is known to have failed.
 boolean isMetadataReady()
          Check, if the metadatafile already exists.
 void setMetadataGenerationSucceeded(boolean success)
          Marks generated metadata as final, closes the writer, and moves the temporary metadata file to its final position, if successful.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

METADATA_SUB_DIR

protected static final java.lang.String METADATA_SUB_DIR
Subdir with final metadata file in it.

See Also:
Constant Field Values

METADATA_FILENAME_FORMAT

public static final java.lang.String METADATA_FILENAME_FORMAT
Constructor Detail

IngestableFiles

public IngestableFiles(HeritrixFiles files)
Constructor for this class. HeritrixFiles contains information about crawlDir, jobId, and harvestnameprefix for a specific finished harvestjob.

Parameters:
files - An instance of HeritrixFiles
Throws:
ArgumentNotValid - if null-arguments are given; if jobID < 1; if crawlDir does not exist
Method Detail

isMetadataReady

public boolean isMetadataReady()
Check, if the metadatafile already exists. If this is true, metadata has been successfully generated. If false, either metadata has not finished being generated, or there was an error generating them.

Returns:
true, if it does exist; false otherwise.

isMetadataFailed

public boolean isMetadataFailed()
Return true if the metadata generation process is known to have failed.

Returns:
True if metadata generation is finished without success, false if generation is still ongoing or has been successfully done.

setMetadataGenerationSucceeded

public void setMetadataGenerationSucceeded(boolean success)
Marks generated metadata as final, closes the writer, and moves the temporary metadata file to its final position, if successful.

Parameters:
success - True if metadata was successfully generated, false otherwise.
Throws:
PermissionDenied - If the metadata has already been marked as ready, or if no metadata file exists upon success.
IOFailure - if there is an error marking the metadata as ready.

getMetadataWriter

public MetadataFileWriter getMetadataWriter()
Get a MetaDatafileWriter for the temporary metadata file. Successive calls to this method on the same object will return the same writer. Once the metadata have been finalized, calling this method will fail.

Returns:
a MetaDatafileWriter for the temporary metadata file.
Throws:
PermissionDenied - if metadata generation is already finished.

getMetadataArcFiles

public java.util.List<java.io.File> getMetadataArcFiles()
Gets the files containing the metadata.

Returns:
the files in the metadata dir
Throws:
PermissionDenied - if the metadata file is not ready, either because generation is still going on or there was an error generating the metadata.

getMetadataFile

protected java.io.File getMetadataFile()
Constructs the single metadata arc file from the crawlDir and the jobID.

Returns:
metadata arc file as a File

getTmpMetadataDir

public java.io.File getTmpMetadataDir()
Constructs the TEMPORARY metadata subdir from the crawlDir.

Returns:
The tmp-metadata subdir as a File

getArcFiles

public java.util.List<java.io.File> getArcFiles()
Get a list of all ARC files that should get ingested. Any open files should be closed with closeOpenFiles first.

Returns:
The ARC files that are ready to get ingested.

getArcsDir

public java.io.File getArcsDir()
Returns:
the arcs dir in the our crawl directory.

getWarcsDir

public java.io.File getWarcsDir()
Returns:
the warcs dir in the our crawl directory.

getWarcFiles

public java.util.List<java.io.File> getWarcFiles()
Get a list of all WARC files that should get ingested. Any open files should be closed with closeOpenFiles first.

Returns:
The WARC files that are ready to get ingested.

closeOpenFiles

public void closeOpenFiles(int waitSeconds)
Close any ".open" files left by a crashed Heritrix. ARC and/or WARC files ending in .open indicate that Heritrix is still writing to them. If Heritrix has died, we can just rename them before we upload. This must not be done while harvesting is still in progress.

Parameters:
waitSeconds - How many seconds to wait before closing files. This may be done in order to allow Heritrix to finish writing before we close the files.

closeOpenFiles

protected void closeOpenFiles(java.lang.String archiveDirName,
                              java.io.FilenameFilter filter)
Given an archive sub-directory name and a filter to match against this method tries to rename the matched files. Files that can not be renamed generate a log message. The filter should always match files that end with ".open" as a minimum.

Parameters:
archiveDirName - archive directory name, currently "arc" or "warc"
filter - filename filter used to select ".open" files to rename

cleanup

public void cleanup()
Remove any temporary files.


getJobId

public long getJobId()
Returns:
the jobID of the harvest job being processed.

getHarvestID

public long getHarvestID()
Returns:
the harvestID of the harvest job being processed.

getHarvestnamePrefix

public java.lang.String getHarvestnamePrefix()
Returns:
the harvestnamePrefix of the harvest job being processed.

getCrawlDir

public java.io.File getCrawlDir()
Returns:
the crawlDir of the harvest job being processed.