dk.netarkivet.common.utils
Class FileUtils

java.lang.Object
  extended by dk.netarkivet.common.utils.FileUtils

public class FileUtils
extends java.lang.Object

Misc. handy file utilities.


Nested Class Summary
static class FileUtils.FilenameParser
          A class for parsing an ARC filename as generated by our runs of Heritrix and retrieving components like harvestID and jobID.
 
Field Summary
static java.lang.String ARC_EXTENSION
          Extension used for ARC files, including separator .
static java.lang.String ARC_GZIPPED_EXTENSION
          Extension used for gzipped ARC files, including separator .
static java.lang.String ARC_PATTERN
          Pattern matching ARC files, including separator.
static java.io.FilenameFilter ARCS_FILTER
          A filter that matches arc files, that is any file that ends on .arc or .arc.gz in any case.
static java.lang.String CDX_EXTENSION
          Extension used for CDX files, including separator .
static java.io.FilenameFilter CDX_FILE_FILTER
          A FilenameFilter accepting a file if and only if its name (transformed to lower case) ends on ".cdx".
static org.apache.commons.logging.Log log
          The logger for this class.
static int MAX_IDS_IN_FILENAME
          Maximum number of IDs we will put in a filename.
static java.lang.String OPEN_ARC_PATTERN
          Pattern matching open ARC files, including separator .
static java.io.FilenameFilter OPEN_ARCS_FILTER
          A filter that matches files left open by a crashed Heritrix process.
static java.lang.String WARC_GZIPPED_EXTENSION
          Extension used for gzipped WARC files, including separator .
static java.lang.String WARC_PATTERN
          Pattern matching WARC files, including separator.
static java.io.FilenameFilter WARCS_FILTER
          A filter that matches warc files, that is any file that ends on .warc or .warc.gz in any case.
 
Constructor Summary
FileUtils()
           
 
Method Summary
static void appendToFile(java.io.File file, java.lang.String... lines)
          Append the given lines to a file.
static void copyDirectory(java.io.File from, java.io.File to)
          Copy an entire directory from one location to another.
static void copyFile(java.io.File from, java.io.File to)
          Copy file from one location to another.
static long countLines(java.io.File file)
          Count the number of lines in a file.
static boolean createDir(java.io.File dir)
          Check if the directory exists and is writable and create it if needed.
static java.io.File createUniqueTempDir(java.io.File inDir, java.lang.String prefix)
          Creates a new temporary directory with a unique name.
static java.lang.String formatFilename(java.lang.String filename)
          Returns a valid filename for most filesystems.
static
<T extends java.lang.Comparable<T>>
java.lang.String
generateFileNameFromSet(java.util.Set<T> IDs, java.lang.String suffix)
          Given a set, generate a reasonable file name from the set.
static long getBytesFree(java.io.File f)
          Returns the number of bytes free on the file system calling the FreeSpaceProvider class defined by the setting CommonSettings.FREESPACE_PROVIDER_CLASS (a.k.a.
static java.io.InputStream getEphemeralInputStream(java.io.File file)
          Create an InputStream that reads from a file but removes the file when all data has been read.
static java.util.List<java.io.File> getFilesRecursively(java.lang.String dir, java.util.List<java.io.File> files, java.lang.String type)
          Retrieves all files whose names ends with 'type' from directory 'dir' and all its subdirectories.
static java.io.File getResourceFileFromClassPath(java.lang.String filePath)
          Loads an file from the class path (for retrieving a file from '.jar').
static java.io.File getTempDir()
          Get the location of the standard temporary directory.
static java.io.FilenameFilter getXmlFilesFilter()
          Return a filter that only accepts XML files (ending with .xml), irrespective of their location.
static void makeSortedFile(java.io.File unsortedFile, java.io.File sortedOutput)
          Sort a file into another.
static java.io.File makeValidFileFromExisting(java.lang.String filename)
          Makes a valid file from filename passed in String.
static void moveFile(java.io.File fromFile, java.io.File toFile)
          Attempt to move a file using rename, and if that fails, move the file by copy-and-delete.
static byte[] readBinaryFile(java.io.File file)
          Read an entire file, byte by byte, into a byte array, ignoring any locale issues.
static java.lang.String readFile(java.io.File file)
          Load file content into text string.
static java.lang.String readLastLine(java.io.File file)
          Read the last line in a file.
static java.util.List<java.lang.String> readListFromFile(java.io.File file)
          Read a all lines from a file into a list of strings.
static java.lang.String relativeTo(java.io.File theFile, java.io.File theDir)
           
static boolean remove(java.io.File f)
          Remove a file.
static void removeLineFromFile(java.lang.String line, java.io.File file)
          Remove a line from a given file.
static boolean removeRecursively(java.io.File f)
          Remove a file and any subfiles in case of directories.
static void sortCDX(java.io.File file, java.io.File toFile)
          Sort a CDX file according to our standard for CDX file sorting.
static void sortCrawlLog(java.io.File file, java.io.File toFile)
          Sort a crawl.log file according to URL.
static void writeBinaryFile(java.io.File file, byte[] b)
          Write an entire byte array to a file, ignoring any locale issues.
static void writeCollectionToFile(java.io.File file, java.util.Collection<java.lang.String> collection)
          Writes a collection of strings to a file, each string on one line.
static void writeFileToStream(java.io.File f, java.io.OutputStream out)
          Write the entire contents of a file to a stream.
static void writeStreamToFile(java.io.InputStream in, java.io.File f)
          Write the contents of a stream into a file.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

CDX_EXTENSION

public static final java.lang.String CDX_EXTENSION
Extension used for CDX files, including separator .

See Also:
Constant Field Values

ARC_EXTENSION

public static final java.lang.String ARC_EXTENSION
Extension used for ARC files, including separator .

See Also:
Constant Field Values

ARC_GZIPPED_EXTENSION

public static final java.lang.String ARC_GZIPPED_EXTENSION
Extension used for gzipped ARC files, including separator .

See Also:
Constant Field Values

WARC_GZIPPED_EXTENSION

public static final java.lang.String WARC_GZIPPED_EXTENSION
Extension used for gzipped WARC files, including separator .

See Also:
Constant Field Values

ARC_PATTERN

public static final java.lang.String ARC_PATTERN
Pattern matching ARC files, including separator. Note: (?i) means case insensitive, (\\.gz)? means .gz is optionally matched, and $ means matches end-of-line. Thus this pattern will match file.arc.gz, file.ARC, file.aRc.GZ, but not file.ARC.open

See Also:
Constant Field Values

OPEN_ARC_PATTERN

public static final java.lang.String OPEN_ARC_PATTERN
Pattern matching open ARC files, including separator . Note: (?i) means case insensitive, (\\.gz)? means .gz is optionally matched, and $ means matches end-of-line. Thus this pattern will match file.arc.gz.open, file.ARC.open, file.arc.GZ.OpEn, but not file.ARC.open.txt

See Also:
Constant Field Values

WARC_PATTERN

public static final java.lang.String WARC_PATTERN
Pattern matching WARC files, including separator. Note: (?i) means case insensitive, (\\.gz)? means .gz is optionally matched, and $ means matches end-of-line. Thus this pattern will match file.warc.gz, file.WARC, file.WaRc.GZ, but not file.WARC.open

See Also:
Constant Field Values

log

public static final org.apache.commons.logging.Log log
The logger for this class.


CDX_FILE_FILTER

public static final java.io.FilenameFilter CDX_FILE_FILTER
A FilenameFilter accepting a file if and only if its name (transformed to lower case) ends on ".cdx".


OPEN_ARCS_FILTER

public static final java.io.FilenameFilter OPEN_ARCS_FILTER
A filter that matches files left open by a crashed Heritrix process. Don't work on these files while Heritrix is still working on them.


ARCS_FILTER

public static final java.io.FilenameFilter ARCS_FILTER
A filter that matches arc files, that is any file that ends on .arc or .arc.gz in any case.


WARCS_FILTER

public static final java.io.FilenameFilter WARCS_FILTER
A filter that matches warc files, that is any file that ends on .warc or .warc.gz in any case.


MAX_IDS_IN_FILENAME

public static final int MAX_IDS_IN_FILENAME
Maximum number of IDs we will put in a filename. Above this number, a checksum of the ids is generated instead. This is done to protect us from getting filenames too long for the filesystem.

See Also:
Constant Field Values
Constructor Detail

FileUtils

public FileUtils()
Method Detail

removeRecursively

public static boolean removeRecursively(java.io.File f)
Remove a file and any subfiles in case of directories.

Parameters:
f - A file to completely and utterly remove.
Returns:
true if the file did exist, false otherwise.
Throws:
java.lang.SecurityException - If a security manager exists and its SecurityManager.checkDelete(java.lang.String) method denies delete access to the file

remove

public static boolean remove(java.io.File f)
Remove a file.

Parameters:
f - A file to completely and utterly remove.
Returns:
true if the file did exist, false otherwise.
Throws:
ArgumentNotValid - if f is null.
java.lang.SecurityException - If a security manager exists and its SecurityManager.checkDelete(java.lang.String) method denies delete access to the file

formatFilename

public static java.lang.String formatFilename(java.lang.String filename)
Returns a valid filename for most filesystems. Exchanges the following characters:

" " -> "_" ":" -> "_" "+" -> "_"

Parameters:
filename - the filename to format correctly
Returns:
a new formatted filename

getFilesRecursively

public static java.util.List<java.io.File> getFilesRecursively(java.lang.String dir,
                                                               java.util.List<java.io.File> files,
                                                               java.lang.String type)
Retrieves all files whose names ends with 'type' from directory 'dir' and all its subdirectories.

Parameters:
dir - Path of base directory
files - Initially, an empty list (e.g. an ArrayList)
type - The extension/ending of the files to retrieve (e.g. ".xml", ".ARC")
Returns:
A list of files from directory 'dir' and all its subdirectories

readFile

public static java.lang.String readFile(java.io.File file)
                                 throws java.io.IOException
Load file content into text string.

Parameters:
file - The file to load
Returns:
file content loaded into text string
Throws:
java.io.IOException - If any IO trouble occurs while reading the file, or the file cannot be found.

copyFile

public static void copyFile(java.io.File from,
                            java.io.File to)
Copy file from one location to another. Will silently overwrite an already existing file.

Parameters:
from - original to copy
to - destination of copy
Throws:
IOFailure - if an io error occurs while copying file, or the original file does not exist.

copyDirectory

public static void copyDirectory(java.io.File from,
                                 java.io.File to)
                          throws IOFailure
Copy an entire directory from one location to another. Note that this will silently overwrite old files, just like copyFile().

Parameters:
from - Original directory (or file, for that matter) to copy.
to - Destination directory, i.e. the 'new name' of the copy of the from directory.
Throws:
IOFailure - On IO trouble copying files.

readBinaryFile

public static byte[] readBinaryFile(java.io.File file)
                             throws IOFailure,
                                    java.lang.IndexOutOfBoundsException
Read an entire file, byte by byte, into a byte array, ignoring any locale issues.

Parameters:
file - A file to be read.
Returns:
A byte array with the contents of the file.
Throws:
IOFailure - on IO trouble reading the file, or the file does not exist
java.lang.IndexOutOfBoundsException - If the file is too large to be in an array.

writeBinaryFile

public static void writeBinaryFile(java.io.File file,
                                   byte[] b)
Write an entire byte array to a file, ignoring any locale issues.

Parameters:
file - The file to write the data to
b - The byte array to write to the file
Throws:
IOFailure - If an exception occurs during the writing.

getXmlFilesFilter

public static java.io.FilenameFilter getXmlFilesFilter()
Return a filter that only accepts XML files (ending with .xml), irrespective of their location.

Returns:
A new filter for XML files.

readListFromFile

public static java.util.List<java.lang.String> readListFromFile(java.io.File file)
Read a all lines from a file into a list of strings.

Parameters:
file - The file to read from.
Returns:
The list of lines.
Throws:
IOFailure - on trouble reading the file, or if the file does not exist

writeCollectionToFile

public static void writeCollectionToFile(java.io.File file,
                                         java.util.Collection<java.lang.String> collection)
Writes a collection of strings to a file, each string on one line.

Parameters:
file - A file to write to. The contents of this file will be overwritten.
collection - The collection to write. The order it will be written in is unspecified.
Throws:
IOFailure - if any error occurs writing to the file.
ArgumentNotValid - if file or collection is null.

makeSortedFile

public static void makeSortedFile(java.io.File unsortedFile,
                                  java.io.File sortedOutput)
Sort a file into another. The current implementation slurps all lines into memory. This will not scale forever.

Parameters:
unsortedFile - A file to sort
sortedOutput - The file to sort into

removeLineFromFile

public static void removeLineFromFile(java.lang.String line,
                                      java.io.File file)
Remove a line from a given file.

Parameters:
line - The full line to remove
file - The file to remove the line from. This file will be rewritten in full, and the entire contents will be kept in memory
Throws:
UnknownID - If the file does not exist

createDir

public static boolean createDir(java.io.File dir)
                         throws PermissionDenied
Check if the directory exists and is writable and create it if needed. The complete path down to the directory is created. If the directory creation fails a PermissionDenied exception is thrown.

Parameters:
dir - The directory to create
Returns:
true if dir created.
Throws:
ArgumentNotValid - If dir is null or its name is the empty string
PermissionDenied - If directory cannot be created for any reason, or is not writable.

getBytesFree

public static long getBytesFree(java.io.File f)
Returns the number of bytes free on the file system calling the FreeSpaceProvider class defined by the setting CommonSettings.FREESPACE_PROVIDER_CLASS (a.k.a. settings.common.freespaceprovider.class)

Parameters:
f - a given file
Returns:
the number of bytes free defined in the settings.xml

relativeTo

public static java.lang.String relativeTo(java.io.File theFile,
                                          java.io.File theDir)
Parameters:
theFile - A file to make relative
theDir - A directory
Returns:
the filepath of the theFile relative to theDir. null, if theFile is not relative to theDir. null, if theDir is not a directory.

countLines

public static long countLines(java.io.File file)
Count the number of lines in a file.

Parameters:
file - the file to read
Returns:
the number of lines in the file
Throws:
IOFailure - If an error occurred while reading the file

getEphemeralInputStream

public static java.io.InputStream getEphemeralInputStream(java.io.File file)
Create an InputStream that reads from a file but removes the file when all data has been read.

Parameters:
file - A file to read. This file will be deleted when the inputstream is closed, finalized, reaches end-of-file, or when the VM closes.
Returns:
An InputStream containing the file's contents.
Throws:
IOFailure - If an error occurs in creating the ephemeral input stream

makeValidFileFromExisting

public static java.io.File makeValidFileFromExisting(java.lang.String filename)
                                              throws IOFailure
Makes a valid file from filename passed in String. Ensures that the File object returned is not null, and that isFile() returns true.

Parameters:
filename - The file to create the File object from
Returns:
A valid, non-null File object.
Throws:
IOFailure - if file cannot be created.

writeFileToStream

public static void writeFileToStream(java.io.File f,
                                     java.io.OutputStream out)
Write the entire contents of a file to a stream.

Parameters:
f - A file to write to the stream.
out - The stream to write to.
Throws:
IOFailure - If any error occurs while writing the file to a stream

writeStreamToFile

public static void writeStreamToFile(java.io.InputStream in,
                                     java.io.File f)
Write the contents of a stream into a file.

Parameters:
in - A stream to read from. This stream is not closed by this method.
f - The file to write the stream contents into.
Throws:
IOFailure - If any error occurs while writing the stream to a file

getTempDir

public static java.io.File getTempDir()
Get the location of the standard temporary directory. The existence of this directory should be ensure at the start of every application.

Returns:
The directory that should be used for temporary files.

moveFile

public static void moveFile(java.io.File fromFile,
                            java.io.File toFile)
Attempt to move a file using rename, and if that fails, move the file by copy-and-delete.

Parameters:
fromFile - The source
toFile - The target

generateFileNameFromSet

public static <T extends java.lang.Comparable<T>> java.lang.String generateFileNameFromSet(java.util.Set<T> IDs,
                                                                                           java.lang.String suffix)
Given a set, generate a reasonable file name from the set.

Type Parameters:
T - The type of objects, that the Set IDs argument contains.
Parameters:
IDs - A set of IDs.
suffix - A suffix. May be empty string.
Returns:
A reasonable file name.

sortCrawlLog

public static void sortCrawlLog(java.io.File file,
                                java.io.File toFile)
Sort a crawl.log file according to URL. This method depends on the Unix sort() command.

Parameters:
file - The file containing the unsorted data.
toFile - The file that the sorted data can be put into.
Throws:
IOFailure - if there were errors running the sort process, or if the file does not exist.

sortCDX

public static void sortCDX(java.io.File file,
                           java.io.File toFile)
Sort a CDX file according to our standard for CDX file sorting. This method depends on the Unix sort() command.

Parameters:
file - The raw unsorted CDX file.
toFile - The file that the result will be put into.
Throws:
IOFailure - If the file does not exist, or could not be sorted

createUniqueTempDir

public static java.io.File createUniqueTempDir(java.io.File inDir,
                                               java.lang.String prefix)
Creates a new temporary directory with a unique name. This directory will be deleted automatically at the end of the VM (though behaviour if there are files in it is undefined). This method will try a limited number of times to create a directory, using a randomly generated suffix, before giving up.

Parameters:
inDir - The directory where the temporary directory should be created.
prefix - The prefix of the directory name, for identification purposes.
Returns:
A newly created directory that no other calls to createUniqueDir returns.
Throws:
ArgumentNotValid - if inDir is not an existing directory that can be written to.
IOFailure - if a free name couldn't be found within a reasonable number of tries.

readLastLine

public static java.lang.String readLastLine(java.io.File file)
Read the last line in a file. Note this method is not UTF-8 safe.

Parameters:
file - input file to read last line from.
Returns:
The last line in the file (ending newline is irrelevant), returns an empty string if file is empty.
Throws:
ArgumentNotValid - on null argument, or file is not a readable file.
IOFailure - on IO trouble reading file.

appendToFile

public static void appendToFile(java.io.File file,
                                java.lang.String... lines)
Append the given lines to a file. Each lines is terminated by a newline.

Parameters:
file - A file to append to.
lines - The lines to write.

getResourceFileFromClassPath

public static java.io.File getResourceFileFromClassPath(java.lang.String filePath)
                                                 throws IOFailure
Loads an file from the class path (for retrieving a file from '.jar').

Parameters:
filePath - The path of the file.
Returns:
The file from the class path.
Throws:
IOFailure - If resource cannot be retrieved from the class path.