Class FileUtils

  • public class FileUtils
    extends Object
    Misc. handy file utilities.
    • Field Detail


        public static final String CDX_EXTENSION
        Extension used for CDX files, including separator .
        See Also:
        Constant Field Values

        public static final String ARC_EXTENSION
        Extension used for ARC files, including separator .
        See Also:
        Constant Field Values

        public static final String ARC_GZIPPED_EXTENSION
        Extension used for gzipped ARC files, including separator .
        See Also:
        Constant Field Values

        public static final String WARC_EXTENSION
        Extension used for WARC files, including separator .
        See Also:
        Constant Field Values

        public static final String WARC_GZIPPED_EXTENSION
        Extension used for gzipped WARC files, including separator .
        See Also:
        Constant Field Values

        public static final String ARC_PATTERN
        Pattern matching ARC files, including separator. Note: (?i) means case insensitive, (\\.gz)? means .gz is optionally matched, and $ means matches end-of-line. Thus this pattern will match file.arc.gz, file.ARC, file.aRc.GZ, but not
        See Also:
        Constant Field Values

        public static final String OPEN_ARC_PATTERN
        Pattern matching open ARC files, including separator . Note: (?i) means case insensitive, (\\.gz)? means .gz is optionally matched, and $ means matches end-of-line. Thus this pattern will match,, file.arc.GZ.OpEn, but not
        See Also:
        Constant Field Values

        public static final String WARC_PATTERN
        Pattern matching WARC files, including separator. Note: (?i) means case insensitive, (\\.gz)? means .gz is optionally matched, and $ means matches end-of-line. Thus this pattern will match file.warc.gz, file.WARC, file.WaRc.GZ, but not
        See Also:
        Constant Field Values

        public static final String OPEN_WARC_PATTERN
        Pattern matching open WARC files, including separator . Note: (?i) means case insensitive, (\\.gz)? means .gz is optionally matched, and $ means matches end-of-line. Thus this pattern will match,, file.warc.GZ.OpEn, but not
        See Also:
        Constant Field Values

        public static final String WARC_ARC_PATTERN
        Pattern matching WARC and ARC files, including separator. Note: (?i) means case insensitive, (\\.gz)? means .gz is optionally matched, and $ means matches end-of-line. Thus this pattern will match file.warc.gz, file.WARC, file.WaRc.GZ, file.arc.gz, file.ARC, file.aRc.GZ but not or
        See Also:
        Constant Field Values

        public static final FilenameFilter CDX_FILE_FILTER
        A FilenameFilter accepting a file if and only if its name (transformed to lower case) ends on ".cdx".

        public static final FilenameFilter OPEN_ARCS_FILTER
        A filter that matches files left open by a crashed Heritrix process. Don't work on these files while Heritrix is still working on them.

        public static final FilenameFilter OPEN_WARCS_FILTER
        A filter that matches warcfiles left open by a crashed Heritrix process. Don't work on these files while Heritrix is still working on them.

        public static final FilenameFilter ARCS_FILTER
        A filter that matches arc files, that is any file that ends on .arc or .arc.gz in any case.

        public static final FilenameFilter WARCS_FILTER
        A filter that matches warc files, that is any file that ends on .warc or .warc.gz in any case.

        public static final FilenameFilter WARCS_ARCS_FILTER
        A filter that matches warc and arc files, that is any file that ends on .warc, .warc.gz, .arc or .arc.gz in any case.

        public static final int MAX_IDS_IN_FILENAME
        Maximum number of IDs we will put in a filename. Above this number, a checksum of the ids is generated instead. This is done to protect us from getting filenames too long for the filesystem.
        See Also:
        Constant Field Values
    • Constructor Detail

      • FileUtils

        public FileUtils()
    • Method Detail

      • removeRecursively

        public static boolean removeRecursively​(File f)
        Remove a file and any subfiles in case of directories.
        f - A file to completely and utterly remove.
        true if the file did exist, false otherwise.
        SecurityException - If a security manager exists and its SecurityManager.checkDelete(java.lang.String) method denies delete access to the file
      • formatFilename

        public static String formatFilename​(String filename)
        Returns a valid filename for most filesystems. Exchanges the following characters:

        " " -> "_" ":" -> "_" "+" -> "_"

        filename - the filename to format correctly
        a new formatted filename
      • getFilesRecursively

        public static List<File> getFilesRecursively​(String dir,
                                                     List<File> files,
                                                     String type)
        Retrieves all files whose names ends with 'type' from directory 'dir' and all its subdirectories.
        dir - Path of base directory
        files - Initially, an empty list (e.g. an ArrayList)
        type - The extension/ending of the files to retrieve (e.g. ".xml", ".ARC")
        A list of files from directory 'dir' and all its subdirectories
      • readFile

        public static String readFile​(File file)
                               throws IOException
        Load file content into text string.
        file - The file to load
        file content loaded into text string
        IOException - If any IO trouble occurs while reading the file, or the file cannot be found.
      • copyFile

        public static void copyFile​(File from,
                                    File to)
        Copy file from one location to another. Will silently overwrite an already existing file.
        from - original to copy
        to - destination of copy
        IOFailure - if an io error occurs while copying file, or the original file does not exist.
      • copyDirectory

        public static void copyDirectory​(File from,
                                         File to)
                                  throws IOFailure
        Copy an entire directory from one location to another. Note that this will silently overwrite old files, just like copyFile().
        from - Original directory (or file, for that matter) to copy.
        to - Destination directory, i.e. the 'new name' of the copy of the from directory.
        IOFailure - On IO trouble copying files.
      • readBinaryFile

        public static byte[] readBinaryFile​(File file)
                                     throws IOFailure,
        Read an entire file, byte by byte, into a byte array, ignoring any locale issues.
        file - A file to be read.
        A byte array with the contents of the file.
        IOFailure - on IO trouble reading the file, or the file does not exist
        IndexOutOfBoundsException - If the file is too large to be in an array.
      • writeBinaryFile

        public static void writeBinaryFile​(File file,
                                           byte[] b)
        Write an entire byte array to a file, ignoring any locale issues.
        file - The file to write the data to
        b - The byte array to write to the file
        IOFailure - If an exception occurs during the writing.
      • getXmlFilesFilter

        public static FilenameFilter getXmlFilesFilter()
        Return a filter that only accepts XML files (ending with .xml), irrespective of their location.
        A new filter for XML files.
      • readListFromFile

        public static List<String> readListFromFile​(File file)
        Read all lines from a file into a list of strings.
        file - The file to read from.
        The list of lines.
        IOFailure - on trouble reading the file, or if the file does not exist
      • writeCollectionToFile

        public static void writeCollectionToFile​(File file,
                                                 Collection<String> collection)
        Writes a collection of strings to a file, each string on one line.
        file - A file to write to. The contents of this file will be overwritten.
        collection - The collection to write. The order it will be written in is unspecified.
        IOFailure - if any error occurs writing to the file.
        ArgumentNotValid - if file or collection is null.
      • makeSortedFile

        public static void makeSortedFile​(File unsortedFile,
                                          File sortedOutput)
        Sort a file into another. The current implementation slurps all lines into memory. This will not scale forever.
        unsortedFile - A file to sort
        sortedOutput - The file to sort into
      • removeLineFromFile

        public static void removeLineFromFile​(String line,
                                              File file)
        Remove a line from a given file.
        line - The full line to remove
        file - The file to remove the line from. This file will be rewritten in full, and the entire contents will be kept in memory
        UnknownID - If the file does not exist
      • createDir

        public static boolean createDir​(File dir)
                                 throws PermissionDenied
        Check if the directory exists, and create it if needed. The complete path down to the directory is created. If the directory creation fails a PermissionDenied exception is thrown. If the directory is not writable, a warning is logged
        dir - The directory to create
        true if dir created.
        ArgumentNotValid - If dir is null or its name is the empty string
        PermissionDenied - If directory cannot be created for any reason
      • getBytesFree

        public static long getBytesFree​(File f)
        Returns the number of bytes free on the file system calling the FreeSpaceProvider class defined by the setting CommonSettings.FREESPACE_PROVIDER_CLASS (a.k.a. settings.common.freespaceprovider.class)
        f - a given file
        the number of bytes free defined in the settings.xml
      • relativeTo

        public static String relativeTo​(File theFile,
                                        File theDir)
        theFile - A file to make relative
        theDir - A directory
        the filepath of the theFile relative to theDir. null, if theFile is not relative to theDir. null, if theDir is not a directory.
      • countLines

        public static long countLines​(File file)
        Count the number of lines in a file.
        file - the file to read
        the number of lines in the file
        IOFailure - If an error occurred while reading the file
      • getEphemeralInputStream

        public static InputStream getEphemeralInputStream​(File file)
        Create an InputStream that reads from a file but removes the file when all data has been read.
        file - A file to read. This file will be deleted when the inputstream is closed, finalized, reaches end-of-file, or when the VM closes.
        An InputStream containing the file's contents.
        IOFailure - If an error occurs in creating the ephemeral input stream
      • makeValidFileFromExisting

        public static File makeValidFileFromExisting​(String filename)
                                              throws IOFailure
        Makes a valid file from filename passed in String. Ensures that the File object returned is not null, and that isFile() returns true.
        filename - The file to create the File object from
        A valid, non-null File object.
        IOFailure - if file cannot be created.
      • writeFileToStream

        public static void writeFileToStream​(File f,
                                             OutputStream out)
        Write the entire contents of a file to a stream.
        f - A file to write to the stream.
        out - The stream to write to.
        IOFailure - If any error occurs while writing the file to a stream
      • writeStreamToFile

        public static void writeStreamToFile​(InputStream in,
                                             File f)
        Write the contents of a stream into a file.
        in - A stream to read from. This stream is not closed by this method.
        f - The file to write the stream contents into.
        IOFailure - If any error occurs while writing the stream to a file
      • getTempDir

        public static File getTempDir()
        Get the location of the standard temporary directory. The existence of this directory should be ensure at the start of every application.
        The directory that should be used for temporary files.
      • moveFile

        public static void moveFile​(File fromFile,
                                    File toFile)
        Attempt to move a file using rename, and if that fails, move the file by copy-and-delete.
        fromFile - The source
        toFile - The target
      • generateFileNameFromSet

        public static <T extends Comparable<T>> String generateFileNameFromSet​(Set<T> IDs,
                                                                               String suffix)
        Given a set, generate a reasonable file name from the set.
        Type Parameters:
        T - The type of objects, that the Set IDs argument contains.
        IDs - A set of IDs.
        suffix - A suffix. May be empty string.
        A reasonable file name.
      • sortCrawlLog

        public static void sortCrawlLog​(File file,
                                        File toFile)
        Sort a crawl.log file according to the url.
        file - The file containing the unsorted data.
        toFile - The file that the sorted data can be put into.
        IOFailure - if there were errors running the sort process, or if the file does not exist.
      • sortCrawlLogOnTimestamp

        public static void sortCrawlLogOnTimestamp​(File file,
                                                   File toFile)
        Sort a crawl.log file according to the timestamp.
        file - The file containing the unsorted data.
        toFile - The file that the sorted data can be put into.
        IOFailure - if there were errors running the sort process, or if the file does not exist.
      • sortCDX

        public static void sortCDX​(File file,
                                   File toFile)
        Sort a CDX file according to our standard for CDX file sorting. This method depends on the Unix sort() command.
        file - The raw unsorted CDX file.
        toFile - The file that the result will be put into.
        IOFailure - If the file does not exist, or could not be sorted
      • sortFile

        public static void sortFile​(File file,
                                    File toFile)
        Sort a file using UNIX sort.
        file - the file that you want to sort.
        toFile - The destination file.
      • createUniqueTempDir

        public static File createUniqueTempDir​(File inDir,
                                               String prefix)
        Creates a new temporary directory with a unique name. This directory will be deleted automatically at the end of the VM (though behaviour if there are files in it is undefined). This method will try a limited number of times to create a directory, using a randomly generated suffix, before giving up.
        inDir - The directory where the temporary directory should be created.
        prefix - The prefix of the directory name, for identification purposes.
        A newly created directory that no other calls to createUniqueDir returns.
        ArgumentNotValid - if inDir is not an existing directory that can be written to.
        IOFailure - if a free name couldn't be found within a reasonable number of tries.
      • readLastLine

        public static String readLastLine​(File file)
        Read the last line in a file. Note this method is not UTF-8 safe.
        file - input file to read last line from.
        The last line in the file (ending newline is irrelevant), returns an empty string if file is empty.
        ArgumentNotValid - on null argument, or file is not a readable file.
        IOFailure - on IO trouble reading file.
      • appendToFile

        public static void appendToFile​(File file,
                                        String... lines)
        Append the given lines to a file. Each lines is terminated by a newline.
        file - A file to append to.
        lines - The lines to write.
      • getResourceFileFromClassPath

        public static File getResourceFileFromClassPath​(String filePath)
                                                 throws IOFailure
        Loads an file from the class path (for retrieving a file from '.jar').
        filePath - The path of the file.
        The file from the class path.
        IOFailure - If resource cannot be retrieved from the class path.
      • getHumanReadableFileSize

        public static String getHumanReadableFileSize​(File aFile)
        Get a humanly readable representation of the file size. If the file is a directory, the size is the aggregate of the files in the directory except that subdirectories are ignored. The number is given with 2 decimals.
        aFile - a File object
        a humanly readable representation of the file size (rounded)
      • hasFiles

        public static boolean hasFiles​(File aDir)
        aDir - A directory
        true, if the given directory contains files; else returns false