Class FileUtils


  • public class FileUtils
    extends Object
    Misc. handy file utilities.
    • Field Detail

      • CDX_EXTENSION

        public static final String CDX_EXTENSION
        Extension used for CDX files, including separator .
        See Also:
        Constant Field Values
      • ARC_EXTENSION

        public static final String ARC_EXTENSION
        Extension used for ARC files, including separator .
        See Also:
        Constant Field Values
      • ARC_GZIPPED_EXTENSION

        public static final String ARC_GZIPPED_EXTENSION
        Extension used for gzipped ARC files, including separator .
        See Also:
        Constant Field Values
      • WARC_EXTENSION

        public static final String WARC_EXTENSION
        Extension used for WARC files, including separator .
        See Also:
        Constant Field Values
      • WARC_GZIPPED_EXTENSION

        public static final String WARC_GZIPPED_EXTENSION
        Extension used for gzipped WARC files, including separator .
        See Also:
        Constant Field Values
      • ARC_PATTERN

        public static final String ARC_PATTERN
        Pattern matching ARC files, including separator. Note: (?i) means case insensitive, (\\.gz)? means .gz is optionally matched, and $ means matches end-of-line. Thus this pattern will match file.arc.gz, file.ARC, file.aRc.GZ, but not file.ARC.open
        See Also:
        Constant Field Values
      • OPEN_ARC_PATTERN

        public static final String OPEN_ARC_PATTERN
        Pattern matching open ARC files, including separator . Note: (?i) means case insensitive, (\\.gz)? means .gz is optionally matched, and $ means matches end-of-line. Thus this pattern will match file.arc.gz.open, file.ARC.open, file.arc.GZ.OpEn, but not file.ARC.open.txt
        See Also:
        Constant Field Values
      • WARC_PATTERN

        public static final String WARC_PATTERN
        Pattern matching WARC files, including separator. Note: (?i) means case insensitive, (\\.gz)? means .gz is optionally matched, and $ means matches end-of-line. Thus this pattern will match file.warc.gz, file.WARC, file.WaRc.GZ, but not file.WARC.open
        See Also:
        Constant Field Values
      • OPEN_WARC_PATTERN

        public static final String OPEN_WARC_PATTERN
        Pattern matching open WARC files, including separator . Note: (?i) means case insensitive, (\\.gz)? means .gz is optionally matched, and $ means matches end-of-line. Thus this pattern will match file.warc.gz.open, file.WARC.open, file.warc.GZ.OpEn, but not file.wARC.open.txt
        See Also:
        Constant Field Values
      • WARC_ARC_PATTERN

        public static final String WARC_ARC_PATTERN
        Pattern matching WARC and ARC files, including separator. Note: (?i) means case insensitive, (\\.gz)? means .gz is optionally matched, and $ means matches end-of-line. Thus this pattern will match file.warc.gz, file.WARC, file.WaRc.GZ, file.arc.gz, file.ARC, file.aRc.GZ but not file.WARC.open or file.ARC.open
        See Also:
        Constant Field Values
      • CDX_FILE_FILTER

        public static final FilenameFilter CDX_FILE_FILTER
        A FilenameFilter accepting a file if and only if its name (transformed to lower case) ends on ".cdx".
      • OPEN_ARCS_FILTER

        public static final FilenameFilter OPEN_ARCS_FILTER
        A filter that matches files left open by a crashed Heritrix process. Don't work on these files while Heritrix is still working on them.
      • OPEN_WARCS_FILTER

        public static final FilenameFilter OPEN_WARCS_FILTER
        A filter that matches warcfiles left open by a crashed Heritrix process. Don't work on these files while Heritrix is still working on them.
      • ARCS_FILTER

        public static final FilenameFilter ARCS_FILTER
        A filter that matches arc files, that is any file that ends on .arc or .arc.gz in any case.
      • WARCS_FILTER

        public static final FilenameFilter WARCS_FILTER
        A filter that matches warc files, that is any file that ends on .warc or .warc.gz in any case.
      • WARCS_ARCS_FILTER

        public static final FilenameFilter WARCS_ARCS_FILTER
        A filter that matches warc and arc files, that is any file that ends on .warc, .warc.gz, .arc or .arc.gz in any case.
      • MAX_IDS_IN_FILENAME

        public static final int MAX_IDS_IN_FILENAME
        Maximum number of IDs we will put in a filename. Above this number, a checksum of the ids is generated instead. This is done to protect us from getting filenames too long for the filesystem.
        See Also:
        Constant Field Values
    • Constructor Detail

      • FileUtils

        public FileUtils()
    • Method Detail

      • removeRecursively

        public static boolean removeRecursively​(File f)
        Remove a file and any subfiles in case of directories.
        Parameters:
        f - A file to completely and utterly remove.
        Returns:
        true if the file did exist, false otherwise.
        Throws:
        SecurityException - If a security manager exists and its SecurityManager.checkDelete(java.lang.String) method denies delete access to the file
      • formatFilename

        public static String formatFilename​(String filename)
        Returns a valid filename for most filesystems. Exchanges the following characters:

        " " -> "_" ":" -> "_" "+" -> "_"

        Parameters:
        filename - the filename to format correctly
        Returns:
        a new formatted filename
      • getFilesRecursively

        public static List<File> getFilesRecursively​(String dir,
                                                     List<File> files,
                                                     String type)
        Retrieves all files whose names ends with 'type' from directory 'dir' and all its subdirectories.
        Parameters:
        dir - Path of base directory
        files - Initially, an empty list (e.g. an ArrayList)
        type - The extension/ending of the files to retrieve (e.g. ".xml", ".ARC")
        Returns:
        A list of files from directory 'dir' and all its subdirectories
      • readFile

        public static String readFile​(File file)
                               throws IOException
        Load file content into text string.
        Parameters:
        file - The file to load
        Returns:
        file content loaded into text string
        Throws:
        IOException - If any IO trouble occurs while reading the file, or the file cannot be found.
      • copyFile

        public static void copyFile​(File from,
                                    File to)
        Copy file from one location to another. Will silently overwrite an already existing file.
        Parameters:
        from - original to copy
        to - destination of copy
        Throws:
        IOFailure - if an io error occurs while copying file, or the original file does not exist.
      • copyDirectory

        public static void copyDirectory​(File from,
                                         File to)
                                  throws IOFailure
        Copy an entire directory from one location to another. Note that this will silently overwrite old files, just like copyFile().
        Parameters:
        from - Original directory (or file, for that matter) to copy.
        to - Destination directory, i.e. the 'new name' of the copy of the from directory.
        Throws:
        IOFailure - On IO trouble copying files.
      • readBinaryFile

        public static byte[] readBinaryFile​(File file)
                                     throws IOFailure,
                                            IndexOutOfBoundsException
        Read an entire file, byte by byte, into a byte array, ignoring any locale issues.
        Parameters:
        file - A file to be read.
        Returns:
        A byte array with the contents of the file.
        Throws:
        IOFailure - on IO trouble reading the file, or the file does not exist
        IndexOutOfBoundsException - If the file is too large to be in an array.
      • writeBinaryFile

        public static void writeBinaryFile​(File file,
                                           byte[] b)
        Write an entire byte array to a file, ignoring any locale issues.
        Parameters:
        file - The file to write the data to
        b - The byte array to write to the file
        Throws:
        IOFailure - If an exception occurs during the writing.
      • getXmlFilesFilter

        public static FilenameFilter getXmlFilesFilter()
        Return a filter that only accepts XML files (ending with .xml), irrespective of their location.
        Returns:
        A new filter for XML files.
      • readListFromFile

        public static List<String> readListFromFile​(File file)
        Read all lines from a file into a list of strings.
        Parameters:
        file - The file to read from.
        Returns:
        The list of lines.
        Throws:
        IOFailure - on trouble reading the file, or if the file does not exist
      • writeCollectionToFile

        public static void writeCollectionToFile​(File file,
                                                 Collection<String> collection)
        Writes a collection of strings to a file, each string on one line.
        Parameters:
        file - A file to write to. The contents of this file will be overwritten.
        collection - The collection to write. The order it will be written in is unspecified.
        Throws:
        IOFailure - if any error occurs writing to the file.
        ArgumentNotValid - if file or collection is null.
      • makeSortedFile

        public static void makeSortedFile​(File unsortedFile,
                                          File sortedOutput)
        Sort a file into another. The current implementation slurps all lines into memory. This will not scale forever.
        Parameters:
        unsortedFile - A file to sort
        sortedOutput - The file to sort into
      • removeLineFromFile

        public static void removeLineFromFile​(String line,
                                              File file)
        Remove a line from a given file.
        Parameters:
        line - The full line to remove
        file - The file to remove the line from. This file will be rewritten in full, and the entire contents will be kept in memory
        Throws:
        UnknownID - If the file does not exist
      • createDir

        public static boolean createDir​(File dir)
                                 throws PermissionDenied
        Check if the directory exists, and create it if needed. The complete path down to the directory is created. If the directory creation fails a PermissionDenied exception is thrown. If the directory is not writable, a warning is logged
        Parameters:
        dir - The directory to create
        Returns:
        true if dir created.
        Throws:
        ArgumentNotValid - If dir is null or its name is the empty string
        PermissionDenied - If directory cannot be created for any reason
      • getBytesFree

        public static long getBytesFree​(File f)
        Returns the number of bytes free on the file system calling the FreeSpaceProvider class defined by the setting CommonSettings.FREESPACE_PROVIDER_CLASS (a.k.a. settings.common.freespaceprovider.class)
        Parameters:
        f - a given file
        Returns:
        the number of bytes free defined in the settings.xml
      • relativeTo

        public static String relativeTo​(File theFile,
                                        File theDir)
        Parameters:
        theFile - A file to make relative
        theDir - A directory
        Returns:
        the filepath of the theFile relative to theDir. null, if theFile is not relative to theDir. null, if theDir is not a directory.
      • countLines

        public static long countLines​(File file)
        Count the number of lines in a file.
        Parameters:
        file - the file to read
        Returns:
        the number of lines in the file
        Throws:
        IOFailure - If an error occurred while reading the file
      • getEphemeralInputStream

        public static InputStream getEphemeralInputStream​(File file)
        Create an InputStream that reads from a file but removes the file when all data has been read.
        Parameters:
        file - A file to read. This file will be deleted when the inputstream is closed, finalized, reaches end-of-file, or when the VM closes.
        Returns:
        An InputStream containing the file's contents.
        Throws:
        IOFailure - If an error occurs in creating the ephemeral input stream
      • makeValidFileFromExisting

        public static File makeValidFileFromExisting​(String filename)
                                              throws IOFailure
        Makes a valid file from filename passed in String. Ensures that the File object returned is not null, and that isFile() returns true.
        Parameters:
        filename - The file to create the File object from
        Returns:
        A valid, non-null File object.
        Throws:
        IOFailure - if file cannot be created.
      • writeFileToStream

        public static void writeFileToStream​(File f,
                                             OutputStream out)
        Write the entire contents of a file to a stream.
        Parameters:
        f - A file to write to the stream.
        out - The stream to write to.
        Throws:
        IOFailure - If any error occurs while writing the file to a stream
      • writeStreamToFile

        public static void writeStreamToFile​(InputStream in,
                                             File f)
        Write the contents of a stream into a file.
        Parameters:
        in - A stream to read from. This stream is not closed by this method.
        f - The file to write the stream contents into.
        Throws:
        IOFailure - If any error occurs while writing the stream to a file
      • getTempDir

        public static File getTempDir()
        Get the location of the standard temporary directory. The existence of this directory should be ensure at the start of every application.
        Returns:
        The directory that should be used for temporary files.
      • moveFile

        public static void moveFile​(File fromFile,
                                    File toFile)
        Attempt to move a file using rename, and if that fails, move the file by copy-and-delete.
        Parameters:
        fromFile - The source
        toFile - The target
      • generateFileNameFromSet

        public static <T extends Comparable<T>> String generateFileNameFromSet​(Set<T> IDs,
                                                                               String suffix)
        Given a set, generate a reasonable file name from the set.
        Type Parameters:
        T - The type of objects, that the Set IDs argument contains.
        Parameters:
        IDs - A set of IDs.
        suffix - A suffix. May be empty string.
        Returns:
        A reasonable file name.
      • sortCrawlLog

        public static void sortCrawlLog​(File file,
                                        File toFile)
        Sort a crawl.log file according to the url.
        Parameters:
        file - The file containing the unsorted data.
        toFile - The file that the sorted data can be put into.
        Throws:
        IOFailure - if there were errors running the sort process, or if the file does not exist.
      • sortCrawlLogOnTimestamp

        public static void sortCrawlLogOnTimestamp​(File file,
                                                   File toFile)
        Sort a crawl.log file according to the timestamp.
        Parameters:
        file - The file containing the unsorted data.
        toFile - The file that the sorted data can be put into.
        Throws:
        IOFailure - if there were errors running the sort process, or if the file does not exist.
      • sortCDX

        public static void sortCDX​(File file,
                                   File toFile)
        Sort a CDX file according to our standard for CDX file sorting. This method depends on the Unix sort() command.
        Parameters:
        file - The raw unsorted CDX file.
        toFile - The file that the result will be put into.
        Throws:
        IOFailure - If the file does not exist, or could not be sorted
      • sortFile

        public static void sortFile​(File file,
                                    File toFile)
        Sort a file using UNIX sort.
        Parameters:
        file - the file that you want to sort.
        toFile - The destination file.
      • createUniqueTempDir

        public static File createUniqueTempDir​(File inDir,
                                               String prefix)
        Creates a new temporary directory with a unique name. This directory will be deleted automatically at the end of the VM (though behaviour if there are files in it is undefined). This method will try a limited number of times to create a directory, using a randomly generated suffix, before giving up.
        Parameters:
        inDir - The directory where the temporary directory should be created.
        prefix - The prefix of the directory name, for identification purposes.
        Returns:
        A newly created directory that no other calls to createUniqueDir returns.
        Throws:
        ArgumentNotValid - if inDir is not an existing directory that can be written to.
        IOFailure - if a free name couldn't be found within a reasonable number of tries.
      • readLastLine

        public static String readLastLine​(File file)
        Read the last line in a file. Note this method is not UTF-8 safe.
        Parameters:
        file - input file to read last line from.
        Returns:
        The last line in the file (ending newline is irrelevant), returns an empty string if file is empty.
        Throws:
        ArgumentNotValid - on null argument, or file is not a readable file.
        IOFailure - on IO trouble reading file.
      • appendToFile

        public static void appendToFile​(File file,
                                        String... lines)
        Append the given lines to a file. Each lines is terminated by a newline.
        Parameters:
        file - A file to append to.
        lines - The lines to write.
      • getResourceFileFromClassPath

        public static File getResourceFileFromClassPath​(String filePath)
                                                 throws IOFailure
        Loads an file from the class path (for retrieving a file from '.jar').
        Parameters:
        filePath - The path of the file.
        Returns:
        The file from the class path.
        Throws:
        IOFailure - If resource cannot be retrieved from the class path.
      • getHumanReadableFileSize

        public static String getHumanReadableFileSize​(File aFile)
        Get a humanly readable representation of the file size. If the file is a directory, the size is the aggregate of the files in the directory except that subdirectories are ignored. The number is given with 2 decimals.
        Parameters:
        aFile - a File object
        Returns:
        a humanly readable representation of the file size (rounded)
      • hasFiles

        public static boolean hasFiles​(File aDir)
        Parameters:
        aDir - A directory
        Returns:
        true, if the given directory contains files; else returns false