Class CrawlLogExtractionStrategy

  • All Implemented Interfaces:
    HadoopJobStrategy

    public class CrawlLogExtractionStrategy
    extends Object
    implements HadoopJobStrategy
    Strategy to give a HadoopJob when wanting to extract crawl log lines matching some regex from metadata files. The mapper expects the used Configuration to have this regex set. Otherwise, an all-matching pattern will be used. This type of job is the Hadoop counterpart to running CrawlLogLinesMatchingRegexp.
    • Constructor Detail

      • CrawlLogExtractionStrategy

        public CrawlLogExtractionStrategy​(long jobID,
                                          org.apache.hadoop.fs.FileSystem fileSystem)
        Constructor.
        Parameters:
        jobID - The ID for the job.
        fileSystem - The Hadoop FileSystem used.
    • Method Detail

      • runJob

        public int runJob​(org.apache.hadoop.fs.Path jobInputFile,
                          org.apache.hadoop.fs.Path jobOutputDir)
        Description copied from interface: HadoopJobStrategy
        Runs a Hadoop job (HadoopJobTool) according to the specification of the used strategy.
        Specified by:
        runJob in interface HadoopJobStrategy
        Parameters:
        jobInputFile - The Path specifying the job's input file.
        jobOutputDir - The Path specifying the job's output directory.
        Returns:
        An exit code for the job.
      • createJobInputFile

        public org.apache.hadoop.fs.Path createJobInputFile​(UUID uuid)
        Description copied from interface: HadoopJobStrategy
        Create the job input file with name from a uuid.
        Specified by:
        createJobInputFile in interface HadoopJobStrategy
        Parameters:
        uuid - The UUID to create a unique name from.
        Returns:
        Path specifying where the input file is located.
      • createJobOutputDir

        public org.apache.hadoop.fs.Path createJobOutputDir​(UUID uuid)
        Description copied from interface: HadoopJobStrategy
        Create the job output directory with name from a uuid.
        Specified by:
        createJobOutputDir in interface HadoopJobStrategy
        Parameters:
        uuid - The UUID to create a unique name from.
        Returns:
        Path specifying where the output directory is located.
      • getJobType

        public String getJobType()
        Description copied from interface: HadoopJobStrategy
        Return a string specifying which kind of job is being run.
        Specified by:
        getJobType in interface HadoopJobStrategy
        Returns:
        String specifying the job's type.