Class CrawlLogExtractionMapper


  • public class CrawlLogExtractionMapper
    extends org.apache.hadoop.mapreduce.Mapper<org.apache.hadoop.io.LongWritable,​org.apache.hadoop.io.Text,​org.apache.hadoop.io.NullWritable,​org.apache.hadoop.io.Text>
    Hadoop Mapper for extracting crawllog lines from metadata files. Expects the Configuration provided for the job to have a regex set, which is used to filter for relevant lines. If no regex is set an all-matching regex will be used.
    • Nested Class Summary

      • Nested classes/interfaces inherited from class org.apache.hadoop.mapreduce.Mapper

        org.apache.hadoop.mapreduce.Mapper.Context
    • Constructor Detail

      • CrawlLogExtractionMapper

        public CrawlLogExtractionMapper()
    • Method Detail

      • map

        protected void map​(org.apache.hadoop.io.LongWritable linenumber,
                           org.apache.hadoop.io.Text archiveFilePath,
                           org.apache.hadoop.mapreduce.Mapper.Context context)
                    throws IOException,
                           InterruptedException
        Mapping method.
        Overrides:
        map in class org.apache.hadoop.mapreduce.Mapper<org.apache.hadoop.io.LongWritable,​org.apache.hadoop.io.Text,​org.apache.hadoop.io.NullWritable,​org.apache.hadoop.io.Text>
        Parameters:
        linenumber - The linenumber. Is ignored.
        archiveFilePath - The path to the archive file.
        context - Context used for writing output.
        Throws:
        IOException - If it fails to generate the CDX indexes.
        InterruptedException