Class NasWARCProcessor

  • All Implemented Interfaces:
    org.archive.checkpointing.Checkpointable, org.archive.io.warc.WARCWriterPoolSettings, org.archive.io.WriterPoolSettings, org.archive.spring.HasKeyedProperties, org.springframework.beans.factory.Aware, org.springframework.beans.factory.BeanNameAware, org.springframework.context.Lifecycle

    public class NasWARCProcessor
    extends org.archive.modules.writer.WARCWriterProcessor
    Custom NAS WARCWriterProcessor addding NetarchiveSuite metadata to the WARCInfo records written by Heritrix by just extending the org.archive.modules.writer.WARCWriterProcessor; This was not possible in H1.
    Author:
    svc
    • Field Summary

      Fields 
      Modifier and Type Field Description
      protected Map<String,​String> metadataMap
      metadata items.
      • Fields inherited from class org.archive.modules.writer.BaseWARCWriterProcessor

        generator, stats, urlsWritten
      • Fields inherited from class org.archive.modules.writer.WriterPoolProcessor

        ANNOTATION_UNWRITTEN, compress, directory, frequentFlushes, maxFileSizeBytes, maxTotalBytesToWrite, maxWaitForIdleMs, poolMaxActive, prefix, serverCache, skipIdenticalDigests, startNewFilesOnCheckpoint, storePaths, template, writeBufferSize
      • Fields inherited from class org.archive.modules.Processor

        beanName, isRunning, kp, recoveryCheckpoint, uriCount
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      Map<String,​String> getFormItems()  
      List<String> getMetadata()  
      boolean getWriteMetadataOutlinks()  
      void setMetadataItems​(Map<String,​String> metadataItems)  
      void setWriteMetadataOutlinks​(boolean writeMetadataOutlinks)  
      protected URI writeMetadata​(org.archive.io.warc.WARCWriter w, String timestamp, URI baseid, org.archive.modules.CrawlURI curi, org.archive.util.anvl.ANVLRecord namedFields)
      modify default writeMetadata method to handle the write of outlinks in metadata or not
      • Methods inherited from class org.archive.modules.writer.WARCWriterProcessor

        fromCheckpointJson, getWriteMetadata, getWriteRequests, innerProcessResult, qualifyRecordID, saveHeader, setWriteMetadata, setWriteRequests, setWriteRevisitForIdenticalDigests, setWriteRevisitForNotModified, toCheckpointJson, write, writeDnsRecords, writeFtpControlConversation, writeFtpRecords, writeHttpRecords, writeRequest, writeResource, writeResponse, writeRevisit, writeRevisit, writeWhoisRecords
      • Methods inherited from class org.archive.modules.writer.BaseWARCWriterProcessor

        addIfNotBlank, addStats, copyStats, getDefaultMaxFileSize, getDefaultStorePaths, getRecordID, getRecordIDGenerator, getStats, report, setRecordIDGenerator, setupPool, updateMetadataAfterWrite
      • Methods inherited from class org.archive.modules.writer.WriterPoolProcessor

        calcOutputDirs, checkBytesWritten, copyForwardWriteTagIfDupe, doCheckpoint, getCompress, getDirectory, getFrequentFlushes, getHostAddress, getMaxFileSizeBytes, getMaxTotalBytesToWrite, getMaxWaitForIdleMs, getMetadataProvider, getPool, getPoolMaxActive, getPrefix, getSerialNo, getServerCache, getSkipIdenticalDigests, getStartNewFilesOnCheckpoint, getStorePaths, getTemplate, getTotalBytesWritten, getWriteBufferSize, innerProcess, innerRejectProcess, setCompress, setDirectory, setFrequentFlushes, setMaxFileSizeBytes, setMaxTotalBytesToWrite, setMaxWaitForIdleMs, setMetadataProvider, setPool, setPoolMaxActive, setPrefix, setServerCache, setSkipIdenticalDigests, setStartNewFilesOnCheckpoint, setStorePaths, setTemplate, setTotalBytesWritten, setWriteBufferSize, shouldProcess, shouldWrite, start, stop
      • Methods inherited from class org.archive.modules.Processor

        finishCheckpoint, flattenVia, getBeanName, getEnabled, getKeyedProperties, getRecordedSize, getShouldProcessRule, getURICount, hasHttpAuthenticationCredential, isRunning, isSuccess, process, setBeanName, setEnabled, setRecoveryCheckpoint, setShouldProcessRule, startCheckpoint
      • Methods inherited from interface org.archive.checkpointing.Checkpointable

        finishCheckpoint, setRecoveryCheckpoint, startCheckpoint
      • Methods inherited from interface org.springframework.context.Lifecycle

        isRunning
      • Methods inherited from interface org.archive.io.warc.WARCWriterPoolSettings

        getRecordIDGenerator
      • Methods inherited from interface org.archive.io.WriterPoolSettings

        calcOutputDirs, getCompress, getFrequentFlushes, getMaxFileSizeBytes, getPrefix, getTemplate, getWriteBufferSize
    • Field Detail

      • metadataMap

        protected Map<String,​String> metadataMap
        metadata items. Add to bean WARCProcessor bean as as ...
    • Constructor Detail

      • NasWARCProcessor

        public NasWARCProcessor()
    • Method Detail

      • getWriteMetadataOutlinks

        public boolean getWriteMetadataOutlinks()
      • setWriteMetadataOutlinks

        public void setWriteMetadataOutlinks​(boolean writeMetadataOutlinks)
      • setMetadataItems

        public void setMetadataItems​(Map<String,​String> metadataItems)
      • getMetadata

        public List<String> getMetadata()
        Specified by:
        getMetadata in interface org.archive.io.WriterPoolSettings
        Overrides:
        getMetadata in class org.archive.modules.writer.BaseWARCWriterProcessor
      • writeMetadata

        protected URI writeMetadata​(org.archive.io.warc.WARCWriter w,
                                    String timestamp,
                                    URI baseid,
                                    org.archive.modules.CrawlURI curi,
                                    org.archive.util.anvl.ANVLRecord namedFields)
                             throws IOException
        modify default writeMetadata method to handle the write of outlinks in metadata or not
        Overrides:
        writeMetadata in class org.archive.modules.writer.WARCWriterProcessor
        Throws:
        IOException