Class ExtractorOAI

  • All Implemented Interfaces:
    org.archive.checkpointing.Checkpointable, org.archive.spring.HasKeyedProperties, org.springframework.beans.factory.Aware, org.springframework.beans.factory.BeanNameAware, org.springframework.context.Lifecycle

    public class ExtractorOAI
    extends org.archive.modules.extractor.ContentExtractor
    This is a link extractor for use with Heritrix. It will find the resumptionToken in an OAI-PMH listMetadata query and construct the link for the next page of the results. This extractor will not extract any other links so if there are additional urls in the OAI metadata then an additional extractor should be used for these. Typically this means that the extractor chain in the order template will end: true true
    • Field Summary

      Fields 
      Modifier and Type Field Description
      static String EXTENDED_RESUMPTION_TOKEN_MATCH
      Regular expression matching the extended resumptionToken with attributes like this.
      static String SIMPLE_RESUMPTION_TOKEN_MATCH
      Regular expression matching the simple resumptionToken like this.
      • Fields inherited from class org.archive.modules.extractor.Extractor

        DEFAULT_PARAMETERS, extractorParameters, loggerModule
      • Fields inherited from class org.archive.modules.Processor

        beanName, isRunning, kp, recoveryCheckpoint, uriCount
    • Constructor Summary

      Constructors 
      Constructor Description
      ExtractorOAI()
      Constructor for this extractor.
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      protected boolean innerExtract​(org.archive.modules.CrawlURI curi)
      Perform the link extraction on the current crawl uri.
      boolean processXml​(org.archive.modules.CrawlURI curi, CharSequence cs)
      Searches for resumption token and adds link if it is found.
      String report()
      Return a report from this processor.
      protected boolean shouldExtract​(org.archive.modules.CrawlURI curi)  
      • Methods inherited from class org.archive.modules.extractor.ContentExtractor

        extract, shouldProcess
      • Methods inherited from class org.archive.modules.extractor.Extractor

        add, addOutlink, addOutlink, addRelativeToBase, addRelativeToVia, fromCheckpointJson, getExtractorParameters, getLoggerModule, innerProcess, logUriError, setExtractorParameters, setLoggerModule, toCheckpointJson
      • Methods inherited from class org.archive.modules.Processor

        doCheckpoint, finishCheckpoint, flattenVia, getBeanName, getEnabled, getKeyedProperties, getRecordedSize, getShouldProcessRule, getURICount, hasHttpAuthenticationCredential, innerProcessResult, innerRejectProcess, isRunning, isSuccess, process, setBeanName, setEnabled, setRecoveryCheckpoint, setShouldProcessRule, start, startCheckpoint, stop
    • Field Detail

      • SIMPLE_RESUMPTION_TOKEN_MATCH

        public static final String SIMPLE_RESUMPTION_TOKEN_MATCH
        Regular expression matching the simple resumptionToken like this. oai_dc/421315/56151148/100/0/292/x/x/x
        See Also:
        Constant Field Values
      • EXTENDED_RESUMPTION_TOKEN_MATCH

        public static final String EXTENDED_RESUMPTION_TOKEN_MATCH
        Regular expression matching the extended resumptionToken with attributes like this. oai_dc/421315/56151148/100/0/292/x/x/x This is seen in OAI targets used by PURE.
        See Also:
        Constant Field Values
    • Constructor Detail

      • ExtractorOAI

        public ExtractorOAI()
        Constructor for this extractor.
    • Method Detail

      • innerExtract

        protected boolean innerExtract​(org.archive.modules.CrawlURI curi)
        Perform the link extraction on the current crawl uri. This method does not set linkExtractorFinished() on the current crawlURI, so subsequent extractors in the chain can find more links.
        Specified by:
        innerExtract in class org.archive.modules.extractor.ContentExtractor
        Parameters:
        curi - the CrawlUI from which to extract the link.
      • processXml

        public boolean processXml​(org.archive.modules.CrawlURI curi,
                                  CharSequence cs)
        Searches for resumption token and adds link if it is found. Returns true iff a link is added.
        Parameters:
        curi - the CrawlURI.
        cs - the character sequence in which to search.
        Returns:
        true iff a resumptionToken is found and a link added.
      • report

        public String report()
        Return a report from this processor.
        Overrides:
        report in class org.archive.modules.extractor.Extractor
        Returns:
        the report.
      • shouldExtract

        protected boolean shouldExtract​(org.archive.modules.CrawlURI curi)
        Specified by:
        shouldExtract in class org.archive.modules.extractor.ContentExtractor