Class IcelandicExtractorJS

  • All Implemented Interfaces:
    org.archive.checkpointing.Checkpointable, org.archive.spring.HasKeyedProperties, org.springframework.beans.factory.Aware, org.springframework.beans.factory.BeanNameAware, org.springframework.context.Lifecycle

    public class IcelandicExtractorJS
    extends org.archive.modules.extractor.ExtractorJS
    Processes Javascript files for strings that are likely to be crawlable URIs. NOTE: This processor may open a ReplayCharSequence from the CrawlURI's Recorder, without closing that ReplayCharSequence, to allow reuse by later processors in sequence. In the usual (Heritrix) case, a call after all processing to the Recorder's endReplays() method ensures timely close of any reused ReplayCharSequences. Reuse of this processor elsewhere should ensure a similar cleanup call to Recorder.endReplays() occurs. TODO: Replace with a system for actually executing Javascript in a browser-workalike DOM, such as via HtmlUnit or remote-controlled browser engines.
    • Field Summary

      Fields 
      Modifier and Type Field Description
      protected static String[] EXTRACTOR_URI_EXCEPTIONS  
      protected long numberOfCURIsHandled  
      • Fields inherited from class org.archive.modules.extractor.Extractor

        DEFAULT_PARAMETERS, extractorParameters, loggerModule, numberOfLinksExtracted
      • Fields inherited from class org.archive.modules.Processor

        beanName, isRunning, kp, recoveryCheckpoint, uriCount
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      long considerStrings​(org.archive.modules.extractor.Extractor ext, org.archive.modules.CrawlURI curi, CharSequence cs, boolean handlingJSFile)  
      List<Pattern> getRejectRelativeMatchingRegexList()  
      protected boolean innerExtract​(org.archive.modules.CrawlURI curi)  
      String report()  
      void setRejectRelativeMatchingRegexList​(List<Pattern> patterns)  
      protected boolean shouldExtract​(org.archive.modules.CrawlURI uri)  
      • Methods inherited from class org.archive.modules.extractor.ExtractorJS

        considerString, considerStrings, considerStrings
      • Methods inherited from class org.archive.modules.extractor.ContentExtractor

        extract, shouldProcess
      • Methods inherited from class org.archive.modules.extractor.Extractor

        add, addOutlink, addOutlink, addRelativeToBase, addRelativeToVia, fromCheckpointJson, getExtractorParameters, getLoggerModule, innerProcess, logUriError, setExtractorParameters, setLoggerModule, toCheckpointJson
      • Methods inherited from class org.archive.modules.Processor

        doCheckpoint, finishCheckpoint, flattenVia, getBeanName, getEnabled, getKeyedProperties, getRecordedSize, getShouldProcessRule, getURICount, hasHttpAuthenticationCredential, innerProcessResult, innerRejectProcess, isRunning, isSuccess, process, setBeanName, setEnabled, setRecoveryCheckpoint, setShouldProcessRule, start, startCheckpoint, stop
    • Field Detail

      • numberOfCURIsHandled

        protected long numberOfCURIsHandled
      • EXTRACTOR_URI_EXCEPTIONS

        protected static final String[] EXTRACTOR_URI_EXCEPTIONS
    • Constructor Detail

      • IcelandicExtractorJS

        public IcelandicExtractorJS()
        Constructor.
    • Method Detail

      • getRejectRelativeMatchingRegexList

        public List<Pattern> getRejectRelativeMatchingRegexList()
      • setRejectRelativeMatchingRegexList

        public void setRejectRelativeMatchingRegexList​(List<Pattern> patterns)
      • shouldExtract

        protected boolean shouldExtract​(org.archive.modules.CrawlURI uri)
        Overrides:
        shouldExtract in class org.archive.modules.extractor.ExtractorJS
      • innerExtract

        protected boolean innerExtract​(org.archive.modules.CrawlURI curi)
        Overrides:
        innerExtract in class org.archive.modules.extractor.ExtractorJS
      • considerStrings

        public long considerStrings​(org.archive.modules.extractor.Extractor ext,
                                    org.archive.modules.CrawlURI curi,
                                    CharSequence cs,
                                    boolean handlingJSFile)
        Overrides:
        considerStrings in class org.archive.modules.extractor.ExtractorJS
      • report

        public String report()
        Overrides:
        report in class org.archive.modules.extractor.Extractor