Class IcelandicExtractorJS
- java.lang.Object
-
- org.archive.modules.Processor
-
- org.archive.modules.extractor.Extractor
-
- org.archive.modules.extractor.ContentExtractor
-
- org.archive.modules.extractor.ExtractorJS
-
- dk.netarkivet.harvester.harvesting.extractor.IcelandicExtractorJS
-
- All Implemented Interfaces:
org.archive.checkpointing.Checkpointable
,org.archive.spring.HasKeyedProperties
,org.springframework.beans.factory.Aware
,org.springframework.beans.factory.BeanNameAware
,org.springframework.context.Lifecycle
public class IcelandicExtractorJS extends org.archive.modules.extractor.ExtractorJS
Processes Javascript files for strings that are likely to be crawlable URIs. NOTE: This processor may open a ReplayCharSequence from the CrawlURI's Recorder, without closing that ReplayCharSequence, to allow reuse by later processors in sequence. In the usual (Heritrix) case, a call after all processing to the Recorder's endReplays() method ensures timely close of any reused ReplayCharSequences. Reuse of this processor elsewhere should ensure a similar cleanup call to Recorder.endReplays() occurs. TODO: Replace with a system for actually executing Javascript in a browser-workalike DOM, such as via HtmlUnit or remote-controlled browser engines.
-
-
Field Summary
Fields Modifier and Type Field Description protected static String[]
EXTRACTOR_URI_EXCEPTIONS
protected long
numberOfCURIsHandled
-
Constructor Summary
Constructors Constructor Description IcelandicExtractorJS()
Constructor.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description long
considerStrings(org.archive.modules.extractor.Extractor ext, org.archive.modules.CrawlURI curi, CharSequence cs, boolean handlingJSFile)
List<Pattern>
getRejectRelativeMatchingRegexList()
protected boolean
innerExtract(org.archive.modules.CrawlURI curi)
String
report()
void
setRejectRelativeMatchingRegexList(List<Pattern> patterns)
protected boolean
shouldExtract(org.archive.modules.CrawlURI uri)
-
Methods inherited from class org.archive.modules.extractor.ExtractorJS
considerString, considerStrings, considerStrings
-
Methods inherited from class org.archive.modules.extractor.Extractor
add, addOutlink, addOutlink, addRelativeToBase, addRelativeToVia, fromCheckpointJson, getExtractorParameters, getLoggerModule, innerProcess, logUriError, setExtractorParameters, setLoggerModule, toCheckpointJson
-
Methods inherited from class org.archive.modules.Processor
doCheckpoint, finishCheckpoint, flattenVia, getBeanName, getEnabled, getKeyedProperties, getRecordedSize, getShouldProcessRule, getURICount, hasHttpAuthenticationCredential, innerProcessResult, innerRejectProcess, isRunning, isSuccess, process, setBeanName, setEnabled, setRecoveryCheckpoint, setShouldProcessRule, start, startCheckpoint, stop
-
-
-
-
Field Detail
-
numberOfCURIsHandled
protected long numberOfCURIsHandled
-
EXTRACTOR_URI_EXCEPTIONS
protected static final String[] EXTRACTOR_URI_EXCEPTIONS
-
-
Method Detail
-
setRejectRelativeMatchingRegexList
public void setRejectRelativeMatchingRegexList(List<Pattern> patterns)
-
shouldExtract
protected boolean shouldExtract(org.archive.modules.CrawlURI uri)
- Overrides:
shouldExtract
in classorg.archive.modules.extractor.ExtractorJS
-
innerExtract
protected boolean innerExtract(org.archive.modules.CrawlURI curi)
- Overrides:
innerExtract
in classorg.archive.modules.extractor.ExtractorJS
-
considerStrings
public long considerStrings(org.archive.modules.extractor.Extractor ext, org.archive.modules.CrawlURI curi, CharSequence cs, boolean handlingJSFile)
- Overrides:
considerStrings
in classorg.archive.modules.extractor.ExtractorJS
-
report
public String report()
- Overrides:
report
in classorg.archive.modules.extractor.Extractor
-
-