Class ExtractorOAI
- java.lang.Object
-
- org.archive.modules.Processor
-
- org.archive.modules.extractor.Extractor
-
- org.archive.modules.extractor.ContentExtractor
-
- dk.netarkivet.harvester.harvesting.extractor.ExtractorOAI
-
- All Implemented Interfaces:
org.archive.checkpointing.Checkpointable
,org.archive.spring.HasKeyedProperties
,org.springframework.beans.factory.Aware
,org.springframework.beans.factory.BeanNameAware
,org.springframework.context.Lifecycle
public class ExtractorOAI extends org.archive.modules.extractor.ContentExtractor
This is a link extractor for use with Heritrix. It will find the resumptionToken in an OAI-PMH listMetadata query and construct the link for the next page of the results. This extractor will not extract any other links so if there are additional urls in the OAI metadata then an additional extractor should be used for these. Typically this means that the extractor chain in the order template will end:true true
-
-
Field Summary
Fields Modifier and Type Field Description static String
EXTENDED_RESUMPTION_TOKEN_MATCH
Regular expression matching the extended resumptionToken with attributes like this.static String
SIMPLE_RESUMPTION_TOKEN_MATCH
Regular expression matching the simple resumptionToken like this.
-
Constructor Summary
Constructors Constructor Description ExtractorOAI()
Constructor for this extractor.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected boolean
innerExtract(org.archive.modules.CrawlURI curi)
Perform the link extraction on the current crawl uri.boolean
processXml(org.archive.modules.CrawlURI curi, CharSequence cs)
Searches for resumption token and adds link if it is found.String
report()
Return a report from this processor.protected boolean
shouldExtract(org.archive.modules.CrawlURI curi)
-
Methods inherited from class org.archive.modules.extractor.Extractor
add, addOutlink, addOutlink, addRelativeToBase, addRelativeToVia, fromCheckpointJson, getExtractorParameters, getLoggerModule, innerProcess, logUriError, setExtractorParameters, setLoggerModule, toCheckpointJson
-
Methods inherited from class org.archive.modules.Processor
doCheckpoint, finishCheckpoint, flattenVia, getBeanName, getEnabled, getKeyedProperties, getRecordedSize, getShouldProcessRule, getURICount, hasHttpAuthenticationCredential, innerProcessResult, innerRejectProcess, isRunning, isSuccess, process, setBeanName, setEnabled, setRecoveryCheckpoint, setShouldProcessRule, start, startCheckpoint, stop
-
-
-
-
Field Detail
-
SIMPLE_RESUMPTION_TOKEN_MATCH
public static final String SIMPLE_RESUMPTION_TOKEN_MATCH
Regular expression matching the simple resumptionToken like this.oai_dc/421315/56151148/100/0/292/x/x/x - See Also:
- Constant Field Values
-
EXTENDED_RESUMPTION_TOKEN_MATCH
public static final String EXTENDED_RESUMPTION_TOKEN_MATCH
Regular expression matching the extended resumptionToken with attributes like this.oai_dc/421315/56151148/100/0/292/x/x/x This is seen in OAI targets used by PURE.- See Also:
- Constant Field Values
-
-
Method Detail
-
innerExtract
protected boolean innerExtract(org.archive.modules.CrawlURI curi)
Perform the link extraction on the current crawl uri. This method does not set linkExtractorFinished() on the current crawlURI, so subsequent extractors in the chain can find more links.- Specified by:
innerExtract
in classorg.archive.modules.extractor.ContentExtractor
- Parameters:
curi
- the CrawlUI from which to extract the link.
-
processXml
public boolean processXml(org.archive.modules.CrawlURI curi, CharSequence cs)
Searches for resumption token and adds link if it is found. Returns true iff a link is added.- Parameters:
curi
- the CrawlURI.cs
- the character sequence in which to search.- Returns:
- true iff a resumptionToken is found and a link added.
-
report
public String report()
Return a report from this processor.- Overrides:
report
in classorg.archive.modules.extractor.Extractor
- Returns:
- the report.
-
shouldExtract
protected boolean shouldExtract(org.archive.modules.CrawlURI curi)
- Specified by:
shouldExtract
in classorg.archive.modules.extractor.ContentExtractor
-
-