ExtractorOAI (NetarchiveSuite 5.4 API)

java.lang.Object
- org.archive.modules.Processor
- - org.archive.modules.extractor.Extractor
  - - org.archive.modules.extractor.ContentExtractor
    - - dk.netarkivet.harvester.harvesting.extractor.ExtractorOAI

All Implemented Interfaces:

org.archive.checkpointing.Checkpointable, org.archive.spring.HasKeyedProperties, org.springframework.beans.factory.BeanNameAware, org.springframework.context.Lifecycle
```
public class ExtractorOAI
extends org.archive.modules.extractor.ContentExtractor
```
This is a link extractor for use with Heritrix. It will find the resumptionToken in an OAI-PMH listMetadata query and construct the link for the next page of the results. This extractor will not extract any other links so if there are additional urls in the OAI metadata then an additional extractor should be used for these. Typically this means that the extractor chain in the order template will end: true true

Field Summary

Fields
Modifier and Type	Field and Description
`static String`	`EXTENDED_RESUMPTION_TOKEN_MATCH` Regular expression matching the extended resumptionToken with attributes like this.
`static String`	`SIMPLE_RESUMPTION_TOKEN_MATCH` Regular expression matching the simple resumptionToken like this.

Fields inherited from class org.archive.modules.extractor.Extractor
DEFAULT_PARAMETERS, extractorParameters, loggerModule

Fields inherited from class org.archive.modules.Processor
beanName, isRunning, kp, recoveryCheckpoint, uriCount

Constructor Summary

Constructors
Constructor and Description

ExtractorOAI()
Constructor for this extractor.

Constructors
Constructor and Description
`ExtractorOAI()` Constructor for this extractor.

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`protected boolean`	`innerExtract(org.archive.modules.CrawlURI curi)` Perform the link extraction on the current crawl uri.
`boolean`	`processXml(org.archive.modules.CrawlURI curi, CharSequence cs)` Searches for resumption token and adds link if it is found.
`String`	`report()` Return a report from this processor.
`protected boolean`	`shouldExtract(org.archive.modules.CrawlURI curi)`

Methods inherited from class org.archive.modules.extractor.ContentExtractor
extract, shouldProcess

Methods inherited from class org.archive.modules.extractor.Extractor
add, addOutlink, addOutlink, addRelativeToBase, addRelativeToVia, fromCheckpointJson, getExtractorParameters, getLoggerModule, innerProcess, logUriError, setExtractorParameters, setLoggerModule, toCheckpointJson

Methods inherited from class org.archive.modules.Processor
doCheckpoint, finishCheckpoint, flattenVia, getBeanName, getEnabled, getKeyedProperties, getRecordedSize, getShouldProcessRule, getURICount, hasHttpAuthenticationCredential, innerProcessResult, innerRejectProcess, isRunning, isSuccess, process, setBeanName, setEnabled, setRecoveryCheckpoint, setShouldProcessRule, start, startCheckpoint, stop

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - SIMPLE_RESUMPTION_TOKEN_MATCH
```
public static final String SIMPLE_RESUMPTION_TOKEN_MATCH
```
    Regular expression matching the simple resumptionToken like this. oai_dc/421315/56151148/100/0/292/x/x/x
    
    See Also:
    
    Constant Field Values
  - EXTENDED_RESUMPTION_TOKEN_MATCH
```
public static final String EXTENDED_RESUMPTION_TOKEN_MATCH
```
    Regular expression matching the extended resumptionToken with attributes like this. oai_dc/421315/56151148/100/0/292/x/x/x This is seen in OAI targets used by PURE.
    
    See Also:
    
    Constant Field Values
- Constructor Detail
  - ExtractorOAI
```
public ExtractorOAI()
```
    Constructor for this extractor.
- Method Detail
  - innerExtract
```
protected boolean innerExtract(org.archive.modules.CrawlURI curi)
```
    Perform the link extraction on the current crawl uri. This method does not set linkExtractorFinished() on the current crawlURI, so subsequent extractors in the chain can find more links.
    
    Specified by:
    
    innerExtract in class org.archive.modules.extractor.ContentExtractor
    
    Parameters:
    
    curi - the CrawlUI from which to extract the link.
  - processXml
```
public boolean processXml(org.archive.modules.CrawlURI curi,
                          CharSequence cs)
```
    Searches for resumption token and adds link if it is found. Returns true iff a link is added.
    
    Parameters:
    
    curi - the CrawlURI.
    
    cs - the character sequence in which to search.
    
    Returns:
    
    true iff a resumptionToken is found and a link added.
  - report
```
public String report()
```
    Return a report from this processor.
    
    Overrides:
    
    report in class org.archive.modules.extractor.Extractor
    
    Returns:
    
    the report.
  - shouldExtract
```
protected boolean shouldExtract(org.archive.modules.CrawlURI curi)
```
    Specified by:
    
    shouldExtract in class org.archive.modules.extractor.ContentExtractor

Class ExtractorOAI

Field Summary

Fields inherited from class org.archive.modules.extractor.Extractor

Fields inherited from class org.archive.modules.Processor

Constructor Summary

Method Summary

Methods inherited from class org.archive.modules.extractor.ContentExtractor

Methods inherited from class org.archive.modules.extractor.Extractor

Methods inherited from class org.archive.modules.Processor

Methods inherited from class java.lang.Object

Field Detail

SIMPLE_RESUMPTION_TOKEN_MATCH

EXTENDED_RESUMPTION_TOKEN_MATCH

Constructor Detail

ExtractorOAI

Method Detail

innerExtract

processXml

report

shouldExtract