ExtractorOAI (NetarchiveSuite 5.0 API)

java.lang.Object
- javax.management.Attribute
- - org.archive.crawler.settings.Type
  - - org.archive.crawler.settings.ComplexType
    - - org.archive.crawler.settings.ModuleType
      - org.archive.crawler.framework.Processor
        
        org.archive.crawler.extractor.Extractor
        
        dk.netarkivet.harvester.harvesting.extractor.ExtractorOAI

All Implemented Interfaces:

Serializable, DynamicMBean
```
public class ExtractorOAI
extends org.archive.crawler.extractor.Extractor
```
This is a link extractor for use with Heritrix. It will find the resumptionToken in an OAI-PMH listMetadata query and construct the link for the next page of the results. This extractor will not extract any other links so if there are additional urls in the OAI metadata then an additional extractor should be used for these. Typically this means that the extractor chain in the order template will end: true true

See Also:

Serialized Form

Nested Class Summary
- Nested classes/interfaces inherited from class org.archive.crawler.settings.ComplexType
  org.archive.crawler.settings.ComplexType.MBeanAttributeInfoIterator

Field Summary

Fields
Modifier and Type	Field and Description
`static String`	`EXTENDED_RESUMPTION_TOKEN_MATCH` Regular expression matching the extended resumptionToken with attributes like this.
`static String`	`SIMPLE_RESUMPTION_TOKEN_MATCH` Regular expression matching the simple resumptionToken like this.

Fields inherited from class org.archive.crawler.framework.Processor
ATTR_DECIDE_RULES, ATTR_ENABLED, attrDecideRules

Fields inherited from class org.archive.crawler.settings.ComplexType
definition, definitionMap

Constructor Summary

Constructors
Constructor and Description

ExtractorOAI(String name)
Constructor for this extractor.

Constructors
Constructor and Description
`ExtractorOAI(String name)` Constructor for this extractor.

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`protected void`	`extract(org.archive.crawler.datamodel.CrawlURI curi)` Perform the link extraction on the current crawl uri.
`boolean`	`processXml(org.archive.crawler.datamodel.CrawlURI curi, CharSequence cs)` Searches for resumption token and adds link if it is found.
`String`	`report()` Return a report from this processor.

Methods inherited from class org.archive.crawler.extractor.Extractor
innerProcess

Methods inherited from class org.archive.crawler.framework.Processor
checkForInterrupt, finalTasks, getController, getDecideRule, getDefaultNextProcessor, initialTasks, innerRejectProcess, isContentToProcess, isEnabled, isExpectedMimeType, isHttpTransactionContentToProcess, kickUpdate, process, rulesAccept, rulesAccept, setDefaultNextProcessor, spawn

Methods inherited from class org.archive.crawler.settings.ModuleType
addElement, listUsedFiles

Methods inherited from class org.archive.crawler.settings.ComplexType
addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, removeElementFromDefinition, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, toString, unsetAttribute

Methods inherited from class org.archive.crawler.settings.Type
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient

Methods inherited from class javax.management.Attribute
getName, hashCode

Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait

- Field Detail
  - SIMPLE_RESUMPTION_TOKEN_MATCH
```
public static final String SIMPLE_RESUMPTION_TOKEN_MATCH
```
    Regular expression matching the simple resumptionToken like this. oai_dc/421315/56151148/100/0/292/x/x/x
    
    See Also:
    
    Constant Field Values
  - EXTENDED_RESUMPTION_TOKEN_MATCH
```
public static final String EXTENDED_RESUMPTION_TOKEN_MATCH
```
    Regular expression matching the extended resumptionToken with attributes like this. oai_dc/421315/56151148/100/0/292/x/x/x This is seen in OAI targets used by PURE.
    
    See Also:
    
    Constant Field Values
- Constructor Detail
  - ExtractorOAI
```
public ExtractorOAI(String name)
```
    Constructor for this extractor.
    
    Parameters:
    
    name - the name of this extractor
- Method Detail
  - extract
```
protected void extract(org.archive.crawler.datamodel.CrawlURI curi)
```
    Perform the link extraction on the current crawl uri. This method does not set linkExtractorFinished() on the current crawlURI, so subsequent extractors in the chain can find more links.
    
    Specified by:
    
    extract in class org.archive.crawler.extractor.Extractor
    
    Parameters:
    
    curi - the CrawlUI from which to extract the link.
  - processXml
```
public boolean processXml(org.archive.crawler.datamodel.CrawlURI curi,
                          CharSequence cs)
```
    Searches for resumption token and adds link if it is found. Returns true iff a link is added.
    
    Parameters:
    
    curi - the CrawlURI.
    
    cs - the character sequency in which to search.
    
    Returns:
    
    true iff a resumptionToken is found and a link added.
  - report
```
public String report()
```
    Return a report from this processor.
    
    Overrides:
    
    report in class org.archive.crawler.framework.Processor
    
    Returns:
    
    the report.

Class ExtractorOAI

Nested Class Summary

Nested classes/interfaces inherited from class org.archive.crawler.settings.ComplexType

Field Summary

Fields inherited from class org.archive.crawler.framework.Processor

Fields inherited from class org.archive.crawler.settings.ComplexType

Constructor Summary

Method Summary

Methods inherited from class org.archive.crawler.extractor.Extractor

Methods inherited from class org.archive.crawler.framework.Processor

Methods inherited from class org.archive.crawler.settings.ModuleType

Methods inherited from class org.archive.crawler.settings.ComplexType

Methods inherited from class org.archive.crawler.settings.Type

Methods inherited from class javax.management.Attribute

Methods inherited from class java.lang.Object

Field Detail

SIMPLE_RESUMPTION_TOKEN_MATCH

EXTENDED_RESUMPTION_TOKEN_MATCH

Constructor Detail

ExtractorOAI

Method Detail

extract

processXml

report