dk.netarkivet.harvester.harvesting.extractor
Class ExtractorOAI

java.lang.Object
  extended by javax.management.Attribute
      extended by org.archive.crawler.settings.Type
          extended by org.archive.crawler.settings.ComplexType
              extended by org.archive.crawler.settings.ModuleType
                  extended by org.archive.crawler.framework.Processor
                      extended by org.archive.crawler.extractor.Extractor
                          extended by dk.netarkivet.harvester.harvesting.extractor.ExtractorOAI
All Implemented Interfaces:
java.io.Serializable, javax.management.DynamicMBean

public class ExtractorOAI
extends org.archive.crawler.extractor.Extractor

This is a link extractor for use with Heritrix. It will find the resumptionToken in an OAI-PMH listMetadata query and construct the link for the next page of the results. This extractor will not extract any other links so if there are additional urls in the OAI metadata then an additional extractor should be used for these. Typically this means that the extractor chain in the order template will end: true true

See Also:
Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class org.archive.crawler.settings.ComplexType
org.archive.crawler.settings.ComplexType.MBeanAttributeInfoIterator
 
Field Summary
(package private)  org.apache.commons.logging.Log log
          The class logger.
 
Fields inherited from class org.archive.crawler.framework.Processor
ATTR_DECIDE_RULES, ATTR_ENABLED, attrDecideRules
 
Fields inherited from class org.archive.crawler.settings.ComplexType
definition, definitionMap
 
Constructor Summary
ExtractorOAI(java.lang.String name)
          Constructor for this extractor.
 
Method Summary
protected  void extract(org.archive.crawler.datamodel.CrawlURI curi)
          Perform the link extraction on the current crawl uri.
 boolean processXml(org.archive.crawler.datamodel.CrawlURI curi, java.lang.CharSequence cs)
          Searches for resumption token and adds link if it is found.
 java.lang.String report()
          Return a report from this processor.
 
Methods inherited from class org.archive.crawler.extractor.Extractor
innerProcess
 
Methods inherited from class org.archive.crawler.framework.Processor
checkForInterrupt, finalTasks, getController, getDecideRule, getDefaultNextProcessor, initialTasks, innerRejectProcess, isContentToProcess, isEnabled, isExpectedMimeType, isHttpTransactionContentToProcess, kickUpdate, process, rulesAccept, rulesAccept, setDefaultNextProcessor, spawn
 
Methods inherited from class org.archive.crawler.settings.ModuleType
addElement, listUsedFiles
 
Methods inherited from class org.archive.crawler.settings.ComplexType
addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, removeElementFromDefinition, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, toString, unsetAttribute
 
Methods inherited from class org.archive.crawler.settings.Type
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient
 
Methods inherited from class javax.management.Attribute
getName, hashCode
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Field Detail

log

final org.apache.commons.logging.Log log
The class logger.

Constructor Detail

ExtractorOAI

public ExtractorOAI(java.lang.String name)
Constructor for this extractor.

Parameters:
name - the name of this extractor
Method Detail

extract

protected void extract(org.archive.crawler.datamodel.CrawlURI curi)
Perform the link extraction on the current crawl uri. This method does not set linkExtractorFinished() on the current crawlURI, so subsequent extractors in the chain can find more links.

Specified by:
extract in class org.archive.crawler.extractor.Extractor
Parameters:
curi - the CrawlUI from which to extract the link.

processXml

public boolean processXml(org.archive.crawler.datamodel.CrawlURI curi,
                          java.lang.CharSequence cs)
Searches for resumption token and adds link if it is found. Returns true iff a link is added.

Parameters:
curi - the CrawlURI.
cs - the character sequency in which to search.
Returns:
true iff a resumptionToken is found and a link added.

report

public java.lang.String report()
Return a report from this processor.

Overrides:
report in class org.archive.crawler.framework.Processor
Returns:
the report.