dk.netarkivet.harvester.harvesting.extractor
Class ExtractorJS

java.lang.Object
  extended by javax.management.Attribute
      extended by org.archive.crawler.settings.Type
          extended by org.archive.crawler.settings.ComplexType
              extended by org.archive.crawler.settings.ModuleType
                  extended by org.archive.crawler.framework.Processor
                      extended by org.archive.crawler.extractor.Extractor
                          extended by dk.netarkivet.harvester.harvesting.extractor.ExtractorJS
All Implemented Interfaces:
java.io.Serializable, javax.management.DynamicMBean, org.archive.crawler.datamodel.CoreAttributeConstants

public class ExtractorJS
extends org.archive.crawler.extractor.Extractor
implements org.archive.crawler.datamodel.CoreAttributeConstants

Processes Javascript files for strings that are likely to be crawlable URIs. contributor gojomo contributor szznax contributor svc

See Also:
Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class org.archive.crawler.settings.ComplexType
org.archive.crawler.settings.ComplexType.MBeanAttributeInfoIterator
 
Field Summary
protected static java.lang.String[] EXTRACTOR_URI_EXCEPTIONS
           
(package private) static java.lang.String JAVASCRIPT_STRING_EXTRACTOR
           
protected  long numberOfCURIsHandled
           
protected static long numberOfLinksExtracted
           
 
Fields inherited from class org.archive.crawler.framework.Processor
ATTR_DECIDE_RULES, ATTR_ENABLED, attrDecideRules
 
Fields inherited from class org.archive.crawler.settings.ComplexType
definition, definitionMap
 
Fields inherited from interface org.archive.crawler.datamodel.CoreAttributeConstants
A_ANNOTATIONS, A_CONTENT_DIGEST, A_CONTENT_TYPE, A_CREDENTIAL_AVATARS_KEY, A_DELAY_FACTOR, A_DISTANCE_FROM_SEED, A_DNS_FETCH_TIME, A_DNS_SERVER_IP_LABEL, A_ETAG_HEADER, A_FETCH_BEGAN_TIME, A_FETCH_COMPLETED_TIME, A_FETCH_HISTORY, A_FORCE_RETIRE, A_FTP_CONTROL_CONVERSATION, A_FTP_FETCH_STATUS, A_HERITABLE_KEYS, A_HTML_BASE, A_HTTP_BIND_ADDRESS, A_HTTP_PROXY_HOST, A_HTTP_PROXY_PORT, A_HTTP_TRANSACTION, A_LAST_MODIFIED_HEADER, A_LOCALIZED_ERRORS, A_META_ROBOTS, A_MINIMUM_DELAY, A_MIRROR_PATH, A_PREREQUISITE_URI, A_REFERENCE_LENGTH, A_RETRY_DELAY, A_RRECORD_SET_LABEL, A_RUNTIME_EXCEPTION, A_SOURCE_TAG, A_STATUS, HEADER_TRUNC, LENGTH_TRUNC, TIMER_TRUNC, TRUNC_SUFFIX
 
Constructor Summary
ExtractorJS(java.lang.String name)
           
 
Method Summary
static long considerStrings(org.archive.crawler.datamodel.CrawlURI curi, java.lang.CharSequence cs, org.archive.crawler.framework.CrawlController controller, boolean handlingJSFile)
           
 void extract(org.archive.crawler.datamodel.CrawlURI curi)
           
 java.lang.String report()
           
 
Methods inherited from class org.archive.crawler.extractor.Extractor
innerProcess
 
Methods inherited from class org.archive.crawler.framework.Processor
checkForInterrupt, finalTasks, getController, getDecideRule, getDefaultNextProcessor, initialTasks, innerRejectProcess, isContentToProcess, isEnabled, isExpectedMimeType, isHttpTransactionContentToProcess, kickUpdate, process, rulesAccept, rulesAccept, setDefaultNextProcessor, spawn
 
Methods inherited from class org.archive.crawler.settings.ModuleType
addElement, listUsedFiles
 
Methods inherited from class org.archive.crawler.settings.ComplexType
addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, removeElementFromDefinition, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, toString, unsetAttribute
 
Methods inherited from class org.archive.crawler.settings.Type
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient
 
Methods inherited from class javax.management.Attribute
getName, hashCode
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Field Detail

JAVASCRIPT_STRING_EXTRACTOR

static final java.lang.String JAVASCRIPT_STRING_EXTRACTOR
See Also:
Constant Field Values

numberOfCURIsHandled

protected long numberOfCURIsHandled

numberOfLinksExtracted

protected static long numberOfLinksExtracted

EXTRACTOR_URI_EXCEPTIONS

protected static final java.lang.String[] EXTRACTOR_URI_EXCEPTIONS
Constructor Detail

ExtractorJS

public ExtractorJS(java.lang.String name)
Parameters:
name -
Method Detail

extract

public void extract(org.archive.crawler.datamodel.CrawlURI curi)
Specified by:
extract in class org.archive.crawler.extractor.Extractor

considerStrings

public static long considerStrings(org.archive.crawler.datamodel.CrawlURI curi,
                                   java.lang.CharSequence cs,
                                   org.archive.crawler.framework.CrawlController controller,
                                   boolean handlingJSFile)

report

public java.lang.String report()
Overrides:
report in class org.archive.crawler.framework.Processor