Dashboard

Fixed api calls to new dnsjava

Replaced transitive dnsjava with explicit import (completely purged now)

Replaced transitive dnsjava with explicit import

Updated Heritrix snapshot version

Updated to support new protocol-agnostic server-ip attribute in heritrix

Enabled configurable url-matching and extraction for sitemaps.

docs: Add decide rules to bean reference

Everywhere: Fix dangling JavaDoc comments

JavaDoc comments need to be directly above a class, method or field

declaration to be used by IDEs and generated documentation. We had some

doc comments that were attached to instance initialization blocks and

thus were being ignored. This invalid doc comment positioning was

presumably an accidental consequence of the conversion of fields to

KeyedProperties.

This change moves most of the dangling doc comments to setter methods

where it can be seen by tools. There were a couple of dangling doc

comments above package statements. These are moved above the class

declaration or removed entirely when empty.

  1. … 34 more files in changeset.
docs: Fallback to field javadoc comments when setter javadoc unavailable

This enables us to generate documentation for more bean properties,

although a number of beans have javadoc on initializer code blocks which

makes it hard to access. This affects javadoc and IDE contextual

documentation too so should probably be fixed in the source code itself.

Merge pull request #411 from internetarchive/chrome-request-capturing

ExtractorChrome: Capture requests made by the browser

docs: Add most of the default config beans to the bean reference

Notably Decide Rules are still to be done.

docs: Strip @link and @code javadoc directives from bean reference

docs: Add remaining link extractor to the bean reference

docs: Remove duplicate word 'Documentation' from page titles

Also include '3' to distinguish from the Heritrix 1 manual which is

still prevalent in search results.

docs: Fix links accidentally using Markdown syntax

docs: Start a 'Bean Reference' document generated from the source code

This is meant to complement the javadoc by providing a reference

more suitable for users trying to configure crawls rather than

developers writing new modules.

The doc generation could still do with some improving and some of the

source javadoc comments need fixing up but this is already useful so I'm

committing what I have so far.

    • -0
    • +107
    /docs/bean-reference.rst
docs: Fix incorrectly indented line causing rst warning

docs: Add sections on FTP, SFTP and WHOIS to config guide

docs: Add a plugin for basic auto-generation of bean examples

    • -0
    • +81
    /docs/_ext/beandoc.py
docs: s/Most/More/ documentation lives on the wiki

A large amount of core documentation has now been migrated here. There's

still a lot to go but I don't think 'most' applies anymore.

docs: Remove inaccurate 2018 date from documentation footer

Large portions of the docs were first published much earlier in other

places and some have been updated since then. Rather than trying to

keep a date range up to date let's just remove the date. We don't use

dates in the source code boilerplate and my understanding is copyright

notices are not mandatory in almost all countries due to the Berne

Convention anyway so it's really just informational.

docs: Add subdirectory explanation from wiki install page

docs: Add an operating guide based on the contents of the wiki

    • -0
    • +749
    /docs/operating.rst
docs: add getting-started.rst, configuring-jobs.rst and glossary.rst

Compiled from the wiki with some restructuring, reformatting and

updating for the current version of Heritrix.

    • -0
    • +418
    /docs/configuring-jobs.rst
    • -0
    • +129
    /docs/getting-started.rst
    • -0
    • +344
    /docs/glossary.rst
Updated javadoc configuration

[maven-release-plugin] prepare for next development iteration

  1. … 24 more files in changeset.
[maven-release-plugin] prepare release netarchivesuite-7.1

  1. … 24 more files in changeset.
Updated complete settings.

ExtractorChrome: Capture requests made by the browser

This adds a `captureRequests` flag to ExtractorChrome which is enabled

by default and causes requests made by the browser to be captured via

the devtools Network domain. Captured browser requests are sent to the

disposition chain for WARC writing and also to statistics tracker and

crawl log.

Browser requests are given the annotation "browser" so they can be

easily distinguished in the log from normal requests.

There are a quite a few limitations that will be addressed in followup

work:

* The frontier is entirely unaware of browser requests. This means they

bypass quotas, ignore scope rules, politeness and don't count towards

the statistics tracked by the frontier itself.

* There's no replay of previously saved resources so duplicate requests

for the same URL end up being made.

* Heritrix's extractors do not currently process browser requests.

* Various error and failure cases likely need improving.

Merge pull request #410 from internetarchive/warc-writer-stats-fixes

Warc writer stats fixes