Clone
 

alex osborne <aosborne@nla.gov.au> in heritrix3

Merge pull request #431 from internetarchive/extractor-chrome-bug-fixes

ExtractorChrome bug fixes

ExtractorChrome: Warn instead of throwing when response headers missing

Colin reported this exception. I'm uncertain how this can occur though

as FetchHTTP should populate the response headers. Perhaps a different

Fetch module was used?

ChromeWindow: Handle raw headers or headersText being unavailable

Colin encountered headersText being unavailable. (Maybe HTTP/2?)

ExtractorChrome: Don't capture data: URIs

They are already captured as part of their containing document. There's

no need to record them separately.

#430

Merge pull request #424 from internetarchive/ui-cleanup

UI: Refactor duplicate template rendering code

ChromeClient: increase RPC timeout from 10 to 60 seconds

We hit a timeout during CI. The timeout is just a safety measure in case

the browser hangs so it doesn't hurt to have it higher. Using a higher

value will hopefully help if the system temporarily stalls for some

reason (garbage collection, IO issues, VM migration etc).

Merge pull request #423 from internetarchive/dont-extract-data-uris

Don't extract data URIs

UI: Pull duplicate getEngine() methods up to BaseResource

Each direct subclass of BaseResource defines an identical getEngine()

method so let's pull it up to BaseResource. We can also don't need the

type cast anymore as BaseResource.getApplication() does it for us.

UI: Refactor duplicated template code into a common render() helper

We remove the calls to setCharacterSet(UTF_8) since

WriterRepresentation's constructor does that anyway.

UI: Use a single instance of Freemarker for the whole application

So we don't need to configure it separately in every resource class that

uses HTML templates.

Merge pull request #421 from internetarchive/toe-thread-interrupt-fix

ToeThread: ensure currentCuri is finished before exiting

Merge pull request #418 from internetarchive/fix-keytool-on-jdk16

JDK 16 compatibility

ExtractorSitemap: Use logUriError() helper like other extractors

ExtractorHTML: Avoid allocating strings for data: URIs when possible

Data URIs can be very large. ExtractorHTML mostly works with off-heap

CharSequences so by delaying the conversion of outlinks to strings

until after filtering out data URIs we can potentially avoid some

very large String allocations.

Extractor: ignore data URIs when adding outlinks

ExtractorPDFContext, ExtractorYoutubeDL: use addOutlink() helper method

Merge pull request #416 from internetarchive/extractor-chrome-replay-responses

ExtractorChrome: reduce request duplication between browser and frontier

ToeThread: ensure currentCuri is finished before exiting

Thread interruption and certain other exceptions can cause a toe thread

to exit without informing the frontier that the current CrawlURI is

finished. This causes the job to get permanently stuck in the STOPPING

state.

This change adds a section to the finally block that will finish any

unfinished CrawlURI.

We also move the continueCheck() call after setCurrentCuri() to ensure

there's no window where InterruptedException can be thrown after the

frontier returns the next CrawlURI but before it gets assigned to

currentCuri.

Fixes #420

Upgrade Groovy to latest stable version (3.0.8) for JDK 16 compatibility

Fixes #419

GitHub actions: run test suite on JDK 16 too

KeyTool wrapper: fallback to running keytool as a subprocess on JDK 16+

JDK 16 defaults to --illegal-access=deny which means trying to call

KeyTool via reflection now throws IllegalAccessException.

Fixes #417

ExtractorChrome: have frontier consider browser-fetched uris included

Since we now run extractors on subresources there's no reason to

schedule and refetch them again.

Note that duplicate fetches can still occur if the URI was already

scheduled or if the browser itself refetches the resource.

ExtractorChrome: run extractors on subresources captured by the browser

This ensures we discover links in subresources even if the browser

doesn't happen to load them. For example a CSS file might link to images

that the browser won't load as they're gated by media queries.

ExtractorChrome: replay the recorded CrawlURI response to the browser

By intercepting the browser's request and fulfilling it using the

response previously recorded by FetchHTTP we avoid sending duplicate

requests for the CrawlURI to the web server.

A size limit (maxReplayLength) is applied as a safety measure since the

browser's Fetch.fulfillRequest API requires us to load the entire

response body into memory.

Note: This only applies to the main CrawlURI. The browser can still

make duplicate requests when loading sub-resources. Solving this for

sub-resources will require implementing the ability to read back

previously written WARC records.

Merge pull request #414 from internetarchive/maven-assembly-plugin-3.3.0

Upgrade maven-assembly-plugin to 3.3.0 to fix file permissions

Upgrade maven-assembly-plugin to 3.3.0 to fix file permissions

The old default version of maven-assembly-plugin generates packages

containing dangerous world writable files.

Fixes #413

docs: Add decide rules to bean reference

Everywhere: Fix dangling JavaDoc comments

JavaDoc comments need to be directly above a class, method or field

declaration to be used by IDEs and generated documentation. We had some

doc comments that were attached to instance initialization blocks and

thus were being ignored. This invalid doc comment positioning was

presumably an accidental consequence of the conversion of fields to

KeyedProperties.

This change moves most of the dangling doc comments to setter methods

where it can be seen by tools. There were a couple of dangling doc

comments above package statements. These are moved above the class

declaration or removed entirely when empty.

  1. … 34 more files in changeset.
docs: Fallback to field javadoc comments when setter javadoc unavailable

This enables us to generate documentation for more bean properties,

although a number of beans have javadoc on initializer code blocks which

makes it hard to access. This affects javadoc and IDE contextual

documentation too so should probably be fixed in the source code itself.

Merge pull request #411 from internetarchive/chrome-request-capturing

ExtractorChrome: Capture requests made by the browser