heritrix3

Clone Tools
  • last updated a few minutes ago
Constraints
Constraints: committers
 
Constraints: files
Constraints: dates
Updated manually to new SNAPSHOT version

Merge branch 'inline-image-filter' into h3.4-merge

Merge branch 'crawltrap-regex-timeout' into h3.4-merge

change trough dedup `date` type to varchar.

By parsing/unparsing to/from java.util.Date, we ended up with a

different date format in trough (sqlite) than warcprox, which is no

good; see https://github.com/internetarchive/warcprox/pull/144

use JSONObject.isNull()

because opt() returns org.json.JSONObject.Null

use org.json like everybody else

extract watch page links from youtube playlists

and equivalent for other sites. Usually we find these links through

normal link extraction, but we have the info here, so we may as well use

it to make sure.

fix non-playlist case (oops!)

Removed mistaken additions

    • -13
    • +0
    /.idea/libraries/Maven__com_google_code_gson_gson_2_2_4.xml
    • -13
    • +0
    /.idea/libraries/Maven__com_googlecode_json_simple_json_simple_1_1_1.xml
    • -13
    • +0
    /.idea/libraries/Maven__com_rethinkdb_rethinkdb_driver_2_3_3.xml
    • -13
    • +0
    /.idea/libraries/Maven__com_sun_istack_istack_commons_runtime_3_0_7.xml
    • -13
    • +0
    /.idea/libraries/Maven__com_sun_xml_fastinfoset_FastInfoset_1_2_15.xml
    • -13
    • +0
    /.idea/libraries/Maven__javax_activation_javax_activation_api_1_2_0.xml
    • -13
    • +0
    /.idea/libraries/Maven__javax_servlet_javax_servlet_api_3_1_0.xml
    • -13
    • +0
    /.idea/libraries/Maven__javax_xml_bind_jaxb_api_2_3_1.xml
    • -13
    • +0
    /.idea/libraries/Maven__org_apache_avro_avro_1_7_6_cdh5_3_5.xml
    • -13
    • +0
    /.idea/libraries/Maven__org_apache_curator_curator_client_2_6_0.xml
    • -13
    • +0
    /.idea/libraries/Maven__org_apache_curator_curator_framework_2_6_0.xml
    • -13
    • +0
    /.idea/libraries/Maven__org_apache_curator_curator_recipes_2_6_0.xml
  1. … 35 more files in changeset.
removed .iml file

    • -0
    • +13
    /.idea/libraries/Maven__com_101tec_zkclient_0_7.xml
    • -0
    • +13
    /.idea/libraries/Maven__com_google_code_gson_gson_2_2_4.xml
    • -0
    • +13
    /.idea/libraries/Maven__com_googlecode_json_simple_json_simple_1_1_1.xml
    • -0
    • +13
    /.idea/libraries/Maven__com_rethinkdb_rethinkdb_driver_2_3_3.xml
    • -0
    • +13
    /.idea/libraries/Maven__com_sleepycat_je_4_1_6.xml
    • -0
    • +13
    /.idea/libraries/Maven__com_sun_istack_istack_commons_runtime_3_0_7.xml
    • -0
    • +13
    /.idea/libraries/Maven__com_sun_xml_fastinfoset_FastInfoset_1_2_15.xml
    • -0
    • +13
    /.idea/libraries/Maven__javax_activation_javax_activation_api_1_2_0.xml
    • -0
    • +13
    /.idea/libraries/Maven__javax_servlet_javax_servlet_api_3_1_0.xml
    • -0
    • +13
    /.idea/libraries/Maven__javax_xml_bind_jaxb_api_2_3_1.xml
    • -0
    • +13
    /.idea/libraries/Maven__junit_junit_4_10.xml
    • -0
    • +13
    /.idea/libraries/Maven__org_apache_avro_avro_1_7_6_cdh5_3_5.xml
    • -0
    • +13
    /.idea/libraries/Maven__org_apache_curator_curator_client_2_6_0.xml
    • -0
    • +13
    /.idea/libraries/Maven__org_apache_curator_curator_framework_2_6_0.xml
    • -0
    • +13
    /.idea/libraries/Maven__org_apache_curator_curator_recipes_2_6_0.xml
  1. … 36 more files in changeset.
Merge remote-tracking branch 'origin/crawltrap-regex-timeout' into crawltrap-regex-timeout

Updated manually to new SNAPSHOT version

Added a timeout to crawlertrap regex matching

Merge branch 'master' into upgrade-bdb-je

be consistent and null-safe with concurrentTo

Merge pull request #286 from internetarchive/adds-subqueue-support-forced-queue-assignment

Add support for forced queue assignment and parallel queues

fix line ending and indentation issues

AssignmentLevelSurtQueueAssignmentPolicy.java - Add support for forced queue assignment and parallel queues URIAuthorityBasedQueueAssignmentPolicy.java - Add interoperability between forced queue assignment and parallel queues QuotaEnforcer.java - Fix javadoc to match default behavior

H3 version merged with IIPC master

Merge pull request #283 from internetarchive/jobdir-put-fix

Fix jobdir PUT

Override PUT so it doesn't change the file extension

Fixes #282 and HER-1907

Use super.getVariants() rather than super.getVariants(GET)

This was a regression introduced in the upgrade to Restlet 2. I

encountered a NullPointerException here when upgrading and misunderstood

the cause of it. Since PUT and DELETE return no content they are

actually supposed to return null.

Merge pull request #280 from internetarchive/fix-cookie-test-failures

Mitigate random CookieStore.testConcurrentLoad test failures

Remove testConcurrentLoad

Noah wrote in #280:

> Maybe we should just drop the test. The assumption when we wrote the

> test was that a race condition would not be so frequent in practice.

> We've seen that under the contrived conditions created by the test

> case, it is frequent. But that's ok

Merge branch 'master' into upgrade-bdb-je

Mitigate random CookieStore.testConcurrentLoad test failures

The arbitary value `25` was used but in prace it's quite possible

for more than 25 writing threads to have checked the cookie count

limit before adding their cookie. In practice we see Travis failing

on this test quite often, every few builds in fact.

I think using `threads.length` (i.e. 200) should cover the worst

case possibility where every thread reads a stale count and tries

to add their cookie.

Fixes #274

Add missing UUID import (interactive commit fail)

Fix digest authentication

In Restlet 2 it appears we need to use DigestAuthenticator.

(Previously both digest and basic auth were handled by the same

Guard class.)

Link to javadoc.io for more recent api docs

Avoid using Thread.interrupt as this freaks BDB-JE.