Dashboard

Use https for chaos-urls

I think this is a single commit aa5ff1dbdaefd04652a9c66506d20f1a6ae01dc3 which we could offer as a pull request.

I think this is a single commit aa5ff1dbdaefd04652a9c66506d20f1a6ae01dc3 which we could offer as a pull request.

This is already in ia/master

This is already in ia/master

This is something we added because heritrix was treating inline image data as links. I think we should make a pull request for it.

This is something we added because heritrix was treating inline image data as links. I think we should make a pull request for it.

This is already what is in ia/master.

This is already what is in ia/master.

This is already in ia/master

This is already in ia/master

Removed mistaken additions

    • -13
    • +0
    /.idea/libraries/Maven__com_google_code_gson_gson_2_2_4.xml
    • -13
    • +0
    /.idea/libraries/Maven__com_googlecode_json_simple_json_simple_1_1_1.xml
    • -13
    • +0
    /.idea/libraries/Maven__com_rethinkdb_rethinkdb_driver_2_3_3.xml
    • -13
    • +0
    /.idea/libraries/Maven__com_sun_istack_istack_commons_runtime_3_0_7.xml
    • -13
    • +0
    /.idea/libraries/Maven__com_sun_xml_fastinfoset_FastInfoset_1_2_15.xml
    • -13
    • +0
    /.idea/libraries/Maven__javax_activation_javax_activation_api_1_2_0.xml
    • -13
    • +0
    /.idea/libraries/Maven__javax_servlet_javax_servlet_api_3_1_0.xml
    • -13
    • +0
    /.idea/libraries/Maven__javax_xml_bind_jaxb_api_2_3_1.xml
    • -13
    • +0
    /.idea/libraries/Maven__org_apache_avro_avro_1_7_6_cdh5_3_5.xml
    • -13
    • +0
    /.idea/libraries/Maven__org_apache_curator_curator_client_2_6_0.xml
    • -13
    • +0
    /.idea/libraries/Maven__org_apache_curator_curator_framework_2_6_0.xml
    • -13
    • +0
    /.idea/libraries/Maven__org_apache_curator_curator_recipes_2_6_0.xml
  1. … 35 more files in changeset.
removed .iml file

    • -0
    • +13
    /.idea/libraries/Maven__com_101tec_zkclient_0_7.xml
    • -0
    • +13
    /.idea/libraries/Maven__com_google_code_gson_gson_2_2_4.xml
    • -0
    • +13
    /.idea/libraries/Maven__com_googlecode_json_simple_json_simple_1_1_1.xml
    • -0
    • +13
    /.idea/libraries/Maven__com_rethinkdb_rethinkdb_driver_2_3_3.xml
    • -0
    • +13
    /.idea/libraries/Maven__com_sleepycat_je_4_1_6.xml
    • -0
    • +13
    /.idea/libraries/Maven__com_sun_istack_istack_commons_runtime_3_0_7.xml
    • -0
    • +13
    /.idea/libraries/Maven__com_sun_xml_fastinfoset_FastInfoset_1_2_15.xml
    • -0
    • +13
    /.idea/libraries/Maven__javax_activation_javax_activation_api_1_2_0.xml
    • -0
    • +13
    /.idea/libraries/Maven__javax_servlet_javax_servlet_api_3_1_0.xml
    • -0
    • +13
    /.idea/libraries/Maven__javax_xml_bind_jaxb_api_2_3_1.xml
    • -0
    • +13
    /.idea/libraries/Maven__junit_junit_4_10.xml
    • -0
    • +13
    /.idea/libraries/Maven__org_apache_avro_avro_1_7_6_cdh5_3_5.xml
    • -0
    • +13
    /.idea/libraries/Maven__org_apache_curator_curator_client_2_6_0.xml
    • -0
    • +13
    /.idea/libraries/Maven__org_apache_curator_curator_framework_2_6_0.xml
    • -0
    • +13
    /.idea/libraries/Maven__org_apache_curator_curator_recipes_2_6_0.xml
  1. … 36 more files in changeset.
Merge remote-tracking branch 'origin/crawltrap-regex-timeout' into crawltrap-regex-timeout

Updated manually to new SNAPSHOT version

Added a timeout to crawlertrap regex matching

This test has still green?

This test has still green?

So let's remove it!

So let's remove it!

Bump tomcat-embed-core.version from 8.0.32 to 9.0.27

Bumps `tomcat-embed-core.version` from 8.0.32 to 9.0.27.

Updates `tomcat-embed-core` from 8.0.32 to 9.0.27

Updates `tomcat-servlet-api` from 8.0.32 to 9.0.27

Updates `tomcat-embed-jasper` from 8.0.32 to 9.0.27

Updates `tomcat-jsp-api` from 8.0.32 to 9.0.27

Signed-off-by: dependabot[bot] <support@github.com>

Bump lucene-core.version from 4.4.0 to 8.3.0

Bumps `lucene-core.version` from 4.4.0 to 8.3.0.

Updates `lucene-core` from 4.4.0 to 8.3.0

Updates `lucene-analyzers-common` from 4.4.0 to 8.3.0

Signed-off-by: dependabot[bot] <support@github.com>

Bump c3p0 from 0.9.2.1 to 0.9.5.4

Bumps [c3p0](https://github.com/swaldman/c3p0) from 0.9.2.1 to 0.9.5.4.

- [Release notes](https://github.com/swaldman/c3p0/releases)

- [Commits](https://github.com/swaldman/c3p0/compare/c3p0-0.9.2.1...c3p0-0.9.5.4)

Signed-off-by: dependabot[bot] <support@github.com>

Bump commons-fileupload from 1.2.1 to 1.3.3

Bumps commons-fileupload from 1.2.1 to 1.3.3.

Signed-off-by: dependabot[bot] <support@github.com>

Actually want to write requests and metadata by default in tests!

Merge issues NAS-heritrix/IIPC-heritrix
Merge issues NAS-heritrix/IIPC-heritrix
Obviously something weird here as "contrib" is there twice,

Obviously something weird here as "contrib" is there twice,

?? Where does this come from? What does it do?

?? Where does this come from? What does it do?

?

?

The fallback is "false", meaning no match, meaning "accept this url". Is this the best choice? Does it matter? Should the behaviour be configurable?

The fallback is "false", meaning no match, meaning "accept this url". Is this the best choice? Does it matter? Should the behaviour be configurable?

This commit is mission-critical for us because we have had serious problems with pathological regexes. To get it accepted we probably should make the default behaviour backwards compatible ie infin...

This commit is mission-critical for us because we have had serious problems with pathological regexes. To get it accepted we probably should make the default behaviour backwards compatible ie infinite timeout, even though that's probably a terrible idea. I'd like to persuade Andy to allow a sensible default like 20s.

(There's also a possibly better solution which is to use a 3rd party regex engine with guaranteed runtime complexity e.g. https://www.brics.dk/automaton/faq.html)

This shouldn't be hardcoded. Why is this not just a bean-value that can be set in crawler beans?

This shouldn't be hardcoded. Why is this not just a bean-value that can be set in crawler beans?

Maybe move/copy the javadoc to the super-class?

Maybe move/copy the javadoc to the super-class?

I think this is the main change added to enable access to the frontier queue, so it ought really to have some javadoc.

I think this is the main change added to enable access to the frontier queue, so it ought really to have some javadoc.

I think the following three methods just expose some internals so that they can be accessed from scripts - ie. there should be no good reason to object to them.

I think the following three methods just expose some internals so that they can be accessed from scripts - ie. there should be no good reason to object to them.

?

?

There are/were some issues with the fact that the contrib package was disabled in the LBS releases. Can we check whether they are by default when the iipc release is built? Then this maybe this ass...

There are/were some issues with the fact that the contrib package was disabled in the LBS releases. Can we check whether they are by default when the iipc release is built? Then this maybe this assembly plugin isnt necessary.