heritrix3

Clone Tools
  • last updated a few minutes ago
Constraints
Constraints: committers
 
Constraints: files
Constraints: dates
Fixed regex timeout handling following suggestion https://github.com/internetarchive/heritrix3/pull/290#discussion_r366711640

Commented this test back in to make Travis happy

I think this is a single commit aa5ff1dbdaefd04652a9c66506d20f1a6ae01dc3 which we could offer as a pull request.

I think this is a single commit aa5ff1dbdaefd04652a9c66506d20f1a6ae01dc3 which we could offer as a pull request.

This is already in ia/master

This is already in ia/master

This is something we added because heritrix was treating inline image data as links. I think we should make a pull request for it.

This is something we added because heritrix was treating inline image data as links. I think we should make a pull request for it.

This is already what is in ia/master.

This is already what is in ia/master.

This is already in ia/master

This is already in ia/master

Removed mistaken additions

    • -13
    • +0
    /.idea/libraries/Maven__com_google_code_gson_gson_2_2_4.xml
    • -13
    • +0
    /.idea/libraries/Maven__com_googlecode_json_simple_json_simple_1_1_1.xml
    • -13
    • +0
    /.idea/libraries/Maven__com_rethinkdb_rethinkdb_driver_2_3_3.xml
    • -13
    • +0
    /.idea/libraries/Maven__com_sun_istack_istack_commons_runtime_3_0_7.xml
    • -13
    • +0
    /.idea/libraries/Maven__com_sun_xml_fastinfoset_FastInfoset_1_2_15.xml
    • -13
    • +0
    /.idea/libraries/Maven__javax_activation_javax_activation_api_1_2_0.xml
    • -13
    • +0
    /.idea/libraries/Maven__javax_servlet_javax_servlet_api_3_1_0.xml
    • -13
    • +0
    /.idea/libraries/Maven__javax_xml_bind_jaxb_api_2_3_1.xml
    • -13
    • +0
    /.idea/libraries/Maven__org_apache_avro_avro_1_7_6_cdh5_3_5.xml
    • -13
    • +0
    /.idea/libraries/Maven__org_apache_curator_curator_client_2_6_0.xml
    • -13
    • +0
    /.idea/libraries/Maven__org_apache_curator_curator_framework_2_6_0.xml
    • -13
    • +0
    /.idea/libraries/Maven__org_apache_curator_curator_recipes_2_6_0.xml
  1. … 35 more files in changeset.
removed .iml file

    • -0
    • +13
    /.idea/libraries/Maven__com_101tec_zkclient_0_7.xml
    • -0
    • +13
    /.idea/libraries/Maven__com_google_code_gson_gson_2_2_4.xml
    • -0
    • +13
    /.idea/libraries/Maven__com_googlecode_json_simple_json_simple_1_1_1.xml
    • -0
    • +13
    /.idea/libraries/Maven__com_rethinkdb_rethinkdb_driver_2_3_3.xml
    • -0
    • +13
    /.idea/libraries/Maven__com_sleepycat_je_4_1_6.xml
    • -0
    • +13
    /.idea/libraries/Maven__com_sun_istack_istack_commons_runtime_3_0_7.xml
    • -0
    • +13
    /.idea/libraries/Maven__com_sun_xml_fastinfoset_FastInfoset_1_2_15.xml
    • -0
    • +13
    /.idea/libraries/Maven__javax_activation_javax_activation_api_1_2_0.xml
    • -0
    • +13
    /.idea/libraries/Maven__javax_servlet_javax_servlet_api_3_1_0.xml
    • -0
    • +13
    /.idea/libraries/Maven__javax_xml_bind_jaxb_api_2_3_1.xml
    • -0
    • +13
    /.idea/libraries/Maven__junit_junit_4_10.xml
    • -0
    • +13
    /.idea/libraries/Maven__org_apache_avro_avro_1_7_6_cdh5_3_5.xml
    • -0
    • +13
    /.idea/libraries/Maven__org_apache_curator_curator_client_2_6_0.xml
    • -0
    • +13
    /.idea/libraries/Maven__org_apache_curator_curator_framework_2_6_0.xml
    • -0
    • +13
    /.idea/libraries/Maven__org_apache_curator_curator_recipes_2_6_0.xml
  1. … 36 more files in changeset.
Merge remote-tracking branch 'origin/crawltrap-regex-timeout' into crawltrap-regex-timeout

Updated manually to new SNAPSHOT version

Added a timeout to crawlertrap regex matching

This test has still green?

This test has still green?

So let's remove it!

So let's remove it!

Merge issues NAS-heritrix/IIPC-heritrix
Merge issues NAS-heritrix/IIPC-heritrix
Obviously something weird here as "contrib" is there twice,

Obviously something weird here as "contrib" is there twice,

?? Where does this come from? What does it do?

?? Where does this come from? What does it do?

?

?

The fallback is "false", meaning no match, meaning "accept this url". Is this the best choice? Does it matter? Should the behaviour be configurable?

The fallback is "false", meaning no match, meaning "accept this url". Is this the best choice? Does it matter? Should the behaviour be configurable?

This commit is mission-critical for us because we have had serious problems with pathological regexes. To get it accepted we probably should make the default behaviour backwards compatible ie infin...

This commit is mission-critical for us because we have had serious problems with pathological regexes. To get it accepted we probably should make the default behaviour backwards compatible ie infinite timeout, even though that's probably a terrible idea. I'd like to persuade Andy to allow a sensible default like 20s.

(There's also a possibly better solution which is to use a 3rd party regex engine with guaranteed runtime complexity e.g. https://www.brics.dk/automaton/faq.html)

This shouldn't be hardcoded. Why is this not just a bean-value that can be set in crawler beans?

This shouldn't be hardcoded. Why is this not just a bean-value that can be set in crawler beans?

Maybe move/copy the javadoc to the super-class?

Maybe move/copy the javadoc to the super-class?

I think this is the main change added to enable access to the frontier queue, so it ought really to have some javadoc.

I think this is the main change added to enable access to the frontier queue, so it ought really to have some javadoc.

I think the following three methods just expose some internals so that they can be accessed from scripts - ie. there should be no good reason to object to them.

I think the following three methods just expose some internals so that they can be accessed from scripts - ie. there should be no good reason to object to them.

?

?

There are/were some issues with the fact that the contrib package was disabled in the LBS releases. Can we check whether they are by default when the iipc release is built? Then this maybe this ass...

There are/were some issues with the fact that the contrib package was disabled in the LBS releases. Can we check whether they are by default when the iipc release is built? Then this maybe this assembly plugin isnt necessary.

Presumably both these exclusions are correct.

Presumably both these exclusions are correct.

This class is not used and not necessary - the functionality is standard in all Processors.

This class is not used and not necessary - the functionality is standard in all Processors.

Merge pull request #286 from internetarchive/adds-subqueue-support-forced-queue-assignment

Add support for forced queue assignment and parallel queues