Uploaded image for project: 'NetarchiveSuite'
  1. NetarchiveSuite
  2. NAS-2803

Heritrix can hang on a pathological regex

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • 5.5
    • None
    • Heritrix 3
    • None
    • Hide

      Run a harvest with a single seed
      http://www.netarkivet.dk/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
      and a single crawler-trap
      http://www\.netarkivet\.dk/((x+x+)+)y

      Look for the message in the logs that the regex search timedout - something like

      2018-11-08 13:07:10.586 INFO thread-25 org.archive.modules.deciderules.MatchesListRegexDecideRule.evaluate() Timeout matching regex 'http://www.netarkivet.dk/((x+x+)+)y' to url 'http://www.netarkivet.dk/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
      

      in the heritrix error log.

      Show
      Run a harvest with a single seed http://www.netarkivet.dk/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx and a single crawler-trap http://www\.netarkivet\.dk/((x+x+)+)y Look for the message in the logs that the regex search timedout - something like 2018-11-08 13:07:10.586 INFO thread-25 org.archive.modules.deciderules.MatchesListRegexDecideRule.evaluate() Timeout matching regex 'http: //www.netarkivet.dk/((x+x+)+)y' to url 'http://www.netarkivet.dk/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx' in the heritrix error log.

    Description

      MatchesListRegecDecideRule needs to be patched to apply a timeout on regex matches.

      Attachments

        Activity

          People

            csr Colin Rosenthal
            csr Colin Rosenthal
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: