Uploaded image for project: 'NetarchiveSuite'
  1. NetarchiveSuite
  2. NAS-2463

Find which jobs are harvesting a given domain

    XMLWordPrintable

Details

    • Task
    • Resolution: Fixed
    • Major
    • 5.4
    • None
    • GUI
    • None
    • Hide

      1. In the NAS web GUI, create a new selective harvest definition, and have it harvest the domain "netarkivet.dk".
      2. Wait for it's status to change to "Started", then go to menu "Harvest Status"->"All Running Jobs".
      3. "www.dr.dk" is not in the seedlist for this job, but will eventually be reached, so to test the filtering try searching for "www.dr.dk" right away when the crawler has started running to verify that it has not yet been reached. Then search again later, to check that it will be reached once the job has progressed enough. This confirms that the job is shown even if not in the seedlist.

      Show
      1. In the NAS web GUI, create a new selective harvest definition, and have it harvest the domain "netarkivet.dk". 2. Wait for it's status to change to "Started", then go to menu "Harvest Status"->"All Running Jobs". 3. "www.dr.dk" is not in the seedlist for this job, but will eventually be reached, so to test the filtering try searching for "www.dr.dk" right away when the crawler has started running to verify that it has not yet been reached. Then search again later, to check that it will be reached once the job has progressed enough. This confirms that the job is shown even if not in the seedlist.

    Description

      The issue is to find which jobs are harvesting a given domain - even if the domain is not in the seedlist. This boils down to running a crawl-job-regexp-search across all Running Jobs instances. This should be possible from within NAS - we have links to all relevant heritrix instances so we just need to make a series of REST calls to each of them using the https://webarchive.jira.com/wiki/display/Heritrix/Heritrix+3.x+API+Guide#Heritrix3.xAPIGuide-ExecuteShellScriptinJob method. (I think we need to do a scan of the job directory of each Heritrix first to identify the Job id.)

      Attachments

        Issue Links

          Activity

            People

              jrg Jeppe Ravn-Grove (Inactive)
              csr Colin Rosenthal
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: