Uploaded image for project: 'NetarchiveSuite'
  1. NetarchiveSuite
  2. NAS-1722

Statistics (DB access scripts batch jobs ....)

    XMLWordPrintable

Details

    • New Feature
    • Resolution: Won't Fix
    • Minor
    • None
    • None
    • GUI
    • None
    • ONB

    Description

      Statistics can be requested through the GUI. The user selects which statistics (mimetypes durations ...) on which basis (Statistics per job per harvest defintions ...) he wants to have. This request will be stored in the database. A statisticProcessor is running in the system and is polling for new requests. As soon as there are new requests the processor starts working. The results can be a report (a pdf maybe included graphics by using http://www.jfree.org/jfreechart/) which is also stored in the db (or the reference for it). In the Gui the User can see the current status of the job and also the generated documents.(maybe for that request it is necessary to have a user management?)

      To get access to all the Heritrix statistics which are stored in the metadata.arc file it is necessary to extract the statistical data out of the large metadata file located in the bitarchive. These extracts should be cached on the local disk for further fast reading. Such an file for each job could have an xml format. For example:

      <xml version=&quot;1.0&quot; encoding=&quot;UTF-8>
      <StatisticalData><crawlreport><![CDATA[Crawl Name: default_orderxml
      Crawl Status: Finished
      Duration Time: 2m2s979ms
      Total Seeds Crawled: 1
      Total Seeds not Crawled: 0
      Total Hosts Crawled: 1
      Total Documents Crawled: 345
      Processed docs/sec: 2.83
      Bandwidth in Kbytes/sec: 15
      Total Raw Data Size in Bytes: 1976458 (1.9 MB) 
      Novel Bytes: 1976458 (1.9 MB) 
      ]]></crawlreport&gt;&lt;mimetypesreport><![CDATA[[#urls] [#bytes] [mime-types]
      167 1191906 image/jpeg
      102 133895 image/gif
      72 643967 text/html
      3 6638 text/css
      1 52 text/dns
      ]]></mimetypesreport></StatisticalData>
      

      for each report within the StatisticalData Tag there must be a Reportanalyzer which is able to read the data stream for further statistical analyzing.
      Maybe for this analyzing some heritrix code can be used.

      Attachments

        Activity

          People

            aponb Andreas P
            aponb Andreas P
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: