Uploaded image for project: 'NetarchiveSuite'
  1. NetarchiveSuite
  2. NAS-2050

The indexes generated by the Indexserver has too big a memory demand

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • I51, 3.20.0
    • 3.18.0
    • Archive, IndexServer
    • None
    • Hide
      • Let Indexserver have access to a lot of PROD metadata-arc files locally on index-server (using LocalArcRepository as arcrepositoryClient), and let IndexServer always return the same big index based on 8000 or 9000 jobs regardless the request (using the TestIndexRequestServer class: The jobs taking part of the index is contained in the file given as argument to the TestIndexRequestServer)
        See kb-test-acs-001.kb.dk:/home/test/LUCENETEST/conf/settings_IndexServerApplication.xml
        on how this is done
      • Ingest 100.000 domains
      • Start a snapshot crawl; set the number of max objects per domain to 100 (to avoid getting complaints)
      • Let one or two jobs complete, and see they don't go into paused mode (A symptom of OOM error in Heritrix)
      Show
      Let Indexserver have access to a lot of PROD metadata-arc files locally on index-server (using LocalArcRepository as arcrepositoryClient), and let IndexServer always return the same big index based on 8000 or 9000 jobs regardless the request (using the TestIndexRequestServer class: The jobs taking part of the index is contained in the file given as argument to the TestIndexRequestServer) See kb-test-acs-001.kb.dk:/home/test/LUCENETEST/conf/settings_IndexServerApplication.xml on how this is done Ingest 100.000 domains Start a snapshot crawl; set the number of max objects per domain to 100 (to avoid getting complaints) Let one or two jobs complete, and see they don't go into paused mode (A symptom of OOM error in Heritrix)

    Description

      It turns out that the lucene indices generated in 3.18 require too much memory in Heritrix, even though the actual index size is smaller.

      Currently, we have a rollback of the code that make the index in parallel, and then merges the indices into one.

      Attachments

        1. DeDuplicator.java
          46 kB
        2. DigestIndexer.java
          17 kB
        3. SparseBitSet.java
          11 kB
        4. SparseRangeFilter.java
          5 kB

        Activity

          People

            svc Søren Vejrup Carlsen (Inactive)
            svc Søren Vejrup Carlsen (Inactive)
            Nicholas Clarke Nicholas Clarke (Inactive)
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: