Uploaded image for project: 'NetarchiveSuite'
  1. NetarchiveSuite
  2. NAS-2201

Upgrade from Lucene 3.X to Lucene 4.0

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Fixed
    • Minor
    • 4.2
    • None
    • None
    • None
    • Hide

      Let Indexserver have access to a lot of PROD metadata-arc files locally on index-server (using LocalArcRepository as arcrepositoryClient), and let IndexServer always return the same big index based on 8000 or 9000 jobs regardless the request (using the TestIndexRequestServer class: The jobs taking part of the index is contained in the file given as argument to the TestIndexRequestServer)
      See kb-test-acs-001.kb.dk:/home/test/LUCENETEST/conf/settings_IndexServerApplication.xml
      on how this is done
      Ingest 50.000 domains (use the list kb-test-adm-001.kb.dk:/home/test/domains-30-05-2012.txt)
      Start a snapshot crawl; set the number of max objects per domain to 100 (to avoid getting complaints)
      Let one job completely, and see it does't go into paused mode (A symptom of OOM error in Heritrix)

      indexserver settings override:

      <common>
      <environmentName>LUCENETEST</environmentName>
      <arcrepositoryClient>
      <class>dk.netarkivet.common.distribute.arcrepository.LocalArcRepositoryClient</class>
      <fileDir>/data/rawdata/prod-metadata</fileDir>
      </arcrepositoryClient>
      ...
      <harvester>
      ..
      <indexserver>
      <indexrequestserver>
      <class>dk.netarkivet.archive.indexserver.distribute.TestIndexRequestServer
      </class>
      <fileContainingJobsForTestindex>/home/test/prod-metadata-ids.txt</fileContainingJobsForTestindex>
      </indexrequestserver>
      </indexserver>
      </harvester>

      remember to set the heritrix heapsize to 2Gb

      <heritrix>
      ..
      <heapSize>1936M</heapSize>
      ..
      </heritrix>

      Show
      Let Indexserver have access to a lot of PROD metadata-arc files locally on index-server (using LocalArcRepository as arcrepositoryClient), and let IndexServer always return the same big index based on 8000 or 9000 jobs regardless the request (using the TestIndexRequestServer class: The jobs taking part of the index is contained in the file given as argument to the TestIndexRequestServer) See kb-test-acs-001.kb.dk:/home/test/LUCENETEST/conf/settings_IndexServerApplication.xml on how this is done Ingest 50.000 domains (use the list kb-test-adm-001.kb.dk:/home/test/domains-30-05-2012.txt) Start a snapshot crawl; set the number of max objects per domain to 100 (to avoid getting complaints) Let one job completely, and see it does't go into paused mode (A symptom of OOM error in Heritrix) indexserver settings override: <common> <environmentName>LUCENETEST</environmentName> <arcrepositoryClient> <class>dk.netarkivet.common.distribute.arcrepository.LocalArcRepositoryClient</class> <fileDir>/data/rawdata/prod-metadata</fileDir> </arcrepositoryClient> ... <harvester> .. <indexserver> <indexrequestserver> <class>dk.netarkivet.archive.indexserver.distribute.TestIndexRequestServer </class> <fileContainingJobsForTestindex>/home/test/prod-metadata-ids.txt</fileContainingJobsForTestindex> </indexrequestserver> </indexserver> </harvester> remember to set the heritrix heapsize to 2Gb <heritrix> .. <heapSize>1936M</heapSize> .. </heritrix>

    Attachments

      Issue Links

        Activity

          People

            svc Søren Vejrup Carlsen (Inactive)
            svc Søren Vejrup Carlsen (Inactive)
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: