Loading...

XML

Word

Printable

Verification:
Hide

Let Indexserver have access to a lot of PROD metadata-arc files locally on index-server (using LocalArcRepository as arcrepositoryClient), and let IndexServer always return the same big index based on 8000 or 9000 jobs regardless the request (using the TestIndexRequestServer class: The jobs taking part of the index is contained in the file given as argument to the TestIndexRequestServer)
See kb-test-acs-001.kb.dk:/home/test/LUCENETEST/conf/settings_IndexServerApplication.xml
on how this is done

Ingest 100.000 domains

Start a snapshot crawl; set the number of max objects per domain to 100 (to avoid getting complaints)

Let one or two jobs complete, and see they don't go into paused mode (A symptom of OOM error in Heritrix)
Show
Let Indexserver have access to a lot of PROD metadata-arc files locally on index-server (using LocalArcRepository as arcrepositoryClient), and let IndexServer always return the same big index based on 8000 or 9000 jobs regardless the request (using the TestIndexRequestServer class: The jobs taking part of the index is contained in the file given as argument to the TestIndexRequestServer) See kb-test-acs-001.kb.dk:/home/test/LUCENETEST/conf/settings_IndexServerApplication.xml on how this is done Ingest 100.000 domains Start a snapshot crawl; set the number of max objects per domain to 100 (to avoid getting complaints) Let one or two jobs complete, and see they don't go into paused mode (A symptom of OOM error in Heritrix)

It turns out that the lucene indices generated in 3.18 require too much memory in Heritrix, even though the actual index size is smaller.

Currently, we have a rollback of the code that make the index in parallel, and then merges the indices into one.