Let Indexserver have access to a lot of PROD metadata-arc files locally on index-server (using LocalArcRepository as arcrepositoryClient), and let IndexServer always return the same big index based on 8000 or 9000 jobs regardless the request (using the TestIndexRequestServer class: The jobs taking part of the index is contained in the file given as argument to the TestIndexRequestServer)
See kb-test-acs-001.kb.dk:/home/test/LUCENETEST/conf/settings_IndexServerApplication.xml
on how this is done
Ingest 50.000 domains (use the list kb-test-adm-001.kb.dk:/home/test/domains-30-05-2012.txt)
Start a snapshot crawl; set the number of max objects per domain to 100 (to avoid getting complaints)
Let one job completely, and see it does't go into paused mode (A symptom of OOM error in Heritrix)
indexserver settings override:
<common>
<environmentName>LUCENETEST</environmentName>
<arcrepositoryClient>
<class>dk.netarkivet.common.distribute.arcrepository.LocalArcRepositoryClient</class>
<fileDir>/data/rawdata/prod-metadata</fileDir>
</arcrepositoryClient>
...
<harvester>
..
<indexserver>
<indexrequestserver>
<class>dk.netarkivet.archive.indexserver.distribute.TestIndexRequestServer
</class>
<fileContainingJobsForTestindex>/home/test/prod-metadata-ids.txt</fileContainingJobsForTestindex>
</indexrequestserver>
</indexserver>
</harvester>
remember to set the heritrix heapsize to 2Gb
<heritrix>
..
<heapSize>1936M</heapSize>
..
</heritrix>