Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • March 20th 13-14.

Any other business

We are on track again and the indexing for the broad crawl is now parallelized and the total start up time
for the broad crawl including creation of a 80 Gb deduplication index took only about 24 hours
without any manual intervention. 

Be aware of, that the new index creation method places a heavy load during sorting in the folder tmpdircommon. 

We had  24 broad crawl harvesters and 32 selective harvesters active during startup (no single low prio
harvester) 

What was the main problems during the upstart:

1) Every low prio harvester died with Java out of heap space after it got the index.
   It seems, that the new parallelized broad crawl index demands more memory for the Heritrix processes.
   Fix: increased memory to 3 GB per heritrix instanse in the local settings.xml file on each 64 bit server and 
        closed all 32 bit harvesters (4).

2) Continiously start, running and fail of harvesters and log spam about trying to generate a new  
   index, even though the index was in place and ready  - until no more jobs in queue.
   Fix: The new requested index name was created as a link to the already created and all jobs was resubmitted.

3) Selective harvest waits for index until broad crawl index is finished.
   Fix: no fix currently.

4) Running jobs GUI out of sync with actually running jobs.
   Fix: used SVC's adhoc java tool to delete zoombee "Running jobs" 

What was our main problems during the upgrade to 3.18 in production:

1) corrupt indexes in the derby admin database. 
   Fix: recreated the indexes.

2) very slow new lookup table in the admin database. 
   Fix: reconfigurated the lookup table to one with only 1 record