[NAS-1993] BatchJob freezes KB Bitarchive Created: 08/Nov/11  Updated: 29/Jul/14  Resolved: 08/Dec/11

Status: Resolved
Project: NetarchiveSuite
Component/s: None
Affects Version/s: 3.18.0
Fix Version/s: 3.18.0

Type: Bug Priority: Critical
Reporter: Colin Rosenthal Assignee: Colin Rosenthal
Resolution: Fixed  
Labels: None
Remaining Estimate: 0h
Time Spent: 7h
Original Estimate: Not Specified

Inspector: Søren Vejrup Carlsen Søren Vejrup Carlsen (Inactive)
External reference:

It36Legalbatchjob

Verification:

Prepare any multi-machine deploy. Check that the log properties in release_software_dist/TESTX
show that the ConsoleHandler has been removed from the list of handlers in the log.prop file for
any windows machines.


 Description   

There is a step in TEST11A:

java -cp lib/dk.netarkivet.archive.jar -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/ExceptionBatch.class -Eerror.exception -Ooutput.exception

which fails as follows

  1. The batch job hangs and never ends
  2. The batch job causes the whole arcrepository to freeze. Any subsequent batch jobs also hang, even jobs which ran ok previously
  3. After restarting, adding -BSBN to the command line to send the job to SB causes the job to run without problem.
    This means that the error closely resembles the problems we have had with automatic indexing for wayback.


 Comments   
Comment by Søren Vejrup Carlsen (Inactive) [ 08/Dec/11 ]

Verified as part of TEST11A using revision 2191

Comment by Mikis Seth Sørensen (Inactive) [ 11/Nov/11 ]

Soooo, this issue is properly at least 'In progress' and perhaps even ready for review?

Comment by Colin Rosenthal [ 09/Nov/11 ]

Fix (to deploy system) applied in svn r2173.

Comment by Colin Rosenthal [ 08/Nov/11 ]

The fault is in the way we handle stdout. The error can be switched on and off by enabling/disabling the ConsoleHandler.

Comment by Colin Rosenthal [ 08/Nov/11 ]

The fault is definitely in the logging. By commenting out all handlers in the failing bitapps I can get the job to complete normally. Alternatively by increasing the amount of logging in a bitapp that was running ok I can trigger the failure. Further experiments now to see which handler is the culprit.

Comment by Colin Rosenthal [ 08/Nov/11 ]

And here's another curious thing. Bitapps 1,2,3 & 6 all say that they have finished and are sending reply messages but the KB BAMON has only received reply message from 1,3 & 6.

Comment by Colin Rosenthal [ 08/Nov/11 ]

Here is a stack trace showing the last thing the bitapp does before dying. Note that it now dies shortly after the first exception.

INFO: Starting batch job on bitarchive application with id '172.17.0.53_BitApp_4': 'dk.netarkivet.common.utils.batch.LoadableFileBatchJob', on filename-pattern: '.*'
08-11-2011 13:29:09 dk.netarkivet.archive.bitarchive.Bitarchive batch
FINE: Batch: Job dk.netarkivet.common.utils.batch.LoadableFileBatchJob processing ExceptionBatch.class started at Tue Nov 08 13:29:09 CET 2011
08-11-2011 13:29:09 dk.netarkivet.common.utils.batch.BatchLocalFiles run
INFO: The batchjob 'class dk.netarkivet.common.utils.batch.LoadableFileBatchJob' has run for 0 seconds and has reached file '10-2-20100903130259-00001-kb-test-har-002.kb.dk.arc', which is number 1 out of 706
08-11-2011 13:29:09 dk.netarkivet.common.utils.batch.BatchLocalFiles processFile
FINEST: Started processing of file 'G:\bitarkiv_4\TEST11A\filedir\10-2-20100903130259-00001-kb-test-har-002.kb.dk.arc'.
08-11-2011 13:29:09 dk.netarkivet.common.utils.batch.LoadableFileBatchJob processFile
FINEST: Started processing of file 'G:\bitarkiv_4\TEST11A\filedir\10-2-20100903130259-00001-kb-test-har-002.kb.dk.arc'.
08-11-2011 13:29:09 dk.netarkivet.common.utils.batch.BatchLocalFiles processFile
FINEST: Started processing of file 'G:\bitarkiv_4\TEST11A\filedir\10-metadata-1.arc'.
08-11-2011 13:29:09 dk.netarkivet.common.utils.batch.LoadableFileBatchJob processFile
FINEST: Started processing of file 'G:\bitarkiv_4\TEST11A\filedir\10-metadata-1.arc'.
08-11-2011 13:29:09 dk.netarkivet.common.utils.batch.BatchLocalFiles processFile
WARNING: Exception while processing file G:\bitarkiv_4\TEST11A\filedir\10-metadata-1.arc with job dk.netarkivet.common.utils.batch.LoadableFileBatchJob processing ExceptionBatch.class
java.lang.SecurityException: Security exception
	at dk.netarkivet.archive.bitarchive.ExceptionBatch.processFile(ExceptionBatch.java:17)
	at dk.netarkivet.common.utils.batch.LoadableFileBatchJob.processFile(LoadableFileBatchJob.java:179)
	at dk.netarkivet.common.utils.batch.BatchLocalFiles.processFile(BatchLocalFiles.java:166)
	at dk.netarkivet.common.utils.batch.BatchLocalFiles.run(BatchLocalFiles.java:113)
	at dk.netarkivet.archive.bitarchive.Bitarchive.batch(Bitarchive.java:244)
	at dk.netarkivet.archive.bitarchive.distribute.BitarchiveServer$1.run(BitarchiveServer.java:421)
08-11-2011 13:29:09 dk.netarkivet.common.distribute.JMSConnection send
FINEST: Sending message (Heartbeat for 172.17.0.53_BitApp_4 at 1320755349954) to [Queue 'TEST11A_KB_THE_BAMON']
08-11-2011 13:29:10 dk.netarkivet.common.distribute.JMSConnection send
FINEST: Sending message (NO ID: To TEST11A_COMMON_MONITOR ReplyTo TEST11A_COMMON_ERROR OK) to [Queue 'TEST11A_COMMON_MONITOR']
Comment by Colin Rosenthal [ 08/Nov/11 ]

... but adding a fourth metadata arcfile to bitapp_6 causes it to register four exceptions and then terminate normally.

Comment by Colin Rosenthal [ 08/Nov/11 ]

Two of the six KB bitapps hang. In each case the bitapp hangs on the fourth file exception thrown. None of the other bitapps throw as many as four exceptions.

Generated at Fri Mar 29 15:30:31 CET 2024 using Jira 9.4.15#940015-sha1:bdaa9cbecfb6791ea579749728cab771f0dfe90b.