[NAS-1993] BatchJob freezes KB Bitarchive Created: 08/Nov/11 Updated: 29/Jul/14 Resolved: 08/Dec/11 |
|
Status: | Resolved |
Project: | NetarchiveSuite |
Component/s: | None |
Affects Version/s: | 3.18.0 |
Fix Version/s: | 3.18.0 |
Type: | Bug | Priority: | Critical |
Reporter: | Colin Rosenthal | Assignee: | Colin Rosenthal |
Resolution: | Fixed | ||
Labels: | None | ||
Remaining Estimate: | 0h | ||
Time Spent: | 7h | ||
Original Estimate: | Not Specified |
Inspector: | Søren Vejrup Carlsen (Inactive) |
External reference: | |
Verification: | Prepare any multi-machine deploy. Check that the log properties in release_software_dist/TESTX |
Description |
There is a step in TEST11A: java -cp lib/dk.netarkivet.archive.jar -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/ExceptionBatch.class -Eerror.exception -Ooutput.exception which fails as follows
|
Comments |
Comment by Søren Vejrup Carlsen (Inactive) [ 08/Dec/11 ] |
Verified as part of TEST11A using revision 2191 |
Comment by Mikis Seth Sørensen (Inactive) [ 11/Nov/11 ] |
Soooo, this issue is properly at least 'In progress' and perhaps even ready for review? |
Comment by Colin Rosenthal [ 09/Nov/11 ] |
Fix (to deploy system) applied in svn r2173. |
Comment by Colin Rosenthal [ 08/Nov/11 ] |
The fault is in the way we handle stdout. The error can be switched on and off by enabling/disabling the ConsoleHandler. |
Comment by Colin Rosenthal [ 08/Nov/11 ] |
The fault is definitely in the logging. By commenting out all handlers in the failing bitapps I can get the job to complete normally. Alternatively by increasing the amount of logging in a bitapp that was running ok I can trigger the failure. Further experiments now to see which handler is the culprit. |
Comment by Colin Rosenthal [ 08/Nov/11 ] |
And here's another curious thing. Bitapps 1,2,3 & 6 all say that they have finished and are sending reply messages but the KB BAMON has only received reply message from 1,3 & 6. |
Comment by Colin Rosenthal [ 08/Nov/11 ] |
Here is a stack trace showing the last thing the bitapp does before dying. Note that it now dies shortly after the first exception. INFO: Starting batch job on bitarchive application with id '172.17.0.53_BitApp_4': 'dk.netarkivet.common.utils.batch.LoadableFileBatchJob', on filename-pattern: '.*' 08-11-2011 13:29:09 dk.netarkivet.archive.bitarchive.Bitarchive batch FINE: Batch: Job dk.netarkivet.common.utils.batch.LoadableFileBatchJob processing ExceptionBatch.class started at Tue Nov 08 13:29:09 CET 2011 08-11-2011 13:29:09 dk.netarkivet.common.utils.batch.BatchLocalFiles run INFO: The batchjob 'class dk.netarkivet.common.utils.batch.LoadableFileBatchJob' has run for 0 seconds and has reached file '10-2-20100903130259-00001-kb-test-har-002.kb.dk.arc', which is number 1 out of 706 08-11-2011 13:29:09 dk.netarkivet.common.utils.batch.BatchLocalFiles processFile FINEST: Started processing of file 'G:\bitarkiv_4\TEST11A\filedir\10-2-20100903130259-00001-kb-test-har-002.kb.dk.arc'. 08-11-2011 13:29:09 dk.netarkivet.common.utils.batch.LoadableFileBatchJob processFile FINEST: Started processing of file 'G:\bitarkiv_4\TEST11A\filedir\10-2-20100903130259-00001-kb-test-har-002.kb.dk.arc'. 08-11-2011 13:29:09 dk.netarkivet.common.utils.batch.BatchLocalFiles processFile FINEST: Started processing of file 'G:\bitarkiv_4\TEST11A\filedir\10-metadata-1.arc'. 08-11-2011 13:29:09 dk.netarkivet.common.utils.batch.LoadableFileBatchJob processFile FINEST: Started processing of file 'G:\bitarkiv_4\TEST11A\filedir\10-metadata-1.arc'. 08-11-2011 13:29:09 dk.netarkivet.common.utils.batch.BatchLocalFiles processFile WARNING: Exception while processing file G:\bitarkiv_4\TEST11A\filedir\10-metadata-1.arc with job dk.netarkivet.common.utils.batch.LoadableFileBatchJob processing ExceptionBatch.class java.lang.SecurityException: Security exception at dk.netarkivet.archive.bitarchive.ExceptionBatch.processFile(ExceptionBatch.java:17) at dk.netarkivet.common.utils.batch.LoadableFileBatchJob.processFile(LoadableFileBatchJob.java:179) at dk.netarkivet.common.utils.batch.BatchLocalFiles.processFile(BatchLocalFiles.java:166) at dk.netarkivet.common.utils.batch.BatchLocalFiles.run(BatchLocalFiles.java:113) at dk.netarkivet.archive.bitarchive.Bitarchive.batch(Bitarchive.java:244) at dk.netarkivet.archive.bitarchive.distribute.BitarchiveServer$1.run(BitarchiveServer.java:421) 08-11-2011 13:29:09 dk.netarkivet.common.distribute.JMSConnection send FINEST: Sending message (Heartbeat for 172.17.0.53_BitApp_4 at 1320755349954) to [Queue 'TEST11A_KB_THE_BAMON'] 08-11-2011 13:29:10 dk.netarkivet.common.distribute.JMSConnection send FINEST: Sending message (NO ID: To TEST11A_COMMON_MONITOR ReplyTo TEST11A_COMMON_ERROR OK) to [Queue 'TEST11A_COMMON_MONITOR'] |
Comment by Colin Rosenthal [ 08/Nov/11 ] |
... but adding a fourth metadata arcfile to bitapp_6 causes it to register four exceptions and then terminate normally. |
Comment by Colin Rosenthal [ 08/Nov/11 ] |
Two of the six KB bitapps hang. In each case the bitapp hangs on the fourth file exception thrown. None of the other bitapps throw as many as four exceptions. |