[NAS-2234] ChecksumFileServer Dying with OOM Error Created: 09/Aug/13  Updated: 17/Sep/15  Resolved: 08/May/14

Status: Resolved
Project: NetarchiveSuite
Component/s: Archive
Affects Version/s: 4.0, 4.2
Fix Version/s: 4.4

Type: Bug Priority: Minor
Reporter: Colin Rosenthal Assignee: Søren Vejrup Carlsen (Inactive)
Resolution: Fixed  
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Verification:

Tested by replacing the archive setting "archive.checksum.archive.class" to dk.netarkivet.archive.checksum.DatabaseChecksumArchive

Restart the FileChecksumApplication

All Bitpreservation actions should be possible and give no error.


 Description   

In TEST7, the "Update checksum and filestatus for CS" step fails because of an OOM error:

Host: kb-test-acs-001.kb.dk
Date: Fri Aug 09 15:38:50 CEST 2013
dk.netarkivet.common.utils.ApplicationUtils.logExceptionAndPrint(ApplicationUtils.java:90)
Could not start class dk.netarkivet.archive.checksum.distribute.ChecksumFileServer
java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at dk.netarkivet.common.utils.ApplicationUtils.startApp(ApplicationUtils.java:178)
        at dk.netarkivet.archive.checksum.ChecksumFileApplication.main(ChecksumFileApplication.java:47)
Caused by: java.lang.OutOfMemoryError: Java heap space

Will try again with more heapspace!



 Comments   
Comment by Søren Vejrup Carlsen (Inactive) [ 08/May/14 ]

The implementation is now completed.
To use this checksum archive instead of the filebased one,
Set the archive setting "archive.checksum.archive.class" to dk.netarkivet.archive.checksum.DatabaseChecksumArchive

The default of this setting is dk.netarkivet.archive.checksum.FileChecksumArchive.

The database will then be located in the "DB" subdir of the checksum basedir (by default set to CS"

To migrate the file checksum archive to a DatabaseChecksumArchive, use the dk.netarkivet.archive.tools.LoadDatabaseChecksumArchive tool

Comment by Søren Vejrup Carlsen (Inactive) [ 11/Nov/13 ]

Have now begun implementing a Berkeley DB backed DatabaseChecksumArchive

Comment by Søren Vejrup Carlsen (Inactive) [ 12/Aug/13 ]

Downgrading the criticality to minor to affect my opinion of its status

Comment by Søren Vejrup Carlsen (Inactive) [ 12/Aug/13 ]

We shouldn't IMHO close it, but we don't need to do more now.

Comment by Mikis Seth Sørensen (Inactive) [ 12/Aug/13 ]

Can we close this, with the increase of heap as solution?

Comment by Mikis Seth Sørensen (Inactive) [ 12/Aug/13 ]

The Bit repository should fix this.

Comment by Søren Vejrup Carlsen (Inactive) [ 09/Aug/13 ]

This problem is an old problem, so it is no good moving to pre-4 releases. The ChecksumFileServer was introduced in NetarchiveSuite 3.12

Comment by Søren Vejrup Carlsen (Inactive) [ 09/Aug/13 ]

This is the case, because every file, and corresponding checksum is stored in an synchronized map, and persisted using a file.

During the start-phase, the synchronized map is filled out using the checksum file on local disk, and this is where it goes wrong here, because it is out of memory.
The problem class is dk.netarkivet.archive.checksum.FileChecksumArchive
and its loadFile method.

The short term fix is increase the MaxHeap value. The longterm is to use a berkeleyDB to persist the information.

Comment by Søren Vejrup Carlsen (Inactive) [ 09/Aug/13 ]

As the number of files in your archive grows, the more memory the checksumFileServer will require.

Comment by Søren Vejrup Carlsen (Inactive) [ 09/Aug/13 ]

TLR recently saw this in production.
https://sbprojects.statsbiblioteket.dk/jira/browse/NARK-376

By changing the start from using "-Xmx1536m" to "-Xmx1936m" the problem disappeared.

Comment by Colin Rosenthal [ 09/Aug/13 ]

Increasing the heap size to 2536m has helped. The job is now running and I am waiting to see if it completes normally.

Generated at Fri Mar 29 02:13:29 CET 2024 using Jira 9.4.15#940015-sha1:bdaa9cbecfb6791ea579749728cab771f0dfe90b.