JPPF Issue Tracker
star_faded.png
Please log in to bookmark issues
bug_report_small.png
CLOSED  Bug report JPPF-160  -  OOME in the driver upon receiving large results from multiple nodes concurrently
Posted Jun 23, 2013 - updated Jun 23, 2013
icon_info.png This issue has been closed with status "Closed" and resolution "RESOLVED".
Issue details
  • Type of issue
    Bug report
  • Status
     
    Closed
  • Assigned to
     lolo4j
  • Progress
       
  • Type of bug
    Not triaged
  • Likelihood
    Not triaged
  • Effect
    Not triaged
  • Posted by
     lolo4j
  • Owned by
    Not owned by anyone
  • Category
    Server
  • Resolution
    RESOLVED
  • Priority
    High
  • Reproducability
    Always
  • Severity
    Normal
  • Targetted for
    icon_milestones.png JPPF 3.3.4
Issue description
When multiple nodes concurrently return large tasks to the driver, an OutOfMemoryError can be raised. I believe this happens because in this scenario the call to the method that determines whether a serialized object read from the network connection will fit in memory is separate from the actual reading of the object, i.e. they are not performed atomically within a "single transaction".

What I think is happening is that while a thread is reading a serialized task from a node, another thread evaluates that another tasks will fit in memory, which will be false once the task from the first thread has been entirely stored in memory. Hence the second thread reads another task and stores it in memory, instead of using the disk overflow, which triggers the OOME.
Steps to reproduce this issue
  • use a driver with -Xmx128m -XX:+HeapDumpOnOutOfMemoryError and configured with "nodethreads" load-balancer (multiplicator=1)
  • use 4 nodes with sufficient memory
  • submit a job with at least as many tasks as there are processing threads in all the nodes. Each task will create a 5MB byte array during its execution and store it as its result
  • ==> this scenario will raise an OOME and produce a heapdump

#3
Comment posted by
 lolo4j
Jun 23, 09:34
Additionally, no exception or error message is logged or printed to the driver's console output. We need to investigate that as well.
#5
Comment posted by
 lolo4j
Jun 23, 11:56
Even when doing the check and the meory allocation in a single atomic transaction, I get OOME due to heap fragmentation. Indeed, the JVM reports an amount of free heap larger than the object size, but it's not contiguous memory, so the allocation fails in OOME
#6
Comment posted by
 lolo4j
Jun 23, 12:46
A fix is to add another criteria when checking if an object will fit in memory: by checking that the total free memory is larger than a predetermined threshold, we ensure that the heap is very likely to contain enough contiguous memory for the predicted memroy allocation. We can set this threshold as a configuration property, for instance "jppf.low.memory.threshold = some_value", where some_value is expressed in megabytes. By setting it to 32MB, I could run the sample code used to reproduce the issue, without any OOME. Of course, it's a lot slower than if the driver had enough heap to hold everything in memory, but it works.
#7
Comment posted by
 lolo4j
Jun 23, 14:02
Fixed. Changes committed to SVN:

The issue was updated with the following change(s):
  • This issue has been closed
  • The status has been updated, from Confirmed to Closed.
  • This issue's progression has been updated to 100 percent completed.
  • The resolution has been updated, from Not determined to RESOLVED.
  • Information about the user working on this issue has been changed, from lolo4j to Not being worked on.