JPPF Issue Tracker
star_faded.png
Please log in to bookmark issues
bug_report_small.png
CLOSED  Bug report JPPF-297  -  Synchronization Deadlock in Queue/Job Processing
Posted Jul 28, 2014 - updated Jul 29, 2014
icon_info.png This issue has been closed with status "Closed" and resolution "RESOLVED".
Issue details
  • Type of issue
    Bug report
  • Status
     
    Closed
  • Assigned to
     lolo4j
  • Progress
       
  • Type of bug
    Not triaged
  • Likelihood
    Not triaged
  • Effect
    Not triaged
  • Posted by
     Daniel Widdis
  • Owned by
    Not owned by anyone
  • Category
    Server
  • Resolution
    RESOLVED
  • Priority
    Critical
  • Reproducability
    Always
  • Severity
    Critical
  • Targetted for
    icon_milestones.png JPPF 4.2.1
Issue description
When expanding my network from fewer than 25 nodes to over 35 nodes, I've been consistently experiencing the driver hanging. Thread dump indicates the driver is experiencing a deadlock.

The following stack trace fragments identify the conflicting lock order causing the deadlock:

(lock.lock())
at org.jppf.server.queue.JPPFPriorityQueue.getJob(JPPFPriorityQueue.java:305) 
  at org.jppf.server.queue.JPPFPriorityQueue.getBundleForJob(JPPFPriorityQueue.java:345)
  at org.jppf.server.job.JPPFJobManager.getBundleForJob(JPPFJobManager.java:110)
(synchronized JPPFJobManager)

vs.

(synchronized JPPFJobManager)
at org.jppf.server.job.JPPFJobManager.jobUpdated(JPPFJobManager.java:188)
  at org.jppf.server.protocol.AbstractServerJob.fireJobUpdated(AbstractServerJob.java:457)
  at org.jppf.server.protocol.ServerJob.copy(ServerJob.java:90)
  at org.jppf.server.queue.JPPFPriorityQueue.nextBundle(JPPFPriorityQueue.java:198)
(lock.lock())
Steps to reproduce this issue
Regularly reproducible on my network executing on Rackspace cloud when I have over about 30 nodes, and tasks complete in about 5 minutes (resulting in an average of 10 seconds between task completions by nodes, with many coming in nearly simultaneously).

#2
Comment posted by
 Daniel Widdis
Jul 28, 05:59
A file was uploaded. Full thread dump of deadlockicon_open_new.png This comment was attached:

Complete driver thread dump showing deadlock.
#5
Comment posted by
 lolo4j
Jul 28, 12:41
I'm not able to reproduce a deadlock, even with 35 nodes and the ServerMonitorClient. I tried small jobs, big jobs, with the largest test having 1000 jobs with 1000 tasks each, and submitting up to 30 jobs concurrently. Reviewing the JPPFJobManager code, I found that the synchronization I implemented was pretty bad. I fixed it and build a jppf-server.jar that can be downloaded from here, along with the source jar. COuld you give it a try and let us know if this resolves the deadlock issue?
#6
Comment posted by
 Daniel Widdis
icon_reply.pngJul 28, 18:33, in reply to comment #5
lolo4j wrote:
I'm not able to reproduce a deadlock, even with 35 nodes and the
ServerMonitorClient. I tried small jobs, big jobs, with the largest test
having 1000 jobs with 1000 tasks each, and submitting up to 30 jobs
concurrently.


Number of threads on the driver may be the limiting factor, and my testing shows that number of connected nodes (and relative task completion rate), combined with heavy CPU load on the driver machine, may be the main contributors to that, rather than queue length itself.

Your fix appears to fix the deadlock, but there is now a NPE in either getAllJobIds() or getJobName(jobUUID), which I emailed you about separately.
#7
Comment posted by
 lolo4j
icon_reply.pngJul 28, 22:26, in reply to comment #6
I was not getting the NPE because there was another unchecked exception just before, which was silently propagated to the thread handling the job notifications. That was due to my modifications of the original ServerMonitorClient, to handle external method calls that were missing. I then displayed the uuids returned by DriverJobManagementMBean.getAllJobIds() and noticed they were like this: { uuid1, uuid2, null, null, ..., null }. The null values caused the NPEs.

I found the problem, due to my fix in JPPFJobManager. I have now "fixed the fix" and uploaded the jars at the same location as the previous version.
#8
Comment posted by
 Daniel Widdis
icon_reply.pngJul 28, 23:58, in reply to comment #7
Looks like this one's fixed!
#9
Comment posted by
 lolo4j
Jul 29, 09:37
fixed in: