JPPF Issue Tracker
star_faded.png
Please log in to bookmark issues
bug_report_small.png
CLOSED  Bug report JPPF-130  -  Job hang when loading a large number of classes with multiple nodes
Posted Feb 25, 2013 - updated Mar 12, 2013
icon_info.png This issue has been closed with status "Closed" and resolution "RESOLVED".
Issue details
  • Type of issue
    Bug report
  • Status
     
    Closed
  • Assigned to
     lolo4j
  • Progress
       
  • Type of bug
    Not triaged
  • Likelihood
    Not triaged
  • Effect
    Not triaged
  • Posted by
     lolo4j
  • Owned by
    Not owned by anyone
  • Category
    Server
  • Resolution
    RESOLVED
  • Priority
    Critical
  • Reproducability
    Often
  • Severity
    Critical
  • Targetted for
    icon_milestones.png JPPF 3.2.3
Issue description
When running a job that causes the nodes to load a large number of classes, some of the nodes may throw a ClassNotFOundException/NoClassDefFoundError. Sometimes, I can also see that the job hangs after a few iterations, without any exception being thrown.

This happens when multiple jobs are running, but not when there's only one node.
Steps to reproduce this issue
from description

#5
Comment posted by
 lolo4j
Feb 25, 22:23
Found yet another deadlock:

Found one Java-level deadlock:
=============================
"NodeJobServer-7":
  waiting to lock monitor 0x000000000b698de8 (object 0x00000000ff878d98, a org.jppf.server.protocol.ServerJob),
  which is held by "NodeJobServer-4"
 
"NodeJobServer-4":
  waiting to lock monitor 0x0000000011542800 (object 0x00000000e0078b60, a org.jppf.server.nio.SelectionKeyWrapper),
  which is held by "ClientJobServer-5"
 
"ClientJobServer-5":
  waiting to lock monitor 0x000000000b9545b0 (object 0x00000000ff892ff0, a org.jppf.server.protocol.ServerTaskBundleClient),
  which is held by "NodeJobServer-4"
 
Java stack information for the threads listed above:
===================================================
 
"NodeJobServer-7":
  at org.jppf.server.protocol.ServerJob.resultsReceived(ServerJob.java:206)
  - waiting to lock <0x00000000ff878d98> (a org.jppf.server.protocol.ServerJob)
  at org.jppf.server.protocol.ServerTaskBundleNode.resultsReceived(ServerTaskBundleNode.java:180)
  at org.jppf.server.nio.nodeserver.WaitingResultsState.performTransition(WaitingResultsState.java:82)
  at org.jppf.server.nio.nodeserver.WaitingResultsState.performTransition(WaitingResultsState.java:1)
  at org.jppf.server.nio.StateTransitionTask.run(StateTransitionTask.java:82)
  - locked <0x00000000e00fa590> (a org.jppf.server.nio.SelectionKeyWrapper)
  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
  at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
  at java.util.concurrent.FutureTask.run(FutureTask.java:166)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
  at java.lang.Thread.run(Thread.java:722)
 
"NodeJobServer-4":
  at org.jppf.server.nio.client.CompletionListener.taskCompleted(CompletionListener.java:86)
  - waiting to lock <0x00000000e0078b60> (a org.jppf.server.nio.SelectionKeyWrapper)
  at org.jppf.server.protocol.ServerTaskBundleClient.fireTasksCompleted(ServerTaskBundleClient.java:329)
  at org.jppf.server.protocol.ServerTaskBundleClient.resultReceived(ServerTaskBundleClient.java:187)
  - locked <0x00000000ff892ff0> (a org.jppf.server.protocol.ServerTaskBundleClient)
  at org.jppf.server.protocol.ServerJob.resultsReceived(ServerJob.java:225)
  - locked <0x00000000ff878d98> (a org.jppf.server.protocol.ServerJob)
  at org.jppf.server.protocol.ServerTaskBundleNode.resultsReceived(ServerTaskBundleNode.java:180)
  at org.jppf.server.nio.nodeserver.WaitingResultsState.performTransition(WaitingResultsState.java:82)
  at org.jppf.server.nio.nodeserver.WaitingResultsState.performTransition(WaitingResultsState.java:1)
  at org.jppf.server.nio.StateTransitionTask.run(StateTransitionTask.java:82)
  - locked <0x00000000e00fa230> (a org.jppf.server.nio.SelectionKeyWrapper)
  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
  at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
  at java.util.concurrent.FutureTask.run(FutureTask.java:166)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
  at java.lang.Thread.run(Thread.java:722)
 
"ClientJobServer-5":
  at org.jppf.server.protocol.ServerTaskBundleClient.isDone(ServerTaskBundleClient.java:269)
  - waiting to lock <0x00000000ff892ff0> (a org.jppf.server.protocol.ServerTaskBundleClient)
  at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source)
  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:601)
  at org.jppf.utils.ReflectionUtils.dumpObject(ReflectionUtils.java:81)
  at org.jppf.utils.ReflectionUtils.dumpObject(ReflectionUtils.java:43)
  at org.jppf.server.protocol.ServerTaskBundleClient.toString(ServerTaskBundleClient.java:375)
  at java.lang.String.valueOf(String.java:2902)
  at java.lang.StringBuilder.append(StringBuilder.java:128)
  at org.jppf.server.nio.client.SendingResultsState.performTransition(SendingResultsState.java:92)
  at org.jppf.server.nio.client.SendingResultsState.performTransition(SendingResultsState.java:1)
  at org.jppf.server.nio.StateTransitionTask.run(StateTransitionTask.java:82)
  - locked <0x00000000e0078b60> (a org.jppf.server.nio.SelectionKeyWrapper)
  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
  at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
  at java.util.concurrent.FutureTask.run(FutureTask.java:166)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
  at java.lang.Thread.run(Thread.java:722)
#7
Comment posted by
 lolo4j
Feb 25, 22:24
A file was uploaded. Thread dump for deadlock found in the drivericon_open_new.png
#8
Comment posted by
 lolo4j
Feb 25, 22:31
In the dump for thread "ClientJobServer-5", the thread is blocked during a call to ServerTaskBundleClient.toString() ==> very bad. We need to avoid synchronization or locking in our toString() methods.
#9
Comment posted by
 lolo4j
Feb 28, 20:23
After many tries, it looks like I found a meaningful exception during the class loading in the server:
2013-02-28 19:18:14,978 [DEBUG][ClientClassServer-4 ][org.jppf.server.nio.StateTransitionTask.run(92)]: error on channel SelectionKeyWrapper[id=10, lolo-i2600:65224, readyOps=4, keyOps=0, context=channel=SelectionKeyWrapper[id=10], state=SENDING_PROVIDER_REQUEST, resource=JPPFResourceWrapper[dynamic=true, asResource=false, state=PROVIDER_REQUEST, data={name=some_package/SomeClass.class}], pendingRequests=2, pendingResponses=0, currentRequest=ResourceRequest[channel=SelectionKeyWrapper[id=6], resource=JPPFResourceWrapper[dynamic=true, asResource=false, state=PROVIDER_REQUEST, data={name=name=some_package/SomeClass.class}]], provider=true, peer=false, uuid=c1, connectionUuid=c1_1] : java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
  at java.util.ArrayList.rangeCheck(ArrayList.java:604)
  at java.util.ArrayList.get(ArrayList.java:382)
  at org.jppf.server.nio.AbstractNioMessage.writeNextObject(AbstractNioMessage.java:188)
  at org.jppf.server.nio.AbstractNioMessage.write(AbstractNioMessage.java:128)
  at org.jppf.server.nio.SimpleNioContext.writeMessage(SimpleNioContext.java:61)
  at org.jppf.server.nio.classloader.client.SendingProviderRequestState.performTransition(SendingProviderRequestState.java:89)
  at org.jppf.server.nio.classloader.client.SendingProviderRequestState.performTransition(SendingProviderRequestState.java:1)
  at org.jppf.server.nio.StateTransitionTask.run(StateTransitionTask.java:82)
  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
  at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
  at java.util.concurrent.FutureTask.run(FutureTask.java:166)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
  at java.lang.Thread.run(Thread.java:722)
#10
Comment posted by
 lolo4j
Mar 08, 10:21
After many tries, the only exception I get now is typically like this:
2013-03-08 09:10:24,570 [DEBUG][ClientClassServer-1 ][org.jppf.server.nio.StateTransitionTask.run(92)]: error on channel SelectionKeyWrapper[id=10, lolo-i2600:54277, readyOps=4, keyOps=0, context=channel=SelectionKeyWrapper[id=10], state=SENDING_PROVIDER_REQUEST, resource=null, pendingRequests=0, currentRequest=null, connectionUuid=c1_1, type=client, peer=false, uuid=c1] : java.net.ConnectException: provider SelectionKeyWrapper[id=10, lolo-i2600:54277, readyOps=4, keyOps=5, context=channel=SelectionKeyWrapper[id=10], state=SENDING_PROVIDER_REQUEST, resource=null, pendingRequests=0, currentRequest=null, connectionUuid=c1_1, type=client, peer=false, uuid=c1] has been disconnected
java.net.ConnectException: provider SelectionKeyWrapper[id=10, lolo-i2600:54277, readyOps=4, keyOps=5, context=channel=SelectionKeyWrapper[id=10], state=SENDING_PROVIDER_REQUEST, resource=null, pendingRequests=0, currentRequest=null, connectionUuid=c1_1, type=client, peer=false, uuid=c1] has been disconnected
  at org.jppf.server.nio.classloader.client.SendingProviderRequestState.performTransition(SendingProviderRequestState.java:67)
  at org.jppf.server.nio.classloader.client.SendingProviderRequestState.performTransition(SendingProviderRequestState.java:1)
  at org.jppf.server.nio.StateTransitionTask.run(StateTransitionTask.java:82)
  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
  at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
  at java.util.concurrent.FutureTask.run(FutureTask.java:166)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
  at java.lang.Thread.run(Thread.java:722)
I almso checked in the client log that the corresponding log message for the class loader channel disconnection happens a few milliseconds after that. So this seems to imply that the channel on the driver side is transitioned to SENDING_PROVIDER_REQUEST before the client has finished sending a response to a class loading request.

Additionally, I can now reproduce very easily with the following task code:
public class JPPF_130_Task extends JPPFTask {
  @Override
  public void run() {
    try {
      ClassLoader cl = getClass().getClassLoader();
      for (int i=0; i<100; i++) {
        URL url = cl.getResource("test.SomeResource" + i);
      }
      setResult("success");
    } catch (Exception e) {
      setException(e);
    }
  }
}
What it's doing is to stress the class loader by requesting many resources in a very short time. Whether the resources actually exist does not matter. I can reproduce the exception above when running jobs with 100 of these tasks, using 4 nodes and a client with a pool of 4 connections, each node has 8 processing threads.
#11
Comment posted by
 lolo4j
Mar 12, 22:34
Fixed. Changes committed to SVN:

The issue was updated with the following change(s):
  • This issue has been closed
  • The status has been updated, from New to Closed.
  • This issue's progression has been updated to 100 percent completed.
  • The resolution has been updated, from Not determined to RESOLVED.
  • Information about the user working on this issue has been changed, from lolo4j to Not being worked on.