JPPF Issue Tracker ~ [CLOSED] JPPF-342 - Uncontrolled incrementing of idle node count

JPPF Issue Tracker

CLOSED Bug report JPPF-342 - Uncontrolled incrementing of idle node count

Posted Oct 21, 2014 - updated Aug 15, 2018

This issue has been closed with status "Closed" and resolution "RESOLVED".

Issue details

Type of issue

Bug report Unknown issue type
Status

Closed

Status not determined
Assigned to

lolo4j

Laurent Cohen
Offline

This user was last seen online at Feb 06, 2021 - 05:08

Show user details

Close this menu

Not assigned to anyone
Progress
Type of bug

Not triaged
Likelihood

Not triaged
Effect

Not triaged
Posted by

Daniel Widdis

Daniel Widdis
Offline

This user was last seen online at Oct 21, 2015 - 01:03

Show user details

Close this menu
Owned by

Not owned by anyone
Estimated time

No time Not estimated
Time spent

No time No time spent
Category

Server Not determined
Resolution

RESOLVED Not determined
Priority

Normal Not determined
Reproducability

Not determined
Severity

Normal Not determined
Targetted for

JPPF 4.2.x Not determined

Issue description

Following an update of my driver and nodes from 4.2.2 to 4.2.3, I'm getting failures processing tasks. The symptoms I am seeing are rapid decrementing of the queue:

6:31:28,820 INFO [ServerMonitorClient] - Tasks in queue per job: bayes-task-assessors=0 expr-task-assessors=327
16:31:30,448 INFO [ServerMonitorClient] - Job expr-task-assessors returned 1 task from 10.176.194.209:11200.
16:31:31,492 INFO [ServerMonitorClient] - Job expr-task-assessors returned 1 task from 10.176.194.209:11199.
16:31:47,633 INFO [ServerMonitorClient] - Job expr-task-assessors returned 1 task from 10.176.193.97:11198.
16:31:48,769 INFO [ServerMonitorClient] - Job expr-task-assessors dispatched 1 task to 10.176.193.97:11198. Queue now has 258 tasks.
16:31:49,230 INFO [ServerMonitorClient] - Tasks in queue per job: bayes-task-assessors=0 expr-task-assessors=256
16:31:49,231 DEBUG [JMXConnectionThread] - 10.176.193.97:11198 about to perform connection attempts
16:31:49,528 DEBUG [JMXConnectionWrapper] - 10.176.193.97:11198 JMX connection successfully established
16:32:14,667 INFO [ServerMonitorClient] - Job expr-task-assessors returned 1 task from 10.176.193.101:11200.
16:32:15,830 INFO [ServerMonitorClient] - Job expr-task-assessors dispatched 1 task to 10.176.193.101:11200. Queue now has 105 tasks.
16:32:16,256 INFO [ServerMonitorClient] - Tasks in queue per job: bayes-task-assessors=0 expr-task-assessors=105
16:32:21,936 INFO [ServerMonitorClient] - Job expr-task-assessors returned 1 task from 10.208.201.26:11199.
16:32:23,054 INFO [ServerMonitorClient] - Job expr-task-assessors dispatched 1 task to 10.208.201.26:11199. Queue now has 54 tasks.
16:32:23,477 INFO [ServerMonitorClient] - Tasks in queue per job: bayes-task-assessors=0 expr-task-assessors=52
16:32:27,113 INFO [ServerMonitorClient] - Job expr-task-assessors returned 1 task from 10.176.193.101:11199.
16:32:28,221 INFO [ServerMonitorClient] - Job expr-task-assessors dispatched 1 task to 10.176.193.101:11199. Queue now has 24 tasks.
16:32:28,658 INFO [ServerMonitorClient] - Tasks in queue per job: bayes-task-assessors=0 expr-task-assessors=21
16:32:42,952 INFO [ServerMonitorClient] - Job expr-task-assessors dispatched 1 task to 10.176.193.97:11199. Queue now has 0 tasks.
16:32:43,373 INFO [ServerMonitorClient] - Tasks in queue per job: bayes-task-assessors=0 expr-task-assessors=0

and, in the monitoring GUI, monitoring the server stats reveals the "Idle Nodes" count steadily increasing at a rate of about 2 nodes per second:

Connections
-----------

Number of Nodes = 43
Maximum Number of Nodes = 60
Idle Nodes = 716
Number of Clients = 3
Maximum Number of Clients = 4

Apparently the driver things there are a lot of idle nodes, and happily sends jobs off to them, never to be seen again since they don't actually exist.

The only change in my own code between these versions (and thus a likely culprit) is automatic node provisioning. See the code in reproduction steps below.

Steps to reproduce this issue

This has occurred twice so far on my system, processing about 300 tasks on about 20 servers, each attempting to provision 2 slave nodes for a total of about 60 nodes. I'm pretty sure I can continue to reproduce it, but not sure exactly how yet. I strongly suspect it's related to my node provisioning code, excerpted below, and possibly involving latency/synchronization/jmx thread pool constraints. I have my code set to automatically provision 2 slave nodes when it dispatches a job to a master node.

// Handle an MBean notification
    public void handleNotification(Notification notification, Object handback) {
      // Parse notification to get job and event information
      JobNotification jobNotif = (JobNotification) notification;
      String jobId = jobNotif.getJobInformation().getJobName();
      String jobUuid = jobNotif.getJobInformation().getJobUuid();
      JobEventType eventType = jobNotif.getEventType();
      // Update queue information to handle events appropriately
      int queueSize = (int) queueSize();
 
      switch (eventType) {
        // Tasks dispatched from driver to a node. This will decrement the
        // queue. Print information and start timer if this is the first tasks
        // for that job. Also since this is best indication of when nodes start
        // up, use this as trigger to provision slave nodes on server
        case JOB_DISPATCHED:
          if (LOG.isInfoEnabled()) {
            LOG.info(String.format(
                "Job %s dispatched %s to %s. Queue now has %s.", jobId,
                pluralize(jobNotif.getJobInformation().getTaskCount(), "task"),
                getHost(jobNotif), pluralize((int) queueSize(), "task")));
            LOG.info(printQueueInfo());
          }
          // If this is a master node, provision slave nodes
          if (jobNotif.getNodeInfo().isMasterNode()) {
            manager.provisionSlaveNodes(jobNotif.getNodeInfo());
          }
          break;

And the provisionSlaveNodes() method:

/**
   * Provisions slave nodes on the specified master node. First checks number of
   * existing slaves and then updates
   * 
   * @param nodeInfo
   *        Master node on which to provision slaves
   */
  public void provisionSlaveNodes(JPPFManagementInfo nodeInfo) {
 
    // Establish jmx connection to node
    JMXNodeConnectionWrapper jmxNode = new JMXNodeConnectionWrapper(
        nodeInfo.getHost(), nodeInfo.getPort(), false);
    jmxNode.connectAndWait(500L);
    if (jmxNode.isConnected()) {
      try {
        JPPFNodeProvisioningMBean nodeProxy = jmxNode.getProxy(
            JPPFNodeProvisioningMBean.MBEAN_NAME,
            JPPFNodeProvisioningMBean.class);
        // Count existing slaves on the node
        // If existing less than required, provision
        if (nodeProxy.getNbSlaves() < slaveNodes) {
          nodeProxy.provisionSlaveNodes(slaveNodes);
        }
      } catch (Exception e) {
        e.printStackTrace();
      } finally {
        // close connection when done
        try {
          jmxNode.close();
        } catch (Exception e) {
          e.printStackTrace();
        }
      }
    } else {
      // Never connected. Try to close anyway but ignore any errors
      try {
        jmxNode.close();
      } catch (Exception e) {
        // Ignore
      }
    }
  }

Show system-generated comments

Comment posted by

Daniel Widdis

Oct 21, 02:53

I conducted some additional debugging, suppressing the auto-slave-node code above. I noted that on each of my nodes, 1 of the slave nodes "crashed".

I recently changed my node's PermGen settings so it's possible setting this on the slave nodes is limiting the number that can launch:

00:40:07,320 DEBUG [SlaveNodeLauncher] - JVM options: [-Xmx768m, -XX:MaxPermSize=192m, -Djava.util.logging.config.file=config/logging-node.properties]

But I am only running 1 master + 2 slaves on an 8GB machine so that shouldn't be a problem... or maybe it is.

Symptom remains so it's not the automatic provisioning at fault, but there's something amiss with whatever's causing these nodes to fail

17:42:34,438 INFO [ServerMonitorClient] - Job expr-task-assessors dispatched 1 task to 10.209.0.156:11199. Queue now has 323 tasks.
17:42:34,857 INFO [ServerMonitorClient] - Tasks in queue per job: bayes-task-assessors=0 expr-task-assessors=323
17:43:40,542 INFO [ServerMonitorClient] - Job expr-task-assessors returned 1 task from 10.209.0.117:11199.
17:43:41,647 INFO [ServerMonitorClient] - Job expr-task-assessors dispatched 1 task to 10.209.0.117:11199. Queue now has 49 tasks.
17:43:42,072 INFO [ServerMonitorClient] - Tasks in queue per job: bayes-task-assessors=0 expr-task-assessors=49
17:43:47,924 INFO [ServerMonitorClient] - Job expr-task-assessors returned 1 task from 10.209.0.139:11200.

Comment posted by

Daniel Widdis

Oct 21, 07:26

/**
   * Informational method that compares a list of cloud servers and servers
   * running nodes, and displays the difference to enable manual operations on
   * the server
   * 
   * @param jmx
   */
  public void serverAudit(JMXDriverConnectionWrapper jmx) {
    // Cloudserver list mapped to Private IP
    Map<String, NodeMetadata> cloudServers = new HashMap<String, NodeMetadata>();
    // Get cloud server list
    ServerLister serverLister = new ServerLister();
    try {
      Set<? extends NodeMetadata> servers = serverLister.call();
      serverLister.close();
      // map to private address
      Iterator<? extends NodeMetadata> serverIter = servers.iterator();
      while (serverIter.hasNext()) {
        NodeMetadata server = serverIter.next();
        cloudServers
            .put(server.getPrivateAddresses().iterator().next(), server);
      }
    } catch (Exception e) {
      LOG.warn("Exception during server listing. [" + e.getMessage() + "]");
    }
 
    // Start with set of Rackspace servers and remove JPPF nodes
    Set<String> cloudServerIPs = cloudServers.keySet();
    try {
      // Get all JPPF nodes
      Collection<JPPFManagementInfo> nodesInfo = jmx.nodesInformation();
 
      Iterator<JPPFManagementInfo> nodesIter = nodesInfo.iterator();
      while (nodesIter.hasNext()) {
        JPPFManagementInfo node = nodesIter.next();
        if (node.isMasterNode()) {
          cloudServerIPs.remove(node.getHost());
        }
      }
    } catch (Exception e) {
      LOG.warn("Exception during node listing. [" + e.getMessage() + "]");
    }
 
    // Set now contains Cloud Servers which were not JPPF nodes
    Iterator<String> serverIter = cloudServerIPs.iterator();
    if (serverIter.hasNext()) {
      LOG.info("Found cloud servers not in the JPPF network.");
      while (serverIter.hasNext()) {
        StringBuilder sb = new StringBuilder();
        String ip = serverIter.next();
        NodeMetadata server = cloudServers.get(ip);
        sb.append(server.getPublicAddresses().toString()).append(" ")
            .append(server.getPrivateAddresses().toString()).append(" ")
            .append(server.getName());
        LOG.info(sb.toString());
      }
    }
  }

Comment visibility (whether to hide this comment for "regular users")

Comment

More troubleshooting, and nothing's any more clear...
When I said "suppressed" the node provisioning, I did so by specifying "0" for the value of slaveNodes in the above code.  
However, to rule out whether there was a problem with the node configuration I ran just the nodes and the standard JPPF task code with no problems.  (Other than a display of 0%CPU on active slave nodes when the master was idle... but they were doing work...)
So the problem is in my monitoring client, and is likely in the provisioning code above, or the associated JMX.  The only other change to the monitoring code since it worked well is some server listing code, quoted below.
<source lang="java">
  /**
   * Informational method that compares a list of cloud servers and servers
   * running nodes, and displays the difference to enable manual operations on
   * the server
   * 
   * @param jmx
   */
  public void serverAudit(JMXDriverConnectionWrapper jmx) {
    // Cloudserver list mapped to Private IP
    Map<String, NodeMetadata> cloudServers = new HashMap<String, NodeMetadata>();
    // Get cloud server list
    ServerLister serverLister = new ServerLister();
    try {
      Set<? extends NodeMetadata> servers = serverLister.call();
      serverLister.close();
      // map to private address
      Iterator<? extends NodeMetadata> serverIter = servers.iterator();
      while (serverIter.hasNext()) {
        NodeMetadata server = serverIter.next();
        cloudServers
            .put(server.getPrivateAddresses().iterator().next(), server);
      }
    } catch (Exception e) {
      LOG.warn("Exception during server listing. [" + e.getMessage() + "]");
    }
    // Start with set of Rackspace servers and remove JPPF nodes
    Set<String> cloudServerIPs = cloudServers.keySet();
    try {
      // Get all JPPF nodes
      Collection<JPPFManagementInfo> nodesInfo = jmx.nodesInformation();
      Iterator<JPPFManagementInfo> nodesIter = nodesInfo.iterator();
      while (nodesIter.hasNext()) {
        JPPFManagementInfo node = nodesIter.next();
        if (node.isMasterNode()) {
          cloudServerIPs.remove(node.getHost());
        }
      }
    } catch (Exception e) {
      LOG.warn("Exception during node listing. [" + e.getMessage() + "]");
    }
    // Set now contains Cloud Servers which were not JPPF nodes
    Iterator<String> serverIter = cloudServerIPs.iterator();
    if (serverIter.hasNext()) {
      LOG.info("Found cloud servers not in the JPPF network.");
      while (serverIter.hasNext()) {
        StringBuilder sb = new StringBuilder();
        String ip = serverIter.next();
        NodeMetadata server = cloudServers.get(ip);
        sb.append(server.getPublicAddresses().toString()).append(" ")
            .append(server.getPrivateAddresses().toString()).append(" ")
            .append(server.getName());
        LOG.info(sb.toString());
      }
    }
  }
</source>

or cancel

Comment visibility (whether to hide this comment for "regular users")

Comment

'''Daniel Widdis wrote:'''
>More troubleshooting, and nothing's any more clear... 
> 
>When I said "suppressed" the node provisioning, I did so by specifying "0"
>for the value of slaveNodes in the above code.   
> 
>However, to rule out whether there was a problem with the node
>configuration I ran just the nodes and the standard JPPF task code with no
>problems.  (Other than a display of 0%CPU on active slave nodes when the
>master was idle... but they were doing work...) 
> 
>So the problem is in my monitoring client, and is likely in the
>provisioning code above, or the associated JMX.  The only other change to
>the monitoring code since it worked well is some server listing code,
>quoted below. 
> 
>/**
>   * Informational method that compares a list of cloud servers and servers
>   * running nodes, and displays the difference to enable manual operations
>on
>   * the server
>   * 
>   * @param jmx
>   */
>  public void serverAudit(JMXDriverConnectionWrapper jmx) {
>    // Cloudserver list mapped to Private IP
>    Map<String, NodeMetadata> cloudServers = new HashMap<String,
>NodeMetadata>();
>    // Get cloud server list
>    ServerLister serverLister = new ServerLister();
>    try {
>      Set<? extends NodeMetadata> servers = serverLister.call();
>      serverLister.close();
>      // map to private address
>      Iterator<? extends NodeMetadata> serverIter = servers.iterator();
>      while (serverIter.hasNext()) {
>        NodeMetadata server = serverIter.next();
>        cloudServers
>            .put(server.getPrivateAddresses().iterator().next(), server);
>      }
>    } catch (Exception e) {
>      LOG.warn("Exception during server listing. [" + e.getMessage() +
>"]");
>    }
> 
>    // Start with set of Rackspace servers and remove JPPF nodes
>    Set<String> cloudServerIPs = cloudServers.keySet();
>    try {
>      // Get all JPPF nodes
>      Collection<JPPFManagementInfo> nodesInfo = jmx.nodesInformation();
> 
>      Iterator<JPPFManagementInfo> nodesIter = nodesInfo.iterator();
>      while (nodesIter.hasNext()) {
>        JPPFManagementInfo node = nodesIter.next();
>        if (node.isMasterNode()) {
>          cloudServerIPs.remove(node.getHost());
>        }
>      }
>    } catch (Exception e) {
>      LOG.warn("Exception during node listing. [" + e.getMessage() + "]");
>    }
> 
>    // Set now contains Cloud Servers which were not JPPF nodes
>    Iterator<String> serverIter = cloudServerIPs.iterator();
>    if (serverIter.hasNext()) {
>      LOG.info("Found cloud servers not in the JPPF network.");
>      while (serverIter.hasNext()) {
>        StringBuilder sb = new StringBuilder();
>        String ip = serverIter.next();
>        NodeMetadata server = cloudServers.get(ip);
>        sb.append(server.getPublicAddresses().toString()).append(" ")
>            .append(server.getPrivateAddresses().toString()).append(" ")
>            .append(server.getName());
>        LOG.info(sb.toString());
>      }
>    }
>  } 
>

or cancel

Comment posted by

lolo4j

Oct 21, 08:32

Doing some basic tests for this bug I found another related symptom:

start 1 driver + 1 master node
from the admin console provision 1 slave node
unprovision the slave node
==> the server stats show 0 nodes (sigh...)
after stopping and restarting the console, the stats still show 0 node, thus the problem is not in the console

Comment posted by

lolo4j

Oct 21, 10:26

After further testing with provisioning/unprovisioning, this time while jkobs are being executed, I get the following stack, which I believe is the main problem:

2014-10-21 10:00:37,767 [INFO ][TaskQueueChecker    ][org.jppf.nio.StateTransitionManager.transitionChannel(161)]: 
java.nio.channels.CancelledKeyException
  at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
  at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:82)
  at org.jppf.nio.SelectionKeyWrapper.setInterestOps(SelectionKeyWrapper.java:107)
  at org.jppf.nio.StateTransitionManager.transitionChannel(StateTransitionManager.java:139)
  at org.jppf.nio.StateTransitionManager.transitionChannel(StateTransitionManager.java:122)
  at org.jppf.server.nio.nodeserver.AbstractNodeContext.submit(AbstractNodeContext.java:460)
  at org.jppf.server.nio.nodeserver.AbstractNodeContext.submit(AbstractNodeContext.java:1)
  at org.jppf.server.nio.nodeserver.TaskQueueChecker.dispatchJobToChannel(TaskQueueChecker.java:303)
  at org.jppf.server.nio.nodeserver.TaskQueueChecker.dispatch(TaskQueueChecker.java:255)
  at org.jppf.server.nio.nodeserver.TaskQueueChecker.run(TaskQueueChecker.java:218)
  at java.lang.Thread.run(Thread.java:744)

This happens once the slave node has been selected for tasks dispatching, but before its server-side channel is transitioned to its new state. What's missing is some exception handling code in or around the call to TaskQueueChecker.dispatchJobToChannel().

Comment posted by

lolo4j

Oct 21, 11:48

A file was uploaded. Tentative fix

Comment posted by

lolo4j

Oct 21, 11:54

@Dan: I managed a fix that seems to resolve the issue and attached it to this ticket. Could you give it a try and let me know?

Comment posted by

Daniel Widdis

Oct 21, 20:27, in reply to comment #8

To be clear, this fix is for the driver?

I'll upload and do another test.

lolo4j wrote:

@Dan: I managed a fix that seems to resolve the issue and attached it to
this ticket. Could you give it a try and let me know?

#10

Comment posted by

Daniel Widdis

Oct 21, 23:16, in reply to comment #8

Looks like it works! Ran twice with normal results.

lolo4j wrote:

@Dan: I managed a fix that seems to resolve the issue and attached it to
this ticket. Could you give it a try and let me know?

#11

Comment posted by

lolo4j

Oct 23, 07:11

fixed in

trunk revision 3443
branch b4.2 revision 3444

#14

Comment posted by

Daniel Widdis

Oct 26, 02:20, in reply to comment #11

Unfortunately, it looks like this just recurred, so it's not fixed....

I currently have 3200 idle nodes... and counting....

lolo4j wrote:

fixed in
trunk revision 3443
branch b4.2 revision 3444

#15

Comment posted by

Daniel Widdis

Oct 26, 04:39, in reply to comment #14

And it's occurring without any of my provisioning code running. It processes the first set of tasks on nodes, and then they go idle.

Some of the nodes show classloading errors. Unfortunately I didn't copy the stack trace before shutting it down.

Daniel Widdis wrote:

Unfortunately, it looks like this just recurred, so it's not fixed....

I currently have 3200 idle nodes... and counting....

#16

Comment posted by

Daniel Widdis

Oct 26, 05:25, in reply to comment #15

The problem began when I upgraded from 3.4.1 to 3.4.3. I would suspect some of the code from fixes in the last two maintenance releases is involved.
The initial problem only occurs when there are slave nodes provisioned. If I run a grid entirely with master nodes, I get no problems.
The problem occurs without my automatic node provisioning code above, but the problem did occur when I manually provisioned nodes using the GUI. I did provision while nodes were still starting up and connecting so there may have been a timing issue involved.
I have uploaded the patch file above (Tentative fix) to my driver's classpath. Everything else is stock 3.4.3.
This appears to occur with larger grids of about 20 servers x (master + 2 nodes) = 60 nodes. When I tested your fix with 12 servers x 3 nodes per server = 36 total, I had no problems.
This may be related to one type of task finishing and the node beginning to process another type of task (hypothesis, not yet tested)
Initial symptom I see is on the GUI, one or two of the nodes turns "red". Logging into that server and checking the logs showed some exceptions relating to classloading. I did not copy it down at the time but will try to capture it later. I recall it was ClassLoaderImpl in the common package, and there may have been a related IOException on the channel it was trying to use for classloading.
The node attempts to restart a few times.
At some point during all this, the idle node count increments and the queue decrements so when other nodes finish their tasks, there is nothing on the queue to give them.

I'll do more troubleshooting tomorrow or Monday.

Comment visibility (whether to hide this comment for "regular users")

Comment

I apologize for all the brief notes here with limited information.  I do not have that much time on the weekend to troubleshoot but I wanted to let you know about the issue.   Let me summarize what I do know about what's happening:
* The problem began when I upgraded from 3.4.1 to 3.4.3.  I would suspect some of the code from fixes in the last two maintenance releases is involved.
* The initial problem only occurs when there are slave nodes provisioned.  If I run a grid entirely with master nodes, I get no problems.
* The problem occurs without my automatic node provisioning code above, but the problem did occur when I manually provisioned nodes using the GUI.  I did provision while nodes were still starting up and connecting so there may have been a timing issue involved.
* I have uploaded the patch file above (Tentative fix) to my driver's classpath.  Everything else is stock 3.4.3.
* This appears to occur with larger grids of about 20 servers x (master + 2 nodes) = 60 nodes.  When I tested your fix with 12 servers x 3 nodes per server = 36 total, I had no problems.
* This may be related to one type of task finishing and the node beginning to process another type of task (hypothesis, not yet tested)
* Initial symptom I see is on the GUI, one or two of the nodes turns "red".  Logging into that server and checking the logs showed some exceptions relating to classloading.  I did not copy it down at the time but will try to capture it later. I recall it was ClassLoaderImpl in the common package, and there may have been a related IOException on the channel it was trying to use for classloading.  
* The node attempts to restart a few times.
* At some point during all this, the idle node count increments and the queue decrements so when other nodes finish their tasks, there is nothing on the queue to give them. 
I'll do more troubleshooting tomorrow or Monday.

or cancel

Comment visibility (whether to hide this comment for "regular users")

Comment

'''Daniel Widdis wrote:'''
>I apologize for all the brief notes here with limited information.  I do
>not have that much time on the weekend to troubleshoot but I wanted to let
>you know about the issue.   Let me summarize what I do know about what's
>happening: 
>
>
> The problem began when I upgraded from 3.4.1 to 3.4.3.  I would suspect
>some of the code from fixes in the last two maintenance releases is
>involved.
> The initial problem only occurs when there are slave nodes provisioned. 
>If I run a grid entirely with master nodes, I get no problems.
> The problem occurs without my automatic node provisioning code above, but
>the problem did occur when I manually provisioned nodes using the GUI.  I
>did provision while nodes were still starting up and connecting so there
>may have been a timing issue involved.
> I have uploaded the patch file above (Tentative fix) to my driver's
>classpath.  Everything else is stock 3.4.3.
> This appears to occur with larger grids of about 20 servers x (master + 2
>nodes) = 60 nodes.  When I tested your fix with 12 servers x 3 nodes per
>server = 36 total, I had no problems.
> This may be related to one type of task finishing and the node beginning
>to process another type of task (hypothesis, not yet tested)
> Initial symptom I see is on the GUI, one or two of the nodes turns "red". 
>Logging into that server and checking the logs showed some exceptions
>relating to classloading.  I did not copy it down at the time but will try
>to capture it later. I recall it was ClassLoaderImpl in the common package,
>and there may have been a related IOException on the channel it was trying
>to use for classloading.  
> The node attempts to restart a few times.
> At some point during all this, the idle node count increments and the
>queue decrements so when other nodes finish their tasks, there is nothing
>on the queue to give them. 
>
>
>I'll do more troubleshooting tomorrow or Monday. 
>

or cancel

#17

Comment posted by

lolo4j

Oct 26, 10:49

No problem, I'm reopening this issue since it's not resolved. While testing for this, I also discovered this new Bug report JPPF-344 - Server deadlock with many slave nodes, maybe it's related

#18

Comment posted by

lolo4j

Oct 26, 19:31

Finally, I now manage to reproduce all the symptoms you have described. I simply use a thread which provisions 10 slaves on a single master every 10s, then unprovisions them 1s later. It took a while to trigger the issue, I suppose if I provision more slaves it would be quicker.

#19

Comment posted by

Daniel Widdis

Oct 26, 22:27, in reply to comment #17

It may be... as mentioned above I think the trigger for all of this to start is an I/O connection/channel connection failure between a node and a driver. If the driver is deadlocked, the node never connects. (I've noticed issues with node listings on the GUI as well, which may also be symptomatic of a deadlock.)

It may also not be the change from 3.4.1 to 3.4.3 but the change in how quickly I provision slave nodes. Before I had tended to do it manually server-by-server, but lately I just do a ctrl-A to select the whole list and provision all at once. (And I've repeated this process multiple times as more of my cloud nodes connect.)

Any guidance on what you might need from me for troubleshooting? My plan was to try the following:

Revert to 3.4.1 and attempt the same sequence to see if it's a new version issue
Try to get a copy of the initial node failure stack trace, although a server deadlock seems to be a reasonable cause of the channel communication failure

lolo4j wrote:

No problem, I'm reopening this issue since it's not resolved. While testing
for this, I also discovered this new Bug report JPPF-344 - Server deadlock
with many slave nodes, maybe it's related

#20

Comment posted by

lolo4j

Oct 27, 07:58, in reply to comment #19

As I mentioned in my last comment, I now reproduce the symptoms you described initially: nodes showing in red in the conosole, inconsistent idle nodes count, etc... Unfortunately, this happens even with the deadlock fixed, therefore the 2 bugs are totally unrelated.

To reproduce with a single master, I have a separate thread in my client which provisions 20 slaves, waits 10s, un-provisions the slaves, waits 1s. Wash, rince, repat. The greater the number of slaves, the sooner the issue is triggered. I'm using a job streaming pattern in the client with tasks that simply wait for 1ms then log a string via log4j2, so they don't have to do much. The driver uses "nodethread" load-balancing with multiplicator = 1

Additional clues that I found by looking at the state of the driver with VisualVM, after I tried and killed all nodes from the console:

there are 19 "jmx@host:port" thread indicating that a node channel is trying to connect to a node's JMX server. It never ends since the nodes are dead
looking at the debug mbean, I see that there are active node job channels still referenced by the server, however their corresponding class loader channels have been closed and discarded. The stae of the job channels is like this:

SelectionKeyWrapper[id=776, channel=sun.nio.ch.SelectionKeyImpl@70ac0f39, context=RemoteNodeContext[channel=SelectionKeyWrapper[id=776], state=WAITING_RESULTS, uuid=0D1832B2-EA55-825D-AF36-E29070B7B031, connectionUuid=null, peer=false, ssl=false]]

#21

Comment posted by

lolo4j

Oct 27, 08:16

An interesting clue: I just noticed an exception stack trace in the driver's console output, raised by SLF4J, within the catch block of StatTransitionTask.run():

SLF4J: Failed toString() invocation on an object of type [org.jppf.nio.SelectionKeyWrapper]
java.nio.channels.CancelledKeyException
  at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
  at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
  at org.jppf.nio.SelectionKeyWrapper.getInterestOps(SelectionKeyWrapper.java:96)
  at org.jppf.nio.AbstractChannelWrapper.toString(AbstractChannelWrapper.java:113)
  at org.jppf.nio.SelectionKeyWrapper.toString(SelectionKeyWrapper.java:85)
  at org.slf4j.helpers.MessageFormatter.safeObjectAppend(MessageFormatter.java:304)
  at org.slf4j.helpers.MessageFormatter.deeplyAppendParameter(MessageFormatter.java:276)
  at org.slf4j.helpers.MessageFormatter.arrayFormat(MessageFormatter.java:230)
  at org.slf4j.helpers.MessageFormatter.format(MessageFormatter.java:152)
  at org.slf4j.impl.Log4jLoggerAdapter.debug(Log4jLoggerAdapter.java:251)
  at org.jppf.nio.StateTransitionTask.run(StateTransitionTask.java:89)
  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
  at java.util.concurrent.FutureTask.run(FutureTask.java:262)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  at java.lang.Thread.run(Thread.java:744)

This exception prevents the handleException() method of the corresponding channel from being invoked, thus it is not properly closed and discarded from the server.

#22

Comment posted by

lolo4j

Oct 28, 20:46

I fixed some of the problems, but not all. The fact that some job channels were not removed was probably due to the nodes being killed after the class loader channel was registered and before the job channel was registered. So now, I don't have any more dangling channels in the driver, resulting in no more nodes displayed in red in the admin console.

However I'm still seeing a job hang at some point, and also a number of jmx@xxx threads trying to connect to he nodes via JMX, even though those nodes are dead ...

The good new is that now I've completely automated the test and the failure detection (job hung detected by the server no longer incrementing the executed tasks count), so I'm wasting less time on this. My test is sending 1000 jobs with 1000 tasks each, one at a time. So far I've never been through all 1000 jobs, so the test is a valid one. Once it passes, I'll run a much larger number of jobs (by at least 1 or 2 orders of magnitude) to make sure the server will hold over time. But we're not there yet

#23

Comment posted by

lolo4j

Oct 29, 09:56

Not too much progress in finding the cause, but I've got additional information. After triggering the issue, I looked at the state of the driver's job queue in VisualVM, it shows the following:

ServerJob[id=422, job uuid=1A17AB9C-2A2D-8231-3A4F-18CAF886A2FE, name=streaming job 421, status=EXECUTING, taskCount=0, nbBundles=1], jobExpired=false, pending=false, suspended=false
client bundles:
- ServerTaskBundleClient[id=52923, pendingTasks=16, cancelled=false, done=false, job=JPPFTaskBundle[name=streaming job 421, uuid=1A17AB9C-2A2D-8231-3A4F-18CAF886A2FE, initialTaskCount=1000,
  taskCount=1000, bundleUuid=null, uuidPath=TraversalList[position=0, list=[8A94F7C8-BA04-7DC7-7382-2E8558EE8427, 6C1FE1D2-8496-6F5C-21E8-B3363A2E25AA]], nodeBundleId=null]]
client completion bundles: empty
node bundles:
- ServerTaskBundleNode[id=53657, name=streaming job 421, uuid=1A17AB9C-2A2D-8231-3A4F-18CAF886A2FE, initialTaskCount=1000, taskCount=8, cancelled=false, requeued=false]
- ServerTaskBundleNode[id=53659, name=streaming job 421, uuid=1A17AB9C-2A2D-8231-3A4F-18CAF886A2FE, initialTaskCount=1000, taskCount=8, cancelled=false, requeued=false]

I guess I should find a way to identify the node in ServerTaskBundleNode.toString(), if possible.

The node channels show only the master node:

node class loader channels:
SelectionKeyWrapper[id=2, channel=sun.nio.ch.SelectionKeyImpl@30634e7, readyOps=4, interestOps=1, context=ClassContext[channel=SelectionKeyWrapper[id=2], state=WAITING_NODE_REQUEST, resource=null, pendingResponses=0, type=node, peer=false, uuid=04508B88-7CA6-6256-AF6E-091DF0F4436E, secure=false, ssl=false]]
 
node job channels:
SelectionKeyWrapper[id=4, channel=sun.nio.ch.SelectionKeyImpl@4b848b3a, readyOps=1, interestOps=1, context=RemoteNodeContext[channel=SelectionKeyWrapper[id=4], state=IDLE, uuid=04508B88-7CA6-6256-AF6E-091DF0F4436E, connectionUuid=null, peer=false, ssl=false]]

The admin console's job data view still shows the job is dispatched to 2 dead nodes:

- lcohen-CSL:11198
  - streaming job 421
    - 12.0.0.1:12019
    - 12.0.0.1:12029

And I see there are 2 live "jmx@xxx" threads that should have died:

[email protected]:12027
[email protected]:12034

This is turning out to be a hard to find issue, but this is also an excellent test of how the grid behaves under that kind of stress, with frequent massive dynamic changes in the topology. I apologize for the time it takes, this is one of those bugs that require time, patience, hard work and imagination. I've already had bugs where I literally spent weeks finding the cause and a solution, this looks like one of them. I just ain't giving up, ever.

Comment visibility (whether to hide this comment for "regular users")

Comment

Not too much progress in finding the cause, but I've got additional information. After triggering the issue, I looked at the state of the driver's job queue in VisualVM, it shows the following:
<source lang="text">ServerJob[id=422, job uuid=1A17AB9C-2A2D-8231-3A4F-18CAF886A2FE, name=streaming job 421, status=EXECUTING, taskCount=0, nbBundles=1], jobExpired=false, pending=false, suspended=false
client bundles:
- ServerTaskBundleClient[id=52923, pendingTasks=16, cancelled=false, done=false, job=JPPFTaskBundle[name=streaming job 421, uuid=1A17AB9C-2A2D-8231-3A4F-18CAF886A2FE, initialTaskCount=1000,
  taskCount=1000, bundleUuid=null, uuidPath=TraversalList[position=0, list=[8A94F7C8-BA04-7DC7-7382-2E8558EE8427, 6C1FE1D2-8496-6F5C-21E8-B3363A2E25AA]], nodeBundleId=null]]
client completion bundles: empty
node bundles:
- ServerTaskBundleNode[id=53657, name=streaming job 421, uuid=1A17AB9C-2A2D-8231-3A4F-18CAF886A2FE, initialTaskCount=1000, taskCount=8, cancelled=false, requeued=false]
- ServerTaskBundleNode[id=53659, name=streaming job 421, uuid=1A17AB9C-2A2D-8231-3A4F-18CAF886A2FE, initialTaskCount=1000, taskCount=8, cancelled=false, requeued=false]</source>
I guess I should find a way to identify the node in ServerTaskBundleNode.toString(), if possible.
The node channels show only the master node:
<source lang="text">node class loader channels:
SelectionKeyWrapper[id=2, channel=sun.nio.ch.SelectionKeyImpl@30634e7, readyOps=4, interestOps=1, context=ClassContext[channel=SelectionKeyWrapper[id=2], state=WAITING_NODE_REQUEST, resource=null, pendingResponses=0, type=node, peer=false, uuid=04508B88-7CA6-6256-AF6E-091DF0F4436E, secure=false, ssl=false]]
node job channels:
SelectionKeyWrapper[id=4, channel=sun.nio.ch.SelectionKeyImpl@4b848b3a, readyOps=1, interestOps=1, context=RemoteNodeContext[channel=SelectionKeyWrapper[id=4], state=IDLE, uuid=04508B88-7CA6-6256-AF6E-091DF0F4436E, connectionUuid=null, peer=false, ssl=false]]</source>
The admin console's job data view still shows the job is dispatched to 2 dead nodes:
<source lang="text">- lcohen-CSL:11198
  - streaming job 421
    - 12.0.0.1:12019
    - 12.0.0.1:12029</source>
And I see there are 2 live "jmx@xxx" threads that should have died:
<source lang="text">jmx@127.0.0.1:12027
jmx@127.0.0.1:12034</source>
This is turning out to be a hard to find issue, but this is also an excellent test of how the grid behaves under that kind of stress, with frequent massive dynamic changes in the topology. I apologize for the time it takes, this is one of those bugs that require time, patience, hard work and imagination. I've already had bugs where I literally spent weeks finding the cause and a solution, this looks like one of them. I just ain't giving up, ever.

or cancel

Comment visibility (whether to hide this comment for "regular users")

Comment

'''lolo4j wrote:'''
>Not too much progress in finding the cause, but I've got additional
>information. After triggering the issue, I looked at the state of the
>driver's job queue in VisualVM, it shows the following: 
>ServerJob[id=422, job uuid=1A17AB9C-2A2D-8231-3A4F-18CAF886A2FE,
>name=streaming job 421, status=EXECUTING, taskCount=0, nbBundles=1],
>jobExpired=false, pending=false, suspended=false
>client bundles:
>- ServerTaskBundleClient[id=52923, pendingTasks=16, cancelled=false,
>done=false, job=JPPFTaskBundle[name=streaming job 421,
>uuid=1A17AB9C-2A2D-8231-3A4F-18CAF886A2FE, initialTaskCount=1000,
>  taskCount=1000, bundleUuid=null, uuidPath=TraversalList[position=0,
>list=[8A94F7C8-BA04-7DC7-7382-2E8558EE8427,
>6C1FE1D2-8496-6F5C-21E8-B3363A2E25AA]], nodeBundleId=null]]
>client completion bundles: empty
>node bundles:
>- ServerTaskBundleNode[id=53657, name=streaming job 421,
>uuid=1A17AB9C-2A2D-8231-3A4F-18CAF886A2FE, initialTaskCount=1000,
>taskCount=8, cancelled=false, requeued=false]
>- ServerTaskBundleNode[id=53659, name=streaming job 421,
>uuid=1A17AB9C-2A2D-8231-3A4F-18CAF886A2FE, initialTaskCount=1000,
>taskCount=8, cancelled=false, requeued=false] 
>I guess I should find a way to identify the node in
>ServerTaskBundleNode.toString(), if possible. 
> 
>The node channels show only the master node: 
>node class loader channels:
>SelectionKeyWrapper[id=2, channel=sun.nio.ch.SelectionKeyImpl@30634e7,
>readyOps=4, interestOps=1,
>context=ClassContext[channel=SelectionKeyWrapper[id=2],
>state=WAITING_NODE_REQUEST, resource=null, pendingResponses=0, type=node,
>peer=false, uuid=04508B88-7CA6-6256-AF6E-091DF0F4436E, secure=false,
>ssl=false]]
> 
>node job channels:
>SelectionKeyWrapper[id=4, channel=sun.nio.ch.SelectionKeyImpl@4b848b3a,
>readyOps=1, interestOps=1,
>context=RemoteNodeContext[channel=SelectionKeyWrapper[id=4], state=IDLE,
>uuid=04508B88-7CA6-6256-AF6E-091DF0F4436E, connectionUuid=null, peer=false,
>ssl=false]] 
>The admin console's job data view still shows the job is dispatched to 2
>dead nodes: 
>- lcohen-CSL:11198
>  - streaming job 421
>    - 12.0.0.1:12019
>    - 12.0.0.1:12029 
>And I see there are 2 live "jmx@xxx" threads that should have died: 
>jmx@127.0.0.1:12027
>jmx@127.0.0.1:12034 
>This is turning out to be a hard to find issue, but this is also an
>excellent test of how the grid behaves under that kind of stress, with
>frequent massive dynamic changes in the topology. I apologize for the time
>it takes, this is one of those bugs that require time, patience, hard work
>and imagination. I've already had bugs where I literally spent weeks
>finding the cause and a solution, this looks like one of them. I just ain't
>giving up, ever. 
>

or cancel

#24

Comment posted by

Daniel Widdis

Oct 30, 06:27, in reply to comment #23

lolo4j wrote:

This is turning out to be a hard to find issue, but this is also an
excellent test of how the grid behaves under that kind of stress, with
frequent massive dynamic changes in the topology. I apologize for the time
it takes, this is one of those bugs that require time, patience, hard work
and imagination.

No problem at all! I'm just happy when you can reproduce the same symptoms I have and trust you're taking the time to fix it the right way. I have workarounds for the issues and can be patient.

#25

Comment posted by

lolo4j

Oct 30, 06:43, in reply to comment #23

I've had to add a lot of exception handling in the server code, so it can cope with nodes failing at points where we hadn't expected an issue to occur.

The only issue now remaining, or at least that I can detect, is a job hang due to a single batch of dispatched tasks that is never returned to the client. An addtional clue is those warnings I sometimes see in the client log:

2014-10-29 20:49:21,640 [WARN ][org.jppf.client.JobResults.addResults(126)]: position 703 (out of 8) already has a result
2014-10-29 20:49:21,640 [WARN ][org.jppf.client.JobResults.addResults(126)]: position 698 (out of 8) already has a result
2014-10-29 20:49:21,640 [WARN ][org.jppf.client.JobResults.addResults(126)]: position 697 (out of 8) already has a result
2014-10-29 20:49:21,640 [WARN ][org.jppf.client.JobResults.addResults(126)]: position 701 (out of 8) already has a result
2014-10-29 20:49:21,640 [WARN ][org.jppf.client.JobResults.addResults(126)]: position 700 (out of 8) already has a result
2014-10-29 20:49:21,640 [WARN ][org.jppf.client.JobResults.addResults(126)]: position 696 (out of 8) already has a result
2014-10-29 20:49:21,640 [WARN ][org.jppf.client.JobResults.addResults(126)]: position 702 (out of 8) already has a result
2014-10-29 20:49:21,640 [WARN ][org.jppf.client.JobResults.addResults(126)]: position 699 (out of 8) already has a result

This indicates that the server is sending back results that were already sent to the client, or that there may be a confusion in the tasks positions. I'm still trying to figure out when and where this happens, which is proving difficult.

#26

Comment posted by

lolo4j

Nov 05, 19:32

Finally! I was able to fix all the issues: deadlocks, bad synchronization, missing exception handling, race conditions, even a small (but visible in the long run) leak in the node provisioning code. Now the 10k jobs tests pass without a hitch. This was a very good exercise in grid resilience and self-repair. The only remaining issue is that in the branch 4.2 code, the idle node count is still off by 1 - yes, just one. I need to find why this is not happening in the trunk, so in any case I'm not resolving this bug yet.

Fixes in

trunk revision 3458
branch b4.2 revision 3459

#27

Comment posted by

lolo4j

Nov 05, 23:15

It was another synchronization in the calculation of idle nodes count. Fixed in

trunk revision 3460
branch b4.2 revision 3461

Really delete this comment?

Really delete this comment?

Really delete this comment?

Really delete this comment?

Really delete this comment?

Really delete this comment?

Really delete this comment?

Really delete this comment?

Really delete this comment?

Really delete this comment?

Really delete this comment?

Really delete this comment?

Really delete this comment?

Really delete this comment?

Really delete this comment?

Really delete this comment?

Really delete this comment?

Really delete this comment?

Really delete this comment?

Really delete this comment?

Really delete this comment?

Really delete this comment?

Really delete this comment?

Really delete this comment?

Really delete this comment?

Really delete this comment?

Really delete this comment?

Really delete this comment?

Really delete this comment?