JPPF Issue Tracker
star_faded.png
Please log in to bookmark issues
bug_report_small.png
CLOSED  Bug report JPPF-345  -  Shutdown vs. Provisioning race condition causes duplicate nodes with incomplete tasks
Posted Oct 30, 2014 - updated Nov 05, 2014
icon_info.png This issue has been closed with status "Closed" and resolution "RESOLVED".
Issue details
  • Type of issue
    Bug report
  • Status
     
    Closed
  • Assigned to
     lolo4j
  • Progress
       
  • Type of bug
    Not triaged
  • Likelihood
    Not triaged
  • Effect
    Not triaged
  • Posted by
     Daniel Widdis
  • Owned by
    Not owned by anyone
  • Category
    Management / Monitoring
  • Resolution
    RESOLVED
  • Priority
    Normal
  • Reproducability
    Always
  • Severity
    Normal
  • Targetted for
    icon_milestones.png JPPF 4.2.4
Issue description
In an admittedly poorly synchronized attempt to implement a home-grown version of my request for JPPF-303 (because I don't want to wait for JPPF 5!) I've stumbled on a sequence of events that result in duplicate host:port combinations in nodesInformation, with undesirable side effects.

If I shutdown() a slave node, but then employ node provisioning on its corresponding master node before it is removed from the driver's node list, I end up with duplicates that won't go away without rebooting the driver. Additionally, if the "dead" node was executing a task at the time it was shutdown, this task is never reported as incomplete.

Screenshot of bugicon_open_new.png Although the duplicate node is in "red" indicating it's shutdown, the presence of another node (the newly provisioned one) with the same host and port appears to prevent it from ever disappearing, and it remains in the Collection<JPPFManagementInfo> returned from the driver's nodesInformation() method. Some GUI features, such as the "JVM Health" get information for the node twice (the CPU% differs slightly but most of the other elements are identical between the dupes) and the number of Active Nodes listed is, of course inaccurate.

Further, even when the master node is eventually shut down, the "stray" shutdown nodes remain in the driver's list (possibly with unfinished tasks), and can only be removed by restarting the driver. Restarting the driver results in sending the incomplete tasks back to the client, which are then resubmitted to the driver when it starts up.

While I can obviously improve my sequencing to avoid causing this problem in the first place, and work around some of the the duplicate issues when iterating, I think these side effects bear looking into from a behind-the-scenes consistency standpoint. Clearly the GUI is able to detect that the node is shutdown, but never seems to do anything about it. It seems reasonable to assume that if there's another active node in the collection with the same host:port combination, that the "dead" node is unreachable, and it should be removed and any of its unfinished tasks resubmitted.

Of note, I do have jppf.recovery.enabled = true in my driver's configuration file, but I don't think the recovery pings are set up to go to slave nodes.
Steps to reproduce this issue
  1. Set up a grid and provision one slave node.
  2. Use a JMXNodeConnectionWrapper to issue a shutdown() for the slave node.
  3. Within a few seconds, provision one slave node on the same master.
The shutdown() immediately decrements the nbSlaves() result for the master so that it thinks (correctly) that it has 0 slave nodes; so the subsequent provisionSlaveNodes() call succeeds in creating a new node with the same host/port combination that the previous one had.

However, since the driver has not yet detected the shutdown, it keeps the deleted node in its list, even when the new node connects on the same host and port, and fails to subsequently detect that the original node is missing.

#3
Comment posted by
 lolo4j
Oct 30, 06:22
I have already this behavior while working on Bug report JPPF-342 - Uncontrolled incrementing of idle node count, due to a general lack of exception handling in some areas of the server code. This is now fixed locally (i.e. not committed) and I will close this bug when I'm done with the first one.
#5
Comment posted by
 lolo4j
Nov 05, 19:33
fixed along with other fixes in Bug report JPPF-342 - Uncontrolled incrementing of idle node count

The issue was updated with the following change(s):
  • This issue has been closed
  • The status has been updated, from New to Closed.
  • The resolution has been updated, from Not determined to RESOLVED.
  • Information about the user working on this issue has been changed, from lolo4j to Not being worked on.