JPPF Issue Tracker
star_faded.png
Please log in to bookmark issues
bug_report_small.png
CLOSED  Bug report JPPF-267  -  Driver queue incorrectly increments/decrements when failed nodes present
Posted May 16, 2014 - updated May 20, 2014
icon_info.png This issue has been closed with status "Closed" and resolution "RESOLVED".
Issue details
  • Type of issue
    Bug report
  • Status
     
    Closed
  • Assigned to
     lolo4j
  • Progress
       
  • Type of bug
    Not triaged
  • Likelihood
    Not triaged
  • Effect
    Not triaged
  • Posted by
     Daniel Widdis
  • Owned by
    Not owned by anyone
  • Estimated time
    Not estimated
  • Category
    Server
  • Resolution
    RESOLVED
  • Priority
    Normal
  • Reproducability
    Always
  • Severity
    Normal
  • Targetted for
    icon_milestones.png JPPF 4.1.2
Issue description
In some situations, a node becomes disconnected from the driver but still appears as available/idle on the driver. When new jobs are sent to the driver, it attempts to dispatch these to the failed nodes, decrementing the queue in the process. Since the node never actually receives these tasks, nothing happens. When another node becomes available, the tasks are dispatched to the working node, decrementing the queue an additional time. This results in negative queue lengths.

For example, I was processing tasks on six active nodes on Rackspace cloud servers. In order to change some coding, I stopped the client and "deleted" the cloud servers. Since they were not shut down from within JPPF, the driver did not recognize that they went away, and they still displayed on the Monitoring and Administration tool as Connected / Executing.

I restarted my client, submitting two jobs with eight total tasks to the driver.
13:28:55,153 INFO  [AssessorsClient] - Dispatching 6 tasks for job long-task-assessors.
13:28:55,153 INFO  [AssessorsClient] - Dispatching 2 tasks for job short-task-assessors.
Of note, the long-task-assessors are higher priority than short-task-assessors, but restricted by SLA to only process on servers with 1 thread. I'm using the nodethreads load balancing profile with multiplicator 1.

The JobNotificationListener on the server reported the following.
13:28:55,542 INFO  [ServerMonitor] - Job short-task-assessors queued. Queue now has 1 tasks.
13:28:55,637 INFO  [ServerMonitor] - Job short-task-assessors started
13:28:55,728 INFO  [ServerMonitor] - Job short-task-assessors dispatched 1 tasks to 10.176.67.98. Queue now has 1 tasks.
13:28:55,827 INFO  [ServerMonitor] - Job short-task-assessors dispatched 1 tasks to 10.176.68.115. Queue now has 1 tasks.
The first entry is the JOB_QUEUED event. Queue should have incremented to 2.

The second and third entry is the first JOB_DISPATCHED event. The fourth entry is the second JOB_DISPATCHED event. Note that neither of these node servers actually existed. The tasks were never actually dispatched anywhere. The queue is really actually 2 at this point.

It's odd that the queue ended up at 1 here. It should either be 2 (with no dispatches) or 0 (if both were dispatched).

Now continuing:
13:28:56,323 INFO  [ServerMonitor] - Job long-task-assessors queued. Queue now has 3 tasks.
13:28:56,418 INFO  [ServerMonitor] - Job long-task-assessors started
13:28:56,510 INFO  [ServerMonitor] - Job long-task-assessors dispatched 1 tasks to 10.176.68.219. Queue now has 3 tasks.
13:28:56,606 INFO  [ServerMonitor] - Job long-task-assessors dispatched 1 tasks to 10.176.69.44. Queue now has 3 tasks.
13:28:56,702 INFO  [ServerMonitor] - Job long-task-assessors dispatched 1 tasks to 10.176.68.214. Queue now has 3 tasks.
13:28:56,796 INFO  [ServerMonitor] - Job long-task-assessors dispatched 1 tasks to 10.176.68.145. Queue now has 3 tasks.
The first line is JOB_QUEUED, and last four lines are JOB_DISPATCHED. Note again, these servers don't exist. Tasks were never successfully dispatched. The queue is really actually 8 at this point. Since we started at 1, the final value of 3 (incrementing by 2) makes sense for queueing 6 and dispatching 4. But the overall total is given immediately on queueing, not waiting for the dispatching.

Now, two of my new cloud server nodes come online:
13:31:43,811 INFO  [ServerCreator] - [10.176.66.237] (512 MB, 1 vCPU) configured for long-task-assessors.
13:31:45,660 INFO  [ServerCreator] - [10.176.68.11] (512 MB, 1 vCPU) configured for long-task-assessors.
13:31:47,975 INFO  [ServerMonitor] - Job long-task-assessors dispatched 1 tasks to 10.176.66.237. Queue now has 2 tasks.
13:31:48,749 INFO  [ServerMonitor] - Job long-task-assessors dispatched 1 tasks to 10.176.68.11. Queue now has 1 tasks.
At this point the queue is actually 6, but is showing as 1.

Two more nodes come online:
13:32:04,611 INFO  [ServerCreator] - [10.176.66.168] (512 MB, 1 vCPU) configured for short-task-assessors.
13:32:05,324 INFO  [ServerCreator] - [10.176.66.9] (512 MB, 1 vCPU) configured for long-task-assessors.
13:32:07,602 INFO  [ServerMonitor] - Job short-task-assessors dispatched 2 tasks to 10.176.66.168. Queue now has -1 tasks.
13:32:07,787 INFO  [ServerMonitor] - Job long-task-assessors dispatched 1 tasks to 10.176.66.9. Queue now has -2 tasks.
The first server has two threads, and since the long-tasks can't process on it, the driver sends two tasks from the short-task assessor. Note that these two tasks were the first ones allegedly dispatched to the nonexistent nodes earlier. The fact they were still sitting around showed they were still on the queue despite the queue having been decremented.

More servers come online:
13:33:32,147 INFO  [ServerCreator] - [10.176.67.31] (512 MB, 1 vCPU) configured for long-task-assessors.
13:33:36,039 INFO  [ServerMonitor] - Job long-task-assessors dispatched 1 tasks to 10.176.67.31. Queue now has -3 tasks.
13:33:39,275 INFO  [ServerCreator] - [10.176.66.230] (512 MB, 1 vCPU) configured for long-task-assessors.
13:33:39,299 INFO  [ServerCreator] - [10.176.66.52] (512 MB, 1 vCPU) configured for long-task-assessors.
13:33:43,169 INFO  [ServerMonitor] - Job long-task-assessors dispatched 1 tasks to 10.176.66.52. Queue now has -4 tasks.
13:33:44,154 INFO  [ServerMonitor] - Job long-task-assessors dispatched 1 tasks to 10.176.66.230. Queue now has -5 tasks.
Here the queue should actually be 0, but it remains at -5 for the duration of the process.

EDIT: I've changed the Reproducability to "Always" by finding out that the key factor is whether any of the remaining servers are "idle" when they are disconnected. Since initially reporting the error I had updated my shutdown/server deletion code which made it harder to reproduce in normal operations; but now that I know what's going on I can reproduce at will.

Additional symptoms:
  • I started a set of tasks, and "killed" the node servers while they were still executing and there were still 2 tasks on the driver queue.
  • I stopped the client and started it again, and the two extra tasks showed up on the queue size. (I submitted 8 tasks):
16:04:09,450 INFO  [AssessorsClient] - Dispatching 6 tasks for job long-task-assessors.
16:04:09,451 INFO  [AssessorsClient] - Dispatching 2 tasks for job short-task-assessors.
16:04:09,824 INFO  [ServerMonitor] - Job short-task-assessors queued. Queue now has 4 tasks.
16:04:10,582 INFO  [ServerMonitor] - Job long-task-assessors queued. Queue now has 10 tasks.
  • The new tasks executed, but the 2 "phantom" tasks from the previous execution remained on the queue, even while servers were idle, and the queue remained at 2.
  • I "killed" the node servers again, and interrupted/restarted the client. At the time there were 3 "phantom" idle servers, one with two threads and two with one thread each (capacity for 4 tasks).
  • In the new process, the initial submission (before any real servers connected) looked like this:
16:24:50,833 INFO  [AssessorsClient] - Dispatching 6 tasks for job long-task-assessors.
16:24:50,834 INFO  [AssessorsClient] - Dispatching 2 tasks for job short-task-assessors.
16:24:51,216 INFO  [ServerMonitor] - Job short-task-assessors queued. Queue now has 2 tasks.
16:24:51,324 INFO  [ServerMonitor] - Job short-task-assessors started
16:24:51,438 INFO  [ServerMonitor] - Job short-task-assessors dispatched 1 task to 10.176.69.11. Queue now has 2 tasks.
16:24:51,556 INFO  [ServerMonitor] - Job short-task-assessors dispatched 1 task to 10.176.68.107. Queue now has 2 tasks.
16:24:51,940 INFO  [ServerMonitor] - Job long-task-assessors queued. Queue now has 8 tasks.
16:25:10,118 INFO  [ServerMonitor] - Job short-task-assessors dispatched 2 tasks to 10.176.69.13. Queue now has 6 tasks.
Note there were still 2 "phantom" tasks on the queue from 2 runs ago. So the initial short-task job added 2 more (queue 4) and then deducted 2 for the dispatches to each of the "phantom" 1-thread nodes. Then the additional 8 tasks in the long-task job arrived, and then 2 short-task jobs (perhaps the same ones that just failed on the other two nodes) were dispatched to the "phantom" 2-thread node. The queue ends at 6 (when it is in actuality 8): the 8 I submitted PLUS the initial 2 leftover jobs MINUS the 4 threads available on "non-existent" Predictably, this run ended with a queue length of -2.

EDIT AGAIN: Decided to try it yet another time with no "phantom" servers but starting with the -2 queue length. Throughout the run, queue lengths were offset by 2 from where they were expected, e.g.,
17:00:23,512 INFO  [AssessorsClient] - Dispatching 6 tasks for job long-task-assessors.
17:00:23,513 INFO  [AssessorsClient] - Dispatching 2 tasks for job short-task-assessors.
17:00:23,846 INFO  [ServerMonitor] - Job short-task-assessors queued. Queue now has 0 tasks.
17:00:24,778 INFO  [ServerMonitor] - Job long-task-assessors queued. Queue now has 6 tasks.
So there appear to be multiple issues involved here: leftover tasks from the previous run stay on the queue's total count (even when they will never be processed), negative queue length carries over to the next run, and nonexistent idle servers deduct from the queue even when tasks never process on them. Both of these situations survive a stats reset(). Obviously restarting the driver fixes it all...
Steps to reproduce this issue
  1. Start a client, sending several tasks to remote node(s)
  2. While the tasks are still processing, simulate hardware failure on the node(s). (I shutdown cloud servers, could probably simulate by unplugging device from network and rebooting.) It is key in this step that at least one node indicates "idle", as this will affect the queue on the next attempt.
  3. Shutdown the client.
  4. Restart the client, before conncting any nodes, but while the driver still thinks there are nodes connected.
  5. Observe driver queue length fail to properly increment to match pending tasks.
See attached ServerMonitor class.

EDIT: Just saw this again in a slightly different context. Apparently the idleness of the node is not as important: what is important is that work begins immediately when tasks are queued. Following sequence:
  1. Start a driver and single node. (In example below node is at 192.168.237.8)
  2. Start a client with a single task. (Task immediately sent to the node with queue size 0)
  3. While the task is processing on the node, start another node. (In example below, 10.176.68.163)
  4. Interrupt (Ctrl-C) the original node that the task was processing on. (Task re-sent to the new node and queue size becomes -1).
10:33:29,666 INFO  [AssessorsClient] - Dispatching 1 task for job long-task-assessors.
10:33:29,964 INFO  [ServerMonitor] - Job long-task-assessors queued. Queue now has 0 tasks.
10:33:30,061 INFO  [ServerMonitor] - Job long-task-assessors started
10:33:30,153 INFO  [ServerMonitor] - Job long-task-assessors dispatched 1 task to 192.168.237.8. Queue now has 0 tasks.
10:40:28,784 INFO  [ServerCreator] - [10.176.68.163] (512 MB, 1 vCPU) configured for long-task-assessors.
10:46:34,088 INFO  [ServerMonitor] - Job long-task-assessors dispatched 1 task to 10.176.68.163. Queue now has -1 tasks.
It may be that the requeue() method should increment the queue counter similarly to the addBundle() method.


#7
Comment posted by
 lolo4j
May 17, 08:51
A file was uploaded. ServerMonitor.javaicon_open_new.png
#14
Comment posted by
 lolo4j
May 18, 07:02
I reproduced the behavior by simply restarting a node while the job was executing, from the admin console. It seems that indeed the queue counter should be incremented in the job's onRequeue() callback.
#15
Comment posted by
 lolo4j
May 18, 10:10
Fixed in:
#18
Comment posted by
 lolo4j
May 19, 03:38
Reopening this issue, as I missed the following scenario:
  • start a driver without any node
  • start a client that submits a single job with a single task in it
  • when seeing the job and task queued in the console's server stats view, kill the client
  • ==> the stats show the latest number of tasks in the queue is 1 (should be 0) and the latest number of jobs in the queue is -1 (should be zero)
I'm attaching a screenshot of the admin console's server stats for clarity.
#19
Comment posted by
 lolo4j
May 19, 03:40
A file was uploaded.
server tats view
server tats viewicon_open_new.png
#20
Comment posted by
 lolo4j
May 20, 09:12
Fixed again in: