adequate
adequate
adequate
adequate
 

JPPF, java, parallel computing, distributed computing, grid computing, parallel, distributed, cluster, grid, cloud, open source, android, .net
JPPF

The open source
grid computing
solution

 Home   About   Features   Download   Documentation   Forums 
February 26, 2018, 02:05:49 AM *
Welcome,
Please login or register.

Login with username, password and session length
Advanced search  
News: New users, please read this message. Thank you!
  Home Help Search Login Register  
Pages: [1]   Go Down

Author Topic: Task Hangs After Completion on Node Behind Firewall  (Read 757 times)

djroze

  • JPPF Knight
  • **
  • Posts: 20
Task Hangs After Completion on Node Behind Firewall
« on: January 27, 2015, 09:24:51 PM »

Hi Laurent,

   We are running JPPF version 4.2.5 with the architecture and patch 02 from http://www.jppf.org/forums/index.php/topic,7753.0.html. On a clean start-up we are at first able to run jobs successfully, but then after a period of time (and it appears to happen more readily when the jobs are larger), processing gets hung up with a job being stuck in an incomplete state. In the case I reproduced, the stuck job was running on a node behind a firewall (i.e. its JMX ports are not accessible to the server), and from the behavior I saw before I did a careful repro scenario, I would guess that the processing always gets stuck on a node behind a firewall. The situation from the perspective of each of the node and server is as follows...

Node:

The node appears to have completed the job execution. The last lines in its debug-level log are:

Code: [Select]
2015-01-27 11:23:07,982 [DEBUG][org.jppf.scheduling.JPPFScheduleHandler.cancelAction(127)]: Task Timeout Timer : cancelling action for key=java.util.concurrent.FutureTask@4cbc9d1a, future=null
2015-01-27 11:23:07,983 [DEBUG][org.jppf.management.JPPFNodeAdmin.setTaskCounter(176)]: node tasks counter reset to 28 requested
2015-01-27 11:23:07,983 [DEBUG][org.jppf.server.node.JPPFNode.processResults(239)]: processing 1 task results for job '<JOB_NAME>'
2015-01-27 11:23:07,983 [DEBUG][org.jppf.server.node.remote.RemoteNodeIO.sendResults(114)]: writing results for JPPFTaskBundle[name=<JOB_NAME>, uuid=b0564f70-ca51-471a-a041-9d4c9f69025f,
 initialTaskCount=1, taskCount=1, bundleUuid=null, uuidPath=TraversalList[position=0, list=[2E72054C-1373-63B4-34B8-D650322BD351, 3EAF3D98-5057-0118-5043-543B4DE48156]]]

The node's stack trace is attached to this post as "node-jstack-1.txt", and its config is as follows:

Code: [Select]
jppf.server.host = <SERVER_IP>
jppf.server.port = <SERVER_SSL_PORT>
jppf.discovery.enabled = false

jppf.management.enabled = false
jppf.management.port = 12000
jppf.management.ssl.enabled = false
jppf.management.ssl.port = 22000

jppf.ssl.enabled = true
jppf.ssl.configuration.file = <SSL_CONFIG_FILE>

jppf.socket.keepalive = true
jppf.socket.max-idle = 60

jppf.reconnect.max.time = -1

The relevant jar MD5s for the node are:

Code: [Select]
58389a60b231e160ded7aa8963924ae1  lib/jppf-client-4.2.5.jar
07fdfc557a88133730af921d69ec93b4  lib/jppf-common-4.2.5.jar
516a8988de363958a2036d5466a0aeaf  lib/jppf-common-node-4.2.5-patch-02.jar
3e8c0634342a069e1cb444d1602b26b6  lib/jppf-jmxremote_optional-1.1.jar
ced3f6897090cbeeee2400260024b4f2  lib/jppf-server-4.2.5-patch-02.jar

Server:

As viewed from the administration console, the server thinks that the job is still in progress. State is "Executing", Initial task count = 1, and Current task count = 0. The node doesn't appear in this admin console, I believe because its JMX is explicitly disabled. The server process is allocating a lot of new memory continuously as viewed in the GC log, and the CPU usage is pegged at or near 100%. In the server log the following message is spammed heavily (a few times per millisecond) ...

Code: [Select]
2015-01-27 20:09:17,651 [DEBUG][org.jppf.nio.StateTransitionManager.submitTransition(92)]: submitting transition for channel id=30

... and if I strip out these messages, the remainder looks like this about once per second:

Code: [Select]
2015-01-27 20:10:07,517 [DEBUG][org.jppf.management.forwarding.JPPFNodeForwarding.forwardInvoke(87)]: invoking state() on mbean=org.jppf:name=admin,type=node for selector=org.jppf.management.NodeSelector$UuidSelector@70ed20b4 (1 channels)
2015-01-27 20:10:07,518 [DEBUG][org.jppf.management.forwarding.JPPFNodeForwarding.forwardInvoke(95)]: invoking state() on mbean=org.jppf:name=admin,type=node for node=DF06EB25-2F34-C9DB-BB6B-46265C40ADAD with jmx=
2015-01-27 20:10:07,566 [DEBUG][org.jppf.management.forwarding.JPPFNodeForwarding.forwardGetAttribute(127)]: getting attribute 'NbSlaves' from node DF06EB25-2F34-C9DB-BB6B-46265C40ADAD, result = 0
2015-01-27 20:10:08,084 [DEBUG][org.jppf.management.forwarding.JPPFNodeForwarding.forwardInvoke(87)]: invoking healthSnapshot() on mbean=org.jppf:name=diagnostics,type=node for selector=org.jppf.management.NodeSelector$UuidSelector@7cb01ba3 (1 channels)
2015-01-27 20:10:08,085 [DEBUG][org.jppf.management.forwarding.JPPFNodeForwarding.forwardInvoke(95)]: invoking healthSnapshot() on mbean=org.jppf:name=diagnostics,type=node for node=DF06EB25-2F34-C9DB-BB6B-46265C40ADAD with jmx=
2015-01-27 20:10:08,517 [DEBUG][org.jppf.management.forwarding.JPPFNodeForwarding.forwardInvoke(87)]: invoking state() on mbean=org.jppf:name=admin,type=node for selector=org.jppf.management.NodeSelector$UuidSelector@55a1b824 (1 channels)
2015-01-27 20:10:08,518 [DEBUG][org.jppf.management.forwarding.JPPFNodeForwarding.forwardInvoke(95)]: invoking state() on mbean=org.jppf:name=admin,type=node for node=DF06EB25-2F34-C9DB-BB6B-46265C40ADAD with jmx=
2015-01-27 20:10:08,567 [DEBUG][org.jppf.management.forwarding.JPPFNodeForwarding.forwardGetAttribute(127)]: getting attribute 'NbSlaves' from node DF06EB25-2F34-C9DB-BB6B-46265C40ADAD, result = 0

The server's stack trace is attached to this post as "server-jstack-1.txt", and its config is as follows:

Code: [Select]
jppf.server.host = <SERVER_IP>
jppf.server.port = <SERVER_PORT>
jppf.ssl.server.port = <SERVER_SSL_PORT>
jppf.ssl.configuration.file = <SSL_CONFIG_FILE>

jppf.management.enabled = true
jppf.management.port = <SERVER_JMX_PORT>
jppf.management.ssl.enabled = true
jppf.management.ssl.port = <SERVER_JMX_SSL_PORT>

jppf.management.connection.timeout = 20000

jppf.local.node.enabled = true

jppf.peer.ssl.enabled = true

jppf.load.balancing.algorithm = manual
jppf.load.balancing.profile = manual_profile
jppf.load.balancing.profile.manual_profile.size = 1

jppf.recovery.enabled = false

jppf.socket.keepalive = true
jppf.socket.max-idle = 300

transition.thread.pool.size = 2

The relevant jar MD5s for the server are:

Code: [Select]
58389a60b231e160ded7aa8963924ae1  lib/jppf-client-4.2.5.jar
07fdfc557a88133730af921d69ec93b4  lib/jppf-common-4.2.5.jar
516a8988de363958a2036d5466a0aeaf  lib/jppf-common-node-4.2.5-patch-02.jar
3e8c0634342a069e1cb444d1602b26b6  lib/jppf-jmxremote_optional-1.1.jar
ced3f6897090cbeeee2400260024b4f2  lib/jppf-server-4.2.5-patch-02.jar

   Please let me know if you have any ideas about what might be going on here, and/or if I can provide any further data to help troubleshoot this. Thanks in advance!

- Daniel
Logged

lolo

  • Administrator
  • JPPF Council Member
  • *****
  • Posts: 2211
    • JPPF Web site
Re: Task Hangs After Completion on Node Behind Firewall
« Reply #1 on: January 27, 2015, 10:24:22 PM »

Hi Daniel,

The first thing that strikes me is that, according to the node's log tail and the jstack thread dump, the node is stuck while sending the job result to the server. In case of full job completion, the last message in the log should be like this: [org.jppf.server.node.remote.RemoteNodeIO.deserializeObjects(63)]: waiting for next request...

I'm ready to bet that the "spam" message you see in the driver's log relates to the connection with this node. What I'm suspecting is that the driver is continuously trying to read data from the node channel, but the reads always return 0 bytes (the server uses non-blocking network I/O).

What does this indicate? It appears that for some reason, the connection between node and server was severed, but neither the node nor the server are aware of it, since the node is still trying to write data and the server is trying to read data. What I'm suspecting here is that the firewall is configured to drop network connections that have been idle for too long, and leaves these connections in an undefined state. This would happen whenever the node takes longer than the firewall connection timeout to process a job.

I can see 3 approaches to resolve this problem:

1) The easiest, if policy allows it: disable the firewall connection timeout or set it to a larger, more appropriate value

2) Configure the recovery mechanism on the node(s) and server. This will setup a heart beat-based polling mechanism that will detect on each side whether the peer is responsive or not. The main drawback is that the tasks dispatched to the node will be cancelled and resubmitted, however it is mitigated by the fact that the detection of the connection failure will happen sooner than if you don't do anything (in which case problems start to occur after the task have been executed by the node).

3) Configure the node to run in offline mode. Offline nodes only connect to the server to fetch new tasks or to send exection results, but disconnect in the meantime, while processing tasks. In this case, the drawback is that the distributed class loader is completely disabled, so the classes you need must be either in the node's local classpath or transported along with the job.

The other server log messages, emitted by JPPFNodeForwarding, are triggered by the admin console, which polls the topology at regular intervals (1 second by default). These intervals can be changed in the admin console's configuration if needed.

Sincerely,
-Laurent
Logged

djroze

  • JPPF Knight
  • **
  • Posts: 20
Re: Task Hangs After Completion on Node Behind Firewall
« Reply #2 on: January 28, 2015, 12:42:05 AM »

Hi Laurent,

   Thanks very much for the quick response. I did notice in our configs that we had recovery explicitly disabled, but this may have been something we needed to set a while back for some other reason and is no longer necessary. Since changing the firewall settings is not an option, I'll experiment with the recovery settings and, failing that, will see how offline mode works for us. Will post an update here when I have anything noteworthy to report.

Many thanks,
Daniel
Logged

djroze

  • JPPF Knight
  • **
  • Posts: 20
Re: Task Hangs After Completion on Node Behind Firewall
« Reply #3 on: January 29, 2015, 11:39:15 PM »

Hi Laurent,

   I have been experimenting with job size and the recovery settings as mentioned above, and we seem to be experiencing better stability but I'm not sure whether the recovery settings are being loaded or which factor is responsible. I did have one question about configuring this however... I noticed in the documentation it says the following:

"As a general rule of thumb, the settings should always respect the following constraint: serverReaperInterval < nodeMaxRetries * nodeTimeout"

   In the example given, the reaper interval is 60000ms, max retries is 3, and the read timeout is 6000, which would make the relationship (60000 < 18000), thus violating the constraint if I'm understanding everything right. Is the constraint supposed to be "serverReaperInterval > nodeMaxRetries * nodeTimeout" instead?

   Also, am I correct in understanding that only the server config needs to be changed to enable this recovery system, and nothing is required on the node side?

Thanks very much,
Daniel
Logged

lolo

  • Administrator
  • JPPF Council Member
  • *****
  • Posts: 2211
    • JPPF Web site
Re: Task Hangs After Completion on Node Behind Firewall
« Reply #4 on: January 31, 2015, 09:42:35 AM »

Hi Daniel,

Quote
Is the constraint supposed to be "serverReaperInterval > nodeMaxRetries * nodeTimeout" instead?
No. Note that here we're talking about the reaper interval of the server, and the max rretries and timeout defined in the node. Each node has its own heartbeat, where if it doesn't receive a message from the server maxRetries times in a row, each time waiting for a timeout time, then it will close the connection on its side. If serverReaperInterval is greater than nodeMaxRetries * nodeTimeout, then the node has enough time to make its attemps at getting a message from the server. Since the server is not sending any message during that time, the node will not receive anything and conclude that the connection is broken.

The recovery mechanism is implemented this way so the heartbeat mechanism uses a single additional network connection per node. To get rid of this constraint we'd have had to use two connections instead, with one connection intiiated by each side. We can always revisit this design in a future version if there is a convincing use case.

Quote
Also, am I correct in understanding that only the server config needs to be changed to enable this recovery system, and nothing is required on the node side?
Unfortunately no, the recovery must also at least be enabled in the node configuration for the mechanism to apply. This is due to the recovery being disabled by defaullt in the nodes and drivers. This allows for more flexibility (some of the nodes may not need the recovery to be enbaled) at the cost of additional configuration overhead.

Thus, it looks like the recovery mechanism may not be enabled on your grid, and the improvements you have seen are most likely due to other factors.

I hope this clarifies,
-Laurent
Logged

djroze

  • JPPF Knight
  • **
  • Posts: 20
Re: Task Hangs After Completion on Node Behind Firewall
« Reply #5 on: February 17, 2015, 10:45:40 PM »

Hi Laurent,

   Thanks for the additional info on this. It's about 75% clear for me right now but I will spend some more time re-reading this to try to understand fully. :-P Regarding the constraint, I think the missing piece of the puzzle for me was the separate node config doc (http://www.jppf.org/doc/v4/index.php?title=Node_configuration#Recovery_from_hardware_failures), for anyone else who may be in the same boat or read this later. I thought the constraint in the server config doc was only referring to parameters defined on the server side.

   The one-connection approach is actually preferable for our architecture so no problem there.

   I set up recovery configs similar to the examples given in the docs, but we still seem to have some stability issues (occasional warnings about "null context" / "job may be stuck"); however, it's not something I have condensed into a reproducible case yet, so I will probably put details in a new thread if/when I can isolate that behavior.

Thanks,
Daniel
Logged

subes

  • JPPF Padawan
  • *
  • Posts: 3
Re: Task Hangs After Completion on Node Behind Firewall
« Reply #6 on: January 22, 2018, 01:53:32 AM »

I encountered the same problem, increasing jppf.recovery.read.timeout to 30-60 seconds solved it for me. This makes sure the constraint "serverReaperInterval < nodeMaxRetries * nodeTimeout" is followed. The default configuration of 6 seconds just did not work for me. Maybe the default and/or documentation should be changed?
Logged
Pages: [1]   Go Up
 
JPPF Powered by SMF 2.0 RC5 | SMF © 2006–2011, Simple Machines LLC Get JPPF at SourceForge.net. Fast, secure and Free Open Source software downloads