JPPF, java, parallel computing, distributed computing, grid computing, parallel, distributed, cluster, grid, cloud, open source, android, .net
JPPF

The open source
grid computing
solution

 Home   About   Features   Download   Documentation   On Github   Forums 
September 24, 2021, 02:15:36 PM *
Welcome,
Please login or register.

Login with username, password and session length
Advanced search  
News: New users, please read this message. Thank you!
  Home Help Search Login Register  
Pages: [1]   Go Down

Author Topic: ClientWithFailover doesn't reconnect to the failed driver  (Read 4436 times)

Emin Mamedov

  • JPPF Grand Master
  • ****
  • Posts: 69
ClientWithFailover doesn't reconnect to the failed driver
« on: March 03, 2014, 11:54:33 AM »

Hi JPPF Team,

I decided to use 2 drivers to make our application as stable as it possible - one active, one passive (to not have single point of failure). We have N nodes that are connected to active driver and they reconnect to passive driver as soon as active driver goes down (it was implemented via InitializationHook). I have client application that uses ClientWithFailover class to submit jobs to active driver (driver1 with priority 100) and passive driver (driver2 with priority 50).

Now I faced next problems:
Scenario 1:
  - Driver1 is up, Driver2 is down.
  - I submit non-blocking job
  - On the middle of job execution stop driver1
  - Start driver2. Nodes switch successfully to driver2 as expected
  - BUT job stops processing and in client logs I see

Code: [Select]
2014-03-03 09:44:55,923 WARN  - [1393839895932ms]: [driver1-8 - ClassServer] caught java.io.EOFException: null, will re-initialise ... [org.jppf.client.ClassServerDelegateImpl] [driver1-8 - ClassServer] {}
2014-03-03 09:44:55,923 WARN  - [1393839895932ms]: [driver1-7 - ClassServer] caught java.io.EOFException: null, will re-initialise ... [org.jppf.client.ClassServerDelegateImpl] [driver1-7 - ClassServer] {}
2014-03-03 09:44:55,923 WARN  - [1393839895932ms]: [driver1-2 - ClassServer] caught java.io.EOFException: null, will re-initialise ... [org.jppf.client.ClassServerDelegateImpl] [driver1-2 - ClassServer] {}
2014-03-03 09:44:55,923 WARN  - [1393839895933ms]: [driver1-10 - ClassServer] caught java.io.EOFException: null, will re-initialise ... [org.jppf.client.ClassServerDelegateImpl] [driver1-10 - ClassServer] {}
2014-03-03 09:44:55,923 ERROR - [1393839895933ms]: java.io.EOFException: null [org.jppf.client.BaseJPPFClientConnection] [RemoteChannelWrapperdriver1-4--1] {}
2014-03-03 09:44:55,939 WARN  - [1393839895940ms]: [driver1-3 - ClassServer] caught java.io.EOFException: null, will re-initialise ... [org.jppf.client.ClassServerDelegateImpl] [driver1-3 - ClassServer] {}
2014-03-03 09:44:55,939 WARN  - [1393839895941ms]: [driver1-4 - ClassServer] caught java.io.EOFException: null, will re-initialise ... [org.jppf.client.ClassServerDelegateImpl] [driver1-4 - ClassServer] {}
2014-03-03 09:44:55,939 WARN  - [1393839895941ms]: [driver1-1 - ClassServer] caught java.io.EOFException: null, will re-initialise ... [org.jppf.client.ClassServerDelegateImpl] [driver1-1 - ClassServer] {}
2014-03-03 09:44:55,939 WARN  - [1393839895941ms]: [driver1-6 - ClassServer] caught java.io.EOFException: null, will re-initialise ... [org.jppf.client.ClassServerDelegateImpl] [driver1-6 - ClassServer] {}
2014-03-03 09:44:55,939 WARN  - [1393839895941ms]: [driver1-5 - ClassServer] caught java.io.EOFException: null, will re-initialise ... [org.jppf.client.ClassServerDelegateImpl] [driver1-5 - ClassServer] {}
2014-03-03 09:44:55,939 WARN  - [1393839895941ms]: [driver1-9 - ClassServer] caught java.io.EOFException: null, will re-initialise ... [org.jppf.client.ClassServerDelegateImpl] [driver1-9 - ClassServer] {}
2014-03-03 09:44:55,939 WARN  - [1393839895942ms]: java.io.EOFException: null [org.jppf.client.balancer.ChannelWrapperRemote] [RemoteChannelWrapperdriver1-4--1] {}
2014-03-03 09:45:05,362 WARN  - [1393839905358ms]: java.lang.NullPointerException: null [org.jppf.client.balancer.ChannelWrapperRemote] [RemoteChannelWrapperdriver2-4--1] {}
2014-03-03 09:45:05,362 WARN  - [1393839905363ms]: [driver2-4 - ClassServer] caught java.io.EOFException: null, will re-initialise ... [org.jppf.client.ClassServerDelegateImpl] [driver2-4 - ClassServer] {}
2014-03-03 09:45:05,425 WARN  - [1393839905424ms]: java.lang.NullPointerException: null [org.jppf.client.balancer.ChannelWrapperRemote] [RemoteChannelWrapperdriver2-5--1] {}
2014-03-03 09:45:05,425 WARN  - [1393839905428ms]: [driver2-5 - ClassServer] caught java.io.EOFException: null, will re-initialise ... [org.jppf.client.ClassServerDelegateImpl] [driver2-5 - ClassServer] {}
2014-03-03 09:45:05,472 WARN  - [1393839905475ms]: java.lang.NullPointerException: null [org.jppf.client.balancer.ChannelWrapperRemote] [RemoteChannelWrapperdriver2-3--1] {}
2014-03-03 09:45:05,472 WARN  - [1393839905478ms]: [driver2-3 - ClassServer] caught java.io.EOFException: null, will re-initialise ... [org.jppf.client.ClassServerDelegateImpl] [driver2-3 - ClassServer] {}
2014-03-03 09:45:05,534 WARN  - [1393839905541ms]: java.lang.NullPointerException: null [org.jppf.client.balancer.ChannelWrapperRemote] [RemoteChannelWrapperdriver2-2--1] {}
2014-03-03 09:45:05,550 WARN  - [1393839905545ms]: [driver2-2 - ClassServer] caught java.io.EOFException: null, will re-initialise ... [org.jppf.client.ClassServerDelegateImpl] [driver2-2 - ClassServer] {}
2014-03-03 09:45:05,612 WARN  - [1393839905608ms]: java.lang.NullPointerException: null [org.jppf.client.balancer.ChannelWrapperRemote] [RemoteChannelWrapperdriver2-7--1] {}
2014-03-03 09:45:05,612 WARN  - [1393839905613ms]: [driver2-7 - ClassServer] caught java.io.EOFException: null, will re-initialise ... [org.jppf.client.ClassServerDelegateImpl] [driver2-7 - ClassServer] {}
2014-03-03 09:45:05,659 WARN  - [1393839905668ms]: java.lang.NullPointerException: null [org.jppf.client.balancer.ChannelWrapperRemote] [RemoteChannelWrapperdriver2-6--1] {}
2014-03-03 09:45:05,674 WARN  - [1393839905674ms]: [driver2-6 - ClassServer] caught java.io.EOFException: null, will re-initialise ... [org.jppf.client.ClassServerDelegateImpl] [driver2-6 - ClassServer] {}
2014-03-03 09:45:05,721 WARN  - [1393839905730ms]: java.lang.NullPointerException: null [org.jppf.client.balancer.ChannelWrapperRemote] [RemoteChannelWrapperdriver2-9--1] {}
2014-03-03 09:45:05,737 WARN  - [1393839905734ms]: [driver2-9 - ClassServer] caught java.io.EOFException: null, will re-initialise ... [org.jppf.client.ClassServerDelegateImpl] [driver2-9 - ClassServer] {}
2014-03-03 09:45:05,784 WARN  - [1393839905790ms]: java.lang.NullPointerException: null [org.jppf.client.balancer.ChannelWrapperRemote] [RemoteChannelWrapperdriver2-8--1] {}
2014-03-03 09:45:05,799 WARN  - [1393839905798ms]: [driver2-8 - ClassServer] caught java.io.EOFException: null, will re-initialise ... [org.jppf.client.ClassServerDelegateImpl] [driver2-8 - ClassServer] {}
2014-03-03 09:45:05,846 WARN  - [1393839905845ms]: java.lang.NullPointerException: null [org.jppf.client.balancer.ChannelWrapperRemote] [RemoteChannelWrapperdriver2-10--1] {}
2014-03-03 09:45:05,846 WARN  - [1393839905851ms]: [driver2-10 - ClassServer] caught java.io.EOFException: null, will re-initialise ... [org.jppf.client.ClassServerDelegateImpl] [driver2-10 - ClassServer] {}
2014-03-03 09:45:05,924 WARN  - [1393839905921ms]: java.lang.NullPointerException: null [org.jppf.client.balancer.ChannelWrapperRemote] [RemoteChannelWrapperdriver2-1--1] {}
2014-03-03 09:45:05,924 WARN  - [1393839905926ms]: [driver2-1 - ClassServer] caught java.io.EOFException: null, will re-initialise ... [org.jppf.client.ClassServerDelegateImpl] [driver2-1 - ClassServer] {}
2014-03-03 09:45:06,376 ERROR - [1393839906375ms]: java.io.EOFException: null [org.jppf.client.BaseJPPFClientConnection] [connecting driver2-4[localhost:21200] : EXECUTING] {}
2014-03-03 09:45:06,376 WARN  - [1393839906375ms]: [driver2-4 - ClassServer] caught java.io.EOFException: null, will re-initialise ... [org.jppf.client.ClassServerDelegateImpl] [driver2-4 - ClassServer] {}
2014-03-03 09:45:06,376 WARN  - [1393839906378ms]: error initializing connection to job server: java.io.EOFException: null [org.jppf.client.TaskServerConnectionHandler] [connecting driver2-4[localhost:21200] : EXECUTING] {}
2014-03-03 09:45:06,439 ERROR - [1393839906438ms]: java.io.EOFException: null [org.jppf.client.BaseJPPFClientConnection] [connecting driver2-5[localhost:21200] : EXECUTING] {}
2014-03-03 09:45:06,439 WARN  - [1393839906438ms]: error initializing connection to job server: java.io.EOFException: null [org.jppf.client.TaskServerConnectionHandler] [connecting driver2-5[localhost:21200] : EXECUTING] {}
2014-03-03 09:45:06,439 WARN  - [1393839906439ms]: [driver2-5 - ClassServer] caught java.io.EOFException: null, will re-initialise ... [org.jppf.client.ClassServerDelegateImpl] [driver2-5 - ClassServer] {}
2014-03-03 09:45:06,532 WARN  - [1393839906539ms]: java.lang.NullPointerException: null [org.jppf.client.balancer.ChannelWrapperRemote] [RemoteChannelWrapperdriver2-3--1] {}
2014-03-03 09:45:06,532 WARN  - [1393839906541ms]: java.net.ConnectException: Client connection not opened [org.jppf.client.balancer.ChannelWrapperRemote] [RemoteChannelWrapperdriver2-3--1] {}
2014-03-03 09:45:06,548 WARN  - [1393839906548ms]: [driver2-3 - ClassServer] caught java.io.EOFException: null, will re-initialise ... [org.jppf.client.ClassServerDelegateImpl] [driver2-3 - ClassServer] {}
2014-03-03 09:45:06,548 ERROR - [1393839906556ms]: java.io.EOFException: null [org.jppf.client.BaseJPPFClientConnection] [connecting driver2-2[localhost:21200] : EXECUTING] {}
2014-03-03 09:45:06,548 WARN  - [1393839906556ms]: error initializing connection to job server: java.io.EOFException: null [org.jppf.client.TaskServerConnectionHandler] [connecting driver2-2[localhost:21200] : EXECUTING] {}
2014-03-03 09:45:06,673 WARN  - [1393839906675ms]: java.lang.NullPointerException: null [org.jppf.client.balancer.ChannelWrapperRemote] [RemoteChannelWrapperdriver2-7--1] {}
2014-03-03 09:45:06,673 WARN  - [1393839906678ms]: [driver2-7 - ClassServer] caught java.io.EOFException: null, will re-initialise ... [org.jppf.client.ClassServerDelegateImpl] [driver2-7 - ClassServer] {}
2014-03-03 09:45:06,751 WARN  - [1393839906745ms]: java.lang.NullPointerException: null [org.jppf.client.balancer.ChannelWrapperRemote] [RemoteChannelWrapperdriver2-6--1] {}
2014-03-03 09:45:06,751 WARN  - [1393839906747ms]: java.net.ConnectException: Client connection not opened [org.jppf.client.balancer.ChannelWrapperRemote] [RemoteChannelWrapperdriver2-6--1] {}

in nodes logs I see

Code: [Select]
java.lang.NullPointerException
at org.jppf.classloader.AbstractClassLoaderConnection.loadResource(AbstractClassLoaderConnection.java:133)
at org.jppf.classloader.AbstractJPPFClassLoader.computeRemoteData(AbstractJPPFClassLoader.java:207)
at org.jppf.classloader.AbstractJPPFClassLoader.computeCallable(AbstractJPPFClassLoader.java:188)

JPPF Grid stops work at all. I have to restart Drivers, Nodes and Clients to submit new jobs.

Scenario 2:
  - Driver1 is up, Driver2 is down.
  - Execute job via Driver1, receive all results
  - Stop driver1 and start driver2. Nodes switch to driver2 successfully.
  - Resubmit the same job. Job failed. In client logs I see

Code: [Select]
2014-03-03 10:46:20,561 WARN  - [1393843580587ms]: [driver2-10 - ClassServer] caught java.io.EOFException: null, will re-initialise ... [org.jppf.client.ClassServerDelegateImpl] [driver2-10 - ClassServer] {}
2014-03-03 10:46:20,623 WARN  - [1393843580652ms]: java.lang.NullPointerException: null [org.jppf.client.balancer.ChannelWrapperRemote] [RemoteChannelWrapperdriver2-9--1] {}
2014-03-03 10:46:20,639 WARN  - [1393843580656ms]: [driver2-9 - ClassServer] caught java.io.EOFException: null, will re-initialise ... [org.jppf.client.ClassServerDelegateImpl] [driver2-9 - ClassServer] {}
2014-03-03 10:46:20,686 WARN  - [1393843580713ms]: java.lang.NullPointerException: null [org.jppf.client.balancer.ChannelWrapperRemote] [RemoteChannelWrapperdriver2-5--1] {}
2014-03-03 10:46:20,701 WARN  - [1393843580718ms]: [driver2-5 - ClassServer] caught java.io.EOFException: null, will re-initialise ... [org.jppf.client.ClassServerDelegateImpl] [driver2-5 - ClassServer] {}
2014-03-03 10:46:20,967 WARN  - [1393843580995ms]: java.lang.NullPointerException: null [org.jppf.client.balancer.ChannelWrapperRemote] [RemoteChannelWrapperdriver2-6--1] {}
2014-03-03 10:46:20,982 WARN  - [1393843580999ms]: [driver2-6 - ClassServer] caught java.io.EOFException: null, will re-initialise ... [org.jppf.client.ClassServerDelegateImpl] [driver2-6 - ClassServer] {}
2014-03-03 10:46:21,060 WARN  - [1393843581089ms]: java.lang.NullPointerException: null [org.jppf.client.balancer.ChannelWrapperRemote] [RemoteChannelWrapperdriver2-2--1] {}
2014-03-03 10:46:21,076 WARN  - [1393843581097ms]: [driver2-2 - ClassServer] caught java.io.EOFException: null, will re-initialise ... [org.jppf.client.ClassServerDelegateImpl] [driver2-2 - ClassServer] {}
2014-03-03 10:46:21,279 WARN  - [1393843581296ms]: java.lang.NullPointerException: null [org.jppf.client.balancer.ChannelWrapperRemote] [RemoteChannelWrapperdriver2-1--1] {}
2014-03-03 10:46:21,279 WARN  - [1393843581301ms]: [driver2-1 - ClassServer] caught java.io.EOFException: null, will re-initialise ... [org.jppf.client.ClassServerDelegateImpl] [driver2-1 - ClassServer] {}
2014-03-03 10:46:21,481 WARN  - [1393843581506ms]: java.lang.NullPointerException: null [org.jppf.client.balancer.ChannelWrapperRemote] [RemoteChannelWrapperdriver2-3--1] {}
2014-03-03 10:46:21,481 WARN  - [1393843581508ms]: java.net.ConnectException: Client connection not opened [org.jppf.client.balancer.ChannelWrapperRemote] [RemoteChannelWrapperdriver2-3--1] {}
2014-03-03 10:46:21,481 WARN  - [1393843581511ms]: [driver2-3 - ClassServer] caught java.io.EOFException: null, will re-initialise ... [org.jppf.client.ClassServerDelegateImpl] [driver2-3 - ClassServer] {}
2014-03-03 10:46:21,559 WARN  - [1393843581575ms]: java.lang.NullPointerException: null [org.jppf.client.balancer.ChannelWrapperRemote] [RemoteChannelWrapperdriver2-7--1] {}
2014-03-03 10:46:21,559 WARN  - [1393843581577ms]: [driver2-7 - ClassServer] caught java.io.EOFException: null, will re-initialise ... [org.jppf.client.ClassServerDelegateImpl] [driver2-7 - ClassServer] {}
2014-03-03 10:46:21,622 WARN  - [1393843581648ms]: java.lang.NullPointerException: null [org.jppf.client.balancer.ChannelWrapperRemote] [RemoteChannelWrapperdriver2-10--1] {}
2014-03-03 10:46:21,622 WARN  - [1393843581649ms]: java.net.ConnectException: Client connection not opened [org.jppf.client.balancer.ChannelWrapperRemote] [RemoteChannelWrapperdriver2-10--1] {}
2014-03-03 10:46:21,622 WARN  - [1393843581651ms]: [driver2-10 - ClassServer] caught java.io.EOFException: null, will re-initialise ... [org.jppf.client.ClassServerDelegateImpl] [driver2-10 - ClassServer] {}

in nodes logs I see

Code: [Select]
java.lang.NullPointerException
at org.jppf.classloader.AbstractClassLoaderConnection.loadResource(AbstractClassLoaderConnection.java:133)
at org.jppf.classloader.AbstractJPPFClassLoader.computeRemoteData(AbstractJPPFClassLoader.java:207)
at org.jppf.classloader.AbstractJPPFClassLoader.computeCallable(AbstractJPPFClassLoader.java:188)

JPPF: 4.0.1
OS: Win7
Java: 1.7

Could you please help me resolve this issue? Or may be propose another solution to get stable topology. Our goal to have Grid that will provide Full Grid Capacity (all nodes) independent from Driver's or Client's Failures

Thanks in advance,
Emin
Logged

lolo

  • Administrator
  • JPPF Council Member
  • *****
  • Posts: 2272
    • JPPF Web site
Re: ClientWithFailover doesn't reconnect to the failed driver
« Reply #1 on: March 03, 2014, 07:44:55 PM »

Hi Emin,

The solution you adopted is the correct one. Unfortunately, I'm unable to reproduce the behavior you observe in both scenarios. What would be helpful, would be to get the full stack trace of the NullPointerException in the client log. For this, could you please add the following line in your client's log4j config file:
Code: [Select]
log4j.logger.org.jppf.client=DEBUG
With this, instead of WARNING statements with the exception message, you will get DEBUG statements with the full stack trace.

Thanks,
-Laurent
Logged

lolo

  • Administrator
  • JPPF Council Member
  • *****
  • Posts: 2272
    • JPPF Web site
Re: ClientWithFailover doesn't reconnect to the failed driver
« Reply #2 on: March 03, 2014, 07:55:13 PM »

Correction: I now reproduce both scenarios, I had "jppf.discovery.enabled = true" in my client configuration and the problem did not occur with discovery enabled.
However, I still don't get the NPE, so it would still be very useful if you could get a full stack trace on your side.

-Laurent
Logged

lolo

  • Administrator
  • JPPF Council Member
  • *****
  • Posts: 2272
    • JPPF Web site
Re: ClientWithFailover doesn't reconnect to the failed driver
« Reply #3 on: March 03, 2014, 08:20:57 PM »

Ok, I think I found what is causing the problem. This is due to the value of the client configuration property "jppf.reconnect.max.time". This property specifies for how long (in seconds) the client or node will attempt to connect to a server. In my case, this value was 3 seconds, so after 3 s, the client stopped trying to connect to the second driver. So, when killing the first driver and starting the second, the client never connected to the second driver, thus the remaining tasks in the job were never submitted.

By setting "jppf.reconnect.max.time" to a large value, or to a negative value to indicate to never stop trying to connect, I could see that the job was indeed executed on the second driver after I killed the first one.

Could you try to apply this solution and let us know if this works for you?

Thanks,
-Laurent
Logged

Emin Mamedov

  • JPPF Grand Master
  • ****
  • Posts: 69
Re: ClientWithFailover doesn't reconnect to the failed driver
« Reply #4 on: March 03, 2014, 10:35:24 PM »

Hi Laurent,

Thanks for the quick reply. Will provide DEBUG logs later because do not have an access to grid at the moment.

Sorry that did not mention it before but
- jppf.reconnect.max.time=-1 in my configs for client
- jppf.discovery.enabled=false because we can't use broadcast in our network. I use direct connections, i.e.
Code: [Select]
  jppf.drivers = driver1 driver2
  driver1.jppf.server.host=localhost
  driver1.jppf.server.port=20100
  driver1.jppf.priority=100
  driver2.jppf.server.host=localhost
  driver2.jppf.server.port=20200
  driver2.jppf.priority=50
- I use compute method to execute some code on client during task execution on node. May be it can be the reason because I remember that this NPE on nodes I got during compute method call.

Will try to provide more information tomorrow

Thanks,
Emin
Logged

Emin Mamedov

  • JPPF Grand Master
  • ****
  • Posts: 69
Re: ClientWithFailover doesn't reconnect to the failed driver
« Reply #5 on: March 04, 2014, 04:08:43 PM »

Hi Laurent,

Thanks for the notice to use DEBUG mode on client side. Found that NPE occurred in TaskResultListener.resultsReceived method  :-[.

So looks like:

- I'm starting Job execution
- On the middle of job execution I stop driver1 and start driver2
- Nodes switch to driver2 successfully
- ClientWithFailover resubmits the Job (only tasks without results)
- Node receives task but during compute method call gets exception

Code: [Select]
2014-03-04 14:49:01,957 DEBUG - [1393944541975ms]: waiting for next request. Serializer = org.jppf.utils.ObjectSerializerImpl@5061e1 (class loader = JPPFClassLoader[id=7, type=server, uuidPath=[], offline=false, classpath=]) [org.jppf.server.node.remote.RemoteNodeIO] [main] {}
2014-03-04 14:49:02,004 DEBUG - [1393944542015ms]: got bundle JPPFTaskBundle[name=JOB_1, uuid=720CB575-E4E9-4286-B56D-D0AB87E71D96, initialTaskCount=183, taskCount=1, bundleUuid=null, uuidPath=TraversalList[position=0, list=[TestApp_2014-03-04 17:40:44, 90F21B1D-841E-EE6C-FCD0-3BEC4DF80A2B]]] [org.jppf.server.node.remote.RemoteNodeIO] [main] {}
2014-03-04 14:49:02,004 DEBUG - [1393944542015ms]: bundle task count = 1, handshake = false [org.jppf.server.node.remote.RemoteNodeIO] [main] {}
2014-03-04 14:49:02,004 DEBUG - [1393944542017ms]: got all data [org.jppf.server.node.remote.RemoteNodeIO] [main] {}
2014-03-04 14:49:02,004 DEBUG - [1393944542018ms]: received a bundle with 1 tasks [org.jppf.server.node.JPPFNode] [main] {}
2014-03-04 14:49:02,004 DEBUG - [1393944542018ms]: executing 1 tasks [org.jppf.server.node.NodeExecutionManagerImpl] [main] {}
2014-03-04 14:49:02,004 INFO  - [1393944542018ms]: Run calculation for taskId=JOB_1_182 [com.emamedov.dprice.testapp.server.RxTask] [Processing-1] {}
2014-03-04 14:49:02,004 DEBUG - [1393944542018ms]: looking up resource [org.jppf.utils.ObjectSerializerImpl] [org.jppf.classloader.AbstractJPPFClassLoader] [Processing-1] {}
2014-03-04 14:49:02,004 DEBUG - [1393944542018ms]: definition for resource [org.jppf.utils.ObjectSerializerImpl] : class org.jppf.utils.ObjectSerializerImpl [org.jppf.classloader.AbstractJPPFClassLoader] [Processing-1] {}
2014-03-04 14:49:02,004 DEBUG - [1393944542019ms]: class 'org.jppf.utils.ObjectSerializerImpl' loaded by JPPFClassLoader[id=4, type=client, uuidPath=[TestApp_2014-03-04 17:40:44, 3DDFAA04-A07B-71EC-1B1C-3EA6472D04C0], offline=false, classpath=] [org.jppf.classloader.AbstractJPPFClassLoader] [Processing-1] {}
2014-03-04 14:49:02,004 DEBUG - [1393944542019ms]: JPPFClassLoader[id=4, type=client, uuidPath=[TestApp_2014-03-04 17:40:44, 3DDFAA04-A07B-71EC-1B1C-3EA6472D04C0], offline=false, classpath=] requesting remote computation, requestUuid = 720CB575-E4E9-4286-B56D-D0AB87E71D96 [org.jppf.classloader.AbstractJPPFClassLoader] [Processing-1] {}
2014-03-04 14:49:02,004 ERROR - [1393944542019ms]: Unhandled exception during calculation [com.emamedov.dprice.testapp.server.RxTask] [Processing-1] {}
java.lang.NullPointerException
at org.jppf.classloader.AbstractClassLoaderConnection.loadResource(AbstractClassLoaderConnection.java:133)
at org.jppf.classloader.AbstractJPPFClassLoader.computeRemoteData(AbstractJPPFClassLoader.java:207)
at org.jppf.classloader.AbstractJPPFClassLoader.computeCallable(AbstractJPPFClassLoader.java:188)
at org.jppf.node.protocol.AbstractTask.compute(AbstractTask.java:231)
at com.emamedov.dprice.testapp.server.RxTask.run(RxTask.java:44)
at org.jppf.server.node.NodeTaskWrapper.run(NodeTaskWrapper.java:145)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

- I catch this exception on node side and put it in task via setThrowable method
- JPPF Client receives events via TaskResultListener.resultsReceived with Throwables

If I try to resubmit the job manually (rerun the same test) without Grid and Client restarting then I get the same NPEs on nodes.

NPE on client side was in my code because I didn't expect null in task result. I corrected it and now client resubmits the job automatically but receives throwables instead of results.

Could you please help me to resolve this issue?

Thanks in advance,
Emin
Logged

Emin Mamedov

  • JPPF Grand Master
  • ****
  • Posts: 69
Re: ClientWithFailover doesn't reconnect to the failed driver
« Reply #6 on: March 14, 2014, 06:08:27 PM »

Hi Laurent,

Looks like these exceptions on node sides are expected - they occur because connection with driver has been closed. Could you please help me in next problems?

1. Is it possible to requeue task if compute method failed on node? Can node or Client app somehow requeue the task? As I can see in code it is possible to throw JPPFNodeReconnectionNotification exception from Task.run method and the task bundle will be requeued but may be exists another way.

2. I have JPPF Grid with 1 Client, 1 driver and 10 nodes. I run tests that passed successfully. After that I update Client application (change task implementation) and change Client UUID (for me it is timestamp of the build). But nodes still use old implementation that cause EOFExceptions.

Code: [Select]
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at org.jppf.comm.socket.AbstractSocketWrapper.readInt(AbstractSocketWrapper.java:258)
at org.jppf.io.SocketWrapperInputSource.readInt(SocketWrapperInputSource.java:89)
at org.jppf.io.IOHelper.readData(IOHelper.java:115)
at org.jppf.io.IOHelper.unwrappedData(IOHelper.java:174)
at org.jppf.classloader.RemoteResourceRequest.run(RemoteResourceRequest.java:77)
at org.jppf.classloader.ClassLoaderRequestHandler$PeriodicTask.run(ClassLoaderRequestHandler.java:160)
at java.lang.Thread.run(Thread.java:744)

Sometimes I see NPE in node's logs in that case. For example, below logs when I submitted job from client "Client2" but in logs you can see that uuidPath=[Client1, 05B0C3F5-1185-85B4-52E5-CDC2A8C621D8]. It was previous client id value.
Code: [Select]
2014-03-14 16:46:43,591 DEBUG - [1394815603614ms]: looking up resource [org.jppf.utils.ObjectSerializerImpl] [org.jppf.classloader.AbstractJPPFClassLoader] [Processing-1] {}
2014-03-14 16:46:43,591 DEBUG - [1394815603614ms]: definition for resource [org.jppf.utils.ObjectSerializerImpl] : class org.jppf.utils.ObjectSerializerImpl [org.jppf.classloader.AbstractJPPFClassLoader] [Processing-1] {}
2014-03-14 16:46:43,591 DEBUG - [1394815603615ms]: class 'org.jppf.utils.ObjectSerializerImpl' loaded by JPPFClassLoader[id=2, type=client, uuidPath=[Client1, 05B0C3F5-1185-85B4-52E5-CDC2A8C621D8], offline=false, classpath=] [org.jppf.classloader.AbstractJPPFClassLoader] [Processing-1] {}
2014-03-14 16:46:43,591 DEBUG - [1394815603615ms]: JPPFClassLoader[id=2, type=client, uuidPath=[Client1, 05B0C3F5-1185-85B4-52E5-CDC2A8C621D8], offline=false, classpath=] requesting remote computation, requestUuid = DE66F1E1-3A1B-E066-119C-78D2E758031E [org.jppf.classloader.AbstractJPPFClassLoader] [Processing-1] {}
2014-03-14 16:46:43,591 ERROR - [1394815603615ms]: Error in calculation for id=TASK_3, produce error response [com.emamedov.server.RxTask] [Processing-1] {}
java.lang.NullPointerException
at org.jppf.classloader.AbstractClassLoaderConnection.loadResource(AbstractClassLoaderConnection.java:133)
at org.jppf.classloader.AbstractJPPFClassLoader.computeRemoteData(AbstractJPPFClassLoader.java:207)
at org.jppf.classloader.AbstractJPPFClassLoader.computeCallable(AbstractJPPFClassLoader.java:188)
at org.jppf.node.protocol.AbstractTask.compute(AbstractTask.java:231)
at com.emamedov.server.RxTask.run(RxTask.java:44)
at org.jppf.server.node.NodeTaskWrapper.run(NodeTaskWrapper.java:145)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

Note: serialVersionUID of Task class is always the same

3. When I have 2 clients with different client ids (for example, Client1 and Client2) and Client1 submits the job with tasks that call compute method then compute method sometimes tries to be executed on Client2. In that case in Client2 logs I see

Code: [Select]
2014-03-14 16:54:15,808 WARN  - [1394816055877ms]: [driver1-1 - ClassServer] caught java.lang.IllegalStateException: , will re-initialise ... [org.jppf.client.ClassServerDelegateImpl] [driver1-1 - ClassServer] {}
2014-03-14 16:54:15,855 WARN  - [1394816055929ms]: [driver1-6 - ClassServer] caught java.lang.IllegalStateException: , will re-initialise ... [org.jppf.client.ClassServerDelegateImpl] [driver1-6 - ClassServer] {}

This issue is not 100% reproducable. You can try to submit job via Client1 and after that the same job via Client2. So I get exceptions in 90%.

Many thanks in advance for any help. It will be really appreciated because I really stuck with these problems.

Emin
« Last Edit: March 14, 2014, 07:39:54 PM by Emin Mamedov »
Logged

Emin Mamedov

  • JPPF Grand Master
  • ****
  • Posts: 69
Re: ClientWithFailover doesn't reconnect to the failed driver
« Reply #7 on: March 17, 2014, 10:16:42 AM »

Hi,

Regarding problems 2 and 3: figured out that task on node is started under wrong classloader - the first JPPFClassLoader that loaded this Task. Therefore everything that I'm creating via "new" operator is created using this classloader. What do you think about it? Is it a bug?.

At the moment I'm using next workaround for compute method call:

Code: [Select]
  public void run() {
      ...
      ClassLoader classLoader = Thread.currentThread().getContextClassLoader();
      String myCallableClassName = MyCallable.class.getName();
      Class<JPPFCallable<Data>> myCallableClass = (Class<JPPFCallable<Data>>) classLoader.loadClass(myCallableClassName);
      if (myCallableClass == null) {
        logger.error("No class definition found for class [" + myCallableClassName + " in " + classLoader);
        throw new IllegalArgumentException("No class definition found for class [" + myCallableClassName + " in " + classLoader);
      }
      Constructor<JPPFCallable<Data>> constructor = myCallableClass.getConstructor(String.class, String.class);
      final Data data = compute(constructor.newInstance(param1, param2));
      ...
  }

Thanks,
Emin
« Last Edit: March 17, 2014, 10:27:20 AM by Emin Mamedov »
Logged

lolo

  • Administrator
  • JPPF Council Member
  • *****
  • Posts: 2272
    • JPPF Web site
Re: ClientWithFailover doesn't reconnect to the failed driver
« Reply #8 on: March 17, 2014, 01:48:54 PM »

Hi Emin,

Quote
... task on node is started under wrong classloader - the first JPPFClassLoader that loaded this Task
It would be the expected behavior if the new task is submitted with a JPPFClient with the same uuid. Is this what you are doing? If yes, then what happens is, since you're using the same client uuid, the node will be using the same class loader it already has in cache for this uuid, and the new class will not be reloaded.

Also, your workaround seems to imply that the thread context class loader and the task class loader are different, could you confirm this, for instance by adding the following test at the start of the run() method:

Code: [Select]
ClassLoader ctxCL = Thread.currentThread().getContextClassLoader();
ClassLoader taskCL = getClass().getClassLoader();
boolean b = (ctxCl == taskCL);
System.out.println("class loaders equal = " + b + ", ctxCL=" + ctxCL + ", taskCL=" + taskCL);

What is printed when you run this code?

As a reminder, there are 2 factors which determine when a node will create a new class loader for the tasks:
- the client UUID
- the node class loader cache size, which defaults to 50: when the cache is full and it doesn't contain a class loader for the uuid of the client that submitted the job, then the node will evict the least recent class loader from the cache and create a new one.

Sincerely,
-Laurent
Logged

Emin Mamedov

  • JPPF Grand Master
  • ****
  • Posts: 69
Re: ClientWithFailover doesn't reconnect to the failed driver
« Reply #9 on: March 17, 2014, 02:53:00 PM »

Hi Laurent,

Attached TRACE logs from JPPF node.

- At 3rd line you can see that expected JPPFClassLoader is used. I submitted job from client with UUID=Marina-0301_2014-03-14 17:53:06 so it looks fine
- "Run calculation for taskId=TASK_1" - here my task starts. Next line in my Task goes compute method call.
- 3 lines below in log you can see that classloader has changed to "Marina-0201_2014-03-14 17:53:06"'s JPPFClassLoader

Code: [Select]
2014-03-17 05:39:12,701 TRACE - [1395034752820ms]: unwrapping from network connection, serialized size=3173 : object=CompositeResourceWrapper[resources=[JPPFResourceWrapper[dynamic=true, name=org/jppf/utils/ObjectSerializerImpl.class, state=NODE_RESPONSE, uuidPath=TraversalList[position=0, list=[[b]Marina-0301_2014-03-14 17:53:06[/b], 1D7CE615-86D6-CF82-B7F2-816159C362DB]], callableID=-1]]] [org.jppf.io.IOHelper] [PeriodicTask] {}
2014-03-17 05:39:12,701 DEBUG - [1395034752821ms]: got response CompositeResourceWrapper[resources=[JPPFResourceWrapper[dynamic=true, name=org/jppf/utils/ObjectSerializerImpl.class, state=NODE_RESPONSE, uuidPath=TraversalList[position=0, list=[Marina-0301_2014-03-14 17:53:06, 1D7CE615-86D6-CF82-B7F2-816159C362DB]], callableID=-1]]] [org.jppf.classloader.ClassLoaderRequestHandler] [PeriodicTask] {}
2014-03-17 05:39:12,701 DEBUG - [1395034752821ms]: JPPFClassLoader[id=3, type=client, uuidPath=[Marina-0301_2014-03-14 17:53:06, 1D7CE615-86D6-CF82-B7F2-816159C362DB], offline=false, classpath=] remote definition for resource [null] found [org.jppf.classloader.AbstractJPPFClassLoaderLifeCycle] [Processing-1] {}
2014-03-17 05:39:12,701 DEBUG - [1395034752821ms]: found definition for resource [org.jppf.utils.ObjectSerializerImpl, definitionLength=2781] [org.jppf.classloader.AbstractJPPFClassLoader] [Processing-1] {}
2014-03-17 05:39:12,701 DEBUG - [1395034752822ms]: definition for resource [org.jppf.utils.ObjectSerializerImpl] : class org.jppf.utils.ObjectSerializerImpl [org.jppf.classloader.AbstractJPPFClassLoader] [Processing-1] {}
2014-03-17 05:39:12,701 DEBUG - [1395034752822ms]: class 'org.jppf.utils.ObjectSerializerImpl' loaded by JPPFClassLoader[id=3, type=client, uuidPath=[Marina-0301_2014-03-14 17:53:06, 1D7CE615-86D6-CF82-B7F2-816159C362DB], offline=false, classpath=] [org.jppf.classloader.AbstractJPPFClassLoader] [Processing-1] {}
2014-03-17 05:39:12,701 TRACE - [1395034752823ms]: unwrapping MultipleBuffersLocation[size=1, count=1, currentBuffer=org.jppf.utils.JPPFBuffer@a3f2da, currentBufferIndex=0, transferring=false, list=[org.jppf.utils.JPPFBuffer@a3f2da]] [org.jppf.io.IOHelper] [Processing-1] {}
2014-03-17 05:39:12,701 DEBUG - [1395034752823ms]: deserializing object index = 1 [org.jppf.server.node.JPPFContainer] [Processing-1] {}
2014-03-17 05:39:12,701 TRACE - [1395034752823ms]: unwrapping MultipleBuffersLocation[size=52, count=52, currentBuffer=org.jppf.utils.JPPFBuffer@19cec7e, currentBufferIndex=0, transferring=false, list=[org.jppf.utils.JPPFBuffer@19cec7e]] [org.jppf.io.IOHelper] [Processing-1] {}
2014-03-17 05:39:12,701 DEBUG - [1395034752824ms]: got all data [org.jppf.server.node.remote.RemoteNodeIO] [main] {}
2014-03-17 05:39:12,701 DEBUG - [1395034752824ms]: received a bundle with 1 tasks [org.jppf.server.node.JPPFNode] [main] {}
2014-03-17 05:39:12,701 DEBUG - [1395034752824ms]: executing 1 tasks [org.jppf.server.node.NodeExecutionManagerImpl] [main] {}
2014-03-17 05:39:12,716 TRACE - [1395034752825ms]: NodeTaskWrapper[task=com.emamedov.marina.server.RxTask@16a6e0e, cancelled=false, callOnCancel=false, timeout=false, started=false] [org.jppf.server.node.NodeTaskWrapper] [Processing-1] {}
2014-03-17 05:39:12,716 INFO  - [1395034752825ms]: Run calculation for taskId=TASK_1 [com.emamedov.marina.server.Task] [Processing-1] {}
2014-03-17 05:39:12,716 DEBUG - [1395034752825ms]: looking up resource [org.jppf.utils.ObjectSerializerImpl] [org.jppf.classloader.AbstractJPPFClassLoader] [Processing-1] {}
2014-03-17 05:39:12,716 DEBUG - [1395034752825ms]: definition for resource [org.jppf.utils.ObjectSerializerImpl] : class org.jppf.utils.ObjectSerializerImpl [org.jppf.classloader.AbstractJPPFClassLoader] [Processing-1] {}
2014-03-17 05:39:12,716 DEBUG - [1395034752825ms]: class 'org.jppf.utils.ObjectSerializerImpl' loaded by JPPFClassLoader[id=2, type=client, uuidPath=[Marina-0201_2014-03-14 17:53:06, 1D7CE615-86D6-CF82-B7F2-816159C362DB], offline=false, classpath=] [org.jppf.classloader.AbstractJPPFClassLoader] [Processing-1] {}
2014-03-17 05:39:12,716 DEBUG - [1395034752826ms]: JPPFClassLoader[id=2, type=client, uuidPath=[Marina-0201_2014-03-14 17:53:06, 1D7CE615-86D6-CF82-B7F2-816159C362DB], offline=false, classpath=] requesting remote computation, requestUuid = 5FA62AC3-BC9D-56A0-83F9-85C1DE81141E [org.jppf.classloader.AbstractJPPFClassLoader] [Processing-1] {}
2014-03-17 05:39:12,716 DEBUG - [1395034752826ms]: sending batch of 1 class loading requests: CompositeResourceWrapper[resources=[JPPFResourceWrapper[dynamic=true, name=callable, state=NODE_REQUEST, uuidPath=TraversalList[position=1, list=[Marina-0201_2014-03-14 17:53:06, 1D7CE615-86D6-CF82-B7F2-816159C362DB]], callableID=7]]] [org.jppf.classloader.ClassLoaderRequestHandler] [PeriodicTask] {}
2014-03-17 05:39:12,716 TRACE - [1395034752826ms]: serializing object CompositeResourceWrapper[resources=[JPPFResourceWrapper[dynamic=true, name=callable, state=NODE_REQUEST, uuidPath=TraversalList[position=1, list=[Marina-0201_2014-03-14 17:53:06, 1D7CE615-86D6-CF82-B7F2-816159C362DB]], callableID=7]]] [org.jppf.io.IOHelper] [PeriodicTask] {}
2014-03-17 05:39:12,716 TRACE - [1395034752826ms]: serializing object to memory CompositeResourceWrapper[resources=[JPPFResourceWrapper[dynamic=true, name=callable, state=NODE_REQUEST, uuidPath=TraversalList[position=1, list=[Marina-0201_2014-03-14 17:53:06, 1D7CE615-86D6-CF82-B7F2-816159C362DB]], callableID=7]]] [org.jppf.io.IOHelper] [PeriodicTask] {}
2014-03-17 05:39:12,716 TRACE - [1395034752828ms]: creating new buffer with size=32768 for MultipleBuffersOutputStream[defaultLength=32768, totalSize=0, nbBuffers=0, currentBuffer=null] [org.jppf.utils.streams.MultipleBuffersOutputStream] [PeriodicTask] {}
2014-03-17 05:39:12,716 TRACE - [1395034752828ms]: sending object with serialized size=420 : CompositeResourceWrapper[resources=[JPPFResourceWrapper[dynamic=true, name=callable, state=NODE_REQUEST, uuidPath=TraversalList[position=1, list=[Marina-0201_2014-03-14 17:53:06, 1D7CE615-86D6-CF82-B7F2-816159C362DB]], callableID=7]]] [org.jppf.io.IOHelper] [PeriodicTask] {}
2014-03-17 05:39:12,732 TRACE - [1395034752854ms]: read data size = 423 [org.jppf.io.IOHelper] [PeriodicTask] {}
2014-03-17 05:39:12,732 TRACE - [1395034752855ms]: free mem / requested size / footprint : 186,553,656 / 423 / 0 [org.jppf.io.IOHelper] [PeriodicTask] {}
2014-03-17 05:39:12,732 TRACE - [1395034752855ms]: unwrapping MultipleBuffersLocation[size=423, count=423, currentBuffer=org.jppf.utils.JPPFBuffer@d2e13a, currentBufferIndex=0, transferring=false, list=[org.jppf.utils.JPPFBuffer@d2e13a]] [org.jppf.io.IOHelper] [PeriodicTask] {}
2014-03-17 05:39:12,747 TRACE - [1395034752858ms]: unwrapping from network connection, serialized size=423 : object=CompositeResourceWrapper[resources=[JPPFResourceWrapper[dynamic=true, name=callable, state=NODE_RESPONSE_ERROR, uuidPath=TraversalList[position=0, list=[Marina-0201_2014-03-14 17:53:06, 1D7CE615-86D6-CF82-B7F2-816159C362DB]], callableID=7]]] [org.jppf.io.IOHelper] [PeriodicTask] {}
2014-03-17 05:39:12,747 DEBUG - [1395034752858ms]: got response CompositeResourceWrapper[resources=[JPPFResourceWrapper[dynamic=true, name=callable, state=NODE_RESPONSE_ERROR, uuidPath=TraversalList[position=0, list=[Marina-0201_2014-03-14 17:53:06, 1D7CE615-86D6-CF82-B7F2-816159C362DB]], callableID=7]]] [org.jppf.classloader.ClassLoaderRequestHandler] [PeriodicTask] {}
2014-03-17 05:39:12,747 DEBUG - [1395034752858ms]: JPPFClassLoader[id=2, type=client, uuidPath=[Marina-0201_2014-03-14 17:53:06, 1D7CE615-86D6-CF82-B7F2-816159C362DB], offline=false, classpath=] remote definition for callable resource not found [org.jppf.classloader.AbstractJPPFClassLoader] [Processing-1] {}
2014-03-17 05:39:12,747 INFO  - [1395034752859ms]: No data received for taskId=TASK_1 [com.emamedov.marina.server.RxTask] [Processing-1] {}
2014-03-17 05:39:12,747 ERROR - [1395034752860ms]: Error in calculation for id=TASK_1, produce error response [com.emamedov.marina.server.RxTask] [Processing-1] {}
com.emamedov.ServiceException: No data received for taskId=TASK_1
at com.emamedov.marina.server.RxTask.run(RxTask.java:47)
at org.jppf.server.node.NodeTaskWrapper.run(NodeTaskWrapper.java:145)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
2014-03-17 05:39:12,747 DEBUG - [1395034752861ms]: Task Timeout Timer : cancelling action for key=java.util.concurrent.FutureTask@2088cc, future=null [org.jppf.scheduling.JPPFScheduleHandler] [Processing-1] {}
2014-03-17 05:39:12,747 DEBUG - [1395034752861ms]: node tasks counter reset to 7 requested [org.jppf.management.JPPFNodeAdmin] [main] {}
2014-03-17 05:39:12,747 DEBUG - [1395034752861ms]: processing 1 task results for job 'TASK' [org.jppf.server.node.JPPFNode] [main] {}
2014-03-17 05:39:12,747 DEBUG - [1395034752861ms]: writing results for JPPFTaskBundle[name=TASK, uuid=E778ADF5-EF0E-5E5E-9DB4-59CFF9082FAA, initialTaskCount=4, taskCount=1, bundleUuid=null, uuidPath=TraversalList[position=0, list=[Marina-0301_2014-03-14 17:53:06, 1D7CE615-86D6-CF82-B7F2-816159C362DB]]] [org.jppf.server.node.remote.RemoteNodeIO] [main] {}
2014-03-17 05:39:12,747 TRACE - [1395034752862ms]: before serialization of object at position -1 [org.jppf.server.node.AbstractNodeIO] [Processing-1] {}
2014-03-17 05:39:12,747 TRACE - [1395034752862ms]: serializing object JPPFTaskBundle[name=TASK, uuid=E778ADF5-EF0E-5E5E-9DB4-59CFF9082FAA, initialTaskCount=4, taskCount=1, bundleUuid=null, uuidPath=TraversalList[position=0, list=[Marina-0301_2014-03-14 17:53:06, 1D7CE615-86D6-CF82-B7F2-816159C362DB]]] [org.jppf.io.IOHelper] [Processing-1] {}
2014-03-17 05:39:12,747 TRACE - [1395034752862ms]: serializing object to memory JPPFTaskBundle[name=TASK, uuid=E778ADF5-EF0E-5E5E-9DB4-59CFF9082FAA, initialTaskCount=4, taskCount=1, bundleUuid=null, uuidPath=TraversalList[position=0, list=[Marina-0301_2014-03-14 17:53:06, 1D7CE615-86D6-CF82-B7F2-816159C362DB]]] [org.jppf.io.IOHelper] [Processing-1] {}
2014-03-17 05:39:12,747 TRACE - [1395034752863ms]: creating new buffer with size=32768 for MultipleBuffersOutputStream[defaultLength=32768, totalSize=0, nbBuffers=0, currentBuffer=null] [org.jppf.utils.streams.MultipleBuffersOutputStream] [Processing-1] {}
2014-03-17 05:39:12,747 TRACE - [1395034752863ms]: serialized object at position -1, size = 319 [org.jppf.server.node.AbstractNodeIO] [Processing-1] {}
2014-03-17 05:39:12,747 TRACE - [1395034752864ms]: before serialization of object at position 1 [org.jppf.server.node.AbstractNodeIO] [Processing-1] {}
2014-03-17 05:39:12,747 TRACE - [1395034752864ms]: writing header with size = 319 [org.jppf.server.node.remote.RemoteNodeIO] [main] {}
2014-03-17 05:39:12,747 TRACE - [1395034752864ms]: serializing object com.emamedov.marina.server.RxTask@16a6e0e [org.jppf.io.IOHelper] [Processing-1] {}
2014-03-17 05:39:12,747 TRACE - [1395034752864ms]: serializing object to memory com.emamedov.marina.server.RxTask@16a6e0e [org.jppf.io.IOHelper] [Processing-1] {}
2014-03-17 05:39:12,747 TRACE - [1395034752866ms]: creating new buffer with size=32768 for MultipleBuffersOutputStream[defaultLength=32768, totalSize=0, nbBuffers=0, currentBuffer=null] [org.jppf.utils.streams.MultipleBuffersOutputStream] [Processing-1] {}
2014-03-17 05:39:12,747 TRACE - [1395034752866ms]: serialized object at position 1, size = 116 [org.jppf.server.node.AbstractNodeIO] [Processing-1] {}
2014-03-17 05:39:12,747 TRACE - [1395034752866ms]: writing task[1] with size = 116 [org.jppf.server.node.remote.RemoteNodeIO] [main] {}
2014-03-17 05:39:12,747 DEBUG - [1395034752866ms]: wrote full results [org.jppf.server.node.remote.RemoteNodeIO] [main] {}
2014-03-17 05:39:12,747 DEBUG - [1395034752867ms]: waiting for next request. Serializer = org.jppf.utils.ObjectSerializerImpl@3ec4df (class loader = JPPFClassLoader[id=1, type=server, uuidPath=[], offline=false, classpath=]) [org.jppf.server.node.remote.RemoteNodeIO] [main] {}

Classloaders:

Code: [Select]
Thread.currentThread().getContextClassLoader() = JPPFClassLoader[id=3, type=client, uuidPath=[Marina-0301_2014-03-14 17:53:06, 1D7CE615-86D6-CF82-B7F2-816159C362DB], offline=false, classpath=]
getClass().getClassLoader() = JPPFClassLoader[id=2, type=client, uuidPath=[Marina-0201_2014-03-14 17:53:06, 1D7CE615-86D6-CF82-B7F2-816159C362DB], offline=false, classpath=]
myCallable.getClass().getClassLoader() = JPPFClassLoader[id=2, type=client, uuidPath=[Marina-0201_2014-03-14 17:53:06, 1D7CE615-86D6-CF82-B7F2-816159C362DB], offline=false, classpath=]

UPD: Forgot to mention. I didn't specify classloader's cache size so it is 50 in my case. And I have 2 or 3 clients only.

Thanks,
Emin
« Last Edit: March 17, 2014, 03:06:03 PM by Emin Mamedov »
Logged

lolo

  • Administrator
  • JPPF Council Member
  • *****
  • Posts: 2272
    • JPPF Web site
Re: ClientWithFailover doesn't reconnect to the failed driver
« Reply #10 on: March 18, 2014, 08:42:24 PM »

Hi Emin,

I see from your traces that soemthing is not working properly, but unfortunately I have not been able to rerpduce the behavior you describe.
Is there any way you can provide a code sample which reproduces, along along with the steps to reproduce the problem?
If it's more convenient for you, you may use the attached code as a starting point. That's the code I'm using to try and reproduce the problem.

Thanks,
-Laurent
Logged

lolo

  • Administrator
  • JPPF Council Member
  • *****
  • Posts: 2272
    • JPPF Web site
Re: ClientWithFailover doesn't reconnect to the failed driver
« Reply #11 on: March 18, 2014, 09:09:26 PM »

I apologize, I forgot to attach the code  :(

You should find it now.

-Laurent
Logged

Emin Mamedov

  • JPPF Grand Master
  • ****
  • Posts: 69
Re: ClientWithFailover doesn't reconnect to the failed driver
« Reply #12 on: March 22, 2014, 02:43:00 PM »

Hi Laurent,

Sorry for late reply. Didn't have a chance to test your code.

So, it is a magic :), but your test works fine. For jobs from Client-1 classloader Client-1 is used, for jobs from Client-2 classloader Client-2 is used.

BUT when I changed client id format to "Marina-" + (n + 1) + "_2010-11-11 12:13:12.000" test failed. It is because of space character in id. As soon as I removed space character from id test has been passed successfully.

Thanks,
Emin
Logged

lolo

  • Administrator
  • JPPF Council Member
  • *****
  • Posts: 2272
    • JPPF Web site
Re: ClientWithFailover doesn't reconnect to the failed driver
« Reply #13 on: March 23, 2014, 10:40:56 AM »

Hi Emin,

Thank you very much for your feedback.
I was unable to reproduce the issue, even with spaces in the client uuid, even using the "Marina" format. I believe that the fact that it is working for you without space is purely circumstantial.

Inspired by the results of your investigation, I have, however, uncovered a bug in the way the server fowards class loading requests to the clients: JPPF-237 Client raises IllegalStateException when using multiple JPPFClient with the same uuid. I invite you to read the description for all the details. I also believe that it is the problem you are facing. I am working on a fix and will keep you updated on my progress. As soon as a fix is implemented, I will release it as part of JPPF 4.0.2.

Currently, the only workaround that is guaranteed to work is to ensure that each JPPFClient instance has a unique uuid.

Sincerely,
-Laurent
Logged

Emin Mamedov

  • JPPF Grand Master
  • ****
  • Posts: 69
Re: ClientWithFailover doesn't reconnect to the failed driver
« Reply #14 on: March 23, 2014, 12:28:49 PM »

Hi Laurent,

Sorry that annoying you. You were right - it is not whitespace issue. Looks like I reproduced this issue using next scenario (you can find everything in attached logs):

- Start 1 driver and 1 node via direct connection (jppf.discovery.enabled=false)
- Run your MyRunnable with ClientIds "Client-" + (n+1)
- Do NOT restart grid. Change ClientIds to "Marina-" + (n + 1) + "_2010-11-11 12:13:12.000"
- Rerun your MyRunnable again.

Hope that it helps to reproduce issue.

Many thanks for your time,
Emin
Logged

lolo

  • Administrator
  • JPPF Council Member
  • *****
  • Posts: 2272
    • JPPF Web site
Re: ClientWithFailover doesn't reconnect to the failed driver
« Reply #15 on: March 23, 2014, 06:26:19 PM »

Hi Emin,

From the log you sent, I can see that your problem is not related to JPPF-237  :(
Still, I cannot reproduce it on my side, even following exactly the steps you provided.

I can also see that your node does a lot more than just running a standard JPPF node. For instance, there is a ton of Spring stuff going on before the node actually starts connecting to the server.
So my first question is: have you customized the node in any way, for instance with an initialization hook, or by any other means? Are there any details you can provide on this, so we can understand what effect this may have on JPPF class loading?

I also looked at the classpath of the main node class loader (the one that loads the main application class), and I found out that:
- there's a lot of jars in there, and the list of jars is actually duplicated: there are thus twice more jars than needed
- the classpath contains "jppf-common-4.0.1.jar" and "jppf-server-4.0.1.jar", which are normally deployed on the the server side. Do you absolutely need these 2 jars, and could you try without them being in the node's classpath?
- I do not know what is in some of the other jars, do they contain classes from your application that are also in the client's classpath? I'm suspecting this could be the main cause of the tasks being loaded by the wrong class loader.

The log also shows that some code is trying to connect to the node's JMX remote connector, before it is started. I'm guessing that somewhere in your code you have something like:
Code: [Select]
JMXNodeConnectionWrapper jmx = new JMXNodeConnectionWrapper("localhost", 22000, false);
jmx.connect();
We can see a number of connection attemps failing with ConnectException, until the JMX remote connector is started, at which point the connection is established successfully. I would like to suggest a more efficient way to connect to the local (same JVM) MBean server, using the no-args constructor:
Code: [Select]
JMXNodeConnectionWrapper jmx = new JMXNodeConnectionWrapper();
jmx.connect();
since it's in the same JVM, there's no need for a network connection, and this is much more resource-efficient.

Sincerely,
-Laurent
Logged

Emin Mamedov

  • JPPF Grand Master
  • ****
  • Posts: 69
Re: ClientWithFailover doesn't reconnect to the failed driver
« Reply #16 on: March 23, 2014, 09:41:14 PM »

Hi Laurent,

I can also see that your node does a lot more than just running a standard JPPF node. For instance, there is a ton of Spring stuff going on before the node actually starts connecting to the server.
So my first question is: have you customized the node in any way, for instance with an initialization hook, or by any other means? Are there any details you can provide on this, so we can understand what effect this may have on JPPF class loading?

Yes, we have InitializationHook to start Spring context. Sorry that didn't mention it before - didn't know that it can somehow affect classloading. In our tasks we use native code (written on C++) that requires initialization. To initialize it in all our other applications (not grid style) we use Spring. We have common JMX beans also. They are created via Spring context also. So we decided to use Spring on nodes to use the same style as in any other our applications. Below you can find source code of this hook.

Code: [Select]
package com.emamedov.dprice.pricinggrid.node;

import org.apache.log4j.Logger;
import org.jppf.node.initialization.InitializationHook;
import org.jppf.utils.UnmodifiableTypedProperties;
import org.springframework.context.ApplicationContext;
import org.springframework.context.support.ClassPathXmlApplicationContext;

public class OnNodeStartUp implements InitializationHook {

  private static final Logger logger = Logger.getLogger(OnNodeStartUp.class);

  public static ApplicationContext applicationContext;

  @Override
  public void initializing(UnmodifiableTypedProperties initialConfiguration) {
    try {
      if (applicationContext == null) {
        applicationContext = new ClassPathXmlApplicationContext(
                                                                "classpath:com/emamedov/dprice/pricinggrid/node/node-context.xml");
      }
    } catch (RuntimeException e) {
      logger.error(e, e);
      throw e;
    }
    logger.debug("Node has been successfully initialized");
  }

}

- there's a lot of jars in there, and the list of jars is actually duplicated: there are thus twice more jars than needed

Looks like it is because I added jars in NodeLauncher's classpath and specified it in jppf.jvm.options attribute. Thanks for the notice.

- the classpath contains "jppf-common-4.0.1.jar" and "jppf-server-4.0.1.jar", which are normally deployed on the the server side. Do you absolutely need these 2 jars, and could you try without them being in the node's classpath?

We created one distribution package that can be used both for server and for node. It contains all required jars. After that on box we start as many nodes as required and as many servers as required (from the same folder). Do you think that existance of jppf-common and jppf-server can be cause of the issue?

- I do not know what is in some of the other jars, do they contain classes from your application that are also in the client's classpath? I'm suspecting this could be the main cause of the tasks being loaded by the wrong class loader.

Yes, we have some jars in node classpath that exist in client classpath. But for example in your MyCallable and MyTask implementations you do not use them. And I didn't put MyCallable and MyTask into node's local classpath.

The log also shows that some code is trying to connect to the node's JMX remote connector, before it is started. I'm guessing that somewhere in your code you have something like:
Code: [Select]
JMXNodeConnectionWrapper jmx = new JMXNodeConnectionWrapper("localhost", 22000, false);
jmx.connect();
We can see a number of connection attemps failing with ConnectException, until the JMX remote connector is started, at which point the connection is established successfully. I would like to suggest a more efficient way to connect to the local (same JVM) MBean server, using the no-args constructor:
Code: [Select]
JMXNodeConnectionWrapper jmx = new JMXNodeConnectionWrapper();
jmx.connect();
since it's in the same JVM, there's no need for a network connection, and this is much more resource-efficient.

Yes, we have a tool for starting and stopping nodes and servers. So I implemented a wrapper over ProcessLauncher that starts node (like NodeLaucher does it) and connects via JMX to this node. So we can send start/stop signals via our tool and node will be started/stopped. Is it ok?
I didn't know that we can use no-args constructor in this case. Many thanks for the notice.

Thanks,
Emin
« Last Edit: March 23, 2014, 09:46:58 PM by Emin Mamedov »
Logged

lolo

  • Administrator
  • JPPF Council Member
  • *****
  • Posts: 2272
    • JPPF Web site
Re: ClientWithFailover doesn't reconnect to the failed driver
« Reply #17 on: March 27, 2014, 04:23:58 PM »

Hi Emin,

Quote
Do you think that existance of jppf-common and jppf-server can be cause of the issue?
Definitely not, even my tests prove it.

Unfortunately, I'm still unable to reproduce this clas loader problem, despite all my efforts at imagining and implementing scenarios.
What I would like to propose is that we add, in the JPPF code, as many trace-level logging statements as possible that will help identify where and how the problem occurs. We would then deliver this instrumented code as part of 4.0.2 (over the next week-end) and continue the tests and investigation from there. Is this something you would be willing to pursue?

Thanks,
-Laurent
Logged

Emin Mamedov

  • JPPF Grand Master
  • ****
  • Posts: 69
Re: ClientWithFailover doesn't reconnect to the failed driver
« Reply #18 on: March 27, 2014, 04:43:35 PM »

Hi Laurent,

Sure. As I told you previously I do not use compute method anymore and implemented DataProvider implementation to retrieve data direct from client. So this classloader issue is not relevant for me anymore. But I will be happy to help you to figure out the reason.

Is it possible to add JPPF-233 into patch release also? I would like to test this feature also

Thanks,
Emin
Logged

lolo

  • Administrator
  • JPPF Council Member
  • *****
  • Posts: 2272
    • JPPF Web site
Re: ClientWithFailover doesn't reconnect to the failed driver
« Reply #19 on: March 27, 2014, 05:07:58 PM »

Hi Emin,

JPPF-233 will not be part of v4.0.2 but it will be in the upcoming v4.1 beta that I will also release this week-end. I hope that's ok with you. As a reminder, the release date for v4.1 is April 15th, so not too far away.

Sincerely,
-Laurent
Logged
Pages: [1]   Go Up
 
JPPF Powered by SMF 2.0 RC5 | SMF © 2006–2011, Simple Machines LLC Get JPPF at SourceForge.net. Fast, secure and Free Open Source software downloads