JPPF, java, parallel computing, distributed computing, grid computing, parallel, distributed, cluster, grid, cloud, open source, android, .net
JPPF

The open source
grid computing
solution

 Home   About   Features   Download   Documentation   On Github   Forums 
February 05, 2023, 08:04:35 AM *
Welcome,
Please login or register.

Login with username, password and session length
Advanced search  
News: New users, please read this message. Thank you!
  Home Help Search Login Register  
Pages: [1]   Go Down

Author Topic: unequal task distribution  (Read 3057 times)

broiyan

  • JPPF Grand Master
  • ****
  • Posts: 54
unequal task distribution
« on: February 15, 2016, 01:06:29 AM »

The following configuration is presented as differences between version 5.1.2 and my deployment. (There is 1 other difference not shown: I automatically edit the node config to use the actual driver IP instead of localhost.)

Quote
application differences next press the Enter key
39c39
< driver1.jppf.server.port = 11111
---
> driver1.jppf.server.port = 11443
42c42
< driver1.jppf.ssl.enabled = false
---
> driver1.jppf.ssl.enabled = true
87c87
< #jppf.ssl.enabled = true
---
> jppf.ssl.enabled = true
press Enter
57c57
< jppf.ssl.keystore.file = config/ssl/keystore.ks
---
> jppf.ssl.keystore.file = config/ssl/server0KeyStore
85c85
< #jppf.ssl.truststore.file = config/ssl/truststore.ks
---
> jppf.ssl.truststore.file = config/ssl/server0TrustStore
89c89
< #jppf.ssl.truststore.password = password
---
> jppf.ssl.truststore.password = password
.....................................................
driver differences next press the Enter key
25c25,26
< jppf.server.port = 11111
---
> #jppf.server.port = 11111
> jppf.server.port = -1
41c42
< #jppf.ssl.configuration.file = config/ssl/ssl-server.properties
---
> jppf.ssl.configuration.file = config/ssl/ssl-server.properties
127c128
< jppf.load.balancing.algorithm = proportional
---
> jppf.load.balancing.algorithm = nodethreads
130c131
< jppf.load.balancing.profile = proportional_profile
---
> jppf.load.balancing.profile = nodethreads_profile
press Enter
55c55
< jppf.ssl.keystore.file = config/ssl/keystore.ks
---
> jppf.ssl.keystore.file = config/ssl/server0KeyStore
83c83
< jppf.ssl.truststore.file = config/ssl/truststore.ks
---
> jppf.ssl.truststore.file = config/ssl/server0TrustStore
.....................................................
node differences next press the Enter key
25c25
< #jppf.server.host = localhost
---
> jppf.server.host = localhost
28c28
< #jppf.server.port = 11111
---
> jppf.server.port = 11443
51c51
< #jppf.ssl.enabled = true
---
> jppf.ssl.enabled = true
54c54
< #jppf.ssl.configuration.file = config/ssl/ssl.properties
---
> jppf.ssl.configuration.file = config/ssl/ssl.properties
press Enter
57c57
< jppf.ssl.keystore.file = config/ssl/keystore.ks
---
> jppf.ssl.keystore.file = config/ssl/nodeXKeyStore
85c85
< #jppf.ssl.truststore.file = config/ssl/truststore.ks
---
> jppf.ssl.truststore.file = config/ssl/nodeXTrustStore
89c89
< #jppf.ssl.truststore.password = password
---
> jppf.ssl.truststore.password = password
.....................................................

I have 5 VMs. I started 4 nodes on 4 VMs. I started 1 driver and 1 application which are on 1 VM (the same VM). The application starts 24 tasks and the nodes have received:

node0, 1 task
node1, 1 task
node2, 4 tasks
node3, 0 tasks

The node VMs have these capabilities:

node0, 1 core
node1, 1 core
node2, 2 cores
node3, 4 cores

This is a waste of the 4 core machine. Please how can I get a more equal distribution? Node3 is new so I rebooted all 5 VMs (1 driver/app and 4 nodes) to see if that would improve the situation. Rebooting the 5 VMs resulted in no change to the distribution.

Edit: All tasks take several hours. Eventually all 4 nodes report: Write failed: Broken pipe
« Last Edit: February 15, 2016, 08:28:46 PM by broiyan »
Logged

lolo

  • Administrator
  • JPPF Council Member
  • *****
  • Posts: 2272
    • JPPF Web site
Re: unequal task distribution
« Reply #1 on: February 15, 2016, 08:29:03 AM »

Hello,

This is definitely a bug and I registered JPPF-440 Nodethreads load-balancing algorithm not working as expected.
As soon as I have a fix, I will deliver it as a patch for JPPF 5.1.2. I will keep you apprised of my progress in this discussion thread.

Thank you very much for reporting this issue.

-Laurent
Logged

lolo

  • Administrator
  • JPPF Council Member
  • *****
  • Posts: 2272
    • JPPF Web site
Re: unequal task distribution
« Reply #2 on: February 15, 2016, 08:21:26 PM »

Hello,

In fact, the bug I registered only occurs in specific conditions (see the comment I added), which I believe do not apply to your environment and context. Due to this, I will need more information to track down the cause of your problem.

First, how do you determine the number of tasks executed by each node? It could be by reading their console output, using the JPPF admin console, any other means? If you haven't done so and if applicable, I suggest you monitor the server and nodes with the administration console (its SSL configuration is exactly the same as for a client). You probably need to run it remotely and use a public IP address for your server
In the numbers you posted I see only 1 + 1 + 4 + 0 = 6 tasks executed, what happened to the other 18 tasks ?
Also the fact that node3 executed 0 task could indicate that there is a problem with this node. Could you check its log file and console output? Could you also do the same for the driver?

Thanks for your time,
-Laurent
Logged

broiyan

  • JPPF Grand Master
  • ****
  • Posts: 54
Re: unequal task distribution
« Reply #3 on: February 15, 2016, 08:33:27 PM »

The nodes report "hello" to the console when a task starts. The tasks potentially take many hours because sometimes cloud servers are slow, however on a dedicated machine it should only be a few minutes for each task or perhaps a few minutes for all the tasks. There are probably still 18 tasks waiting because the total of "hello" messages is 6.

Node3 is a new instance of ubuntu so it is unlikely that there is anything wrong with it. This log file from node0 shows that after several hours the connection was lost.

Quote
2016-02-14 18:52:48,149 [INFO ][org.jppf.utils.ManagementUtils.<clinit>(153)]: management successfully initialized
2016-02-14 18:52:48,185 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(80)]: --------------------------------------------------------------------------------
2016-02-14 18:52:48,186 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(81)]: JPPF Version: 5.1.2, Build number: 1758, Build date: 2016-02-05 06:24 CET
2016-02-14 18:52:48,192 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(82)]: starting node with PID=3779, UUID=D68F26F6-F9A8-245E-E36E-16FE6254C0AD
2016-02-14 18:52:48,192 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(83)]: --------------------------------------------------------------------------------
2016-02-14 18:52:54,797 [INFO ][org.jppf.classloader.ClassLoaderRequestHandler.run(155)]: maxBatchSize = 1
2016-02-14 18:52:54,914 [INFO ][org.jppf.execute.AbstractExecutionManager.<init>(114)]: running 1 processing thread
2016-02-14 18:52:54,915 [INFO ][org.jppf.execute.AbstractExecutionManager.createThreadManager(140)]: Using default thread manager
2016-02-15 05:04:11,570 [ERROR][org.jppf.server.node.JPPFNode.run(142)]: Connection reset
java.net.SocketException: Connection reset
   at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113)
   at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
   at sun.security.ssl.OutputRecord.writeBuffer(OutputRecord.java:431)
   at sun.security.ssl.OutputRecord.write(OutputRecord.java:417)
   at sun.security.ssl.SSLSocketImpl.writeRecordInternal(SSLSocketImpl.java:876)
   at sun.security.ssl.SSLSocketImpl.writeRecord(SSLSocketImpl.java:847)
   at sun.security.ssl.AppOutputStream.write(AppOutputStream.java:123)
   at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
   at java.io.DataOutputStream.write(DataOutputStream.java:107)
   at org.jppf.comm.socket.AbstractSocketWrapper.write(AbstractSocketWrapper.java:149)
   at org.jppf.io.SocketWrapperOutputDestination.write(SocketWrapperOutputDestination.java:54)
   at org.jppf.io.MultipleBuffersLocation.blockingTransferTo(MultipleBuffersLocation.java:214)
   at org.jppf.io.MultipleBuffersLocation.transferTo(MultipleBuffersLocation.java:192)
   at org.jppf.io.IOHelper.writeData(IOHelper.java:126)
   at org.jppf.server.node.remote.RemoteNodeIO.sendResults(RemoteNodeIO.java:130)
   at org.jppf.server.node.AbstractNodeIO.writeResults(AbstractNodeIO.java:155)
   at org.jppf.server.node.JPPFNode.processResults(JPPFNode.java:261)
   at org.jppf.server.node.JPPFNode.processNextJob(JPPFNode.java:208)
   at org.jppf.server.node.JPPFNode.perform(JPPFNode.java:175)
   at org.jppf.server.node.JPPFNode.run(JPPFNode.java:136)
   at org.jppf.node.NodeRunner.main(NodeRunner.java:140)
2016-02-15 05:04:11,582 [WARN ][org.jppf.classloader.AbstractClassLoaderConnection.sendCloseChannelCommand(105)]: error sending close channel command : java.io.EOFException: null

The log file from node3, the VM that did not receive a task:

Quote
2016-02-14 18:54:33,055 [INFO ][org.jppf.utils.ManagementUtils.<clinit>(153)]: management successfully initialized
2016-02-14 18:54:33,070 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(80)]: --------------------------------------------------------------------------------
2016-02-14 18:54:33,071 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(81)]: JPPF Version: 5.1.2, Build number: 1758, Build date: 2016-02-05 06:24 CET
2016-02-14 18:54:33,071 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(82)]: starting node with PID=3853, UUID=90BA30A7-1899-0C80-1122-8FFC99354844
2016-02-14 18:54:33,071 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(83)]: --------------------------------------------------------------------------------
2016-02-14 18:54:38,694 [INFO ][org.jppf.classloader.ClassLoaderRequestHandler.run(155)]: maxBatchSize = 1
2016-02-14 18:54:38,747 [INFO ][org.jppf.execute.AbstractExecutionManager.<init>(114)]: running 4 processing threads
2016-02-14 18:54:38,748 [INFO ][org.jppf.execute.AbstractExecutionManager.createThreadManager(140)]: Using default thread manager
2016-02-15 05:04:11,572 [ERROR][org.jppf.server.node.remote.RemoteNodeIO.deserializeObjects(96)]: Exception occurred while deserializing the tasks
java.io.EOFException
   at java.io.DataInputStream.readInt(DataInputStream.java:392)
   at org.jppf.comm.socket.AbstractSocketWrapper.readInt(AbstractSocketWrapper.java:266)
   at org.jppf.io.SocketWrapperInputSource.readInt(SocketWrapperInputSource.java:85)
   at org.jppf.io.IOHelper.readData(IOHelper.java:110)
   at org.jppf.server.node.remote.JPPFRemoteContainer.deserializeObjects(JPPFRemoteContainer.java:78)
   at org.jppf.server.node.remote.RemoteNodeIO.deserializeObjects(RemoteNodeIO.java:89)
   at org.jppf.server.node.remote.RemoteNodeIO.deserializeObjects(RemoteNodeIO.java:69)
   at org.jppf.server.node.AbstractNodeIO.readObjects(AbstractNodeIO.java:107)
   at org.jppf.server.node.AbstractNodeIO.readTask(AbstractNodeIO.java:81)
   at org.jppf.server.node.JPPFNode.processNextJob(JPPFNode.java:190)
   at org.jppf.server.node.JPPFNode.perform(JPPFNode.java:175)
   at org.jppf.server.node.JPPFNode.run(JPPFNode.java:136)
   at org.jppf.node.NodeRunner.main(NodeRunner.java:140)
2016-02-15 05:04:11,580 [ERROR][org.jppf.server.node.JPPFNode.run(142)]: Connection closed by remote host
java.net.SocketException: Connection closed by remote host
   at sun.security.ssl.SSLSocketImpl.checkWrite(SSLSocketImpl.java:1555)
   at sun.security.ssl.AppOutputStream.write(AppOutputStream.java:71)
   at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
   at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
   at java.io.DataOutputStream.flush(DataOutputStream.java:123)
   at org.jppf.comm.socket.AbstractSocketWrapper.flush(AbstractSocketWrapper.java:176)
   at org.jppf.comm.socket.AbstractSocketWrapper.writeInt(AbstractSocketWrapper.java:164)
   at org.jppf.io.SocketWrapperOutputDestination.writeInt(SocketWrapperOutputDestination.java:85)
   at org.jppf.io.IOHelper.writeData(IOHelper.java:125)
   at org.jppf.server.node.remote.RemoteNodeIO.sendResults(RemoteNodeIO.java:130)
   at org.jppf.server.node.AbstractNodeIO.writeResults(AbstractNodeIO.java:155)
   at org.jppf.server.node.JPPFNode.processResults(JPPFNode.java:261)
   at org.jppf.server.node.JPPFNode.processNextJob(JPPFNode.java:208)
   at org.jppf.server.node.JPPFNode.perform(JPPFNode.java:175)
   at org.jppf.server.node.JPPFNode.run(JPPFNode.java:136)
   at org.jppf.node.NodeRunner.main(NodeRunner.java:140)
2016-02-15 05:04:11,584 [WARN ][org.jppf.classloader.AbstractClassLoaderConnection.sendCloseChannelCommand(105)]: error sending close channel command : java.io.EOFException: null

Driver:

Quote
2016-02-14 18:52:15,652 [INFO ][org.jppf.utils.ManagementUtils.<clinit>(153)]: management successfully initialized
2016-02-14 18:52:15,661 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(80)]: --------------------------------------------------------------------------------
2016-02-14 18:52:15,661 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(81)]: JPPF Version: 5.1.2, Build number: 1758, Build date: 2016-02-05 06:24 CET
2016-02-14 18:52:15,661 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(82)]: starting driver with PID=3753, UUID=942D45E9-4B3D-CAEE-4EA5-41A7434936C6
2016-02-14 18:52:15,661 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(83)]: --------------------------------------------------------------------------------
2016-02-14 18:52:15,854 [INFO ][org.jppf.nio.NioConstants.getCheckConnection(80)]: NIO checks are enabled
2016-02-14 18:52:15,858 [INFO ][org.jppf.nio.StateTransitionManager.initExecutor(288)]: globalExecutor=java.util.concurrent.ThreadPoolExecutor@ea4a92b[Running, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0], maxSize=2

Application:

Quote
2016-02-14 18:55:54,039 [INFO ][org.jppf.utils.ManagementUtils.<clinit>(153)]: management successfully initialized
2016-02-14 18:55:54,051 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(80)]: --------------------------------------------------------------------------------
2016-02-14 18:55:54,051 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(81)]: JPPF Version: 5.1.2, Build number: 1758, Build date: 2016-02-05 06:24 CET
2016-02-14 18:55:54,051 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(82)]: starting client with PID=3874, UUID=4B6EB3CB-0FD4-A1E2-7561-D0114BCC3E27
2016-02-14 18:55:54,052 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(83)]: --------------------------------------------------------------------------------
2016-02-14 18:55:54,320 [INFO ][org.jppf.client.AbstractGenericClient.newConnection(281)]: connection [jppf_discovery-1-1] created
2016-02-14 18:55:54,340 [INFO ][org.jppf.client.ClassServerDelegateImpl.init(76)]: [client: jppf_discovery-1-1 - ClassServer] Attempting connection to the class server at li1025-39.members.linode.com:11443
2016-02-14 18:55:54,363 [INFO ][org.jppf.client.ClassServerDelegateImpl.init(84)]: [client: jppf_discovery-1-1 - ClassServer] Reconnected to the class server
2016-02-14 18:55:54,367 [INFO ][org.jppf.client.TaskServerConnectionHandler.init(78)]: [client: jppf_discovery-1-1 - TasksServer] Attempting connection to the task server at li1025-39.members.linode.com:11443
2016-02-14 18:55:54,910 [INFO ][org.jppf.client.TaskServerConnectionHandler.init(91)]: [client: jppf_discovery-1-1 - TasksServer] Reconnected to the JPPF task server
2016-02-14 18:55:54,914 [INFO ][org.jppf.client.balancer.queue.TaskQueueChecker.dispatchJobToChannel(299)]: dispatching 1000000 tasks to remote channel
2016-02-15 05:04:11,567 [ERROR][org.jppf.client.BaseJPPFClientConnection.receiveBundleAndResults(277)]: java.io.EOFException: null
2016-02-15 05:04:11,568 [WARN ][org.jppf.client.ClassServerDelegateImpl.run(108)]: [jppf_discovery-1-1 - ClassServer] caught java.io.EOFException: null, will re-initialise ...
2016-02-15 05:04:11,570 [WARN ][org.jppf.client.balancer.ChannelWrapperRemote.run(239)]: java.io.EOFException: null
2016-02-15 05:04:11,571 [INFO ][org.jppf.client.ClassServerDelegateImpl.init(76)]: [client: jppf_discovery-1-1 - ClassServer] Attempting connection to the class server at li1025-39.members.linode.com:11443
2016-02-15 05:04:11,575 [INFO ][org.jppf.client.TaskServerConnectionHandler.init(78)]: [client: jppf_discovery-1-1 - TasksServer] Attempting connection to the task server at li1025-39.members.linode.com:11443
« Last Edit: February 16, 2016, 01:54:44 AM by broiyan »
Logged

lolo

  • Administrator
  • JPPF Council Member
  • *****
  • Posts: 2272
    • JPPF Web site
Re: unequal task distribution
« Reply #4 on: February 15, 2016, 08:57:10 PM »

Thanks for the information.
For the sake of figuring out this issue, would it be possible for you to test with simple tasks that just sleep for a few seconds? For load-balancer testing purposes, a few seconds are more than enough. This way, you don't have to wait for hours to get the outcome of a test.

For example:

Code: [Select]
import java.util.concurrent.atomic.AtomicInteger;
import org.jppf.node.protocol.AbstractTask;

public class MyTask extends AbstractTask<String> {
  private static final AtomicInteger instanceCount = new AtomicInteger(0);
  private final int taskNumber = instanceCount.incrementAndGet();

  @Override public void run() {
    System.out.println("starting task #" + taskNumber);
    try {
      Thread.sleep(5000L); // sleep for 5s
      setResult("success");
    } catch (Exception e) {
      setThrowable(e);
      e.printStackTrace();
    }
  }
}

Could you try with instances of this class instead of your own tasks and let us of the outcome?

Thanks,
-Laurent
Logged

broiyan

  • JPPF Grand Master
  • ****
  • Posts: 54
Re: unequal task distribution
« Reply #5 on: February 16, 2016, 12:18:20 AM »

Ok. These are the results from the test:
Quote
node0, 1 core:
starting task #5
starting task #16
starting task #24

node1, 1 core:
starting task #8
starting task #11
starting task #19

node2, 2 core:
starting task #6
starting task #7
starting task #9
starting task #10
starting task #17
starting task #18

node3, 4 core:
starting task #1
starting task #2
starting task #3
starting task #4
starting task #12
starting task #14
starting task #13
starting task #15
starting task #20
starting task #21
starting task #22
starting task #23

Returning to my application, I get different results now:
Quote
node0, 1 core:
2 tasks (that is, the node said "hello" twice)
node1, 1 core:
1 task
node2, 2 cores:
5 tasks
node3, 4 cores:
8 tasks

This is a total of 2+1+5+8 = 16 tasks of the 24 tasks submitted. I will wait for the results and hopefully the remaining 8 tasks will get started later. I am not sure why it failed yesterday and since I have made no changes, it may still fail. There is something strange happening: on a 2.5 GHz computer single threaded it takes 40 minutes. Some shared cloud VM servers are slow, for example at one vendor I estimated it to be 0.9 GHz. I have not estimated it at the vendor I am using recently. However, it should not take 4 nodes (1 + 1 + 2 + 4 = 8 cores) many hours to solve this problem.
« Last Edit: February 16, 2016, 09:39:02 AM by broiyan »
Logged

broiyan

  • JPPF Grand Master
  • ****
  • Posts: 54
Re: unequal task distribution
« Reply #6 on: February 16, 2016, 06:00:09 AM »

After starting the job (24 tasks), the following is the driver log file:

Quote
2016-02-15 20:36:07,455 [INFO ][org.jppf.utils.ManagementUtils.<clinit>(153)]: management successfully initialized
2016-02-15 20:36:07,469 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(80)]: --------------------------------------------------------------------------------
2016-02-15 20:36:07,469 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(81)]: JPPF Version: 5.1.2, Build number: 1758, Build date: 2016-02-05 06:24 CET
2016-02-15 20:36:07,470 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(82)]: starting driver with PID=3748, UUID=8739C394-7CC1-AC50-4BBB-F0439D506BDF
2016-02-15 20:36:07,470 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(83)]: --------------------------------------------------------------------------------
2016-02-15 20:36:08,208 [INFO ][org.jppf.nio.NioConstants.getCheckConnection(80)]: NIO checks are enabled
2016-02-15 20:36:08,212 [INFO ][org.jppf.nio.StateTransitionManager.initExecutor(288)]: globalExecutor=java.util.concurrent.ThreadPoolExecutor@ea4a92b[Running, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0], maxSize=2
2016-02-15 20:37:55,496 [WARN ][org.jppf.nio.StateTransitionTask.run(89)]: error on channel SelectionKeyWrapper[id=20, readyOps=1, interestOps=0, context=ClientContext[channel=SelectionKeyWrapper[id=20], state=WAITING_JOB, uuid=8317CD9B-C9D2-E3C8-8348-CA42B24C932A, connectionUuid=8317CD9B-C9D2-E3C8-8348-CA42B24C932A_1, peer=false, ssl=true], nbTasksToSend=0, completedBundles={[]}] : java.io.EOFException: EOF reading inbound stream

Might this error explain why the remaining tasks, of the original 24 tasks, fail to be sent to the nodes? Note: I log a used memory high water mark every second on the application/driver VM and it seems to be well under the maximum memory (0.5GB when 2.0 GB is available).  On a node the high water mark was at 0.1 GB and 1.0 GB was available. Lack of memory is probably not the problem here.
« Last Edit: February 16, 2016, 06:33:43 AM by broiyan »
Logged

lolo

  • Administrator
  • JPPF Council Member
  • *****
  • Posts: 2272
    • JPPF Web site
Re: unequal task distribution
« Reply #7 on: February 16, 2016, 07:44:43 AM »

The error you posted indicates that the connection between your application and the server was broken. When this happens, the server's default behavior is to cancel the jobs submitted by the application. This behavior can be overriden for an individual job by setting a flag in the job's SLA:

Code: [Select]
JPPFJob job = new JPPFJob();
job.getSLA().setCancelUponDisconnect(false);

Maybe an error occurred within the application or the JPPF client, or maybe it just terminated without waiting for the job's results, this I cannot say without more information. Does the application's log show anything relevant?

Thanks,
-Laurent
Logged

broiyan

  • JPPF Grand Master
  • ****
  • Posts: 54
Re: unequal task distribution
« Reply #8 on: February 16, 2016, 08:42:43 AM »

Perhaps the method is:

Quote
job.getSLA().setCancelUponClientDisconnect(false); // there is no such method as setCancelUponDisconnect

The application and the server are on the same VM. The application log:
Quote
2016-02-15 20:45:09,629 [INFO ][org.jppf.utils.ManagementUtils.<clinit>(153)]: management successfully initialized
2016-02-15 20:45:09,641 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(80)]: --------------------------------------------------------------------------------
2016-02-15 20:45:09,642 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(81)]: JPPF Version: 5.1.2, Build number: 1758, Build date: 2016-02-05 06:24 CET
2016-02-15 20:45:09,642 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(82)]: starting client with PID=3939, UUID=26F91872-B423-1DAE-C87E-8939CC0354CC
2016-02-15 20:45:09,642 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(83)]: --------------------------------------------------------------------------------
2016-02-15 20:45:09,872 [INFO ][org.jppf.client.AbstractGenericClient.newConnection(281)]: connection [jppf_discovery-1-1] created
2016-02-15 20:45:09,898 [INFO ][org.jppf.client.ClassServerDelegateImpl.init(76)]: [client: jppf_discovery-1-1 - ClassServer] Attempting connection to the class server at li1025-39.members.linode.com:11443
2016-02-15 20:45:09,918 [INFO ][org.jppf.client.ClassServerDelegateImpl.init(84)]: [client: jppf_discovery-1-1 - ClassServer] Reconnected to the class server
2016-02-15 20:45:09,923 [INFO ][org.jppf.client.TaskServerConnectionHandler.init(78)]: [client: jppf_discovery-1-1 - TasksServer] Attempting connection to the task server at li1025-39.members.linode.com:11443
2016-02-15 20:45:10,447 [INFO ][org.jppf.client.TaskServerConnectionHandler.init(91)]: [client: jppf_discovery-1-1 - TasksServer] Reconnected to the JPPF task server
2016-02-15 20:45:10,449 [INFO ][org.jppf.client.balancer.queue.TaskQueueChecker.dispatchJobToChannel(299)]: dispatching 1000000 tasks to remote channel
Note that the application log is a few minutes after the error in the driver log. I usually start the driver first. Then I start the nodes. Lastly I start the application, so that explains why the application timestamps are after the driver timestamps.

Retrying Everything:
I rebooted all VMs. I put in the call to setCancelUponClientDisconnect(false). I re-ran everything. This is the driver log.
Quote
2016-02-16 02:54:46,836 [INFO ][org.jppf.utils.ManagementUtils.<clinit>(153)]: management successfully initialized
2016-02-16 02:54:46,847 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(80)]: --------------------------------------------------------------------------------
2016-02-16 02:54:46,847 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(81)]: JPPF Version: 5.1.2, Build number: 1758, Build date: 2016-02-05 06:24 CET
2016-02-16 02:54:46,848 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(82)]: starting driver with PID=3758, UUID=750317BF-AEF1-FFAB-6A02-7A5B5DCC12EA
2016-02-16 02:54:46,848 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(83)]: --------------------------------------------------------------------------------
2016-02-16 02:54:47,111 [INFO ][org.jppf.nio.NioConstants.getCheckConnection(80)]: NIO checks are enabled
2016-02-16 02:54:47,117 [INFO ][org.jppf.nio.StateTransitionManager.initExecutor(288)]: globalExecutor=java.util.concurrent.ThreadPoolExecutor@ea4a92b[Running, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0], maxSize=2
2016-02-16 02:57:45,847 [WARN ][org.jppf.nio.StateTransitionTask.run(89)]: error on channel SelectionKeyWrapper[id=14, readyOps=1, interestOps=0, context=ClientClassContext[channel=SelectionKeyWrapper[id=14], state=WAITING_INITIAL_PROVIDER_REQUEST, resource=null, pendingRequests=0, currentRequest=null, connectionUuid=null, type=client, peer=false, uuid=null, secure=true, ssl=true]] : java.io.EOFException: EOF reading inbound stream
It appears the driver was started at 2:54:46, then a few minutes later at 2:57:44 I started the application and at almost the same time the [WARN] appears in the driver log. In fact just 1 second later, at 2:57:45.

The application log:
Quote
2016-02-16 02:57:44,831 [INFO ][org.jppf.utils.ManagementUtils.<clinit>(153)]: management successfully initialized
2016-02-16 02:57:44,840 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(80)]: --------------------------------------------------------------------------------
2016-02-16 02:57:44,841 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(81)]: JPPF Version: 5.1.2, Build number: 1758, Build date: 2016-02-05 06:24 CET
2016-02-16 02:57:44,841 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(82)]: starting client with PID=3913, UUID=ADFB15F0-7CCF-ECBC-53C8-DEFE7D696598
2016-02-16 02:57:44,841 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(83)]: --------------------------------------------------------------------------------
2016-02-16 02:57:45,723 [INFO ][org.jppf.client.AbstractGenericClient.newConnection(281)]: connection [jppf_discovery-1-1] created
2016-02-16 02:57:45,744 [INFO ][org.jppf.client.ClassServerDelegateImpl.init(76)]: [client: jppf_discovery-1-1 - ClassServer] Attempting connection to the class server at li1025-39.members.linode.com:11443
2016-02-16 02:57:45,761 [INFO ][org.jppf.client.ClassServerDelegateImpl.init(84)]: [client: jppf_discovery-1-1 - ClassServer] Reconnected to the class server
2016-02-16 02:57:45,762 [INFO ][org.jppf.client.TaskServerConnectionHandler.init(78)]: [client: jppf_discovery-1-1 - TasksServer] Attempting connection to the task server at li1025-39.members.linode.com:11443
2016-02-16 02:57:45,844 [WARN ][org.jppf.client.ClassServerDelegateImpl.run(108)]: [jppf_discovery-1-1 - ClassServer] caught org.jppf.ssl.SSLConfigurationException: java.security.UnrecoverableKeyException: Password must not be null, will re-initialise ...
2016-02-16 02:57:45,844 [INFO ][org.jppf.client.ClassServerDelegateImpl.init(76)]: [client: jppf_discovery-1-1 - ClassServer] Attempting connection to the class server at li1025-39.members.linode.com:11443
2016-02-16 02:57:45,852 [INFO ][org.jppf.client.ClassServerDelegateImpl.init(84)]: [client: jppf_discovery-1-1 - ClassServer] Reconnected to the class server
2016-02-16 02:57:46,226 [INFO ][org.jppf.client.TaskServerConnectionHandler.init(91)]: [client: jppf_discovery-1-1 - TasksServer] Reconnected to the JPPF task server
2016-02-16 02:57:46,229 [INFO ][org.jppf.client.balancer.queue.TaskQueueChecker.dispatchJobToChannel(299)]: dispatching 1000000 tasks to remote channel

This time, with only 3 nodes the distribution is:
Quote
node0, 1 core:
1 task
node1, 1 core:
1 task
node2, 2 cores:
10 tasks.
This experiment does not have node3 (4 cores) because I don't think it's relevant to the problem.

There are 24 tasks in total. It should take about 40 minutes to run 24 tasks, single threaded at 2.5 GHz. Since I have 3 nodes (1+1+2 = 4 cores in total), it can be reasonable to think it should take 10 minutes  If I assume time shared public VM servers give 50% throughput it should be finished by 20 minutes. It has been more than 25 minutes and it is not finished and the remainder of the tasks have not been started at any nodes.
« Last Edit: February 16, 2016, 09:36:21 AM by broiyan »
Logged

lolo

  • Administrator
  • JPPF Council Member
  • *****
  • Posts: 2272
    • JPPF Web site
Re: unequal task distribution
« Reply #9 on: February 16, 2016, 12:07:55 PM »

The latest application log you posted shows the following: "caught org.jppf.ssl.SSLConfigurationException: java.security.UnrecoverableKeyException: Password must not be null, will re-initialise ..."
It seems there is a SSL configuration issue here.

-Laurent
Logged

broiyan

  • JPPF Grand Master
  • ****
  • Posts: 54
Re: unequal task distribution
« Reply #10 on: February 16, 2016, 10:48:35 PM »

The SSL configuration is the same between my run of your suggested test program and my actual applcation.

The suggested test program produced these logs. 

application log:

Quote
2016-02-16 16:41:20,616 [INFO ][org.jppf.utils.ManagementUtils.<clinit>(153)]: management successfully initialized
2016-02-16 16:41:20,630 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(80)]: --------------------------------------------------------------------------------
2016-02-16 16:41:20,630 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(81)]: JPPF Version: 5.1.2, Build number: 1758, Build date: 2016-02-05 06:24 CET
2016-02-16 16:41:20,631 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(82)]: starting client with PID=4185, UUID=6E943B8C-DA6D-ADE0-555D-B0B249715048
2016-02-16 16:41:20,631 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(83)]: --------------------------------------------------------------------------------
2016-02-16 16:41:21,505 [INFO ][org.jppf.client.AbstractGenericClient.newConnection(281)]: connection [jppf_discovery-1-1] created
2016-02-16 16:41:21,521 [INFO ][org.jppf.client.ClassServerDelegateImpl.init(76)]: [client: jppf_discovery-1-1 - ClassServer] Attempting connection to the class server at li1025-39.members.linode.com:11443
2016-02-16 16:41:21,539 [INFO ][org.jppf.client.ClassServerDelegateImpl.init(84)]: [client: jppf_discovery-1-1 - ClassServer] Reconnected to the class server
2016-02-16 16:41:21,541 [INFO ][org.jppf.client.TaskServerConnectionHandler.init(78)]: [client: jppf_discovery-1-1 - TasksServer] Attempting connection to the task server at li1025-39.members.linode.com:11443
2016-02-16 16:41:22,112 [INFO ][org.jppf.client.TaskServerConnectionHandler.init(91)]: [client: jppf_discovery-1-1 - TasksServer] Reconnected to the JPPF task server
2016-02-16 16:41:22,115 [INFO ][org.jppf.client.balancer.queue.TaskQueueChecker.dispatchJobToChannel(299)]: dispatching 1000000 tasks to remote channel

driver log:

Quote
2016-02-16 16:40:54,971 [INFO ][org.jppf.utils.ManagementUtils.<clinit>(153)]: management successfully initialized
2016-02-16 16:40:54,992 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(80)]: --------------------------------------------------------------------------------
2016-02-16 16:40:54,992 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(81)]: JPPF Version: 5.1.2, Build number: 1758, Build date: 2016-02-05 06:24 CET
2016-02-16 16:40:54,993 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(82)]: starting driver with PID=4131, UUID=4AA5778B-CDB5-D7AE-0526-CAA6A29D2CB5
2016-02-16 16:40:54,993 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(83)]: --------------------------------------------------------------------------------
2016-02-16 16:40:55,654 [INFO ][org.jppf.nio.NioConstants.getCheckConnection(80)]: NIO checks are enabled
2016-02-16 16:40:55,659 [INFO ][org.jppf.nio.StateTransitionManager.initExecutor(288)]: globalExecutor=java.util.concurrent.ThreadPoolExecutor@ea4a92b[Running, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0], maxSize=2
2016-02-16 16:41:52,355 [WARN ][org.jppf.nio.StateTransitionTask.run(89)]: error on channel SelectionKeyWrapper[id=16, readyOps=1, interestOps=0, context=ClientContext[channel=SelectionKeyWrapper[id=16], state=WAITING_JOB, uuid=6E943B8C-DA6D-ADE0-555D-B0B249715048, connectionUuid=6E943B8C-DA6D-ADE0-555D-B0B249715048_1, peer=false, ssl=true], nbTasksToSend=0, completedBundles={[]}] : java.io.EOFException: EOF reading inbound stream

The [WARN] message does not seem to be fatal. The simple test program completes. I think the completion is approximately simultaneous to the [WARN] message.

Sometimes the [WARN] message from the driver as the simple test program completes is:

Quote
2016-02-16 22:42:22,007 [WARN ][org.jppf.nio.StateTransitionTask.run(89)]: error on channel SelectionKeyWrapper[id=16, readyOps=1, interestOps=0, context=ClientContext[channel=SelectionKeyWrapper[id=16], state=WAITING_JOB, uuid=54E338C2-F1EC-B683-1459-FD19505BEAEE, connectionUuid=54E338C2-F1EC-B683-1459-FD19505BEAEE_1, peer=false, ssl=true], nbTasksToSend=0, completedBundles={[]}] : java.lang.NullPointerException: null
« Last Edit: February 17, 2016, 04:48:56 AM by broiyan »
Logged

broiyan

  • JPPF Grand Master
  • ****
  • Posts: 54
Re: unequal task distribution
« Reply #11 on: February 17, 2016, 02:12:03 AM »

Edit: In order to get rid of the following error,

Quote
2016-02-16 02:57:45,844 [WARN ][org.jppf.client.ClassServerDelegateImpl.run(108)]: [jppf_discovery-1-1 - ClassServer] caught org.jppf.ssl.SSLConfigurationException: java.security.UnrecoverableKeyException: Password must not be null, will re-initialise ...

I decided to turn off discovery in the application config.
Quote
jppf.discovery.enabled = false

I also turned the following in the driver config into a comment. (In JPPF 5.1.12 it is originally a comment.)

Quote
# jppf.ssl.configuration.file = config/ssl/ssl-server.properties

Unfortunately, the job starts but never finishes. Some tasks are sent to nodes, but not all of them. 14 of 16 are sent and 2 are never sent. They are poorly distributed:
Quote
node0, 1 core: 8 tasks
node1, 1 core: 4 tasks
node2, 2 core: 2 tasks
never sent: 2 tasks

The application log:

Quote
2016-02-20 00:22:44,600 [INFO ][org.jppf.utils.ManagementUtils.<clinit>(153)]: management successfully initialized
2016-02-20 00:22:44,610 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(80)]: --------------------------------------------------------------------------------
2016-02-20 00:22:44,610 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(81)]: JPPF Version: 5.1.2, Build number: 1758, Build date: 2016-02-05 06:24 CET
2016-02-20 00:22:44,610 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(82)]: starting client with PID=4727, UUID=08B70C74-D553-D215-06B6-1303657A5802
2016-02-20 00:22:44,610 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(83)]: --------------------------------------------------------------------------------
2016-02-20 00:22:44,668 [INFO ][org.jppf.client.AbstractGenericClient.newConnection(281)]: connection [driver1-1] created
2016-02-20 00:22:44,690 [INFO ][org.jppf.client.ClassServerDelegateImpl.init(76)]: [client: driver1-1 - ClassServer] Attempting connection to the class server at localhost:11443
2016-02-20 00:22:44,702 [INFO ][org.jppf.client.ClassServerDelegateImpl.init(84)]: [client: driver1-1 - ClassServer] Reconnected to the class server
2016-02-20 00:22:44,711 [INFO ][org.jppf.client.TaskServerConnectionHandler.init(78)]: [client: driver1-1 - TasksServer] Attempting connection to the task server at localhost:11443
2016-02-20 00:22:45,163 [INFO ][org.jppf.client.TaskServerConnectionHandler.init(91)]: [client: driver1-1 - TasksServer] Reconnected to the JPPF task server
2016-02-20 00:22:45,167 [INFO ][org.jppf.client.balancer.queue.TaskQueueChecker.dispatchJobToChannel(299)]: dispatching 1000000 tasks to remote channel

The driver log:

Quote
2016-02-20 00:21:31,297 [INFO ][org.jppf.utils.ManagementUtils.<clinit>(153)]: management successfully initialized
2016-02-20 00:21:31,303 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(80)]: --------------------------------------------------------------------------------
2016-02-20 00:21:31,303 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(81)]: JPPF Version: 5.1.2, Build number: 1758, Build date: 2016-02-05 06:24 CET
2016-02-20 00:21:31,303 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(82)]: starting driver with PID=4595, UUID=8B6DFCCF-8BF9-E902-B373-EE6B76858816
2016-02-20 00:21:31,304 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(83)]: --------------------------------------------------------------------------------
2016-02-20 00:21:31,401 [INFO ][org.jppf.nio.NioConstants.getCheckConnection(80)]: NIO checks are enabled
2016-02-20 00:21:31,404 [INFO ][org.jppf.nio.StateTransitionManager.initExecutor(288)]: globalExecutor=java.util.concurrent.ThreadPoolExecutor@ea4a92b[Running, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0], maxSize=2
2016-02-20 00:22:02,071 [WARN ][org.jppf.nio.StateTransitionTask.run(89)]: error on channel SelectionKeyWrapper[id=16, readyOps=1, interestOps=0, context=ClientContext[channel=SelectionKeyWrapper[id=16], state=WAITING_JOB, uuid=0BBBE262-03AB-8A29-78C6-43D7FCBEA324, connectionUuid=0BBBE262-03AB-8A29-78C6-43D7FCBEA324_1, peer=false, ssl=true], nbTasksToSend=0, completedBundles={[]}] : java.io.EOFException: EOF reading inbound stream
2016-02-20 00:22:20,146 [WARN ][org.jppf.nio.StateTransitionTask.run(89)]: error on channel SelectionKeyWrapper[id=20, readyOps=1, interestOps=0, context=ClientContext[channel=SelectionKeyWrapper[id=20], state=WAITING_JOB, uuid=2C8C8127-8DEC-F74E-EBB4-03DF4E969ECF, connectionUuid=2C8C8127-8DEC-F74E-EBB4-03DF4E969ECF_1, peer=false, ssl=true], nbTasksToSend=0, completedBundles={[]}] : java.io.EOFException: EOF reading inbound stream
2016-02-20 00:22:35,852 [WARN ][org.jppf.nio.StateTransitionTask.run(89)]: error on channel SelectionKeyWrapper[id=24, readyOps=1, interestOps=0, context=ClientContext[channel=SelectionKeyWrapper[id=24], state=WAITING_JOB, uuid=1C02AC80-A205-807C-6D92-3B91F800811C, connectionUuid=1C02AC80-A205-807C-6D92-3B91F800811C_1, peer=false, ssl=true], nbTasksToSend=0, completedBundles={[]}] : java.lang.NullPointerException: null
« Last Edit: February 20, 2016, 06:53:57 AM by broiyan »
Logged
Pages: [1]   Go Up
 
JPPF Powered by SMF 2.0 RC5 | SMF © 2006–2011, Simple Machines LLC Get JPPF at SourceForge.net. Fast, secure and Free Open Source software downloads