adequate
adequate
adequate
adequate
 

JPPF, java, parallel computing, distributed computing, grid computing, parallel, distributed, cluster, grid, cloud, open source, android, .net
JPPF

The open source
grid computing
solution

 Home   About   Features   Download   Documentation   Forums 
May 29, 2017, 02:02:01 AM *
Welcome,
Please login or register.

Login with username, password and session length
Advanced search  
News: New users, please read this message. Thank you!
  Home Help Search Login Register  
Pages: [1]   Go Down

Author Topic: application does not scale  (Read 33 times)

broiyan

  • JPPF Grand Master
  • ****
  • Posts: 53
application does not scale
« on: May 09, 2017, 08:30:22 AM »

My job has 55 tasks sent to 7 servers (8 processors in each server).
Most servers get 8 tasks and one server gets only 7 and the total is the expected 55.
server0 is runs the application and driver and a node.
server1 through server6 each run a node.
With 20 units of data, the application execution completes in about 19 hours.
This works as expected.

With 200 units of data it is expected to complete at 190 hours but actually execution completes around 210 hours plus or minus 10 hours because of some apparent non-linearity and because "identical" machines do not perform identically for unknown reasons. When a node completes the 8 tasks allocated to it I wait several minutes and then terminate the node via control-C and shutdown Linux. Sometimes I even use vmstat to see that the unit is idling before I issue control-C. Termination of nodes is done to save electricity. The application never completes. Why?

On server0, which is an application+driver+node, vmstat shows that the unit is 96% idling and performing very little block I/O and not performing swapping. For some reason swpd memory is still occupied. Note that the second line of output from vmstat is the important one because the first line is historical (since boot) and the second line is from the recent sampling period.
$ vmstat -Sm 8 2
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0  17372    257     13    271    0    0    20    22    5    0 96  0  2  2  0
 0  0  17372    258     13    271    0    0     0     2 1450 3351  3  0 96  0  0


The application's heap high water mark is 20GB on a 32GB machine according to my log file. Ultimately, I don't think the heap high water mark has any meaning because it tends to be higher on machines that have more memory and a higher Xmx even if subject to the same data. In other words, I believe the JVM just postpones garbage collection on "bigger" machines. Given that, I think vmstat is a better indicator of a constraint. Columns si so bi bo show that not much is happening. The 96% idle reiterates that point.

I believe vmstat shows swpd memory is always 0 on server1 through server6.

Since the application never terminates the console of the application is stuck looking like this:
OptimizingGridWorker constructor
OptimizingGridWorker constructor
OptimizingGridWorker constructor
OptimizingGridWorker constructor
... OMITTING SOME MORE OF THE SAME...
OptimizingGridWorker constructor
Number of tasks to start 55.
client process id: 2524, uuid: 01EC117A-A58C-9ECE-402A-488AC1A36B09
[client: driver1-1 - ClassServer] Attempting connection to the class server at localhost:11111
[client: driver1-1 - ClassServer] Reconnected to the class server
[client: driver1-1 - TasksServer] Attempting connection to the task server at localhost:11111
[client: driver1-1 - TasksServer] Reconnected to the JPPF task server


The application log:
$ cat jppf.log
2017-04-29 15:58:32,713 [INFO ][org.jppf.utils.ManagementUtils.<clinit>(153)]: management successfully initialized
2017-04-29 15:58:32,722 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(80)]: --------------------------------------------------------------------------------
2017-04-29 15:58:32,722 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(81)]: JPPF Version: 5.1.2, Build number: 1758, Build date: 2016-02-05 06:24 CET
2017-04-29 15:58:32,722 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(82)]: starting client with PID=2524, UUID=01EC117A-A58C-9ECE-402A-488AC1A36B09
2017-04-29 15:58:32,723 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(83)]: --------------------------------------------------------------------------------
2017-04-29 15:58:32,817 [INFO ][org.jppf.client.AbstractGenericClient.newConnection(281)]: connection [driver1-1] created
2017-04-29 15:58:32,829 [INFO ][org.jppf.client.ClassServerDelegateImpl.init(76)]: [client: driver1-1 - ClassServer] Attempting connection to the class server at localhost:11111
2017-04-29 15:58:32,846 [INFO ][org.jppf.client.ClassServerDelegateImpl.init(84)]: [client: driver1-1 - ClassServer] Reconnected to the class server
2017-04-29 15:58:32,853 [INFO ][org.jppf.client.TaskServerConnectionHandler.init(78)]: [client: driver1-1 - TasksServer] Attempting connection to the task server at localhost:11111
2017-04-29 15:58:32,899 [INFO ][org.jppf.client.TaskServerConnectionHandler.init(91)]: [client: driver1-1 - TasksServer] Reconnected to the JPPF task server
2017-04-29 15:58:32,917 [INFO ][org.jppf.client.balancer.queue.TaskQueueChecker.dispatchJobToChannel(299)]: dispatching 1000000 tasks to remote channel


The driver log shows activity on April 29 and also on May 8. That represents about 200 hours:
$ cat jppf-driver.log
2017-04-29 15:50:12,368 [INFO ][org.jppf.utils.ManagementUtils.<clinit>(153)]: management successfully initialized
2017-04-29 15:50:12,398 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(80)]: --------------------------------------------------------------------------------
2017-04-29 15:50:12,398 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(81)]: JPPF Version: 5.1.2, Build number: 1758, Build date: 2016-02-05 06:24 CET
2017-04-29 15:50:12,399 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(82)]: starting driver with PID=2185, UUID=A89E415F-408D-1191-F6FF-B3434E2E60DC
2017-04-29 15:50:12,399 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(83)]: --------------------------------------------------------------------------------
2017-04-29 15:50:12,660 [INFO ][org.jppf.nio.NioConstants.getCheckConnection(80)]: NIO checks are enabled
2017-04-29 15:50:12,664 [INFO ][org.jppf.nio.StateTransitionManager.initExecutor(288)]: globalExecutor=java.util.concurrent.ThreadPoolExecutor@47f37ef1[Running, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0], maxSize=8
2017-05-08 01:56:43,138 [WARN ][org.jppf.nio.StateTransitionTask.run(89)]: error on channel SelectionKeyWrapper[id=16, readyOps=1, interestOps=0, context=RemoteNodeContext[channel=SelectionKeyWrapper[id=16], state=IDLE, uuid=D88CC823-0CC4-7F74-3379-4822710ADA66, connectionUuid=null, peer=false, ssl=false]] : java.net.ConnectException: node SelectionKeyWrapper[id=16, readyOps=1, interestOps=0, context=RemoteNodeContext[channel=SelectionKeyWrapper[id=16], state=IDLE, uuid=D88CC823-0CC4-7F74-3379-4822710ADA66, connectionUuid=null, peer=false, ssl=false]] has been disconnected
2017-05-08 01:56:43,138 [WARN ][org.jppf.nio.StateTransitionTask.run(89)]: error on channel SelectionKeyWrapper[id=14, readyOps=1, interestOps=0, context=NodeClassContext[channel=SelectionKeyWrapper[id=14], state=WAITING_NODE_REQUEST, resource=null, pendingResponses=0, type=node, peer=false, uuid=D88CC823-0CC4-7F74-3379-4822710ADA66, secure=false, ssl=false]] : java.io.EOFException: null
2017-05-08 16:35:09,349 [WARN ][org.jppf.nio.StateTransitionTask.run(89)]: error on channel SelectionKeyWrapper[id=12, readyOps=1, interestOps=0, context=RemoteNodeContext[channel=SelectionKeyWrapper[id=12], state=IDLE, uuid=4AC689F0-DE3B-3A61-E354-D8925CB080B8, connectionUuid=null, peer=false, ssl=false]] : java.net.ConnectException: node SelectionKeyWrapper[id=12, readyOps=1, interestOps=0, context=RemoteNodeContext[channel=SelectionKeyWrapper[id=12], state=IDLE, uuid=4AC689F0-DE3B-3A61-E354-D8925CB080B8, connectionUuid=null, peer=false, ssl=false]] has been disconnected
2017-05-08 16:35:09,349 [WARN ][org.jppf.nio.StateTransitionTask.run(89)]: error on channel SelectionKeyWrapper[id=20, readyOps=1, interestOps=0, context=RemoteNodeContext[channel=SelectionKeyWrapper[id=20], state=IDLE, uuid=7D744992-2E33-31EE-CBF5-DA2CD83CF2EA, connectionUuid=null, peer=false, ssl=false]] : java.net.ConnectException: node SelectionKeyWrapper[id=20, readyOps=1, interestOps=0, context=RemoteNodeContext[channel=SelectionKeyWrapper[id=20], state=IDLE, uuid=7D744992-2E33-31EE-CBF5-DA2CD83CF2EA, connectionUuid=null, peer=false, ssl=false]] has been disconnected
2017-05-08 16:35:09,349 [WARN ][org.jppf.nio.StateTransitionTask.run(89)]: error on channel SelectionKeyWrapper[id=26, readyOps=1, interestOps=0, context=NodeClassContext[channel=SelectionKeyWrapper[id=26], state=WAITING_NODE_REQUEST, resource=null, pendingResponses=0, type=node, peer=false, uuid=282105AB-E026-DCA1-0A0A-74EE10F85586, secure=false, ssl=false]] : java.io.EOFException: null
2017-05-08 16:35:09,349 [WARN ][org.jppf.nio.StateTransitionTask.run(89)]: error on channel SelectionKeyWrapper[id=6, readyOps=1, interestOps=0, context=NodeClassContext[channel=SelectionKeyWrapper[id=6], state=WAITING_NODE_REQUEST, resource=null, pendingResponses=0, type=node, peer=false, uuid=9CD38BB6-A195-F74B-B0F2-A668C8E66939, secure=false, ssl=false]] : java.io.EOFException: null
2017-05-08 16:35:09,349 [WARN ][org.jppf.nio.StateTransitionTask.run(89)]: error on channel SelectionKeyWrapper[id=10, readyOps=1, interestOps=0, context=NodeClassContext[channel=SelectionKeyWrapper[id=10], state=WAITING_NODE_REQUEST, resource=null, pendingResponses=0, type=node, peer=false, uuid=4AC689F0-DE3B-3A61-E354-D8925CB080B8, secure=false, ssl=false]] : java.io.EOFException: null
2017-05-08 16:35:09,349 [WARN ][org.jppf.nio.StateTransitionTask.run(89)]: error on channel SelectionKeyWrapper[id=18, readyOps=1, interestOps=0, context=NodeClassContext[channel=SelectionKeyWrapper[id=18], state=WAITING_NODE_REQUEST, resource=null, pendingResponses=0, type=node, peer=false, uuid=7D744992-2E33-31EE-CBF5-DA2CD83CF2EA, secure=false, ssl=false]] : java.io.EOFException: null
2017-05-08 16:35:09,349 [WARN ][org.jppf.nio.StateTransitionTask.run(89)]: error on channel SelectionKeyWrapper[id=8, readyOps=1, interestOps=0, context=RemoteNodeContext[channel=SelectionKeyWrapper[id=8], state=IDLE, uuid=9CD38BB6-A195-F74B-B0F2-A668C8E66939, connectionUuid=null, peer=false, ssl=false]] : java.net.ConnectException: node SelectionKeyWrapper[id=8, readyOps=1, interestOps=0, context=RemoteNodeContext[channel=SelectionKeyWrapper[id=8], state=IDLE, uuid=9CD38BB6-A195-F74B-B0F2-A668C8E66939, connectionUuid=null, peer=false, ssl=false]] has been disconnected
2017-05-08 22:00:50,674 [WARN ][org.jppf.nio.StateTransitionTask.run(89)]: error on channel SelectionKeyWrapper[id=22, readyOps=1, interestOps=0, context=NodeClassContext[channel=SelectionKeyWrapper[id=22], state=WAITING_NODE_REQUEST, resource=null, pendingResponses=0, type=node, peer=false, uuid=1C643953-664F-5392-8902-3B73593EB596, secure=false, ssl=false]] : java.io.EOFException: null
2017-05-08 22:00:51,885 [WARN ][org.jppf.nio.StateTransitionTask.run(89)]: error on channel SelectionKeyWrapper[id=4, readyOps=1, interestOps=0, context=RemoteNodeContext[channel=SelectionKeyWrapper[id=4], state=WAITING_RESULTS, uuid=A9FE2297-7D2B-A04F-329D-A8B7AB61B376, connectionUuid=null, peer=false, ssl=false]] : java.io.IOException: Connection reset by peer
2017-05-08 22:00:50,674 [WARN ][org.jppf.nio.StateTransitionTask.run(89)]: error on channel SelectionKeyWrapper[id=2, readyOps=1, interestOps=0, context=NodeClassContext[channel=SelectionKeyWrapper[id=2], state=WAITING_NODE_REQUEST, resource=null, pendingResponses=0, type=node, peer=false, uuid=A9FE2297-7D2B-A04F-329D-A8B7AB61B376, secure=false, ssl=false]] : java.io.EOFException: null


An example of a node log:
$ cat jppf-node.log
2017-04-29 15:51:34,603 [INFO ][org.jppf.utils.ManagementUtils.<clinit>(153)]: management successfully initialized
2017-04-29 15:51:34,612 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(80)]: --------------------------------------------------------------------------------
2017-04-29 15:51:34,612 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(81)]: JPPF Version: 5.1.2, Build number: 1758, Build date: 2016-02-05 06:24 CET
2017-04-29 15:51:34,612 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(82)]: starting node with PID=2307, UUID=A9FE2297-7D2B-A04F-329D-A8B7AB61B376
2017-04-29 15:51:34,613 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(83)]: --------------------------------------------------------------------------------
2017-04-29 15:51:35,185 [INFO ][org.jppf.classloader.ClassLoaderRequestHandler.run(155)]: maxBatchSize = 1
2017-04-29 15:51:35,240 [INFO ][org.jppf.execute.AbstractExecutionManager.<init>(114)]: running 8 processing threads
2017-04-29 15:51:35,241 [INFO ][org.jppf.execute.AbstractExecutionManager.createThreadManager(140)]: Using default thread manager

« Last Edit: May 09, 2017, 08:53:50 AM by broiyan »
Logged

lolo

  • Administrator
  • JPPF Council Member
  • *****
  • Posts: 2172
    • JPPF Web site
Re: application does not scale
« Reply #1 on: May 11, 2017, 08:11:42 AM »

Hello,

Unfortunately, I do not have any conclusive information as to the scaling issue. I made some experiments, on a much smaller time scale than your (seconds whereas yours is hours), and there is one thing I found out though. I tried running cpu-bound computations, and then scaling them to 10x longer then 100x longer and I could see they did not scale linearly. By running a CPU monitoring tool side by side with the nodes, I could see that the CPU (an Intel i7-4390) was constantly adjusting its frequency, depending on the temperature of each core.

I ran the following code in the tasks:

Code: [Select]
public class MyTask extends AbstractTask<String> {
  private int arraySize, nbIterations;
  private long elapsed;

  public MyTask(final int arraySize, final int nbIterations) {
    this.arraySize = arraySize;
    this.nbIterations = nbIterations;
  }

  @Override
  public void run() {
    long start = System.nanoTime();
    for (int i=0; i<nbIterations; i++) {
      byte[] array = new byte[arraySize];
      for (int j=0; j<arraySize; j++) {
        int n = (int) Math.exp(Math.log(j + 1));
        n = (n % 256) - 128;
        array[j] = (byte) n;
      }
    }
    elapsed = (System.nanoTime() - start) / 1_000_000L;
    setResult(String.format("completed in %,d ms on node '%s'", elapsed, JPPFConfiguration.getProperties().getString("id")));
  }

  public long getElapsed() { return elapsed; }
}

I used 2 nodes, one with 128 MB heap, the other with 1 GB, but could not see and significant difference in performance.

Using an array size of 512kb and varying the number of iterations, I got these results (times in millis):
Code: [Select]
   170 iterations: job time =   6,375; avg task =   6,151; min task =   5,967; max task =   6,319
 1,700 iterations: job time =  68,742; avg task =  67,424; min task =  66,138; max task =  68,672
17,000 iterations: job time = 983,948; avg task = 964,960; min task = 950,358; max task = 983,877

Here I'm quite sure the discrepancy is due to the CPU's temperature control mechanism.

Quote
The application never completes. Why?

There isn't currently enough information to answer this. What I would suggest is to add logging information so the job's life cycle can be logged. Can you add the following:

In the driver's Log4j configuration:
Code: [Select]
log4j.logger.org.jppf.server.job.management.DriverJobManagement=DEBUGthis will allows us if and when a job is received from the application (JOB_QUEUED notification), dispatched to a node (JOB_DISPATCHED), results are received from a node (JOB_RETURNED) and the job completes in the driver (JOB_ENDED)

in the application's Log4j configuration:
Code: [Select]
log4j.logger.org.jppf.client.JPPFJob=DEBUGsimilarly, this will show when a job is submitted (JOB_START), tasks are sent to the driver (JOB_DISPATCH), results are received from the driver (JOB_RETURN) and the job completes in the client (JOB_END).

Can you perform a test with these log levels and let us know of the outcome?

Thanks,
-Laurent
Logged
Pages: [1]   Go Up
 
JPPF Powered by SMF 2.0 RC5 | SMF © 2006–2011, Simple Machines LLC Get JPPF at SourceForge.net. Fast, secure and Free Open Source software downloads