JPPF, java, parallel computing, distributed computing, grid computing, parallel, distributed, cluster, grid, cloud, open source, android, .net
JPPF

The open source
grid computing
solution

 Home   About   Features   Download   Documentation   On Github   Forums 
June 04, 2023, 09:36:06 AM *
Welcome,
Please login or register.

Login with username, password and session length
Advanced search  
News: New users, please read this message. Thank you!
  Home Help Search Login Register  
Pages: [1]   Go Down

Author Topic: Implicit time for job to stay in server queue?  (Read 3863 times)

Binh Nguyen

  • JPPF Knight
  • **
  • Posts: 15
Implicit time for job to stay in server queue?
« on: September 07, 2013, 02:04:11 AM »

Hi,

I have a grid cluster with 50 nodes. I submitted 300 long running jobs (each would take about 40 minutes) but only about 100 ran. Interestingly, I had no problem running thousands of short jobs. I am wondering if there is an implicit maximum time for a job to stay in server queue?

I am using node_threads balancing algorithm with multiplicator = 1 in the Server.

Thanks
Logged

lolo

  • Administrator
  • JPPF Council Member
  • *****
  • Posts: 2272
    • JPPF Web site
Re: Implicit time for job to stay in server queue?
« Reply #1 on: September 07, 2013, 06:03:26 AM »

Hello,

I confirm that there is no implicit timeout for the jobs. You can set a job to expire (and eventually be cancelled if it's executing) via its SLA, but it has to be done explicitely.

What I suspect may be happening is that the jobs are held in the client's queue. Please keep in mind that each client connection to the server can only handle one job at a time. For instance, if you have a single connection to the server (the default), and submit your jobs asynchronously (i.e. non-blocking jobs), then the first job will be sent to the server, and all other jobs will remain in the client queue until this job has finished executing, then the next job in the client queue will be sent to the server, etc.

To mitigate this, you can define a pool of connections (see here and here) to enable sending multiple jobs concurrently to the server.

Is this what is happening in your scenario?

-Laurent
Logged

Binh Nguyen

  • JPPF Knight
  • **
  • Posts: 15
Re: Implicit time for job to stay in server queue?
« Reply #2 on: September 09, 2013, 10:44:52 PM »

Hi Laurent,

I already used connection pool of size 50 ( same as number of nodes ) in the client. So that means somehow the jobs in client queue get lost for long running jobs. Is there a way for me to debug/look at the client job queue?

Thanks
Logged

Binh Nguyen

  • JPPF Knight
  • **
  • Posts: 15
Re: Implicit time for job to stay in server queue?
« Reply #3 on: September 10, 2013, 02:24:45 AM »

Hi Laurent,

After more investigation, I think the problem is something else. It looks like my long running job finished event is never been sent back to the client (and possibly the server). So the client hangs waiting for the job when the node is idle after finishing the job. I also witness that to observe the bug, any 30 + minutes job will do.

Let me know if you want any more details. I am wondering if this has something to do with TTL of the TCP connection?

Thanks
Logged

lolo

  • Administrator
  • JPPF Council Member
  • *****
  • Posts: 2272
    • JPPF Web site
Re: Implicit time for job to stay in server queue?
« Reply #4 on: September 10, 2013, 08:29:43 AM »

Hello,

Thank you for sharing the details of your investigation.
Can you let us know how you found that the job finsihed executing in the node? This may help us figure out whether it was actually sent back to the server or it got stuck in the node because some error happened.

Have you also checked the node and server logs for error messages or exceptions?

A possible suspect is that an OutOfMemoryError occurs either while executing the job in the node, or while the node is serializing the results before sending them back to the server. The easiest way to find that out is to add the JVM option -XX:+HeapDumpOnOutOfMemoryError to the node's configuration property "jppf.jvm.options", for instance:
Code: [Select]
jppf.jvm.options = -server -Xmx1024m -XX:+HeapDumpOnOutOfMemoryErrorYou might also want to do that for the server.

Unless your nodes are very far away from the server, I don't think the TTL (as defined here) set on the connection could be the cause of this problem. Maybe you were thinking of the keepalive setting?

-Laurent
Logged

Binh Nguyen

  • JPPF Knight
  • **
  • Posts: 15
Re: Implicit time for job to stay in server queue?
« Reply #5 on: September 10, 2013, 08:22:34 PM »

Hi Laurent,

I can confirm that the job did not stuck at the node. It is pretty easy to replicate the bug

Client:

Code: [Select]

import java.util.List;

import org.jppf.client.JPPFClient;
import org.jppf.client.JPPFJob;
import org.jppf.server.protocol.JPPFTask;

public final class LongTaskRunner {

public static final long SLEEP_TIME = Long.getLong("longTask.ms", 40 * 60 * 1000); // default 40 minutes

public static void main(String[] args) throws Exception {
// client
final JPPFClient client = new JPPFClient();

// job
final JPPFJob job = new JPPFJob();
job.setName("LongRunningJob");

// task
final LongRunningTask longTask = new LongRunningTask(SLEEP_TIME);

job.addTask(longTask);
job.setBlocking(true);

long start = System.currentTimeMillis();

System.err.println("submitting job...");
List<JPPFTask> submitted = client.submit(job);

if (submitted.size() != 1) {
throw new IllegalStateException("Expecting one job");
}

System.err.println("job returned after " + ( System.currentTimeMillis() - start )+ " ms.");

JPPFTask task = submitted.get(0);
System.err.println("Result = " +task.getResult());
System.err.println("Exception = " +task.getException().getStackTrace());
}
}


Task:

Code: [Select]
import org.jppf.server.protocol.JPPFTask;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public final class LongRunningTask extends JPPFTask {
private static final long serialVersionUID = 1L;

private static final Logger log = LoggerFactory.getLogger(LongRunningTask.class);

private final long sleepTime;

public LongRunningTask(long sleepTime) {
this.sleepTime = sleepTime;
}

@Override
public void run() {
try {
long start = System.currentTimeMillis();
log.error("Start long running task.");
Thread.sleep( sleepTime );
final long totalTime = System.currentTimeMillis() - start;
log.error("Finish long running task. time = " + totalTime + " ms.");
setResult(totalTime);
} catch (InterruptedException e) {
log.error("interuppted ", e);
setException(e);
}
}
}


After about 40 minutes this is what I see at the node log:

Code: [Select]
2013-09-10 10:24:13,140 [ERROR][com.palantir.grid.LongRunningTask.run(22)]: Start long running task.
2013-09-10 11:04:13,140 [ERROR][com.palantir.grid.LongRunningTask.run(25)]: Finish long running task. time = 2400000 ms.

while the client is still hang waiting for the job to return with this log:

Code: [Select]
client process id: 1542
10:24:04,506  INFO VersionUtils:88 - --------------------------------------------------------------------------------
10:24:04,508  INFO VersionUtils:89 - JPPF version information: Version: 3.3.5, Build number: 1162, Build date: 2013-08-11 07:51 CEST
10:24:04,509  INFO VersionUtils:90 - starting client with PID=1542, UUID=FCFEB3F8-AAB0-0260-F0AB-E59AD16BB9D6
10:24:04,511  INFO VersionUtils:91 - --------------------------------------------------------------------------------
10:24:04,511  INFO AbstractGenericClient:105 - JPPF client starting with sslEnabled = false
submitting job...
10:24:04,616  INFO AbstractGenericClient:248 - connection [driver1-1] created
[client: driver1-1 - ClassServer] Attempting connection to the class server at hh-sse-be-03.ptcloud.int:11111
10:24:04,669  INFO ClassServerDelegateImpl:80 - [client: driver1-1 - ClassServer] Attempting connection to the class server at hh-sse-be-03.ptcloud.int:11111
[client: driver1-1 - ClassServer] Reconnected to the class server
10:24:04,708  INFO ClassServerDelegateImpl:91 - [client: driver1-1 - ClassServer] Reconnected to the class server
[client: driver1-1 - TasksServer] Attempting connection to the JPPF task server at hh-sse-be-03.ptcloud.int:11111
10:24:04,715  INFO AbstractGenericClient:248 - connection [driver1-2] created
[client: driver1-1 - TasksServer] Reconnected to the JPPF task server
[client: driver1-2 - ClassServer] Attempting connection to the class server at hh-sse-be-03.ptcloud.int:11111
10:24:04,768  INFO ClassServerDelegateImpl:80 - [client: driver1-2 - ClassServer] Attempting connection to the class server at hh-sse-be-03.ptcloud.int:11111
[client: driver1-2 - ClassServer] Reconnected to the class server
10:24:04,781  INFO ClassServerDelegateImpl:91 - [client: driver1-2 - ClassServer] Reconnected to the class server
[client: driver1-2 - TasksServer] Attempting connection to the JPPF task server at hh-sse-be-03.ptcloud.int:11111
[client: driver1-2 - TasksServer] Reconnected to the JPPF task server

So somehow the client missed setResult() in the task.

Binh,
« Last Edit: September 10, 2013, 08:42:40 PM by Binh Nguyen »
Logged

Binh Nguyen

  • JPPF Knight
  • **
  • Posts: 15
Re: Implicit time for job to stay in server queue?
« Reply #6 on: September 10, 2013, 08:43:56 PM »

I think I found the cause in server log:

Code: [Select]
2013-09-10 10:02:35,490 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(88)]: --------------------------------------------------------------------------------
2013-09-10 10:02:35,493 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(89)]: JPPF version information: Version: 3.3.5, Build number: 1162, Build date: 2013-08-11 07:51 CEST
2013-09-10 10:02:35,493 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(90)]: starting driver with PID=2217, UUID=67EF0EDB-8130-165C-7C6B-521C78E589AD
2013-09-10 10:02:35,493 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(91)]: --------------------------------------------------------------------------------
2013-09-10 10:02:35,644 [INFO ][org.jppf.server.nio.NioConstants.getCheckConnection(81)]: NIO checks are enabled
2013-09-10 11:19:45,088 [WARN ][org.jppf.server.nio.StateTransitionTask.run(93)]: error on channel SelectionKeyWrapper[id=12, 10.100.25.45:47793, readyOps=1, keyOps=0, context=ClientContext[channel=SelectionKeyWrapper[id=12], state=WAITING_JOB, uuid=FCFEB3F8-AAB0-0260-F0AB-E59AD16BB9D6, connectionUuid=FCFEB3F8-AAB0-0260-F0AB-E59AD16BB9D6_1, peer=false], nbTasksToSend=0] : java.io.IOException: Connection timed out



Logged

lolo

  • Administrator
  • JPPF Council Member
  • *****
  • Posts: 2272
    • JPPF Web site
Re: Implicit time for job to stay in server queue?
« Reply #7 on: September 12, 2013, 06:51:36 AM »

Hi,

Thanks for sharing your investigation. This trace tells us that it is the connection between the client and the server that timed out. To me, it seems like there is a potential firewall configuration issue. It is common to have a firewall  setup such that it will automatically close any connection that has been idle for more than a specifiied time (I guess 30 mn in your case). This is very common when you run within a cloud environment for instance.

So to fix this, you will need to chnage the firewall settings or network security configuration to ensure that connections to the server do not time out.

Sincerely,
-Laurent
Logged

Binh Nguyen

  • JPPF Knight
  • **
  • Posts: 15
Re: Implicit time for job to stay in server queue?
« Reply #8 on: September 12, 2013, 07:23:26 PM »

It is almost impossible for me to alter the firewall rules. Also, there will always be a limit while in one of our use case, the job may need to run for ~12 hours.
Is there a way to add an option for job/ connection that will set keep alive to true? http://sourceforge.net/p/jin/bugs/199/

Thanks a lot,
Logged

lolo

  • Administrator
  • JPPF Council Member
  • *****
  • Posts: 2272
    • JPPF Web site
Re: Implicit time for job to stay in server queue?
« Reply #9 on: September 13, 2013, 06:58:55 AM »

Hello,

I have created an enhancement request for this: JPPF-187 Ability to enable TCP keepalive from the configuration.
Basically, this will allow you to enable keepalive with a configuration property such as "jppf.socket.keepalive = true".

Please note that this will not be sufficient to avoid the firewall-related timout. You will also need to configure the keep alive time (2 hours by default) at the OS level, to a value that is less than the firewall timeout setting. For instance, here are some articles on how to configure keepalive on Windows and on Linux.

This enhancement will be delivered as part of the next maintenance release JPPF 3.3.6. I do not have an exact date, but it will be before the end of September 2013.

Sincerely,
-Laurent
Logged

Binh Nguyen

  • JPPF Knight
  • **
  • Posts: 15
Re: Implicit time for job to stay in server queue?
« Reply #10 on: September 13, 2013, 08:53:16 PM »

Great! Thanks a lot! Please let me know when you have the fix in, I can try to pull and compile from source when it is available.

Logged

lolo

  • Administrator
  • JPPF Council Member
  • *****
  • Posts: 2272
    • JPPF Web site
Re: Implicit time for job to stay in server queue?
« Reply #11 on: September 14, 2013, 08:06:21 AM »

The fix is now committed. Please check the enhancement request for the proper revision numbers.

-Laurent
Logged

Binh Nguyen

  • JPPF Knight
  • **
  • Posts: 15
Re: Implicit time for job to stay in server queue?
« Reply #12 on: September 16, 2013, 07:33:36 PM »

Awesome! Thanks for the quick turn around.
Logged
Pages: [1]   Go Up
 
JPPF Powered by SMF 2.0 RC5 | SMF © 2006–2011, Simple Machines LLC Get JPPF at SourceForge.net. Fast, secure and Free Open Source software downloads