adequate
adequate
adequate
adequate
 

JPPF, java, parallel computing, distributed computing, grid computing, parallel, distributed, cluster, grid, cloud, open source, android, .net
JPPF

The open source
grid computing
solution

 Home   About   Features   Download   Documentation   Forums 
June 25, 2018, 12:03:56 AM *
Welcome,
Please login or register.

Login with username, password and session length
Advanced search  
News: New users, please read this message. Thank you!
  Home Help Search Login Register  
Pages: [1]   Go Down

Author Topic: Reduce priority of NodeError job  (Read 177 times)

shiva.verma

  • JPPF Master
  • ***
  • Posts: 27
Reduce priority of NodeError job
« on: March 06, 2018, 01:02:07 AM »

Can you please suggest me the best way to reduce priority of any job that got failed because of a Node error.

i want job to be re-tried once incase it failed because of a node error(setApplyMaxResubmitsUponNodeError(true); ) , BUT I want it to be tried with the lowest priroity since its probability of failing on any other node is very high.

Thanks in advance
Shiva
« Last Edit: March 06, 2018, 06:50:23 AM by shiva.verma »
Logged

lolo

  • Administrator
  • JPPF Council Member
  • *****
  • Posts: 2223
    • JPPF Web site
Re: Reduce priority of NodeError job
« Reply #1 on: March 06, 2018, 08:12:43 AM »

Hello Shiva,

This can be accomplished with a JobTasksListener, a JPPF extension for the driver which receives notifications when tasks are dispatched to or returned from a node. In particular, each "return" event has an associated return reason that will allow you to determine whether the tasks were returned because of a node error. Based on this, you can then use a local connection to the driver's MBean server and use the JMX-based APIs to lower the job's priority.

The following example implementation set the priority of a job to -10 when tasks failed due to a node error:

Code: [Select]
public class NodeErrorJobTasksListener implements JobTasksListener {
  // A JMX connection to the driver's MBean server.
  private final JMXDriverConnectionWrapper jmx;

  public NodeErrorJobTasksListener() {
    // no-args consructyor to connect to the driver's local MBean server
    jmx = new JMXDriverConnectionWrapper();
    jmx.connect();
  }

  @Override
  public void tasksReturned(final JobTasksEvent event) {
    switch (event.getReturnReason()) {
      case NODE_CHANNEL_ERROR:
      case NODE_PROCESSING_ERROR:
        if (event.getJobSLA().getPriority() >= 0) {
          try {
            // update the priority via JMX to ensure the driver requeues the job properly, according to the new priority
            jmx.getJobManager().updatePriority(event.getJobUuid(), -10);
          } catch (Exception e) {
            e.printStackTrace();
          }
        }
        break;
    }
  }

  @Override
  public void tasksDispatched(final JobTasksEvent event) { }

  @Override
  public void resultsReceived(JobTasksEvent event) { }
}

I hope this helps,
-Laurent
Logged

shiva.verma

  • JPPF Master
  • ***
  • Posts: 27
Re: Reduce priority of NodeError job
« Reply #2 on: March 07, 2018, 04:42:10 AM »

Thanks Laurent, it will help.

Meanwhile I was trying to use the admin-ui, but I was unable to fetch job data. sometime I want to have a better control over jobs, like killing one of the hanged job on any node.

I can see data fetched in all of the tabs except for "Job Data".
driver: 5.1.8
node: 5.1.8

My node is running at 12001 and driver at 11111

Any clue would be highly appreciated. Attaching the admin console logs (hostnames are changed in the logs)
Logged

lolo

  • Administrator
  • JPPF Council Member
  • *****
  • Posts: 2223
    • JPPF Web site
Re: Reduce priority of NodeError job
« Reply #3 on: March 07, 2018, 07:09:49 AM »

Hi Shiva,

In your admin-ui log file, I see the following warnings:

Code: [Select]
2018-03-06 19:36:04,687 [WARN ][org.jppf.ui.plugin.PluggableViewHandler.logErrors(157)]: Errors reported while creating the pluggable view 'ServerChooser':
no class name defined for pluggable view 'ServerChooser'
no 'addto' property defined for pluggable view 'ServerChooser'
container 'null' for pluggable view 'ServerChooser' could not be found
2018-03-06 19:36:04,687 [WARN ][org.jppf.ui.plugin.PluggableViewHandler.logErrors(157)]: Errors reported while creating the pluggable view 'JobData':
no class name defined for pluggable view 'JobData'
no 'addto' property defined for pluggable view 'JobData'
container 'null' for pluggable view 'JobData' could not be found

I could reproduce these warnings by setting the following in the admin-ui configuration file:

Code: [Select]
jppf.admin.console.view.ServerChooser.enabled = true
jppf.admin.console.view.JobData.enabled = true

Could you try to find these 2 lines in your jppf-gui.properties and simply comment them out or remove them, and let us know if this resolves the problem?

Thanks for your time,
-Laurent
Logged

shiva.verma

  • JPPF Master
  • ***
  • Posts: 27
Re: Reduce priority of NodeError job
« Reply #4 on: March 07, 2018, 07:25:42 AM »

Thanks for the quick reply Laurent. I enabled them in a hope to get it working.

I retried commenting them but no luck. Attaching ui as well as driver logs.

Thanks again
Shiva
Logged

lolo

  • Administrator
  • JPPF Council Member
  • *****
  • Posts: 2223
    • JPPF Web site
Re: Reduce priority of NodeError job
« Reply #5 on: March 08, 2018, 07:43:21 AM »

Hi Shiva,

Based on the log you provided, it seems this problem could be due to the fact that your admin-ui configuration defines a connection pool size of 10: you probably have either "jppf.pool.size = 10" or "driver1.jppf.pool.size = 10" or both. It appears the admin tool does not handle it very well. Could you please try setting 1 instead of ten and let us know of the outcome?

In any case, the admin tool only uses a single connection to each driver, even if more than one are configured, therefore setting the connection pool size to 1 will not have any impact on performance, quite the contrary in fact.

Sincerely,
-Laurent
Logged

shiva.verma

  • JPPF Master
  • ***
  • Posts: 27
Re: Reduce priority of NodeError job
« Reply #6 on: March 08, 2018, 01:56:43 PM »

Hi Laurent,

I even tried pool size of 1, but no luck. Here is the output:
client process id: 3161, uuid: B07A5255-595E-4FBE-5AE5-72FF9AC27313
[client: driver1-1 - ClassServer] Attempting connection to the class server at DRIVER_IP2:11111
[client: driver1-1 - ClassServer] Reconnected to the class server
[client: driver1-1 - TasksServer] Attempting connection to the task server at DRIVER_IP2:11111
[client: driver1-1 - TasksServer] Reconnected to the JPPF task server

P.S. all other data got refreshed well, except for the job data. Anything to do with JMX at driver/node side?

I am using JPPF-5.1.2-admin-ui

Please help.

Thanks in Advance
Shiva
Logged

shiva.verma

  • JPPF Master
  • ***
  • Posts: 27
Re: Reduce priority of NodeError job
« Reply #7 on: March 13, 2018, 08:06:20 AM »

Hi Laurent,

Can you please provide me any pointer. I am really struck on this and need this before I can complete my project

Much Thanks in Advance
Logged

lolo

  • Administrator
  • JPPF Council Member
  • *****
  • Posts: 2223
    • JPPF Web site
Re: Reduce priority of NodeError job
« Reply #8 on: March 14, 2018, 07:59:30 AM »

Hi Shiva,

I have identified this problem as an issue in the client code and registered a bug for it: JPPF-527 Admin console's job data view does not display anything when client pool size is greater than 1

I also uploaded a fix in the patch 01 for JPPF 5.1.6. Even though it is built on top of 5.1.6, it should also work if you apply it to your 5.1.2 admin-ui. Could you please give it a try and let us know if this fixes the issue?

Please also note that I tested using the latest configuration file you provided (just replacing DRIVER_IP2 with "localhost") and it did work for me, I could not reproduce the problem on 5.1.2. Anyway, the code fix should take care of that.

Sincerely,
-Laurent
Logged

shiva.verma

  • JPPF Master
  • ***
  • Posts: 27
Re: Reduce priority of NodeError job
« Reply #9 on: March 16, 2018, 07:54:06 AM »

Hi Laurent,

Thanks for providing the patch. I appreciate your timely help

Looks like I am still running out of luck. This is the combinations I have tried:
1. 5.1.2 (server node and application) + 5.1.6 admin ui (with you patch)
2. 5.1.6 (server node and application) + 5.1.6 admin ui (with you patch)

For the #1 combination, I am also attaching the logs and properties

May be I am missing something basics here? But since it is able to fetch all other data except for task I am mode confused.

For the #2 combination, I was able to get task information few times (though it will be hard for me to move my application and drivers/nodes to 5.1.6 for now). I was able to see tasks information like 3 out of 10 times.

Is it something to do with the pools?

My application was earlier running with 200 pool connection, which I further reduced to 50, but did not helped me in #1

Please help
Logged

lolo

  • Administrator
  • JPPF Council Member
  • *****
  • Posts: 2223
    • JPPF Web site
Re: Reduce priority of NodeError job
« Reply #10 on: March 17, 2018, 12:54:24 AM »

Hello Shiva,

Sorry to learn that the patch didn't resolve the problem.
It is very difficult to understand what the issue is from the logs. Is there any way that you could provide a screenshot of the amdin-ui that illustrates the problem?

Or maybe could you tell whether it is something similar to this (in bug JPPF-527):



or this (in bug JPPF-518) ?



Thanks,
-Laurent
Logged

shiva.verma

  • JPPF Master
  • ***
  • Posts: 27
Re: Reduce priority of NodeError job
« Reply #11 on: March 17, 2018, 01:58:18 AM »

Hi Laurent,

Please find attached the screenshot.

Please let me know if you need anything else to understand this issue better?

Thanks
Shiva
Logged

lolo

  • Administrator
  • JPPF Council Member
  • *****
  • Posts: 2223
    • JPPF Web site
Re: Reduce priority of NodeError job
« Reply #12 on: March 17, 2018, 07:51:14 AM »

Hi Shiva,

Thanks a lot for the screenshots, they provided very useful information.
In your admin-ui configuration file I see the following: "jppf.gui.publish.mode = immediate_notifications". Could you try to change it to "jppf.gui.publish.mode = polling"  instead, and let us know if this changes anything?

In "immediate_notifications" mode, the job data view is updated only via notifications (as the name indicates of course) received from the driver. However, I could see, in the server statistics view, that your tasks last around 2 minutes. That would be the frequency of the notifications the admin-ui receives. If the admin-ui is started after the jobs are submitted, then it would miss the notifications emitted by the driver for jobs dispatched to the nodes, and therefore the job data view would not reflect the current execution status of the jobs and nodes.

For such a scenario, the "polling" mode is indeed more appropriate, as it will update and/or rebuid the entire job execution tree after each polling interval (provided with the "jppf.gui.publish.period" property).

Sincerely,
-Laurent
Logged

shiva.verma

  • JPPF Master
  • ***
  • Posts: 27
Re: Reduce priority of NodeError job
« Reply #13 on: March 17, 2018, 09:52:06 PM »

Thanks a ton Laurent, this worked like a charm. I have not tested it thoroughly yet. Will give test it more and let you know.

Sorry for putting a dump question, But I have couple of more related questions:

1. can you also please suggest me the best way to suppress console output from the client. The only output I am unable to suppress is following and I want it to be clubbed to the client logfile only:
....
....
....
[client: driver1-48 - ClassServer] Attempting connection to the class server at localhost:11111
[client: driver1-45 - ClassServer] Reconnected to the class server
[client: driver1-47 - ClassServer] Reconnected to the class server
[client: driver1-49 - ClassServer] Attempting connection to the class server at localhost:11111
[client: driver1-46 - ClassServer] Reconnected to the class server
[client: driver1-50 - ClassServer] Attempting connection to the class server at localhost:11111
[client: driver1-48 - ClassServer] Reconnected to the class server
[client: driver1-43 - TasksServer] Attempting connection to the task server at localhost:11111
[client: driver1-42 - TasksServer] Attempting connection to the task server at localhost:11111
[client: driver1-49 - ClassServer] Reconnected to the class server
[client: driver1-50 - ClassServer] Reconnected to the class server
[client: driver1-48 - TasksServer] Attempting connection to the task server at localhost:11111
[client: driver1-46 - TasksServer] Attempting connection to the task server at localhost:11111
[client: driver1-47 - TasksServer] Attempting connection to the task server at localhost:11111
[client: driver1-45 - TasksServer] Attempting connection to the task server at localhost:11111
[client: driver1-44 - TasksServer] Attempting connection to the task server at localhost:11111
[client: driver1-50 - TasksServer] Attempting connection to the task server at localhost:11111
[client: driver1-49 - TasksServer] Attempting connection to the task server at localhost:11111
....
....
....

2. The number of these connections should be double the number of nodes connecting and do not has to do anything with the number of jobs? I know in my case I need two connection (one for sending and another for collecting). My understanding is yet not very clear on this. Any pointer would be great.


Thanks again
Shiva
Logged

lolo

  • Administrator
  • JPPF Council Member
  • *****
  • Posts: 2223
    • JPPF Web site
Re: Reduce priority of NodeError job
« Reply #14 on: March 18, 2018, 07:58:11 AM »

Hello Shiva,

I'm glad the display issue is resolved :)

To answer your questions:

1. Unfortunately, JPPF doesn't have an option to suppress the output generated with System.out.println. What you can do is redirect System.out to a file, using System.setOut(). You may also use a more elegant solution found here, by simply copy/pasting the class SystemOutToSlf4j and calling SystemOutToSlf4j.enable("org.jppf.client") before creating the JPPFClient.

2. The number of connections available to a client represents the maximum number of jobs a JPPF client can handle concurrently. For example, if your client has 50 connections and you submit 60 jobs, then up to 50 jobs at a time will be sent to the server. The remaining 10 jobs will be kept in the client's queue, waiting for a connection to be available (which happens whenever a job completes). To make an analogy, this is exactly like the number of threads in a fixed thread pool executor.

Sincerely,
-Laurent
Logged

shiva.verma

  • JPPF Master
  • ***
  • Posts: 27
Re: Reduce priority of NodeError job
« Reply #15 on: March 22, 2018, 10:03:39 AM »

Thanks Laurent for the response. I will try supressing the client's console output as you suggested.

I am right now stuck with timeout.

Scenario:
 -Each job has single task
 -I want job to expire after running for X amount of time.
 - Code:
Code: [Select]
JPPFSchedule jobExpirationSchedule = new JPPFSchedule(4000L);
job.getSLA().setMaxDispatchExpirations(0);
job.getSLA().setDispatchExpirationSchedule(jobExpirationSchedule);

But this doesn't seem like cancelling the job once the timeout is achieved. The job/task still continue to run

This is what I observed in server log:
Code: [Select]
2018-03-22 01:37:38,232 [DEBUG][org.jppf.scheduling.JPPFScheduleHandler.scheduleAction(102)]: DispatchExpiration : scheduling action[key=1E73FD32-AF6A-4FEF-4A13-B04BD34FAB34|4, schedule[delay=4000], action=org.jppf.server.nio.nodeserver.NodeDispatchTimeoutAction@2bb55e7f, start=2018-03-22 01:37:38.231
2018-03-22 01:37:38,247 [DEBUG][org.jppf.scheduling.JPPFScheduleHandler.scheduleAction(110)]: DispatchExpiration : date=2018-03-22 01:37:42.231, key=1E73FD32-AF6A-4FEF-4A13-B04BD34FAB34|4, future=java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@5bf8c575
2018-03-22 01:37:38,247 [DEBUG][org.jppf.nio.StateTransitionManager.transitionChannel(157)]: transition from SENDING_BUNDLE to WAITING_RESULTS with ops=1 (readyOps=4) for channel id=4, submit=false
2018-03-22 01:37:38,247 [DEBUG][org.jppf.nio.PlainNioObject.read(98)]: read 4 bytes for PlainNioObject[channel id=477, size=4, count=4, source=ChannelInputSource[channel=java.nio.channels.SocketChannel[connected local=/127.0.0.1:11111 remote=/127.0.0.1:62421]], dest=null, location=MultipleBuffersLocation[size=4, count=4, currentBuffer=org.jppf.utils.JPPFBuffer@1ce56350, currentBufferIndex=0, transferring=false, list=[org.jppf.utils.JPPFBuffer@1ce56350]]]

node log:
Code: [Select]
2018-03-22 01:37:42,248 [DEBUG][org.jppf.management.JPPFNodeAdmin.cancelJob(253)]: Request to cancel jobId = '1E73FD32-AF6A-4FEF-4A13-B04BD34FAB34', requeue = false
2018-03-22 01:37:42,249 [DEBUG][org.jppf.execute.AbstractExecutionManager.cancelAllTasks(193)]: cancelling all tasks with: callOnCancel=true, requeue=false
2018-03-22 01:37:42,249 [DEBUG][org.jppf.execute.AbstractExecutionManager.cancelTask(211)]: cancelling task = NodeTaskWrapper[task=test.utilities.RunSuiteTask@3ceef7d, cancelled=false, callOnCancel=false, timeout=false, started=true]
2018-03-22 01:37:42,249 [DEBUG][org.jppf.execute.AbstractExecutionManager.cancelTask(214)]: calling future.cancel(true) for task = NodeTaskWrapper[task=test.utilities.RunSuiteTask@3ceef7d, cancelled=false, callOnCancel=false, timeout=false, started=true]
2018-03-22 01:37:42,250 [DEBUG][org.jppf.scheduling.JPPFScheduleHandler.cancelAction(131)]: Task Timeout Timer : cancelling action for key=java.util.concurrent.FutureTask@26e39e2f, future=null
2018-03-22 01:37:42,251 [DEBUG][org.jppf.classloader.AbstractJPPFClassLoader.getResourceAsStream(253)]: JPPFClassLoader[id=3, type=client, uuidPath=[707FABBC-51DD-5EEE-F16F-EDEC867EE744, 40705A97-1FED-BDCD-B8D5-1485A2E2337C], offline=false, classpath=] lookup for 'META-INF/services/org.apache.xerces.xni.parser.XMLParserConfiguration' = null for JPPFClassLoader[id=3, type=client, uuidPath=[707FABBC-51DD-5EEE-F16F-EDEC867EE744, 40705A97-1FED-BDCD-B8D5-1485A2E2337C], offline=false, classpath=]
2018-03-22 01:37:42,253 [DEBUG][org.jppf.classloader.AbstractJPPFClassLoader.getResourceAsStream(253)]: JPPFClassLoader[id=3, type=client, uuidPath=[707FABBC-51DD-5EEE-F16F-EDEC867EE744, 40705A97-1FED-BDCD-B8D5-1485A2E2337C], offline=false, classpath=] lookup for 'META-INF/services/org.apache.xerces.xni.parser.XMLParserConfiguration' = null for JPPFClassLoader[id=3, type=client, uuidPath=[707FABBC-51DD-5EEE-F16F-EDEC867EE744, 40705A97-1FED-BDCD-B8D5-1485A2E2337C], offline=false, classpath=]
2018-03-22 01:37:42,275 [DEBUG][org.jppf.classloader.AbstractJPPFClassLoader.getResourceAsStream(253)]: JPPFClassLoader[id=3, type=client, uuidPath=[707FABBC-51DD-5EEE-F16F-EDEC867EE744, 40705A97-1FED-BDCD-B8D5-1485A2E2337C], offline=false, classpath=] lookup for 'META-INF/services/org.apache.xerces.xni.parser.XMLParserConfiguration' = null for JPPFClassLoader[id=3, type=client, uuidPath=[707FABBC-51DD-5EEE-F16F-EDEC867EE744, 40705A97-1FED-BDCD-B8D5-1485A2E2337C], offline=false, classpath=]
2018-03-22 01:37:59,308 [DEBUG][org.jppf.classloader.AbstractJPPFClassLoader.getResourceAsStream(253)]: JPPFClassLoader[id=3, type=client, uuidPath=[707FABBC-51DD-5EEE-F16F-EDEC867EE744, 40705A97-1FED-BDCD-B8D5-1485A2E2337C], offline=false, classpath=] lookup for 'META-INF/services/org.apache.xerces.xni.parser.XMLParserConfiguration' = null for JPPFClassLoader[id=3, type=client, uuidPath=[707FABBC-51DD-5EEE-F16F-EDEC867EE744, 40705A97-1FED-BDCD-B8D5-1485A2E2337C], offline=false, classpath=]


I would also like to know if there is a way to run some additional step at the end of each timeout to make sure the node state is good enough for next job/task

Please help
« Last Edit: March 22, 2018, 10:06:28 AM by shiva.verma »
Logged

lolo

  • Administrator
  • JPPF Council Member
  • *****
  • Posts: 2223
    • JPPF Web site
Re: Reduce priority of NodeError job
« Reply #16 on: March 23, 2018, 08:06:22 AM »

Hi Shiva,

The node log extract you provided shows clearly that the job dispatch timeout was triggered and that the node actually attempts to cancel the executing task.
What I am suspecting here is that the task is not performing interruptible operations at that time causing the cancel request ot be ignored.

A bit of explanation: cancelling a task in a JPPF node results in calling Future.cancel(true), which in turn results in calling Thread.interrupt() on the thread that is executing the task. If the task is not performing an interruptible operation, as specified in the Javadoc for Thread.interrupt(), then all that happens is that the interrupted flag of the thread is set to true. If the task is performing an interruptible operation, then the thread will also receive an InterruptedException, allowing it to effectively stop processing.

In your case, I believe the task is not doing an interruptible operation and therefore does not receive an InterruptedException. To resolve this, you will need to add regular checks in the code of the task, as in this example:

Code: [Select]
public class MyTask extends AbstratTask<Object> {
  @Override
  public void run() {
    try {
      ...
      if (Thread.currentThread().isInterrupted()) {
        throw new InterruptedException("task cancelled");
      }
      ...
    } catch(InterruptedException e) {
      // process cancellation/interruption
    }
  }
}

Regarding your question:
Quote
I would also like to know if there is a way to run some additional step at the end of each timeout to make sure the node state is good enough for next job/task

There isn't a way to distinguish whether a job was cancelled because of a timeout, but you can determine whether a job was cancelled, by using a NodeLifeCycleListener and implementing its jobEnding() method.

For instance, let's first add a taskCancelled attribute to the above task implementation:

Code: [Select]
public class MyTask extends AbstractTask<Object> {
  private boolean taskCancelled;

  @Override public void run() {
    try {
      ...
      if (Thread.currentThread().isInterrupted()) {
        this.taskCancelled = true;
        throw new InterruptedException("task cancelled");
      }
      ...
    } catch(InterruptedException e) { /* process cancellation/interruption */ }
  }

  public boolean isTaskCancelled() { return taskCancelled; }
}

Then we can write a NodeLifeCycleListener that uses this attribute, as follows:

Code: [Select]
public class MyNodeLifeCycleListener extends NodeLifeCycleListenerAdapter {
  @Override
  public void jobEnding(NodeLifeCycleEvent event) {
    boolean jobCancelled = false;
    for (Task<?> t: event.getTasks()) {
      MyTask task = (MyTask) t;
      if (task.isTaskCancelled()) {
        jobCancelled = true;
        break;
      }
    }
    if (jobCancelled) {
      // check node state, etc...
    }
  }
}

Sincerely,
-Laurent
Logged
Pages: [1]   Go Up
 
JPPF Powered by SMF 2.0 RC5 | SMF © 2006–2011, Simple Machines LLC Get JPPF at SourceForge.net. Fast, secure and Free Open Source software downloads