JPPF, java, parallel computing, distributed computing, grid computing, parallel, distributed, cluster, grid, cloud, open source, android, .net
JPPF

The open source
grid computing
solution

 Home   About   Features   Download   Documentation   On Github   Forums 
March 28, 2023, 10:35:00 PM *
Welcome,
Please login or register.

Login with username, password and session length
Advanced search  
News: New users, please read this message. Thank you!
  Home Help Search Login Register  
Pages: [1]   Go Down

Author Topic: Evaluating JPPF for client project  (Read 3591 times)

andi

  • JPPF Padawan
  • *
  • Posts: 3
Evaluating JPPF for client project
« on: February 27, 2012, 11:03:38 PM »

Hello,

I'm evaluating different clustering solutions in addition to developing our own custom one and I came across JPPF. I have to say I really like what I've been reading. Great work and I'm sure we'll be able to use this framework in other situations.

Here's the problem we need to tackle:
We have a chat server (XMPP) where we need to have external components (separate processes) connect to it (via TCP connections) in order to process messages and perform different operations. The chat server also provides a 'component presence' that lets components know the status of all other components (connected or not). We need to find a solution that 1. uses the chat server's component presence to see what it thinks that the status of all components is and 2. have some way to detect a stuck or hung component that's still connected to the server but doesn't process any data anymore or processes it too slowly and needs to be killed and restarted.

Right now we've designed a solution where we'd have n number of identical processes that can be started as needed which all post heartbeat messages to a JMS topic every few seconds. All components also listen on this topic and therefore every process knows the health of all other processes. When they start up, they're all in the 'standby' state. After an initial waiting period, if there doesn't seem to be a 'controller' component (this is the process that connects to the chat server to listen for the component presence), then one of the standby processes will attempt to become the controller. The chat server guarantees that only 1 process can become the 'controller'. Once a controller is available, this one will build the overall status of all processes and takes the component presence into account. If the standby processes are distributed across multiple machines, the controller will tell one standby on each machine to become the 'task killer' which will be responsible for killing a process that doesn't respond anymore, i.e. that is hung/stuck (on the machine where the controller is running, it will perform those tasks directly). From the chat server's component presence, the controller knows which 'primary' components to assign and it will then tell available standbys to become 'primary's which will connect to the chat server and process messages.

This is a clustering problem in that every process can perform any job and we want to be able to dynamically add and remove processes without having to reconfigure anything or restart any servers. Our solution gives us that as we use JMS for the communication backbone.

We'd love to use JPPF if possible as it provides a similar scalability and dynamic configuration and also because of the great UI admin tools which we'd have to also build ourselves. However there are a few core questions for which I wasn't able find an answer for.

So technically, every JPPF node would be one of these 'standby' components. The controller would be the JPPF client that would submit jobs to the driver/server. The jobs/tasks would potentially be running for very long periods.

Questions:
1. We'd prefer not to have to startup a process as a dedicated 'controller'. Is it possible to configure the driver/server to have it make sure that there's always the special 'controller' job is running? This way as the driver starts, it would ensure that this special job is always running and as soon as it is completed (i.e. the controller died/crashed) it would reschedule it so that another 'standby' node could take over the job.

2. Is there a good way for a node to detect if a job/task is stuck and if so, terminate/cancel it? When a job is cancelled, does this force the termination of the thread(s)/process(es) at the node where it's running (i.e. forcibly close the TCP connection to the chat server)? [If this is possible, then we'd only need 'controller', 'primary' and 'standby'. No need for 'task killer' anymore.]

3. Is it possible to send a message/command to a node where a job is running?

4. What happens if the driver/server goes down? Do all nodes continue to run and reconnect when the driver comes back? Or will nodes be forced to restart?

5. Can a node (i.e. controller) also be a client that submits jobs?

6. Is there a way to have jobs assigned so that the driver will send those jobs to nodes on different machines? E.g. if there are 3 'primary' components to process 3 different types of chat server messages and there are 3 physical machine where nodes are running, then it would be great to have the jobs assigned so that each of those 3 primary jobs end up on the 3 machines. (it would probably be easy and work by default if there were only 3 nodes, one on each machine but we'll definitely have multiple nodes (i.e. standbys) running on each physical machine.

Hope this all makes sense, if not please ask. I'd be happy to provide more details.

Andi
Logged

lolo

  • Administrator
  • JPPF Council Member
  • *****
  • Posts: 2272
    • JPPF Web site
Re: Evaluating JPPF for client project
« Reply #1 on: February 29, 2012, 08:27:05 AM »

Hello Andi,

Thanks for the detailed description of your requirements.

1) 
Quote
Is it possible to configure the driver/server to have it make sure that there's always the special 'controller' job is running?
That's what a driver does by default: if a node dies, the driver resubmits the job it was executing to another node automatically. There's nothing else to do, as it's the built-in behavior.

2)
Quote
Is there a good way for a node to detect if a job/task is stuck and if so, terminate/cancel it?
There are several approaches to this:
- you could use the built-in task timeout facility, so that a task can be cancelled if it didn't complete at a specified date or after a specified time has elapsed. (Please note that the documentation link I provided has obsolete information in it, as cancelling an individual task is not available any more, this doesn't affect the timeout functionality though). When a task times out, a callback method onTimeout() is called on this task, from which you can perform any needed cleanup operation.
- you could also use a NodeLifeCycleListener, and upon "job starting" events setup a timer (or any other mechanism) that will allow you to know if a job is stuck according to your criteria. The timer or custom mechaism can be cancelled/disabled upon receiving a "job ended" event. You can then cancel the job using the node management APIs.

3)
Quote
Is it possible to send a message/command to a node where a job is running?
You can do do that using the node management APIs, either from another remote JVM or from within the node itself. Please note that you may extend the management and monitoring capabilities by implementing your own pluggable management MBeans.

4)
Quote
What happens if the driver/server goes down? Do all nodes continue to run and reconnect when the driver comes back? Or will nodes be forced to restart?
When a driver goes down, the nodes cancel the job they were currently executing, and then attempt to reconnect to a driver, possibly the same driver. So indeed, we can say that the nodes actually restart, and their current work is lost.

5)
Quote
Can a node (i.e. controller) also be a client that submits jobs?
Yes. You just need to add the required client libraries to the node's classpath, along with the client configuration properties in the node's configuration (if needed).
There is one limitation you must be aware of: if all other nodes are busy executing long running jobs, then the submitted job will not start immediately and may wait for a long time. Also during that time, the node that submits the job may be unavailable, especially if the job is submitted in blocking mode.

6)
Quote
Is there a way to have jobs assigned so that the driver will send those jobs to nodes on different machines?
I'm not sure I fully understand your requirement here, however it seems it can be accomplished by setting an execution policy onto the job, which allows you to specify which node(s) a job can be run on, at a very fine level of granularity.
Please also note that, from the JPPF point of view, a node is seen as a JVM, whether on the same machine or a remote one. You can have any number of nodes on a single physical or virtual machine - in the limits of available system resources.

I hope this answers your questions, and feel free to get back to us if you need any clarification or additional information.

Sincerely,
-Laurent

Logged

andi

  • JPPF Padawan
  • *
  • Posts: 3
Re: Evaluating JPPF for client project
« Reply #2 on: March 01, 2012, 08:14:07 PM »

Hi Laurant,

Thanks very much for the fast and detailed response. This is very helpful. I do have a few follow up questions.

1)
Quote
That's what a driver does by default: if a node dies, the driver resubmits the job it was executing to another node automatically. There's nothing else to do, as it's the built-in behavior.
Great and that's what I gathered from the documentation as well. However in order for this to work, it requires that someone had submitted that job at some point, right? Is there a way (and maybe we'll have to extend the driver?) to have the driver start up (for the very first time) and by itself pickup and submit that initial job (i.e. bootstrap)? Or is there a way to do that via the admin UI (haven't looked at that much yet)?

2)
Quote
Task timeout / NodeLifeCycleListener
The problem with either of these ideas is that it requires for a task to complete (if everything is running well). At this point, I was thinking that there would be 1 job with 1 task that would basically run forever if everything is going well or until it's cancelled by the controller. So in this case, there is no feedback and no timer that could be setup as we won't know for how long the job will run.
However I was just thinking, could we do something like have 2 tasks per job initially and each task would only run for a few seconds (quasi as a heartbeat) and whenever a task is completed, have it (or the job itself) resubmit a new task so that there's always one task running and one more in the queue. This way we could setup a task timer and if a task doesn't complete, then it could be auto cancelled that way.

3)
Quote
Node Management API
I think that could work. Would a client or in our case the controller node be able to 'discover' all the other nodes in the network and find out which node is running which job via JPPFDriverAdmin.nodesInformation,  JPPFDriver.getJobManager().getAllJobIds() and .getNodesForJob()?

4)
That's good to know and may not be an issue. I'm assuming that the driver itself doesn't do a lot of things other than take the jobs clients send and distribute them and send back statuses.

6)
We may want to make sure that the different 'primary' jobs/nodes (which will do the important, critical work) are distributed so that no 2 'primary' jobs are running on the same physical (or virtual) machine so that if a physical machine goes down, only 1 of those primary jobs is affected.
I've looked at the execution policy and I think we may be able to do that with those.

We won't get around to implementing this for quite a while but if we decide to use JPPF, I'm sure we'll have more questions in a few months.

Thanks again for the help. It's an awesome project and it's great that it's this active.

Andi
Logged

lolo

  • Administrator
  • JPPF Council Member
  • *****
  • Posts: 2272
    • JPPF Web site
Re: Evaluating JPPF for client project
« Reply #3 on: March 08, 2012, 06:56:45 AM »

Hello Andi,

Sorry for this late answer.

1)
Quote
Is there a way ... to have the driver start up ... and by itself pickup and submit that initial job (i.e. bootstrap)?
The best way is to do this via a JPPF client. You could embded the client in the driver and have it submit the job from a driver startup class. In this case you need to make sure that the job submission is done in a separate thread, to avoid blocking the server's initialization. Here's a simple code example:

Code: [Select]
import java.util.List;
import org.jppf.client.*;
import org.jppf.server.protocol.JPPFTask;
import org.jppf.startup.JPPFDriverStartupSPI;

public class MyStartup implements JPPFDriverStartupSPI {
  @Override
  public void run() {
    System.out.println("running MyStartup");
    try {
      final JPPFClient client = new JPPFClient();
      final JPPFJob job = new JPPFJob();
      job.setName("bootstrap job");
      job.addTask(new MyTask());
      Runnable jobSubmission = new Runnable() {
        @Override
        public void run() {
          try {
            List<JPPFTask> results = client.submit(job);
            JPPFTask t = results.get(0);
            if (t.getException() != null) throw t.getException();
            else System.out.println("result: " + t.getResult());
            client.close();
          } catch (Exception e) {
            e.printStackTrace();
          }
        }
      };
      new Thread(jobSubmission).start();
    } catch (Throwable e) {
      e.printStackTrace();
    }
  }

  private static class MyTask extends JPPFTask {
    @Override
    public void run() {
      System.out.println("hello bootstrap world");
      setResult("execution sucessful");
    }
  }
}

2)
Quote
At this point, I was thinking that there would be 1 job with 1 task that would basically run forever if everything is going well or until it's cancelled by the controller.
I guess I misunderstood your question. I don't see any problem having a job with a task that runs forever. You can always cancel the job using the management facilities (either the admin console or via APis). Cancelling the job will effectively interrupt the task. What you can do is "suspend" the job, which cancels its execution in the node(s), but leaves it in the server's queue, then "resume" it, which will cause the driver to dispatch it to the same or another node.

3)
Quote
Would a client or in our case the controller node be able to 'discover' all the other nodes in the network and find out which node is running which job via JPPFDriverAdmin.nodesInformation,  JPPFDriver.getJobManager().getAllJobIds() and .getNodesForJob()?
Yes, this should work.

4)
Quote
I'm assuming that the driver itself doesn't do a lot of things other than take the jobs clients send and distribute them and send back statuses.
Yes, you are correct. This is why a single driver can scale up to hundreds or thousands of nodes.

I hope this helps.

Sincerely,
-Laurent
Logged
Pages: [1]   Go Up
 
JPPF Powered by SMF 2.0 RC5 | SMF © 2006–2011, Simple Machines LLC Get JPPF at SourceForge.net. Fast, secure and Free Open Source software downloads