JPPF, java, parallel computing, distributed computing, grid computing, parallel, distributed, cluster, grid, cloud, open source, android, .net
JPPF

The open source
grid computing
solution

 Home   About   Features   Download   Documentation   On Github   Forums 
June 04, 2023, 08:20:34 PM *
Welcome,
Please login or register.

Login with username, password and session length
Advanced search  
News: New users, please read this message. Thank you!
  Home Help Search Login Register  
Pages: [1]   Go Down

Author Topic: Distributing input files from Client to Nodes  (Read 3651 times)

goshlive

  • JPPF Padawan
  • *
  • Posts: 4
Distributing input files from Client to Nodes
« on: March 16, 2017, 11:16:05 PM »

Hello,

Please introduce, I am Goshlive, a new member in this forum.

First of all, thanks for the JPPF framework which is made available open for the world to use. This is really a great framework!
I am currently evaluating the framework to perform some file processing in a grid network. But I have some difficulties which I describe below.

I have one input file loaded from the JPPF Client which is ran on the same machine with the JPPF Driver. It is supposed to be processed (using external windows command --Command Line Tasks) by several nodes sitting on the same LAN network. The problem is I am unable to make the input file which read and loaded by the client to be transferred (does it actually needed?) and processed by the nodes in the most effective and efficient way without giving some additional performance downgrade to the system.

So far, I tried to transfer the input file through the FTP server which is loaded during the Driver startup (as in http://www.jppf.org/doc/5.2/index.php?title=JPPF_startup_classes). If I use a simple FTP file transfer (as in http://www.jppf.org/samples-pack/FTPServer/) where Driver, Client and Nodes are on the same machine, the input file would be transferred successfully without problem, but when I tried running it in the actual implementation whereby nodes are on different machines and the input file processed by using external Command Line tasks, the input file was transferred incorrectly (size different with the original) as such the input file cannot be processed by nodes.

Having the above issue, I tried to use the callable Class (as in "Executing code in the client from a task" section at http://www.jppf.org/doc/5.2/index.php?title=Task_objects#Executing_code_in_the_client_from_a_task). I was expecting that a node would delegate the file input retrieval process to the client and then continue to process the file accordingly. But the file which is located in the client computer (specific path) was not readable when the process comes back to the node as I simply get the callable function output as a "File" object (would it work if I convert it into a serializable object instead such as FileInputStream/FileOutputStream? --but this input/output stream I tried with the same failed result)

Is any of my approach above actually workable? Am I missed anything or any better option?

Lastly, sorry for my English. Thanks so much and have a great day!


Goshlive  ;)
Logged

lolo

  • Administrator
  • JPPF Council Member
  • *****
  • Posts: 2272
    • JPPF Web site
Re: Distributing input files from Client to Nodes
« Reply #1 on: March 17, 2017, 05:45:23 AM »

Hello Goshlive,

Your approach with a Task.comute(JPPFCallable) would work if the callable returned the file content instead of a File object, which is only an abstract representation of a file path, but not of its content. JPPF has utility methods in the class FileUtils that allow you to read the content of a file as binary data with FileUtils.getFileAsBytes(). Using this, your callable would look like this:

Code: [Select]
public static class MyCallable implements JPPFCallable<byte[]> {
  private String filePath; // path of the file in the client file system

  public MyCallable(String path) {
    this.filePath = path;
  }

  @Override
  public String call() throws Exception {
    return FileUtils.getFileAsBytes(filePath);
  }
}

and you could use it in your task as in this example:

Code: [Select]
public class MyTask extends AbstractTask<String> {
  @Override
  public void run() {
    byte[] fileContent = compute(new MyCallable("/tmp/myFile.dat"));
    try (ByteArrayInputStream bais = new ByteArrayInputStream(fileContent)) {
      // ... process the file data ...
    } catch(IOException e) {
      setThrwoable(e);
    }
  }
}

However, this is probably not the best way performance-wise. If the file is known before the tasks execute, then it is more efficient to send its content along wih the job, which will save you additional round trips from the node to the client. This can be done with a data provider, attached to the job like this (client side):

Code: [Select]
DataProvider dataProvider = new MemoryMapDataProvider();
byte[] fileContent = FileUtils.getFileAsBytes("/tmp/myFile.dat");
dataProvider.setValue("myFile", fileContent);
JPPFJob = new JPPFJob();
job.setDataProvider(dataProvider);
job.add(new MyTask());

then your task could use it as in this example:

Code: [Select]
public class MyTask extends AbstractTask<String> {
  @Override
  public void run() {
    byte[] fileContent = (byte[]) getDataProvider().getValue("myFile");
    try (ByteArrayInputStream bais = new ByteArrayInputStream(fileContent)) {
      // ... process the file data ...
    } catch(IOException e) {
      setThrwoable(e);
    }
  }
}

I hope this helps,
-Laurent
Logged

goshlive

  • JPPF Padawan
  • *
  • Posts: 4
Re: Distributing input files from Client to Nodes
« Reply #2 on: March 18, 2017, 10:31:20 PM »

Dear Laurent

Thanks so much for the reply.

I have tried the given solution with the Data Provider and it works like a charm!

By the way, now I need to transfer the processing result of more than one files done by each node.

I see some options:

1) use the "Executing code in the client from a task" mechanism and call it in the JPPF task before the setResult() method

I chose this way but still having some concurrent issues during zipping process of these result files before the processing by the client side. And seems that I may encounter another issue when more than one nodes exist and helping the process on a single computer. Shall I create a different result folder for each task (or for each node) in this case during the task's result file generation?

2) use the NodeLifeCycleListenerAdapter during its jobEnding() method calls

3) I read that Data Provider is read only, what does it actually mean? possible to create Data Provider as well in the Task and read by the other part in the JPPF life cycle?

or any better approach?

Thanks again for your attention and have a great day!

Goshlive  ;)
Logged

lolo

  • Administrator
  • JPPF Council Member
  • *****
  • Posts: 2272
    • JPPF Web site
Re: Distributing input files from Client to Nodes
« Reply #3 on: March 19, 2017, 06:35:02 AM »

Hi Goshlive,

Quote
I read that Data Provider is read only, what does it actually mean?
It means that, even though a task can modify the content of the data provider, the modified data will be discarded before sending the task results back to the server. In the same way the data provider is not sent back to the client. The data provider facility is basically a performance and memory consumption optimization for when the tasks share common input data.

With regards to tasks processing multiple files, I would go for the simplest solution that covers all cases: create a uniquely named directory for each task. There ar emany ways to do this, including:
- using one of the java.nio.file.Files.createTempDirectory(...) methods
- generating a uuid or guid as the folder name, using the JDK's UUID class or JPPFUuid
- etc...

After a task has processed all of its files, it can zip them and set the content of the zip as its result. This way, you're sure to avoid collisions, even when multiple tasks use the same file names and/or multiple nodes run on the same machine.

Sincerely,
-Laurent
Logged

goshlive

  • JPPF Padawan
  • *
  • Posts: 4
Re: Distributing input files from Client to Nodes
« Reply #4 on: March 29, 2017, 05:06:55 AM »

Hi Laurent,

Apologies for my late reply.

Basically, I tried your solution with various input files and the approach of using the UUID for the output folder management seems to be working nicely for smaller input files. So the directory issue should no longer be a problem. But, when the input file is a bit bigger, sometimes the results that have been set did not returned all as I got exception: java.lang.OutOfMemoryError: Java heap space (result file generated successfully on the node).

I do not think to modify the JVM configuration is a good solution as computers on a grid might have various RAM sizes and JVM settings.

This is how I do it on each task:
Code: [Select]
        try {
            setCommand();
            setCaptureOutput(true);
            launchProcess();
           
            setResult(getOutput());    --getOutput returns byte[]
        } catch (Exception e) {
            e.printStackTrace();
            setThrowable(e);
            System.out.println(e.getMessage());
        }
 

And then on the Client side:
Code: [Select]
            List<Task<?>> tasks = job.awaitResults();

            for (Task<?> ts : tasks) {
                Throwable t = ts.getThrowable();
                if (t != null) {
                    System.err.println("Task " + ts.getId() + " was failed. No result returned." + t.getMessage());
                } else {
                    process(ts.getResult());
                }
            }

The following code was executed:
Code: [Select]

System.err.println("Task " + ts.getId() + " was failed. No result returned." + t.getMessage());



I got the following result on the Node computer for one task which is not that big actually:
tmp_1.zip            (21,413 KB)    -most of the time, does not get returned
tmp_2.zip            (21,594 KB)    -most of the time, returned
tmp_3.zip            (17,547 KB)    -most of the time, returned

And another task (executed at the same time), output:
tmp_1.zip            (1 KB)         -most of the time, returned

Is there any minimum Node specification configuration required? For processing of the input file, my computer should be no problem as the output file actually generated properly.

I am currently running one Node, one Driver and one Client on the same Computer with the following specification:
OS: Windows 7 SP 1
Pro: Intel Core i5-3320 CPU @ 2.6 GHz
RAM: 8 GB
Arch: 64-bit

So, I think by setting the output directly with the calls to setResult() method is not an option. (Note: I did a zip/compress  to the folder output before the setRsult() calls).

Am I doing something wrong?
Shall I revert back to using FTP? or do you see any better approach?


Thanks again for your attention and have a great day!

Goshlive  ;)
Logged

lolo

  • Administrator
  • JPPF Council Member
  • *****
  • Posts: 2272
    • JPPF Web site
Re: Distributing input files from Client to Nodes
« Reply #5 on: March 31, 2017, 08:25:31 AM »

Hi Goshlive,

My understanding is that the compressed files you put in your tasks are too large for your Java heap. What I would recpmmend to try first is to increase the heap size for your nodes, that is, set a-Xmx that will accomodate the files size. This can be done in the node's configuration file, in the value of the "jppf.jvm.options" property. For instance you can try "jppf.jvm.options = -Xmx1g <... other options ...>". I believe with 8 GB of RAM, you have plenty enough to allocate 1 GB of heap to each node, you might tune it down later if needed. For information, the node's default value is -Xmx128m (128 MB) so that's probably not sufficient.

If the files size is really very large (in the hundreds of MBs or in the GBs), then you might indeed have to use another technique like FTP.

Sincerely,
-Laurent
Logged

goshlive

  • JPPF Padawan
  • *
  • Posts: 4
Re: Distributing input files from Client to Nodes
« Reply #6 on: April 04, 2017, 11:46:17 PM »

Hi Laurent,

Thanks so much for the reply and the suggestion, I tried the above solution and it works nicely but I feel it's a bit slow, might be because of the actual Windows' processes ran within the node execution.

Next, I will test it in a pre-setup grid environment and will see how it goes.


Thanks again for your attention and have a great day!

Goshlive  ;)
Logged
Pages: [1]   Go Up
 
JPPF Powered by SMF 2.0 RC5 | SMF © 2006–2011, Simple Machines LLC Get JPPF at SourceForge.net. Fast, secure and Free Open Source software downloads