Hello all
I am investigating JPPF for a machine learning application. I am specifically interested in distributing the work required to do EM training of Gaussian Mixture Models.
http://en.wikipedia.org/wiki/Expectation-maximization_algorithmFor example, for the K-Means algorithm:
http://en.wikipedia.org/wiki/K-means_clusteringthe E-step in the EM algorithm involves computing the distance between each data point and each cluster centroid. Since I have a lot of data, I would like to send the current centroids to all the nodes, have them calculate distances and report back, at which point I can do the M-step.
For large datasets, it becomes useful to have the same node process the same data on each iteration. Since the data has to come from somewhere initially (like a central server), having the same node work on the same data all the time can save a lot of bandwidth.
BOINC calls this concept locality scheduling:
http://boinc.berkeley.edu/sched_locality.phpBefore I discovered JPPF, I was looking at using JMS to distribute the tasks to my nodes. The ActiveMQ JMS message broker has a very cool feature called message groups:
http://activemq.apache.org/message-groups.htmlThis allows you to ensure that the same consumer receives all the messages from a logical group which is determined by setting the same ID on all the messages in the group. This provides a basic way of doing locality scheduling.
I have briefly looked at JPPF's DataProvider mechanism, but it doesn't quite look like it's going to be able to help with locality scheduling.
Any thoughts on how one could achieve locality scheduling with JPPF?
Cheers,
Albert
P. S. More details about the setup I envision.
I'm looking at storing the data on a Hadoop distributed file system:
http://lucene.apache.org/hadoop/http://lucene.apache.org/hadoop/hdfs_design.htmlEach node will have an Ehcahche instance in which it will keep the data it fetches from the distributed file system:
http://ehcache.sourceforge.net/This cache can be configured to have a maximum memory size and could possibly spool to disk, or discard the least recently used entries.