|
Please wait while updating issue type...
Could not save your changes
This issue has been changed since you started editing it
Data that has been changed is highlighted in red below. Undo your changes to see the updated information
You have changed this issue, but haven't saved your changes yet. To save it, press the Save changes button to the right
This issue is blocking the next release
There are no comments
This issue has no duplicates
There are no code checkins for this issue |
|||||||||||||||||||||||||||||||||||
Really delete this comment?
Really delete this comment?
I recently changed my node's PermGen settings so it's possible setting this on the slave nodes is limiting the number that can launch:
But I am only running 1 master + 2 slaves on an 8GB machine so that shouldn't be a problem... or maybe it is.
Symptom remains so it's not the automatic provisioning at fault, but there's something amiss with whatever's causing these nodes to fail
Really delete this comment?
When I said "suppressed" the node provisioning, I did so by specifying "0" for the value of slaveNodes in the above code.
However, to rule out whether there was a problem with the node configuration I ran just the nodes and the standard JPPF task code with no problems. (Other than a display of 0%CPU on active slave nodes when the master was idle... but they were doing work...)
So the problem is in my monitoring client, and is likely in the provisioning code above, or the associated JMX. The only other change to the monitoring code since it worked well is some server listing code, quoted below.
Really delete this comment?
Really delete this comment?
Really delete this comment?
This happens once the slave node has been selected for tasks dispatching, but before its server-side channel is transitioned to its new state. What's missing is some exception handling code in or around the call to TaskQueueChecker.dispatchJobToChannel().Really delete this comment?
Really delete this comment?
Really delete this comment?
I'll upload and do another test.
lolo4j wrote:
Really delete this comment?
lolo4j wrote:
Really delete this comment?
Really delete this comment?
Really delete this comment?
Really delete this comment?
Unfortunately, it looks like this just recurred, so it's not fixed....
I currently have 3200 idle nodes... and counting....
lolo4j wrote:
Really delete this comment?
Some of the nodes show classloading errors. Unfortunately I didn't copy the stack trace before shutting it down.
Daniel Widdis wrote:
Really delete this comment?
- The problem began when I upgraded from 3.4.1 to 3.4.3. I would suspect some of the code from fixes in the last two maintenance releases is involved.
- The initial problem only occurs when there are slave nodes provisioned. If I run a grid entirely with master nodes, I get no problems.
- The problem occurs without my automatic node provisioning code above, but the problem did occur when I manually provisioned nodes using the GUI. I did provision while nodes were still starting up and connecting so there may have been a timing issue involved.
- I have uploaded the patch file above (Tentative fix) to my driver's classpath. Everything else is stock 3.4.3.
- This appears to occur with larger grids of about 20 servers x (master + 2 nodes) = 60 nodes. When I tested your fix with 12 servers x 3 nodes per server = 36 total, I had no problems.
- This may be related to one type of task finishing and the node beginning to process another type of task (hypothesis, not yet tested)
- Initial symptom I see is on the GUI, one or two of the nodes turns "red". Logging into that server and checking the logs showed some exceptions relating to classloading. I did not copy it down at the time but will try to capture it later. I recall it was ClassLoaderImpl in the common package, and there may have been a related IOException on the channel it was trying to use for classloading.
- The node attempts to restart a few times.
- At some point during all this, the idle node count increments and the queue decrements so when other nodes finish their tasks, there is nothing on the queue to give them.
I'll do more troubleshooting tomorrow or Monday.Really delete this comment?
Really delete this comment?
Really delete this comment?
It may also not be the change from 3.4.1 to 3.4.3 but the change in how quickly I provision slave nodes. Before I had tended to do it manually server-by-server, but lately I just do a ctrl-A to select the whole list and provision all at once. (And I've repeated this process multiple times as more of my cloud nodes connect.)
Any guidance on what you might need from me for troubleshooting? My plan was to try the following:
lolo4j wrote:
Really delete this comment?
To reproduce with a single master, I have a separate thread in my client which provisions 20 slaves, waits 10s, un-provisions the slaves, waits 1s. Wash, rince, repat. The greater the number of slaves, the sooner the issue is triggered. I'm using a job streaming pattern in the client with tasks that simply wait for 1ms then log a string via log4j2, so they don't have to do much. The driver uses "nodethread" load-balancing with multiplicator = 1
Additional clues that I found by looking at the state of the driver with VisualVM, after I tried and killed all nodes from the console:
Really delete this comment?
This exception prevents the handleException() method of the corresponding channel from being invoked, thus it is not properly closed and discarded from the server.Really delete this comment?
However I'm still seeing a job hang at some point, and also a number of jmx@xxx threads trying to connect to he nodes via JMX, even though those nodes are dead ...
The good new is that now I've completely automated the test and the failure detection (job hung detected by the server no longer incrementing the executed tasks count), so I'm wasting less time on this. My test is sending 1000 jobs with 1000 tasks each, one at a time. So far I've never been through all 1000 jobs, so the test is a valid one. Once it passes, I'll run a much larger number of jobs (by at least 1 or 2 orders of magnitude) to make sure the server will hold over time. But we're not there yet
Really delete this comment?
I guess I should find a way to identify the node in ServerTaskBundleNode.toString(), if possible.The node channels show only the master node:
The admin console's job data view still shows the job is dispatched to 2 dead nodes:And I see there are 2 live "jmx@xxx" threads that should have died:- lcohen-CSL:11198 - streaming job 421 - 12.0.0.1:12019 - 12.0.0.1:12029This is turning out to be a hard to find issue, but this is also an excellent test of how the grid behaves under that kind of stress, with frequent massive dynamic changes in the topology. I apologize for the time it takes, this is one of those bugs that require time, patience, hard work and imagination. I've already had bugs where I literally spent weeks finding the cause and a solution, this looks like one of them. I just ain't giving up, ever.Really delete this comment?
lolo4j wrote:
No problem at all! I'm just happy when you can reproduce the same symptoms I have and trust you're taking the time to fix it the right way. I have workarounds for the issues and can be patient.
Really delete this comment?
The only issue now remaining, or at least that I can detect, is a job hang due to a single batch of dispatched tasks that is never returned to the client. An addtional clue is those warnings I sometimes see in the client log:
This indicates that the server is sending back results that were already sent to the client, or that there may be a confusion in the tasks positions. I'm still trying to figure out when and where this happens, which is proving difficult.Really delete this comment?
Fixes in
Really delete this comment?
Really delete this comment?
Really delete this comment?