Me and my team have been struggling to keep a clustered ColdFusion application stable for the better part of the last 6 months with little result. We are turning to SF in the hope of some finding some JRun experts or fresh ideas cause we can't seem to figure it out.
The setup:
Two ColdFusion 7.0.2 instances clustered with JRun 4 (w/ the latest update) on IIS 6 under Windows Server 2003. Two quad core CPUs, 8GB RAM.
The issue:
Every now and again, usually once a week one of the instance will stop handling request completely. There is no activity on it what so ever and we have to restart it.
What we know:
Every time this happen JRun's error log is always full of java.lang.OutOfMemoryError: unable to create new native thread.
After reading JRun documentation from Macromedia/Adobe and many confusing blog posts we've more or less narrowed it down to incorrect/unoptimized JRun thread pool settings in the instance's jrun.xml.
Relevant part of our jrun.xml:
<service class="jrun.servlet.jrpp.JRunProxyService" name="ProxyService">
<attribute name="activeHandlerThreads">500</attribute>
<attribute name="backlog">500</attribute>
<attribute name="deactivated">false</attribute>
<attribute name="interface">*</attribute>
<attribute name="maxHandlerThreads">1000</attribute>
<attribute name="minHandlerThreads">1</attribute>
<attribute name="port">51003</attribute>
<attribute name="threadWaitTimeout">300</attribute>
<attribute name="timeout">300</attribute>
{snip}
</service>
I've enabled JRun's metrics logging last week to collect data related to threads. This is a summary of the data after letting it log for a week.
Average values:
{jrpp.listenTh} 1
{jrpp.idleTh} 9
{jrpp.delayTh} 0
{jrpp.busyTh} 0
{jrpp.totalTh} 10
{jrpp.delayRq} 0
{jrpp.droppedRq} 0
{jrpp.handledRq} 4
{jrpp.handledMs} 6036
{jrpp.delayMs} 0
{freeMemory} 48667
{totalMemory} 403598
{sessions} 737
{sessionsInMem} 737
Maximum values:
{jrpp.listenTh} 10
{jrpp.idleTh} 94
{jrpp.delayTh} 1
{jrpp.busyTh} 39
{jrpp.totalTh} 100
{jrpp.delayRq} 0
{jrpp.droppedRq} 0
{jrpp.handledRq} 87
{jrpp.handledMs} 508845
{jrpp.delayMs} 0
{freeMemory} 169313
{totalMemory} 578432
{sessions} 2297
{sessionsInMem} 2297
Any ideas as to what we could try now?
Cheers!
EDIT #1 -> Things I forgot to mention: Windows Server 2003 Enterprise w/ JVM 1.4.2 (for JRun)
The max heap size is around 1.4GB yeah. We used to have leaks but we fixed them, now the application use around 400MB, rarely more. The max heap size is set to 1200MB so we aren't reaching it. When we did have leaks the JVM would just blow up and the instance would restart itself. This isn't happening now, it simply stops handling incoming request.
We were thinking it has to do with thread following this blog post: http://www.talkingtree.com/blog/index.cfm/2005/3/11/NewNativeThread
The Java exception being thrown is of type OutOfMemory but it's not actually saying that we ran out of heap space, just that it couldn't create new threads. The exception type is a bit misleading.
Basically the blog is saying that 500 as activeHandlerThreads might be too high but my metrics seems to show that we get no where near that which is confusing us.