Tracking down a memory leak / garbage-collection issue in Java

小开

Can you run the production box with JMX enabled?

-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.port=<port>
...

Monitoring and Management Using JMX

And then attach with JConsole, VisualVM?

Is it ok to do a heap dump with jmap?

If yes you could then analyze the heap dump for leaks with JProfiler (you already have), jhat, VisualVM, Eclipse MAT. Also compare heap dumps that might help to find leaks/patterns.

And as you mentioned jakarta-commons. There is a problem when using the jakarta-commons-logging related to holding onto the classloader. For a good read on that check

A day in the life of a memory leak hunter (release(Classloader))

小开

Any JAXB? I find that JAXB is a perm space stuffer.

Also, I find that visualgc, now shipped with JDK 6, is a great way to see what's going on in memory. It shows the eden, generational, and perm spaces and the transient behavior of the GC beautifully. All you need is the PID of the process. Maybe that will help while you work on JProfile.

And what about the Spring tracing/logging aspects? Maybe you can write a simple aspect, apply it declaratively, and do a poor man's profiler that way.

小开

I would look for directly allocated ByteBuffer.

From the javadoc.

A direct byte buffer may be created by invoking the allocateDirect factory method of this class. The buffers returned by this method typically have somewhat higher allocation and deallocation costs than non-direct buffers. The contents of direct buffers may reside outside of the normal garbage-collected heap, and so their impact upon the memory footprint of an application might not be obvious. It is therefore recommended that direct buffers be allocated primarily for large, long-lived buffers that are subject to the underlying system's native I/O operations. In general it is best to allocate direct buffers only when they yield a measureable gain in program performance.

Perhaps the Tomcat code uses this do to I/O; configure Tomcat to use a different connector.

Failing that you could have a thread that periodically executes System.gc(). "-XX:+ExplicitGCInvokesConcurrent" might be an interesting option to try.

小开

"Unfortunately, the problem also pops up sporadically, it seems to be unpredictable, it can run for days or even a week without having any problems, or it can fail 40 times in a day, and the only thing I can seem to catch consistently is that garbage collection is acting up."

Sounds like, this is bound to a use case which is executed up to 40 times a day and then not anymore for days. I hope, you do not just track only the symptoms. This must be something, that you can narrow down by tracing the actions of the application's actors (users, jobs, services).

If this happens by XML imports, you should compare the XML data of the 40 crashes day with data, that is imported on a zero crash day. Maybe it's some sort of logical problem, that you do not find inside your code, only.

小开

It seems like memory other than heap is leaking, you mention that heap is remaining stable. A classical candidate is permgen (permanent generation) which consists of 2 things: loaded class objects and interned strings. Since you report having connected with VisualVM you should be able to seem the amount of loaded classes, if there is a continues increase of the loaded classes (important, visualvm also shows the total amount of classes ever loaded, it's okay if this goes up but the amount of loaded classes should stabilize after a certain time).

If it does turn out to be a permgen leak then debugging gets trickier since tooling for permgen analysis is rather lacking in comparison to the heap. Your best bet is to start a small script on the server that repeatedly (every hour?) invokes:

jmap -permstat <pid> > somefile<timestamp>.txt

jmap with that parameter will generate an overview of loaded classes together with an estimate of their size in bytes, this report can help you identify if certain classes do not get unloaded. (note: with I mean the process id and should be some generated timestamp to distinguish the files)

Once you identified certain classes as being loaded and not unloaded you can figure out mentally where these might be generated, otherwise you can use jhat to analyze dumps generated with jmap -dump. I'll keep that for a future update should you need the info.

小开

最佳答案

Well, I finally found the issue that was causing this, and I'm posting a detail answer in case someone else has these issues.

I tried jmap while the process was acting up, but this usually caused the jvm to hang further, and I would have to run it with --force. This resulted in heap dumps that seemed to be missing a lot of data, or at least missing the references between them. For analysis, I tried jhat, which presents a lot of data but not much in the way of how to interpret it. Secondly, I tried the eclipse-based memory analysis tool ( http://www.eclipse.org/mat/ ), which showed that the heap was mostly classes related to tomcat.

The issue was that jmap was not reporting the actual state of the application, and was only catching the classes on shutdown, which was mostly tomcat classes.

I tried a few more times, and noticed that there were some very high counts of model objects (actually 2-3x more than were marked public in the database).

Using this I analyzed the slow query logs, and a few unrelated performance problems. I tried extra-lazy loading ( http://docs.jboss.org/hibernate/core/3.3/reference/en/html/performance.html ), as well as replacing a few hibernate operations with direct jdbc queries (mostly where it was dealing with loading and operating on large collections -- the jdbc replacements just worked directly on the join tables), and replaced some other inefficient queries that mysql was logging.

These steps improved pieces of the frontend performance, but still did not address the issue of the leak, the app was still unstable and acting unpredictably.

Finally, I found the option: -XX:+HeapDumpOnOutOfMemoryError . This finally produced a very large (~6.5GB) hprof file that accurately showed the state of the application. Ironically, the file was so large that jhat could not anaylze it, even on a box with 16gb of ram. Fortunately, MAT was able to produce some nice looking graphs and showed some better data.

This time what stuck out was a single quartz thread was taking up 4.5GB of the 6GB of heap, and the majority of that was a hibernate StatefulPersistenceContext ( https://www.hibernate.org/hib_docs/v3/api/org/hibernate/engine/StatefulPersistenceContext.html ). This class is used by hibernate internally as its primary cache (i had disabled the second-level and query-caches backed by EHCache).

This class is used to enable most of the features of hibernate, so it can't be directly disabled (you can work around it directly, but spring doesn't support stateless session) , and i would be very surprised if this had such a major memory leak in a mature product. So why was it leaking now?

Well, it was a combination of things: The quartz thread pool instantiates with certain things being threadLocal, spring was injecting a session factory in, that was creating a session at the start of the quartz threads lifecycle, which was then being reused to run the various quartz jobs that used the hibernate session. Hibernate then was caching in the session, which is its expected behavior.

The problem then is that the thread pool was never releasing the session, so hibernate was staying resident and maintaining the cache for the lifecycle of the session. Since this was using springs hibernate template support, there was no explicit use of the sessions (we are using a dao -> manager -> driver -> quartz-job hierarchy, the dao is injected with hibernate configs through spring, so the operations are done directly on the templates).

So the session was never being closed, hibernate was maintaining references to the cache objects, so they were never being garbage collected, so each time a new job ran it would just keep filling up the cache local to the thread, so there was not even any sharing between the different jobs. Also since this is a write-intensive job (very little reading), the cache was mostly wasted, so the objects kept getting created.

The solution: create a dao method that explicitly calls session.flush() and session.clear(), and invoke that method at the beginning of each job.

The app has been running for a few days now with no monitoring issues, memory errors or restarts.

Thanks for everyone's help on this, it was a pretty tricky bug to track down, as everything was doing exactly what it was supposed to, but in the end a 3 line method managed to fix all the problems.

小开

I had the same problem, with couple of differences..

My technology is the following:

grails 2.2.4

tomcat7

quartz-plugin 1.0

I use two datasources on my application. That is a particularity determinant to bug causes..

Another thing to consider is that quartz-plugin, inject hibernate session in quartz threads, just like @liam says, and quartz threads still alive, untill I finish application.

My problem was a bug on grails ORM combined with the way the plugin handle session and my two datasources.

Quartz plugin had a listener to init and destroy hibernate sessions

public class SessionBinderJobListener extends JobListenerSupport {


public static final String NAME = "sessionBinderListener";


private PersistenceContextInterceptor persistenceInterceptor;


public String getName() {
return NAME;
}


public PersistenceContextInterceptor getPersistenceInterceptor() {
return persistenceInterceptor;
}


public void setPersistenceInterceptor(PersistenceContextInterceptor persistenceInterceptor) {
this.persistenceInterceptor = persistenceInterceptor;
}


public void jobToBeExecuted(JobExecutionContext context) {
if (persistenceInterceptor != null) {
persistenceInterceptor.init();
}
}


public void jobWasExecuted(JobExecutionContext context, JobExecutionException exception) {
if (persistenceInterceptor != null) {
persistenceInterceptor.flush();
persistenceInterceptor.destroy();
}
}
}

In my case, persistenceInterceptor instances AggregatePersistenceContextInterceptor, and it had a List of HibernatePersistenceContextInterceptor. One for each datasource.

Every opertion do with AggregatePersistenceContextInterceptor its passed to HibernatePersistence, without any modification or treatments.

When we calls init() on HibernatePersistenceContextInterceptor he increment the static variable below

private static ThreadLocal<Integer> nestingCount = new ThreadLocal<Integer>();

I don't know the pourpose of that static count. I just know he it's incremented two times, one per datasource, because of the AggregatePersistence implementation.

Until here I just explain the cenario.

The problem comes now...

When my quartz job finish, the plugin calls the listener to flush and destroy hibernate sessions, like you can see in source code of SessionBinderJobListener.

The flush occurs perfectly, but the destroy not, because HibernatePersistence, do one validation before close hibernate session... It examines nestingCount to see if the value is grather than 1. If the answer is yes, he not close the session.

Simplifying what was did by Hibernate:

if(--nestingCount.getValue() > 0)
do nothing;
else
close the session;

That's the base of my memory leak.. Quartz threads still alive with all objects used in session, because grails ORM not close session, because of a bug caused because I have two datasources.

To solve that, I customize the listener, to call clear before destroy, and call destroy two times, (one for each datasource). Ensuring my session was clear and destroyed, and if the destroy fails, he was clear at least.