String Deduplication feature of Java 8

Since String in Java (like other languages) consumes a lot of memory because each character consumes two bytes, Java 8 has introduced a new feature called String Deduplication which takes advantage of the fact that the char arrays are internal to strings and final, so the JVM can mess around with them.

I have read this example so far but since I am not a pro java coder, I am having a hard time grasping the concept.

Here is what it says,

Various strategies for String Duplication have been considered, but the one implemented now follows the following approach: Whenever the garbage collector visits String objects it takes note of the char arrays. It takes their hash value and stores it alongside with a weak reference to the array. As soon as it finds another String which has the same hash code it compares them char by char. If they match as well, one String will be modified and point to the char array of the second String. The first char array then is no longer referenced anymore and can be garbage collected.

This whole process of course brings some overhead, but is controlled by tight limits. For example if a string is not found to have duplicates for a while it will be no longer checked.

My First question,

There is still a lack of resources on this topic since it is recently added in Java 8 update 20, could anyone here share some practical examples on how it help in reducing the memory consumed by String in Java ?

Edit:

The above link says,

As soon as it finds another String which has the same hash code it compares them char by char

My 2nd question,

If hash code of two String are same then the Strings are already the same, then why compare them char by char once it is found that the two String have same hash code ?

33041 次浏览

The strategy they describe is to simply reuse the internal character array of one String in possibly many equal Strings. There's no need for each String to have its own copy if they are equal.

In order to more quickly determine if 2 strings are equal, the hash code is used as a first step, as it is a fast way to determine if Strings may be equal. Hence their statement:

As soon as it finds another String which has the same hash code it compares them char by char

This is to make a certain (but slower) comparison for equality once possible equality has been determined using the hash code.

In the end, equal Strings will share a single underlying char array.

Java has had String.intern() for a long time, to do more or less the same (i.e. save memory by deduplicating equal Strings). What's novel about this is that it happens during garbage collection time and can be externally controlled.

Imagine you have a phone book, which contains people, which have a String firstName and a String lastName. And it happens that in your phone book, 100,000 people have the same firstName = "John".

Because you get the data from a database or a file those strings are not interned so your JVM memory contains the char array {'J', 'o', 'h', 'n'} 100 thousand times, one per John string. Each of these arrays takes, say, 20 bytes of memory so those 100k Johns take up 2 MB of memory.

With deduplication, the JVM will realise that "John" is duplicated many times and make all those John strings point to the same underlying char array, decreasing the memory usage from 2MB to 20 bytes.

You can find a more detailed explanation in the JEP. In particular:

Many large-scale Java applications are currently bottlenecked on memory. Measurements have shown that roughly 25% of the Java heap live data set in these types of applications is consumed by String objects. Further, roughly half of those String objects are duplicates, where duplicates means string1.equals(string2) is true. Having duplicate String objects on the heap is, essentially, just a waste of memory.

[...]

The actual expected benefit ends up at around 10% heap reduction. Note that this number is a calculated average based on a wide range of applications. The heap reduction for a specific application could vary significantly both up and down.

Since your first question has already been answered, I'll answer your second question.

The String objects must be compared character by character, because though equal Objects implies equal hashes, the inverse is not necessarily true.

As Holger said in his comment, this represents a hash collision.

The applicable specifications for the hashcode() method are as follows:

  • If two objects are equal according to the equals(Object) method, then calling the hashCode method on each of the two objects must produce the same integer result.

  • It is not required that if two objects are unequal according to the equals(java.lang.Object) method, then calling the hashCode method on each of the two objects must produce distinct integer results. ...

This means that in order for them to guarantee equality, the comparison of each character is necessary in order for them to confirm the equality of the two objects. They start by comparing hashCodes rather than using equals since they are using a hash table for the references, and this improves performance.

@assylias answer basiclly tells you how it work and is very good answer. I have tested a production application with String Deduplication and have some results. The web app heavily uses Strings so i think the advantage is pretty clear.

To enable String Deduplication you have to add these JVM params (you need at least Java 8u20):

-XX:+UseG1GC -XX:+UseStringDeduplication -XX:+PrintStringDeduplicationStatistics

The last one is optional but like the name says it shows you the String Deduplication statistics. Here are mine:

[GC concurrent-string-deduplication, 2893.3K->2672.0B(2890.7K), avg 97.3%, 0.0175148 secs]
[Last Exec: 0.0175148 secs, Idle: 3.2029081 secs, Blocked: 0/0.0000000 secs]
[Inspected:           96613]
[Skipped:              0(  0.0%)]
[Hashed:           96598(100.0%)]
[Known:                2(  0.0%)]
[New:              96611(100.0%)   2893.3K]
[Deduplicated:        96536( 99.9%)   2890.7K( 99.9%)]
[Young:                0(  0.0%)      0.0B(  0.0%)]
[Old:              96536(100.0%)   2890.7K(100.0%)]
[Total Exec: 452/7.6109490 secs, Idle: 452/776.3032184 secs, Blocked: 11/0.0258406 secs]
[Inspected:        27108398]
[Skipped:              0(  0.0%)]
[Hashed:        26828486( 99.0%)]
[Known:            19025(  0.1%)]
[New:           27089373( 99.9%)    823.9M]
[Deduplicated:     26853964( 99.1%)    801.6M( 97.3%)]
[Young:             4732(  0.0%)    171.3K(  0.0%)]
[Old:           26849232(100.0%)    801.4M(100.0%)]
[Table]
[Memory Usage: 2834.7K]
[Size: 65536, Min: 1024, Max: 16777216]
[Entries: 98687, Load: 150.6%, Cached: 415, Added: 252375, Removed: 153688]
[Resize Count: 6, Shrink Threshold: 43690(66.7%), Grow Threshold: 131072(200.0%)]
[Rehash Count: 0, Rehash Threshold: 120, Hash Seed: 0x0]
[Age Threshold: 3]
[Queue]
[Dropped: 0]

These are the results after running the app for 10 minutes. As you can see String Deduplication was executed 452 times and "deduplicated" 801.6 MB Strings. String Deduplication inspected 27 000 000 Strings. When i compared my memory consumption from Java 7 with the standard Parallel GC to Java 8u20 with the G1 GC and enabled String Deduplication the heap dropped approximatley 50%:

Java 7 Parallel GC

Java 7 Parallel GC

Java 8 G1 GC with String Deduplication

Java 8 G1 GC with String Deduplication