Java HashMap 性能优化/替代方案

我想创建一个大型的 HashMap,但是 put()的性能不够好。有什么想法吗?

其他的数据结构建议是受欢迎的,但是我需要 Java 地图的查找特性:

map.get(key)

在我的例子中,我想创建一个有2600万条目的地图。使用标准的 JavaHashMap,插入200-300万次之后,放入速率变得极其缓慢。

另外,是否有人知道对密钥使用不同的散列代码发行版是否有帮助?

我的 hashcode 方法:

byte[] a = new byte[2];
byte[] b = new byte[3];
...


public int hashCode() {
int hash = 503;
hash = hash * 5381 + (a[0] + a[1]);
hash = hash * 5381 + (b[0] + b[1] + b[2]);
return hash;
}

我使用了加法结合律来确保相同的对象具有相同的 hashcode。这些数组是值在0-51范围内的字节。值在任一数组中只使用一次。如果 a 数组包含相同的值(两种顺序) ,b 数组也是如此,那么对象就是相等的。所以 a = {0,1} b = {45,12,33}和 a = {1,0} b = {33,45,12}是相等的。

编辑,一些注释:

  • 一些人批评使用散列映射或其他数据结构来存储2600万条目。我不明白这有什么奇怪的。在我看来,这是一个典型的数据结构和算法问题。我有2600万个条目,我希望能够快速地将它们插入并从数据结构中查找它们: 给我数据结构和算法。

  • 将默认 JavaHashMap 的初始容量设置为2600万 减少的性能。

  • 有些人建议使用数据库,在其他一些情况下,这绝对是明智的选择。但我实际上是在问一个数据结构和算法的问题,一个完整的数据库将是过度杀伤和比一个好的数据结构解决方案慢得多(毕竟数据库只是软件,但会有通信和可能的磁盘开销)。

128078 次浏览

HashMap has initial capacity and HashMap's performance very very depends on hashCode that produce underlying objects.

Try to tweak both.

My first idea is to make sure you're initializing your HashMap appropriately. From the JavaDocs for HashMap:

An instance of HashMap has two parameters that affect its performance: initial capacity and load factor. The capacity is the number of buckets in the hash table, and the initial capacity is simply the capacity at the time the hash table is created. The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased. When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the hash table is rehashed (that is, internal data structures are rebuilt) so that the hash table has approximately twice the number of buckets.

So if you're starting off with a too-small HashMap, then every time it needs to resize, all the hashes are recomputed... which might be what you're feeling when you get to the 2-3 million insertion point.

Allocate a large map in the beginning. If you know it will have 26 million entries and you have the memory for it, do a new HashMap(30000000).

Are you sure, you have enough memory for 26 million entries with 26 million keys and values? This sounds like a lot memory to me. Are you sure that the garbage collection is doing still fine at your 2 to 3 million mark? I could imagine that as a bottleneck.

You could try two things:

  • Make your hashCode method return something simpler and more effective such as a consecutive int

  • Initialize your map as:

    Map map = new HashMap( 30000000, .95f );
    

Those two actions will reduce tremendously the amount of rehashing the structure is doing, and are pretty easy to test I think.

If that doesn't work, consider using a different storage such a RDBMS.

EDIT

Is strange that setting the initial capacity reduce the performance in your case.

See from the javadocs:

If the initial capacity is greater than the maximum number of entries divided by the load factor, no rehash operations will ever occur.

I made a microbeachmark ( which is not by anymeans definitive but at least proves this point )

$cat Huge*java
import java.util.*;
public class Huge {
public static void main( String [] args ) {
Map map = new HashMap( 30000000 , 0.95f );
for( int i = 0 ; i < 26000000 ; i ++ ) {
map.put( i, i );
}
}
}
import java.util.*;
public class Huge2 {
public static void main( String [] args ) {
Map map = new HashMap();
for( int i = 0 ; i < 26000000 ; i ++ ) {
map.put( i, i );
}
}
}
$time java -Xms2g -Xmx2g Huge


real    0m16.207s
user    0m14.761s
sys 0m1.377s
$time java -Xms2g -Xmx2g Huge2


real    0m21.781s
user    0m20.045s
sys 0m1.656s
$

So, using the initial capacity drops from 21s to 16s because of the rehasing. That leave us with your hashCode method as an "area of opportunity" ;)

EDIT

Is not the HashMap

As per your last edition.

I think you should really profile your application and see where it the memory/cpu is being consumed.

I have created a class implementing your same hashCode

That hash code give millions of collisions, then the entries in the HashMap is reduced dramatically.

I pass from 21s, 16s in my previous test to 10s and 8s. The reason is because the hashCode provokes a high number of collisions and you are not storing the 26M objects you think but a much significant lower number ( about 20k I would say ) So:

The problems IS NOT THE HASHMAP is somewhere else in your code.

It is about time to get a profiler and find out where. I would think it is on the creation of the item or probably you're writing to disk or receiving data from the network.

Here's my implementation of your class.

note I didn't use a 0-51 range as you did but -126 to 127 for my values and admits repeated, that's because I did this test before you updated your question

The only difference is that your class will have more collisions thus less items stored in the map.

import java.util.*;
public class Item {


private static byte w = Byte.MIN_VALUE;
private static byte x = Byte.MIN_VALUE;
private static byte y = Byte.MIN_VALUE;
private static byte z = Byte.MIN_VALUE;


// Just to avoid typing :)
private static final byte M = Byte.MAX_VALUE;
private static final byte m = Byte.MIN_VALUE;




private byte [] a = new byte[2];
private byte [] b = new byte[3];


public Item () {
// make a different value for the bytes
increment();
a[0] = z;        a[1] = y;
b[0] = x;        b[1] = w;   b[2] = z;
}


private static void increment() {
z++;
if( z == M ) {
z = m;
y++;
}
if( y == M ) {
y = m;
x++;
}
if( x == M ) {
x = m;
w++;
}
}
public String toString() {
return "" + this.hashCode();
}






public int hashCode() {
int hash = 503;
hash = hash * 5381 + (a[0] + a[1]);
hash = hash * 5381 + (b[0] + b[1] + b[2]);
return hash;
}
// I don't realy care about this right now.
public boolean equals( Object other ) {
return this.hashCode() == other.hashCode();
}


// print how many collisions do we have in 26M items.
public static void main( String [] args ) {
Set set = new HashSet();
int collisions = 0;
for ( int i = 0 ; i < 26000000 ; i++ ) {
if( ! set.add( new Item() ) ) {
collisions++;
}
}
System.out.println( collisions );
}
}

Using this class has Key for the previous program

 map.put( new Item() , i );

gives me:

real     0m11.188s
user     0m10.784s
sys 0m0.261s




real     0m9.348s
user     0m9.071s
sys  0m0.161s

To elaborate on Pascal: Do you understand how a HashMap works? You have some number of slots in your hash table. The hash value for each key is found, and then mapped to an entry in the table. If two hash values map to the same entry -- a "hash collision" -- HashMap builds a linked list.

Hash collisions can kill the performance of a hash map. In the extreme case, if all your keys have the same hash code, or if they have different hash codes but they all map to the same slot, then your hash map turns into a linked list.

So if you're seeing performance problems, the first thing I'd check is: Am I getting a random-looking distribution of hash codes? If not, you need a better hash function. Well, "better" in this case may mean "better for my particular set of data". Like, suppose you were working with strings, and you took the length of the string for the hash value. (Not how Java's String.hashCode works, but I'm just making up a simple example.) If your strings have widely varying lengths, from 1 to 10,000, and are fairly evenly distributed across that range, that this could be a very good hash function. But if your strings are all 1 or 2 characters, this would be a very bad hash function.

Edit: I should add: Every time you add a new entry, HashMap checks if this is a duplicate. When there's a hash collision, it has to compare the incoming key against every key that mapped to that slot. So in the worst case where everything hashes to a single slot, the second key is compared to the first key, the third key is compared to #1 and #2, the fourth key is compared to #1, #2, and #3, etc. By the time you get to key #1 million, you've done over a trillion compares.

@Oscar: Umm, I don't see how that's a "not really". It's more like a "let me clarify". But yes, it's true that if you make a new entry with the same key as an existing entry, that this overwrites the first entry. That's what I meant when I talked about looking for duplicates in the last paragraph: Whenever a key hashes to the same slot, HashMap must check if it's a duplicate of an existing key, or if they are just in the same slot by coincidence of the hash function. I don't know that that's the "whole point" of a HashMap: I would say that the "whole point" is that you can retrieve elements by key quickly.

But anyway, that doesn't affect the "whole point" that I was trying to make: When you have two keys -- yes, different keys, not the same key showing up again -- that map to the same slot in the table, HashMap builds a linked list. Then, because it has to check each new key to see if it is in fact a duplicate of an existing key, each attempt to add a new entry that maps to this same slot must chase the linked list examining each existing entry to see if this is a duplicate of a previously-seen key, or if it is a new key.

Update long after the original post

I just got an up-vote on this answer 6 years after posting which led me to re-read the question.

The hash function given in the question is not a good hash for 26 million entries.

It adds together a[0]+a[1] and b[0]+b[1]+b[2]. He says values of each byte range from 0 to 51, so that gives only (51*2+1)*(51*3+1)=15,862 possible hash values. With 26 million entries, this means an average of about 1639 entries per hash value. That is lots and lots of collisions, requiring lots and lots of sequential searches through linked lists.

The OP says that different orders within array a and array b should be considered equal, i.e. [[1,2],[3,4,5]].equals([[2,1],[5,3,4]]), and so to fulfill the contract they must have equal hash codes. Okay. Still, there are a lot more than 15,000 possible values. His second proposed hash function is much better, giving a broader range.

Though as someone else commented, it seems inappropriate for a hash function to change other data. It would make more sense to "normalize" the object when it is created, or to have the hash function work from copies of the arrays. Also, using a loop to calculate constants every time through the function is inefficient. As there are only four values here, I would have either written

return a[0]+a[1]*52+b[0]*52*52+b[1]*52*52*52+b[2]*52*52*52*52;

which would cause the compiler to perform the calculation once at compile time; or have 4 static constants defined in the class.

Also, the first draft at a hash function has several calculations that do nothing to add to the range of outputs. Note he first sets hash =503 than multiplies by 5381 before even considering values from the class. So ... in effect he adds 503*5381 to every value. What does this accomplish? Adding a constant to every hash value just burns cpu cycles without accomplishing anything useful. Lesson here: Adding complexity to a hash function is not the goal. The goal is to get a broad range of different values, not just to add complexity for the sake of complexity.

You can try to use an in-memory database like HSQLDB.

Have you considered using a embeded database to do this. Look at Berkeley DB. It is open-source, owned by Oracle now.

It stores everything as Key->Value pair, it is NOT an RDBMS. and it aims to be fast.

SQLite lets you use it in memory.

First you should check that you are using Map correctly, good hashCode() method for keys, initial capacity for Map, right Map implementation etc. like many other answers describe.

Then I would suggest using a profiler to see what is actually happening and where the execution time is spent. Is, for example, hashCode() method executed for billions of times?

If that doesn't help, how about using something like EHCache or memcached? Yes, they are products for caching but you could configure them so that they will have enough capacity and will never evict any values from cache storage.

Another option would be some database engine that is lighter weight than full SQL RDBMS. Something like Berkeley DB, maybe.

Note, that I have personally no experience of these products' performance, but they could be worth the try.

If the keys have any pattern to them then you can split the map into smaller maps and have a index map.

Example: Keys: 1,2,3,.... n 28 maps of 1 million each. Index map: 1-1,000,000 -> Map1 1,000,000-2,000,000 -> Map2

So you'll be doing two lookups but the key set would be 1,000,000 vs 28,000,000. You can easily do this with sting patterns also.

If the keys are completely random then this will not work

You could try to cache computed hash code to the key object.

Something like this:

public int hashCode() {
if(this.hashCode == null) {
this.hashCode = computeHashCode();
}
return this.hashCode;
}


private int computeHashCode() {
int hash = 503;
hash = hash * 5381 + (a[0] + a[1]);
hash = hash * 5381 + (b[0] + b[1] + b[2]);
return hash;
}

Of course you have to be careful not to change contents of the key after the hashCode has been calculated for the first time.

Edit: It seems that caching has code values is not worthwhile when you are adding each key only once to a map. In some other situation this could be useful.

I'd suggest a three-pronged approach:

  1. Run Java with more memory: java -Xmx256M for example to run with 256 Megabytes. Use more if needed and you have lots of RAM.

  2. Cache your calculated hash values as suggested by another poster, so each object only calculates its hash value once.

  3. Use a better hashing algorithm. The one you posted would return the same hash where a = {0, 1} as it would where a ={1, 0}, all else being equal.

Utilise what Java gives you for free.

public int hashCode() {
return 31 * Arrays.hashCode(a) + Arrays.hashCode(b);
}

I'm pretty sure this has much less chance of clashing than your existing hashCode method, although it depends on the exact nature of your data.

If the arrays in your posted hashCode are bytes, then you will likely end up with lots of duplicates.

a[0] + a[1] will always be between 0 and 512. adding the b's will always result in a number between 0 and 768. multiply those and you get an upper limit of 400,000 unique combinations, assuming your data is perfectly distributed among every possible value of each byte. If your data is at all regular, you likely have far less unique outputs of this method.

One thing I notice in your hashCode() method is that the order of the elements in the arrays a[] and b[] don't matter. Thus (a[]={1,2,3}, b[]={99,100}) will hash to the same value as (a[]={3,1,2}, b[]={100,99}). Actually all keys k1 and k2 where sum(k1.a)==sum(k2.a) and sum(k1.b)=sum(k2.b) will result in collisions. I suggest assigning a weight to each position of the array:

hash = hash * 5381 + (c0*a[0] + c1*a[1]);
hash = hash * 5381 + (c0*b[0] + c1*b[1] + c3*b[2]);

where, c0, c1 and c3 are distinct constants (you can use different constants for b if necessary). That should even out things a bit more.

Another poster already pointed out that your hashcode implementation will result in a lot of collisions due to the way that you're adding values together. I'm willing to be that, if you look at the HashMap object in a debugger, you'll find that you have maybe 200 distinct hash values, with extremely long bucket chains.

If you always have values in the range 0..51, each of those values will take 6 bits to represent. If you always have 5 values, you can create a 30-bit hashcode with left-shifts and additions:

    int code = a[0];
code = (code << 6) + a[1];
code = (code << 6) + b[0];
code = (code << 6) + b[1];
code = (code << 6) + b[2];
return code;

The left-shift is fast, but will leave you with hashcodes that aren't evenly distributed (because 6 bits implies a range 0..63). An alternative is to multiply the hash by 51 and add each value. This still won't be perfectly distributed (eg, {2,0} and {1,52} will collide), and will be slower than the shift.

    int code = a[0];
code *= 51 + a[1];
code *= 51 + b[0];
code *= 51 + b[1];
code *= 51 + b[2];
return code;

If the two byte arrays you mention is your entire key, the values are in the range 0-51, unique and the order within the a and b arrays is insignificant, my math tells me that there is only just about 26 million possible permutations and that you likely are trying to fill the map with values for all possible keys.

In this case, both filling and retrieving values from your data store would of course be much faster if you use an array instead of a HashMap and index it from 0 to 25989599.

As many people pointed out the hashCode() method was to blame. It was only generating around 20,000 codes for 26 million distinct objects. That is an average of 1,300 objects per hash bucket = very very bad. However if I turn the two arrays into a number in base 52 I am guaranteed to get a unique hash code for every object:

public int hashCode() {
// assume that both a and b are sorted
return a[0] + powerOf52(a[1], 1) + powerOf52(b[0], 2) + powerOf52(b[1], 3) + powerOf52(b[2], 4);
}


public static int powerOf52(byte b, int power) {
int result = b;
for (int i = 0; i < power; i++) {
result *= 52;
}
return result;
}

The arrays are sorted to ensure this methods fulfills the hashCode() contract that equal objects have the same hash code. Using the old method the average number of puts per second over blocks of 100,000 puts, 100,000 to 2,000,000 was:

168350.17
109409.195
81344.91
64319.023
53780.79
45931.258
39680.29
34972.676
31354.514
28343.062
25562.371
23850.695
22299.22
20998.006
19797.799
18702.951
17702.434
16832.182
16084.52
15353.083

Using the new method gives:

337837.84
337268.12
337078.66
336983.97
313873.2
317460.3
317748.5
320000.0
309704.06
310752.03
312944.5
265780.75
275540.5
264350.44
273522.97
270910.94
279008.7
276285.5
283455.16
289603.25

Much much better. The old method tailed off very quickly while the new one keeps up a good throughput.

Getting into the gray area of "on/off topic", but necessary to eliminate confusion regarding Oscar Reyes suggestion that more hash collisions is a good thing because it reduces the number of elements in the HashMap. I may misunderstand what Oscar is saying, but I don't seem to be the only one: kdgregory, delfuego, Nash0, and I all seem to share the same (mis)understanding.

If I understand what Oscar is saying about the same class with the same hashcode, he's proposing that only one instance of a class with a given hashcode will be inserted into the HashMap. Eg, if I have an instance of SomeClass with a hashcode of 1 and a second instance of SomeClass with a hashcode of 1, only one instance of SomeClass is inserted.

The Java pastebin example at http://pastebin.com/f20af40b9 seems to indicate the above correctly summarizes what Oscar is proposing.

Regardless of any understanding or misunderstanding, what happens is different instances of the same class do not get inserted only once into the HashMap if they have the same hashcode - not until it's determined whether the keys are equal or not. The hashcode contract requires that equal objects have the same hashcode; however, it doesn't require that unequal objects have different hashcodes (although this may be desirable for other reasons)[1].

The pastebin.com/f20af40b9 example (which Oscar refers to at least twice) follows, but modified slightly to use JUnit assertions rather than printlines. This example is used to support the proposal that the same hashcodes cause collisions and when the classes are the same only one entry is created (eg, only one String in this specific case):

@Test
public void shouldOverwriteWhenEqualAndHashcodeSame() {
String s = new String("ese");
String ese = new String("ese");
// same hash right?
assertEquals(s.hashCode(), ese.hashCode());
// same class
assertEquals(s.getClass(), ese.getClass());
// AND equal
assertTrue(s.equals(ese));


Map map = new HashMap();
map.put(s, 1);
map.put(ese, 2);
SomeClass some = new SomeClass();
// still  same hash right?
assertEquals(s.hashCode(), ese.hashCode());
assertEquals(s.hashCode(), some.hashCode());


map.put(some, 3);
// what would we get?
assertEquals(2, map.size());


assertEquals(2, map.get("ese"));
assertEquals(3, map.get(some));


assertTrue(s.equals(ese) && s.equals("ese"));
}


class SomeClass {
public int hashCode() {
return 100727;
}
}

However, the hashcode isn't the complete story. What the pastebin example neglects is the fact that both s and ese are equal: they are both the string "ese". Thus, inserting or getting the contents of the map using s or ese or "ese" as the key are all equivalent because s.equals(ese) && s.equals("ese").

A second test demonstrates it is erroneous to conclude that identical hashcodes on the same class is the reason the key -> value s -> 1 is overwritten by ese -> 2 when map.put(ese, 2) is called in test one. In test two, s and ese still have the same hashcode (as verified by assertEquals(s.hashCode(), ese.hashCode());) AND they are the same class. However, s and ese are MyString instances in this test, not Java String instances - with the only difference relevant for this test being the equals: ese -> 20 in test one above, whereas ese -> 21 in test two:

@Test
public void shouldInsertWhenNotEqualAndHashcodeSame() {
MyString s = new MyString("ese");
MyString ese = new MyString("ese");
// same hash right?
assertEquals(s.hashCode(), ese.hashCode());
// same class
assertEquals(s.getClass(), ese.getClass());
// BUT not equal
assertFalse(s.equals(ese));


Map map = new HashMap();
map.put(s, 1);
map.put(ese, 2);
SomeClass some = new SomeClass();
// still  same hash right?
assertEquals(s.hashCode(), ese.hashCode());
assertEquals(s.hashCode(), some.hashCode());


map.put(some, 3);
// what would we get?
assertEquals(3, map.size());


assertEquals(1, map.get(s));
assertEquals(2, map.get(ese));
assertEquals(3, map.get(some));
}


/**
* NOTE: equals is not overridden so the default implementation is used
* which means objects are only equal if they're the same instance, whereas
* the actual Java String class compares the value of its contents.
*/
class MyString {
String i;


MyString(String i) {
this.i = i;
}


@Override
public int hashCode() {
return 100727;
}
}

Based on a later comment, Oscar seems to reverse what he's said earlier and acknowledges the importance of equals. However, it still seems the notion that equals is what matters, not the "same class", is unclear (emphasis mine):

"Not really. The list is created only if the hash is the same, but the key is different. For instance if a String give hashcode 2345 and and Integer gives the same hashcode 2345, then the integer is inserted into the list because String.equals( Integer ) is false. But if you have the same class ( or at least .equals returns true ) then the same entry is used. For instance new String("one") and `new String("one") used as keys, will use the same entry. Actually this is the WHOLE point of HashMap in first place! See for yourself: pastebin.com/f20af40b9 – Oscar Reyes"

versus earlier comments that explicitly address the importance of identical class and same hashcode, with no mention of equals:

"@delfuego: See for yourself: pastebin.com/f20af40b9 So, in this question the same class is being used ( wait a minute, the same class is being used right? ) Which implies that when the same hash is used the same entry is used and there is not "list" of entries. – Oscar Reyes"

or

"Actually this would increase the performance. The more collisions eq less entries in the hashtable eq. less work to do. Is not the hash ( which looks fine ) nor the hashtable ( which works great ) I'd bet it is on the object creation where the performance is degrading. – Oscar Reyes"

or

"@kdgregory: Yes, but only if the collision happens with different classes, for the same class ( which is the case ) the same entry is used. – Oscar Reyes"

Again, I may misunderstand what Oscar was actually trying to say. However, his original comments have caused enough confusion that it seems prudent to clear everything up with some explicit tests so there are no lingering doubts.


[1] - From Effective Java, Second Edition by Joshua Bloch:

  • Whenever it is invoked on the same object more than once during an execution of an application, the hashCode method must consistently return the same integer, provided no information used in equal s comparisons on the object is modified. This integer need not remain consistent from one execution of an application to another execution of the same application.

  • If two objects are equal according to the equal s(Obj ect) method, then calling the hashCode method on each of the two objects must produce the same integer result.

  • It is not required that if two objects are unequal according to the equal s(Object) method, then calling the hashCode method on each of the two objects must produce distinct integer results. However, the programmer should be aware that producing distinct integer results for unequal objects may improve the performance of hash tables.

I'm late here, but a couple comments about big maps:

  1. As discussed at length in other posts, with a good hashCode(), 26M entries in a Map is no big deal.
  2. However, a potentially hidden issue here is GC impact of giant maps.

I'm making an assumption that these maps are long lived. i.e. you populate them and they stick around for the duration of the app. I'm also assuming the app itself is long lived -- like a server of some sort.

Each entry in a Java HashMap requires three objects: the key, the value, and the Entry that ties them together. So 26M entries in the map means 26M * 3 == 78M objects. This is fine until you hit a full GC. Then you've got a pause-the-world problem. The GC will look at each of the 78M objects and determine they're all alive. 78M+ objects is just a lot of objects to look at. If your app can tolerate occasional long (perhaps many seconds) pauses, there's no issue. If you're trying to achieve any latency guarantees you could have a major issue (of course if you want latency guarantees, Java's not the platform to choose :)) If the values in your maps churn quickly you can end up with frequent full collects which compounds the problem greatly.

I don't know of a great solution to this issue. Ideas:

  • It's sometimes possible to tune GC and heap sizes to "mostly" prevent full GCs.
  • If your map contents churn a lot you could try Javolution's FastMap -- it can pool Entry objects, which could lower the frequency of full collects
  • You could create your own map impl and do explicit memory management on byte[] (i.e. trade cpu for more predictable latency by serializing millions of objects into a single byte[] -- ugh!)
  • Don't use Java for this part -- talk to some sort of predictable in-memory DB over a socket
  • Hope that the new G1 collector will help (mainly applies to the high-churn case)

Just some thoughts from someone who has spent a lot of time with giant maps in Java.


As pointed out, your hashcode implementation has too many collisions, and fixing it should result in decent performance. Moreover, caching hashCodes and implementing equals efficiently will help.

If you need to optimize even further:

By your description, there are only (52 * 51 / 2) * (52 * 51 * 50 / 6) = 29304600 different keys (of which 26000000, i.e. about 90%, will be present). Therefore, you can design a hash function without any collisions, and use a simple array rather than a hashmap to hold your data, reducing memory consumption and increasing lookup speed:

T[] array = new T[Key.maxHashCode];


void put(Key k, T value) {
array[k.hashCode()] = value;


T get(Key k) {
return array[k.hashCode()];
}

(Generally, it is impossible to design an efficient, collision-free hash function that clusters well, which is why a HashMap will tolerate collisions, which incurs some overhead)

Assuming a and b are sorted, you might use the following hash function:

public int hashCode() {
assert a[0] < a[1];
int ahash = a[1] * a[1] / 2
+ a[0];


assert b[0] < b[1] && b[1] < b[2];


int bhash = b[2] * b[2] * b[2] / 6
+ b[1] * b[1] / 2
+ b[0];
return bhash * 52 * 52 / 2 + ahash;
}


static final int maxHashCode = 52 * 52 / 2 * 52 * 52 * 52 / 6;

I think this is collision-free. Proving this is left as an exercise for the mathematically inclined reader.

I did a small test a while back with a list vs a hashmap, funny thing was iterating through the list and finding the object took the same amount of time in milliseconds as using the hashmaps get function... just an fyi. Oh yeah memory is a big issue when working with hashmaps that size.

In Effective Java: Programming Language Guide (Java Series)

Chapter 3 you can find good rules to follow when computing hashCode().

Specially:

If the field is an array, treat it as if each element were a separate field. That is, compute a hash code for each significant element by applying these rules recursively, and combine these values per step 2.b. If every element in an array field is significant, you can use one of the Arrays.hashCode methods added in release 1.5.

The popular hashing methods used are not really very good for large sets and, as pointed out above, the hash used is particularly bad. Better is to use a hash algorithm with high mixing and coverage such as BuzHash (sample implementation at http://www.java2s.com/Code/Java/Development-Class/AveryefficientjavahashalgorithmbasedontheBuzHashalgoritm.htm)

In my case I want to create a map with 26 million entries. Using the standard Java HashMap the put rate becomes unbearably slow after 2-3 million insertions.

From my experiment (student project in 2009):

  • I built up a Red Black Tree for 100.000 nodes from 1 to 100.000. It took 785.68 seconds (13 minutes). And I failed to build up RBTree for 1 million nodes (like your results with HashMap).
  • Using "Prime Tree", my algorithm data structure. I could build up a tree/map for 10 million nodes within 21.29 seconds (RAM: 1.97Gb). Search key-value cost is O(1).

Note: "Prime Tree" works best on "continuous keys" from 1 - 10 millions. To work with keys like HashMap we need some minors adjustment.


So, what is #PrimeTree? In short, it is a tree data structure like Binary Tree, with branches numbers are prime numbers (instead of "2"-binary).