HashMap源码阅读笔记(实现原理)
本文参考 Java8 API文档 加上源码来学习整理HashMap底层的核心实现原理。
HashMap是基于哈希表原理并实现Map接口的一个实现类,他是Java中一个重要的集合,通过kv键值对的方式来进行存储。它是线程不安全的,也就是说需要外部来实现线程安全,它允许有null值,仅有唯一的key。下面就结合文档和HashMap源码来解读一下主要实现原理。
1. API文档中的介绍
Hash table based implementation of the
Mapinterface. This implementation provides all of the optional map operations, and permitsnullvalues and thenullkey. (TheHashMapclass is roughly equivalent toHashtable, except that it is unsynchronized and permits nulls.) This class makes no guarantees as to the order of the map; in particular, it does not guarantee that the order will remain constant over time.
HashMap是基于哈希表并实现Map接口的一个实现类。它实现了所有Map接口上所提供的可选操作,并且允许有为null的value和为null的key。(HashMap除了是线程不安全的和允许空值外,其他可以约等于Hashtable。)这个类不能确保Map的顺序;尤其是他不能保证Map的顺序会在一段时间内保持不变。
This implementation provides constant-time performance for the basic operations (
getandput), assuming the hash function disperses the elements properly among the buckets. Iteration over collection views requires time proportional to the "capacity" of theHashMapinstance (the number of buckets) plus its size (the number of key-value mappings). Thus, it's very important not to set the initial capacity too high (or the load factor too low) if iteration performance is important.
这个实现类为基本操作(get和put)提供了恒定时间的性能(也就是时间复杂度为常数),前提是哈希函数可以将元素正确的分散到桶(容器)中。对集合视图(也就是HashMap)的迭代所需要的时间是和HashMap实例的”容量“(也就是桶(Entry或Entry链)的数量)加上其size(也就是kv键值对的个数)成正比,因此,如果对迭代的性能有很大的要求,那么就不要将初始容量设置太高(或者负载因子太低)。
An instance of
HashMaphas two parameters that affect its performance: initial capacity and load factor. The capacity is the number of buckets in the hash table, and the initial capacity is simply the capacity at the time the hash table is created. The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased. When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the hash table is rehashed (that is, internal data structures are rebuilt) so that the hash table has approximately twice the number of buckets.
有两个参数会影响HashMap的性能:初始容量 和 负载因子。容量就是哈希表中的桶的数量(也就是table数组的长度),这个初始容量仅是创建哈希表时的容量。负载因子是在自动增加哈希表容量之前,允许哈希表里所填满元素的最大值的标准。当哈希表中Entry的数量达到负载因子与当前容量的乘积时,哈希表将被重新格式化(即内部数据结构被重构)这样哈希表的桶数大约会变成原来的两倍。
注:负载因子选择0.75这个数值有很多原因,大致是通过泊松分布和二项分布看,选择0.5以上会大概率避免碰撞,除了0.75,据说还有0.625和0.875可选,但是据说超过0.8的时候,查询哈希表的时候CPU缓存命中率会指数下降。
As a general rule, the default load factor (.75) offers a good tradeoff between time and space costs. Higher values decrease the space overhead but increase the lookup cost (reflected in most of the operations of the
HashMapclass, includinggetandput). The expected number of entries in the map and its load factor should be taken into account when setting its initial capacity, so as to minimize the number of rehash operations. If the initial capacity is greater than the maximum number of entries divided by the load factor, no rehash operations will ever occur.
作为默认规则,默认的负载因子(0.75)在空间和时间成本上找到了平衡点。更高的值会减少空间上的开销,但会增加查找成本(反映在HashMap的大多数操作中,包括get和put)。因此在设置初始容量的时候,应当考虑Entry的个数和负载因子,以便减少再次重新格式化的操作次数。如果初始容量 大于 Entry数量除以负载因子,那么就不会发生重新格式化的操作。
//源码的默认负载因子大小
static final float DEFAULT_LOAD_FACTOR = 0.75f;
//源码的默认初始化容量
static final int DEFAULT_INITIAL_CAPACITY = 1 << 4; // aka 16
If many mappings are to be stored in a
HashMapinstance, creating it with a sufficiently large capacity will allow the mappings to be stored more efficiently than letting it perform automatic rehashing as needed to grow the table. Note that using many keys with the samehashCode()is a sure way to slow down performance of any hash table. To ameliorate impact, when keys areComparable, this class may use comparison order among keys to help break ties.
如果想要在一个HashMap实例中存储很多kv映射,那么使用足够大的容量来创建HashMap将提升它的存储效率,而不是让他根据大小重新格式化来扩展表。请注意,使用多个哈希值的多个键肯定降低哈希表的性能,为了改变这种影响,当键是可以比较的(CompareTo方法可比),这个类可以通过键之间的比较顺序来解决这个束缚。
Note that this implementation is not synchronized. If multiple threads access a hash map concurrently, and at least one of the threads modifies the map structurally, it must be synchronized externally. (A structural modification is any operation that adds or deletes one or more mappings; merely changing the value associated with a key that an instance already contains is not a structural modification.) This is typically accomplished by synchronizing on some object that naturally encapsulates the map. If no such object exists, the map should be "wrapped" using the
Collections.synchronizedMapmethod. This is best done at creation time, to prevent accidental unsynchronized access to the map:Map m = Collections.synchronizedMap(new HashMap(...));The iterators returned by all of this class's "collection view methods" are fail-fast: if the map is structurally modified at any time after the iterator is created, in any way except through the iterator's own
removemethod, the iterator will throw aConcurrentModificationException. Thus, in the face of concurrent modification, the iterator fails quickly and cleanly, rather than risking arbitrary, non-deterministic behavior at an undetermined time in the future.
请注意,这个实现类(HashMap)是线程不安全的。 如果多个线程同时访问统一个HashMap,并且至少有一个线程修改了该实例的映射关系,则必须在外部对其进行同步。(结构修改指的是任何添加或者删除一个或者多个映射的操作;仅更改实例已经包含的键所对应的值不是结构修改)。这通常依靠一些对象来完成对Map的线程同步的自然封装。如果不存在这种对象,Map应该用Collections.synchronizedMap方法来进行”包装“。最好在创建的时候执行这个操作,以防止意外地异步地访问Map:
Map m = Collections.synchronizedMap(new HashMap(...));
这个类的所有”collection view methods“的迭代器返回都是fail-fast:在创建迭代器之后,除了通过迭代器自己的remove方法之外,任何其他的remove迭代器都会抛出ConcurrentModificationException异常,因此,在面对并发修改的时候,迭代器总是干净利落的失败,而不会在之后不确定的时间点上冒着武断、不确定性行为所带来的风险。
Note that the fail-fast behavior of an iterator cannot be guaranteed as it is, generally speaking, impossible to make any hard guarantees in the presence of unsynchronized concurrent modification. Fail-fast iterators throw
ConcurrentModificationExceptionon a best-effort basis. Therefore, it would be wrong to write a program that depended on this exception for its correctness: the fail-fast behavior of iterators should be used only to detect bugs.
需要注意的是迭代器的fail-fast行为无法保证,一般来说,在非同步的并发修改的情况下,不能做出任何绝对的保证。拥有fail-fast的迭代器会尽可能地抛出ConcurrentModificationException异常。因此,编写依靠这个异常的正确性的程序是不对的:拥有fail-fast的迭代器只应用于检测bug。
从上文可以看出,HashMap允许空值的出现,它是无序的,有两个因子影响它的性能,一个是初始容量一个是负载因子。HashMap是线程不安全的,因此需要用Collection的异步方法来帮助HashMap实现线程安全,当然也可以使用JUC下的ConcurrentMap来实现。
2.具体实现
接下来我们通过源码来看一下HashMap的底层实现。HashMap有四个构造方法,一个无参构造方法和两个带有指定初始容量和负载因子的构造方法和一个将Map转为HashMap的构造方法。
* This map usually acts as a binned (bucketed) hash table, but
* when bins get too large, they are transformed into bins of
* TreeNodes, each structured similarly to those in
* java.util.TreeMap. Most methods try to use normal bins, but
* relay to TreeNode methods when applicable (simply by checking
* instanceof a node). Bins of TreeNodes may be traversed and
* used like any others, but additionally support faster lookup
* when overpopulated. However, since the vast majority of bins in
* normal use are not overpopulated, checking for existence of
* tree bins may be delayed in the course of table methods.
* Because TreeNodes are about twice the size of regular nodes, we
* use them only when bins contain enough nodes to warrant use
* (see TREEIFY_THRESHOLD). And when they become too small (due to
* removal or resizing) they are converted back to plain bins. In
* usages with well-distributed user hashCodes, tree bins are
* rarely used. Ideally, under random hashCodes, the frequency of
* nodes in bins follows a Poisson distribution
* (http://en.wikipedia.org/wiki/Poisson_distribution) with a
* parameter of about 0.5 on average for the default resizing
* threshold of 0.75, although with a large variance because of
* resizing granularity. Ignoring variance, the expected
* occurrences of list size k are (exp(-0.5) * pow(0.5, k) /
* factorial(k)). The first values are:
*
* 0: 0.60653066
* 1: 0.30326533
* 2: 0.07581633
* 3: 0.01263606
* 4: 0.00157952
* 5: 0.00015795
* 6: 0.00001316
* 7: 0.00000094
* 8: 0.00000006
* more: less than 1 in ten million
通过这段注释我们可以看出,HashMap通常用哈希表实现,但是当容器太大的时候,就会变成TreeMap的结构,但一般是不会变的。想要使用树结构,需要有足够的节点,当Map变小之后,又会被转回普通的Map,在哈希分布良好的情况下,很少使用树结构,理想情况下,随机的哈希值在容器中节点的泊松分布如下,当大于8的时候,概率已经一千万分之一了,因此选择了8作为变换的边界值,也就是说,HashMap中的容量小于8时,用的是普通的容器,当大于8的时候,就会用到树结构。
根据上面的注释可以看出,HashMap有两种实现方式,普通容器和树结构的容器。树结构类似于TreeMap,TreeMap是通过桶加上红黑树实现的,那普通容器是如何实现的呢?我们看一下添加元素源码。
final V putVal(int hash, K key, V value, boolean onlyIfAbsent,
boolean evict) {
Node<K,V>[] tab; Node<K,V> p; int n, i;
if ((tab = table) == null || (n = tab.length) == 0)
n = (tab = resize()).length;
if ((p = tab[i = (n - 1) & hash]) == null)
tab[i] = newNode(hash, key, value, null);
else {
Node<K,V> e; K k;
if (p.hash == hash &&
((k = p.key) == key || (key != null && key.equals(k))))
e = p;
else if (p instanceof TreeNode)
e = ((TreeNode<K,V>)p).putTreeVal(this, tab, hash, key, value);
else {
for (int binCount = 0; ; ++binCount) {
if ((e = p.next) == null) {
p.next = newNode(hash, key, value, null);
if (binCount >= TREEIFY_THRESHOLD - 1) // -1 for 1st
treeifyBin(tab, hash);
break;
}
if (e.hash == hash &&
((k = e.key) == key || (key != null && key.equals(k))))
break;
p = e;
}
}
if (e != null) { // existing mapping for key
V oldValue = e.value;
if (!onlyIfAbsent || oldValue == null)
e.value = value;
afterNodeAccess(e);
return oldValue;
}
}
++modCount;
if (++size > threshold)
resize();
afterNodeInsertion(evict);
return null;
}
可以看到第三行代码,声明了一个链表数组和一个链表,也就是说,容器是通过数组+链表实现的,根据不同的哈希值作为索引,添加到容器中,容器如图所示。
if (binCount >= TREEIFY_THRESHOLD - 1) // -1 for 1st
treeifyBin(tab, hash);
看到这里,可以看出,当容器数量大于8的时候,传递hash值和表过去,通过treeifyBin改变一下,那么我们点开treeifyBin看一下。
/**
* Replaces all linked nodes in bin at index for given hash unless
* table is too small, in which case resizes instead.
*/
final void treeifyBin(Node<K,V>[] tab, int hash) {
int n, index; Node<K,V> e;
if (tab == null || (n = tab.length) < MIN_TREEIFY_CAPACITY)
resize();
else if ((e = tab[index = (n - 1) & hash]) != null) {
TreeNode<K,V> hd = null, tl = null;
do {
TreeNode<K,V> p = replacementTreeNode(e, null);
if (tl == null)
hd = p;
else {
p.prev = tl;
tl.next = p;
}
tl = p;
} while ((e = e.next) != null);
if ((tab[index] = hd) != null)
hd.treeify(tab);
}
}
注释说,这个方法用来替换给定的哈希索引处的bin中的所有链接节点,除非表太小,否则就会调整大小。怎么替换呢?我们可以看到这行代码:
TreeNode<K,V> p = replacementTreeNode(e, null);
可以看到,通过replacementTreeNode来进行了替换,那我们再点进去看一下。
TreeNode<K,V> replacementTreeNode(Node<K,V> p, Node<K,V> next) {
return new TreeNode<>(p.hash, p.key, p.value, next);
}
它返回的是一个TreeNode,那这个TreeNode怎么定义的呢,继续往下看。
/**
* Entry for Tree bins. Extends LinkedHashMap.Entry (which in turn
* extends Node) so can be used as extension of either regular or
* linked node.
*/
static final class TreeNode<K,V> extends LinkedHashMap.Entry<K,V> {
TreeNode<K,V> parent; // red-black tree links
TreeNode<K,V> left;
TreeNode<K,V> right;
TreeNode<K,V> prev; // needed to unlink next upon deletion
boolean red;
TreeNode(int hash, K key, V val, Node<K,V> next) {
super(hash, key, val, next);
}
整个TreeNode类的实现就不贴在这里了,可以看到第一个声明的变量,说到parent是红黑树的链接。那么回到上面的普通容器,也就是说,当整个链表长度大于8的时候,就会将链表转换为红黑树。
再看一下两个常量的定义:
/**
* The bin count threshold for using a tree rather than list for a
* bin. Bins are converted to trees when adding an element to a
* bin with at least this many nodes. The value must be greater
* than 2 and should be at least 8 to mesh with assumptions in
* tree removal about conversion back to plain bins upon
* shrinkage.
*/
static final int TREEIFY_THRESHOLD = 8;
/**
* The bin count threshold for untreeifying a (split) bin during a
* resize operation. Should be less than TREEIFY_THRESHOLD, and at
* most 6 to mesh with shrinkage detection under removal.
*/
static final int UNTREEIFY_THRESHOLD = 6;
通过第一个常量的注释可以知道,这个值必须大于2小于8才可以,为什么小于8上面已经说过了,也就是说,想要变成树结构,节点是要大于8的;下一个常量的注释可以知道,当小于6的时候,才开始检测去转换,也就是当这个树的节点数量小于6了,就会转为普通容器。
3.总结
HashMap是允许null值的;它有两个影响性能的因素,一个是初始容量一个是负载因子,默认值分别是16和0.75;HashMap是线程不安全的,因此需要外部封装成线程安全的。HashMap是有两种容器,一种是普通容器,一种是树结构容器,整合起来可以说,HashMap是通过数组+链表/红黑树来实现的,当链表节点数量大于8的时候会转为红黑树,小于6的时候会转为链表。