HashMap相关笔记(一)

184 阅读9分钟

HashMap

类头

public class HashMap<K,V> extends AbstractMap<K,V>
    implements Map<K,V>, Cloneable, Serializable {

结构

数组套链表/树的实现

  • 数组的index: hash & (arr.size-1)

未归档 & 速记

0.75f : stackoverflow.com/questions/1…

(size - 1) & hash

hash&OldCap

总结

其实红黑树这部分,是1.8之后加入的
  1. 红黑树的部分其实还是带着前后指针的,可以很轻松的变回链表结构
  2. 红黑树的操作其实基本是分成两步的:
    1. 作为BST进行相关操作
    2. BST操作完之后,再进行相关的平衡操作
230:最大值
扩容条件:
  1. 扩容在条件正常的情况下是直接double的
  2. 树化条件下,table太小了(<=64),会resize作为树化的替代
  3. 当前容积小于threshold就resize
    1. threshold:size * loadFactor
    2. loadFactor:装载因子
    3. size:Map中节点的数量,而不是table数组中不为空的节点的数量
fail-fast:
  1. 会添加的接口:
    1. putVal,removeNode,clear,等等等等
  2. 会检查的接口:
    1. keySet中的forEach,Values中的foreach,EntrySet的foreach,本身的foreach、replaceAll
  3. 所以可以得出一个结论:
    1. HashMap对于fail-fast机制的实现,局限于foreach中检查。如果将HashMap作为一个整体黑盒进行使用,对fail-fast并不会有检查;自身关于fail-fast机制的实现,都是放在子类中,和1.8新增相关支持里后续补充的。
线程安全
  1. resize的情况下,可能会生成环形链表(1.7)

    1. 1.7中,由于使用了头插法,会产生环状链表(blog.csdn.net/sinat_40482…

      void transfer(Entry[] newTable, boolean rehash) {
              int newCapacity = newTable.length;
              for (Entry<K,V> e : table) {
                  while(null != e) {
                      Entry<K,V> next = e.next;
                      if (rehash) {
                          e.hash = null == e.key ? 0 : hash(e.key);
                      }
                      int i = indexFor(e.hash, newCapacity);
                      //主要就是这个地方出现了问题
                      e.next = newTable[i];
                       //1.A中断在这里(未执行[i] = e)
                      newTable[i] = e;
                      e = next;
                      //2.B到这里跑完,A再重新开始
                  }
              }
          }
      
      • 示意:

      1. 简化一下hash算法为%取余操作

      2. 现在假设HashMap的bucket数组长度为2,table[1]中有3,5,7

      graph LR
      A[1] --> C[?]
      B[3] --> D[5] --> F[7]
      
      1. 此时开始newHash的操作,变成了如下bucket:

        graph LR
        A[?] 
        B[?] 
        A1[?]
        A2[?]
        
      2. 第4个bucket中应该是:

        graph LR
        B[3] --> D[7]
        

        此时,线程A执行到:

        e.next = newTable[i];
        
        graph LR
        A[3]-->A1[newTable.i]
        

        未将新节点接到头上就挂起,线程B拿到时间片,执行了整个流程,此时变成:

        graph LR
        A[?] --> C[?]
        B[3] --> D[?]
        A1[?]--> A11[?]
        A2[7]-->A21[3]
        

        随后A开始执行,变成了3->7,随后拼接到table数组上:

        graph LR
        A2[3]-->A21[7]
        
        graph LR
        A[?] --> C[?]
        B[?] --> D[?]
        A1[?]--> A11[?]
        A2[3-A线程挂起恢复的位置]-->A21[7]-->A22[3-B线程处理完的]
        

        而第4个bucket中,前面的3和后面的3都是同一个节点对象

        因此后面的3的next节点,也变成了7:

        graph LR
        
        A2[3]-->A21[7]-->A2
        

        此时就出现了环形链表,造成死循环

    2. 1.8中,由于:

    if (loTail != null) {
                            loTail.next = null;
                            newTab[j] = loHead;
                        }
                        //如果拼了hi这条,那就拼到j+cap上
                        if (hiTail != null) {
                            hiTail.next = null;
                            newTab[j + oldCap] = hiHead;
                        }
    

    resize中是直接把链表绑定到表上的,因此不会出现1.7中环形链表的问题。

  2. 但是无论是1.7还是1.8,都会有数据丢失的问题:

    1. 还是resize,如果新插入的节点值不一致,1.8中直接拼接的方法有可能把另一个线程中变更的值覆盖掉

    2. putVal中:

      tab[i] = newNode(hash, key, value, null);//line 631 (1.8)
      

      如果A执行到这里挂起,B在同样的hash-bin中,也执行到了这里,那么B的值会被覆盖。


note - 解释TreeNode

hashMap的源码里,在私有变量之前写了一大堆实现note。一段一段看:

第一段 - 底层结构可能会变
This map usually acts as a binned (bucketed) hash table, but
* when bins get too large, they are transformed into bins of
* TreeNodes, each structured similarly to those in
* java.util.TreeMap. Most methods try to use normal bins, but
* relay to TreeNode methods when applicable (simply by checking
* instanceof a node).  Bins of TreeNodes may be traversed and
* used like any others, but additionally support faster lookup
* when overpopulated. However, since the vast majority of bins in
* normal use are not overpopulated, checking for existence of
* tree bins may be delayed in the course of table methods.

第一段话说的是:

  • hashMap一开始是bucket的哈希表。
  • 如果桶太大了就会把bin变成树节点的形状,结构就类似TreeMap了。
  • 大部分方法是用的通常的bin,但是如果变成了TreeNode的结构,就得使用TreeNode了。
    • 这一步的区别,是通过instanceof来判别的
  • TreeNode的节点可以和其他的节点一样遍历和使用,但当很多的时候,查询效率会快。
    • 但是大部分bins的使用场景中并不会有很多数据的情况,因此在表的方法中,查看treeBin的存在可能会延迟。

总结就是,会根据大小变身;变身是好事。

第二段 - 变成什么了?
 Tree bins (i.e., bins whose elements are all TreeNodes) are
ordered primarily by hashCode, but in the case of ties, if two
elements are of the same "class C implements Comparable<C>",
type then their compareTo method is used for ordering. (We
conservatively check generic types via reflection to validate
this -- see method comparableClassFor).  The added complexity
of tree bins is worthwhile in providing worst-case O(log n)
operations when keys either have distinct hashes or are
orderable, Thus, performance degrades gracefully under
accidental or malicious usages in which hashCode() methods
return values that are poorly distributed, as well as those in
which many keys share a hashCode, so long as they are also
Comparable. (If neither of these apply, we may waste about a
factor of two in time and space compared to taking no
precautions. But the only known cases stem from poor user
programming practices that are already so slow that this makes
little difference.)

第二段说的是变身Tree bin之后hashCode和排序的问题。

  • 主要是用hashCode。但是如果相等了,如果存储的对象继承了 Comparable方法,就会调用这个方法来排序。
  • 这么做是有用的:在有区别的hash值或者有序的情况下,操作能提供最差O(logn)的复杂读。
  • 因此,存储对象的hash算法对本容器的性能很关键。
    • 如果发生了hashCode分散程度低,大部分都相同的情况,即使都实现了 Comparable接口,性能依然会下降。(具体啥的性能没说,大概是都不行)
    • 吐槽了一下既不Comparable,hashCode区分度还很低的情况:
      • 这么做大概会比不需要这么做花2倍左右的时间和空间cost。
      • 但是能做出这种代码的,在写这个note的那群人看来,这点性能下降 makes little difference
第三段-什么时候变
Because TreeNodes are about twice the size of regular nodes, we
use them only when bins contain enough nodes to warrant use
(see TREEIFY_THRESHOLD). And when they become too small (due to
removal or resizing) they are converted back to plain bins.  In
usages with well-distributed user hashCodes, tree bins are
rarely used.  Ideally, under random hashCodes, the frequency of
nodes in bins follows a Poisson distribution
(http://en.wikipedia.org/wiki/Poisson_distribution) with a
parameter of about 0.5 on average for the default resizing
threshold of 0.75, although with a large variance because of
resizing granularity. Ignoring variance, the expected
occurrences of list size k are (exp(-0.5) * pow(0.5, k) /
factorial(k)). The first values are:
0:    0.60653066
1:    0.30326533
2:    0.07581633
3:    0.01263606
4:    0.00157952
5:    0.00015795
6:    0.00001316
7:    0.00000094
8:    0.00000006
more: less than 1 in ten million

这部分讲的是TreeNode和bin转化的相关事项。

  • TreeNode是正常node的两倍大,因此只有在bin里包含了足够多的node时,才会让这些东西变身。

    • 通过TREEIFY_THRESHOLD控制

    • 如果bin变小了,会再变回原来的简单bin

  • 如果使用了分布效果好的hashCode生成方法,TreeNode是很少被用到的

  • 接下来讲了理想情况下:

    • 在参数为:resizing threshold = 0.75的情况下,bin里的node数量符合下述泊松分布(这里并不是说0.75符合泊松分布,而是说threshold = 0.75的情况下符合下面这个参数为0.5的泊松分布,loadFactyory为其他的时候λ会变成其他的值,参考资料:blog.csdn.net/reliveIT/ar…
    • 期望的bin的listSize为k,在容器内部总数据为resizing的最大阈值时(这个前提注释里没说,上述博客里说的泊松分布的前提是:对于HashMap.table[].length的空间来说,放入0.75*length个数据) ,则某个bin的list长度为K的概率应该是:
(e0.5×0.5k)÷k!(e^{-0.5} \times 0.5^{k} ) \div k!

​ k大于8的情况,出现概率就小于 1/一千万

至于为什么默认用了0.75f,头部注释里是这么说的:

As a general rule, the default load factor (.75) offers a good
tradeoff between time and space costs.  Higher values decrease the
space overhead but increase the lookup cost (reflected in most of
the operations of the HashMap class, including
get and put).  The expected number of entries in
the map and its load factor should be taken into account when
setting its initial capacity, so as to minimize the number of
rehash operations.  If the initial capacity is greater than the
maximum number of entries divided by the load factor, no rehash
operations will ever occur.

大意就是:是个时间空间的折衷方案。这个值大了,空间利用率高了但是查找就慢了;如果初始化大小比Max(count(entry))/loadFactor 大,那么就不会发生rehash的操作。

第四段 - tree节点的根节点
The root of a tree bin is normally its first node.  However,
sometimes (currently only upon Iterator.remove), the root might
be elsewhere, but can be recovered following parent links
(method TreeNode.root()).

这段说的是树节点的根节点一般是第一个节点,但有时也可能是其他的

  • 但,可以变回来。
第五段 - 内部传参
 All applicable internal methods accept a hash code as an
 argument (as normally supplied from a public method), allowing
 them to call each other without recomputing user hashCodes.
 Most internal methods also accept a "tab" argument, that is
 normally the current table, but may be a new or old one when
 resizing or converting.

这段说的是内部方法中,可能将hashCode,和table数组的快照,作为参数传入。

第六段 - 顺序相关
When bin lists are treeified, split, or untreeified, we keep
them in the same relative access/traversal order (i.e., field
Node.next) to better preserve locality, and to slightly
simplify handling of splits and traversals that invoke
iterator.remove. When using comparators on insertion, to keep a
total ordering (or as close as is required here) across
rebalancings, we compare classes and identityHashCodes as
tie-breakers.
  • bin被树状化、分割或取消树状化时,相对的获取/遍历顺序会被保存下来;为了简化分割和遍历,会使用遍历器的remove方法。

  • 当在插入时使用比较器时,为了保证重平衡过程中总体都是有序的,会使用classes and identityHashCodes来进行比较。

    第七、八段 - 后面的一些小事
The use and transitions among plain vs tree modes is
complicated by the existence of subclass LinkedHashMap. See
below for hook methods defined to be invoked upon insertion,
removal and access that allow LinkedHashMap internals to
otherwise remain independent of these mechanics. (This also
requires that a map instance be passed to some utility methods
that may create new nodes.)
The concurrent-programming-like SSA-based coding style helps
avoid aliasing errors amid all of the twisty pointer operations.

主要讲了这么几个事:

  • 因为有子类LinkedHashMap的存在,普通和树状的使用和转化是比较复杂的。

    • 下面在插入,删除,获取中使用的hook方法的定义,让LinkedHashMap能在这些机制中有相对独立的性值。
    • 这也需要把一个map实例,传到某些可能产生新节点的工具方法中。
  • 像基于SSA的编码风格这样的并发编程有助于避免所有扭曲指针操作中的混叠错误。(机翻)

    • 有这么几个东西:
    • 所以这个地方说的是:这个类由于是使用基于SSA的,类似并发编程的代码风格,避免了在指针交换(twisty pointer operations)时发生指针交换错误(aliasing errors,就是换错了)。

内部变量

  • 默认的初始化大小:16
    • 大小必须是2的幂次
//The default initial capacity - MUST be a power of two.
static final int DEFAULT_INITIAL_CAPACITY = 1 << 4; // aka 16

  • 最大容量:1<<30 即:230
/**
 * The maximum capacity, used if a higher value is implicitly specified
 * by either of the constructors with arguments.
 * MUST be a power of two <= 1<<30.
 */
static final int MAXIMUM_CAPACITY = 1 << 30;
  • 默认装载引子0.75f
/**
 * The load factor used when none specified in constructor.
 */
static final float DEFAULT_LOAD_FACTOR = 0.75f;
  • 树化阈值8,上面的note讲了这个值为什么是8
/**
 * The bin count threshold for using a tree rather than list for a
 * bin.  Bins are converted to trees when adding an element to a
 * bin with at least this many nodes. The value must be greater
 * than 2 and should be at least 8 to mesh with assumptions in
 * tree removal about conversion back to plain bins upon
 * shrinkage.
 */
static final int TREEIFY_THRESHOLD = 8;
  • 非树化阈值 6 ,这俩还是不一样的
/**
 * The bin count threshold for untreeifying a (split) bin during a
 * resize operation. Should be less than TREEIFY_THRESHOLD, and at
 * most 6 to mesh with shrinkage detection under removal.
 */
static final int UNTREEIFY_THRESHOLD = 6;
  • 最小树化容积 64,比这个小了就不会进行bin中树化的操作
/**
 * The smallest table capacity for which bins may be treeified.
 * (Otherwise the table is resized if too many nodes in a bin.)
 * Should be at least 4 * TREEIFY_THRESHOLD to avoid conflicts
 * between resizing and treeification thresholds.
 */
static final int MIN_TREEIFY_CAPACITY = 64;
  • 托管数组
/**
 * The table, initialized on first use, and resized as
 * necessary. When allocated, length is always a power of two.
 * (We also tolerate length zero in some operations to allow
 * bootstrapping mechanics that are currently not needed.)
 */
transient Node<K,V>[] table;
  • 不知道啥玩意
/**
 * Holds cached entrySet(). Note that AbstractMap fields are used
 * for keySet() and values().
 */
transient Set<Map.Entry<K,V>> entrySet;
  • size,常用的trick,这样子获取size就一直是O(1)的操作
  • 这个size其实指的是:所有储存的k,v对数量,而不是table.size
/**
 * The number of key-value mappings contained in this map.
 */
transient int size;
  • modCount,非同步容器中都有的数,用来实现fail-fast的机制
/**
 * The number of times this HashMap has been structurally modified
 * Structural modifications are those that change the number of mappings in
 * the HashMap or otherwise modify its internal structure (e.g.,
 * rehash).  This field is used to make iterators on Collection-views of
 * the HashMap fail-fast.  (See ConcurrentModificationException).
 */
transient int modCount;
  • 容器下次resize的大小
/**
 * The next size value at which to resize (capacity * load factor).
 *
 * @serial
 */
// (The javadoc description is true upon serialization.
// Additionally, if the table array has not been allocated, this
// field holds the initial array capacity, or zero signifying
// DEFAULT_INITIAL_CAPACITY.)
int threshold;
  • 装载引子,不设置值就是默认的
/**
 * The load factor for the hash table.
 *
 * @serial
 */
final float loadFactor;