你懂HashCode?你不懂!

·  阅读 1704

在讨论hashCode之前,我先提出几个问题,后面我会一一解答

  1. 什么是hashCode?
  2. hashCode有什么用?
  3. 为什么会发生哈希冲突,可以避免吗?
  4. 为什么重写equals()方法一定要重写hashCode()方法?
  5. hashCode在Java中是如何生成的?

在讨论上述问题之前,我们先看看java里面是如何对hashCode定义的,下面我摘抄一段java8 Object对hashCode()的一段注释

    /**
     * Returns a hash code value for the object. This method is
     * supported for the benefit of hash tables such as those provided by
     * {@link java.util.HashMap}.
     * <p>
     * The general contract of {@code hashCode} is:
     * <ul>
     * <li>Whenever it is invoked on the same object more than once during
     *     an execution of a Java application, the {@code hashCode} method
     *     must consistently return the same integer, provided no information
     *     used in {@code equals} comparisons on the object is modified.
     *     This integer need not remain consistent from one execution of an
     *     application to another execution of the same application.
     * <li>If two objects are equal according to the {@code equals(Object)}
     *     method, then calling the {@code hashCode} method on each of
     *     the two objects must produce the same integer result.
     * <li>It is <em>not</em> required that if two objects are unequal
     *     according to the {@link java.lang.Object#equals(java.lang.Object)}
     *     method, then calling the {@code hashCode} method on each of the
     *     two objects must produce distinct integer results.  However, the
     *     programmer should be aware that producing distinct integer results
     *     for unequal objects may improve the performance of hash tables.
     * </ul>
     */
复制代码

我把他的意思大致归纳于以下几点

  1. 每个对象都能返回一个hashCode,hashCode是用来加速哈希表的
  2. 同一java应用程序执行期间,对同一个对象调用超过一次时,如果equals()比较的内容没有发生变化,hashCode()的返回应该也不应该发生变化。(特别需要注意的是同一java程序,不同java程序可以允许hashCode发生变化)
  3. 根据equals()方法,两个对象相等,那么hashCode也必须要相等
  4. 根据equals()方法,两个对象不相等,那么hashCode可能相等,但是要尽可能的让他们不相等,这样可以提升哈希表的性能

什么是hashCode?

hashCode是用来加速哈希表的,在同一个java应用程序中,hashCode是对java对象的一个签名,相等的对象hashCode一定相等。

hashCode有什么用?

上面也已经提到了,是用来加速哈希表的,采取了空间换时间的方式。

这一篇博客不是说哈希表的,因此这里不做展开

大致思路是先通过hashcode去哈希表里面找到对应的哈希桶,然后调用equals()方法去判断哈希桶里面挂载的元素有没有相等的。

为什么相等的元素会出现在同一个哈希桶里面?

前文已经说过了,相等的对象hashcode一定相等,hashcode相等的,对象却不一定相等,因此需要再调用equals()方法做判断

为什么会发生哈希冲突,可以避免吗?

先来看一下java里面hashCode的方法签名

public native int hashCode();
复制代码

可以看到hashCode本质上面来说是一个int,int的表示的数据范围是有限的,而对象是无限的,把一个无限的集合压缩进一个有限的集合中,发生哈希冲突就成了一个必然事件,哈希冲突是无法避免的。

当然,在我们的应用中,对象不可能是无穷无尽的,因此选择好的哈希算法是可以降低哈希冲突的概率的

为什么重写equals()方法一定要重写hashCode()方法?

这个问题实际上前面已经多次提到了,根据java规范,equals()相等的对象,hashCode()必须相等

重写equals()必然导致对象是否相等的结果发生变化,因此也就需要修改hashCode()方法

如果只修改equals()不修改hashCode()方法,就会导致使用哈希表的时候不能准确定位到哈希桶,导致哈希表工作异常

hashCode在Java中是如何生成的?

这个问题其实很复杂,我再摘抄一段java8 Object里面的注释

    /**
     * As much as is reasonably practical, the hashCode method defined by
     * class {@code Object} does return distinct integers for distinct
     * objects. (This is typically implemented by converting the internal
     * address of the object into an integer, but this implementation
     * technique is not required by the
     * Java&trade; programming language.)
     */
复制代码

这句话的意思是说,很多时候hashcode是对象内存地址的一个映射,但是java里面不是的。

那么java里面的hashCode是如何生成的呢?

实际上java默认的hashcode是由一套随机算法生成的,只与对象生成的顺序和线程有关。

下面我用代码证明下

    public static void main(String[] args) {
        for (int i = 0; i < 5; i++) {
            Object o = new Object();
            System.out.println("hashCode:" + o.hashCode());
            System.out.println("address:" + VM.current().addressOf(o));
        }
    }
复制代码

可以直接用Unsafe类拿到内存地址,不过操作比较繁琐,因此我用了别人封装的工具包

        <dependency>
            <groupId>org.openjdk.jol</groupId>
            <artifactId>jol-core</artifactId>
            <version>0.10</version>
        </dependency>
复制代码

反复执行上面代码,结果如下

第一次第二次第三次
hashCode:531885035 address:31856529176 hashCode:705265961 address:31874395832 hashCode:428746855 address:31874396520 hashCode:317983781 address:31874397208 hashCode:987405879 address:31874397896hashCode:531885035 address:31856529344 hashCode:705265961 address:31874395856 hashCode:428746855 address:31874396544 hashCode:317983781 address:31874397232 hashCode:987405879 address:31874397920hashCode:531885035 address:31856529920 hashCode:705265961 address:31874407200 hashCode:428746855 address:31874407888 hashCode:317983781 address:31874408576 hashCode:987405879 address:31874409264

我们可以清楚的看到,相同的顺序hashCode是一样的,内存地址是不一样的,因此可以肯定java的hashCode和内存是无关的,看起来与创建顺序有关。

接下来我们用代码去查看发生哈希冲突时的情况

    public static void main(String[] args) {
        HashSet<Integer> set = new HashSet<Integer>();
        int count = 0, num = 0;
        while (true) {
            count++;
            Object o = new Object();
            if (set.contains(o.hashCode())) {
                num++;
                System.out.println("count:" + count);
                System.out.println("hashCode:" + o.hashCode());
                if (num == 5) break;
            }
            set.add(o.hashCode());
        }
    }
复制代码

重复运行上面代码

第一次第二次第三次
count:105730 hashCode:2134400190 count:111177 hashCode:651156501 count:121838 hashCode:1867750575 count:145361 hashCode:2038112324 count:146294 hashCode:1164664992count:105730 hashCode:2134400190 count:111177 hashCode:651156501 count:121838 hashCode:1867750575 count:145361 hashCode:2038112324 count:146294 hashCode:1164664992count:105730 hashCode:2134400190 count:111177 hashCode:651156501 count:121838 hashCode:1867750575 count:145361 hashCode:2038112324 count:146294 hashCode:1164664992

我们可以看到,发生哈希冲突的位置是一样的,发生哈希冲突时的hashCode也是一样的,因此我们可以断定hashcode的生成与对象生成的顺序有关。

接下来我们用不同的线程去重复上述操作

    public static void main(String[] args) throws InterruptedException {
        new Thread(() ->{
            HashSet<Integer> set = new HashSet<Integer>();
            int count = 0, num = 0;
            while (true) {
                count++;
                Object o = new Object();
                if (set.contains(o.hashCode())) {
                    num++;
                    System.out.println("count:" + count);
                    System.out.println("hashCode:" + o.hashCode());
                    if (num == 5) break;
                }
                set.add(o.hashCode());
            }
        }).start();
        Thread.sleep(1000L);
    }
复制代码

结果如下

第一次第二次第三次
count:53869 hashCode:820543451 count:87947 hashCode:951498229 count:99859 hashCode:175788428 count:100940 hashCode:1979813773 count:123438 hashCode:916565907count:51579 hashCode:1573800482 count:124111 hashCode:1834341042 count:139761 hashCode:2021276643 count:142276 hashCode:1807321828 count:143515 hashCode:1522017140count:69186 hashCode:742597489 count:102967 hashCode:2084692455 count:144290 hashCode:424089733 count:176112 hashCode:53258565 count:178639 hashCode:1046986547

看起来没有任何规律了,这就需要去扒java的源码才能解答这个现象了

Object源码地址

在Object中我们可以看到hashCode实际上是调用了IHashCode

static JNINativeMethod methods[] = {
    {"hashCode",    "()I",                    (void *)&JVM_IHashCode},
    {"wait",        "(J)V",                   (void *)&JVM_MonitorWait},
    {"notify",      "()V",                    (void *)&JVM_MonitorNotify},
    {"notifyAll",   "()V",                    (void *)&JVM_MonitorNotifyAll},
    {"clone",       "()Ljava/lang/Object;",   (void *)&JVM_Clone},
};
复制代码

然后在jvm.cpp找到了IHashCode实际调用的是FastHashCode

JVM_ENTRY(jint, JVM_IHashCode(JNIEnv* env, jobject handle))
  JVMWrapper("JVM_IHashCode");
  // as implemented in the classic virtual machine; return 0 if object is NULL
  return handle == NULL ? 0 : ObjectSynchronizer::FastHashCode (THREAD, JNIHandles::resolve_non_null(handle)) ;
JVM_END
复制代码

接着扒FastHashCode,FastHashCode在synchronizer.cpp

intptr_t ObjectSynchronizer::FastHashCode (Thread * Self, oop obj) {
  if (UseBiasedLocking) {
    // NOTE: many places throughout the JVM do not expect a safepoint
    // to be taken here, in particular most operations on perm gen
    // objects. However, we only ever bias Java instances and all of
    // the call sites of identity_hash that might revoke biases have
    // been checked to make sure they can handle a safepoint. The
    // added check of the bias pattern is to avoid useless calls to
    // thread-local storage.
    if (obj->mark()->has_bias_pattern()) {
      // Box and unbox the raw reference just in case we cause a STW safepoint.
      Handle hobj (Self, obj) ;
      // Relaxing assertion for bug 6320749.
      assert (Universe::verify_in_progress() ||
              !SafepointSynchronize::is_at_safepoint(),
             "biases should not be seen by VM thread here");
      BiasedLocking::revoke_and_rebias(hobj, false, JavaThread::current());
      obj = hobj() ;
      assert(!obj->mark()->has_bias_pattern(), "biases should be revoked by now");
    }
  }

  // hashCode() is a heap mutator ...
  // Relaxing assertion for bug 6320749.
  assert (Universe::verify_in_progress() ||
          !SafepointSynchronize::is_at_safepoint(), "invariant") ;
  assert (Universe::verify_in_progress() ||
          Self->is_Java_thread() , "invariant") ;
  assert (Universe::verify_in_progress() ||
         ((JavaThread *)Self)->thread_state() != _thread_blocked, "invariant") ;

  ObjectMonitor* monitor = NULL;
  markOop temp, test;
  intptr_t hash;
  markOop mark = ReadStableMark (obj);

  // object should remain ineligible for biased locking
  assert (!mark->has_bias_pattern(), "invariant") ;
	//mark是对象头
  if (mark->is_neutral()) {
    hash = mark->hash();              // 取出hash值(实际上可以理解为缓存,有就直接返回,没有就生成一个新的)
    if (hash) {                       // if it has hash, just return it
      return hash;
    }
    hash = get_next_hash(Self, obj);  // 这是生成hashcode的核心方法
    temp = mark->copy_set_hash(hash); // merge the hash code into header
    // use (machine word version) atomic operation to install the hash
    test = (markOop) Atomic::cmpxchg_ptr(temp, obj->mark_addr(), mark);
    if (test == mark) {
      return hash;
    }
    // If atomic operation failed, we must inflate the header
    // into heavy weight monitor. We could add more code here
    // for fast path, but it does not worth the complexity.
  } else if (mark->has_monitor()) {
    monitor = mark->monitor();
    temp = monitor->header();
    assert (temp->is_neutral(), "invariant") ;
    hash = temp->hash();
    if (hash) {
      return hash;
    }
    // Skip to the following code to reduce code size
  } else if (Self->is_lock_owned((address)mark->locker())) {
    temp = mark->displaced_mark_helper(); // this is a lightweight monitor owned
    assert (temp->is_neutral(), "invariant") ;
    hash = temp->hash();              // by current thread, check if the displaced
    if (hash) {                       // header contains hash code
      return hash;
    }
    // WARNING:
    //   The displaced header is strictly immutable.
    // It can NOT be changed in ANY cases. So we have
    // to inflate the header into heavyweight monitor
    // even the current thread owns the lock. The reason
    // is the BasicLock (stack slot) will be asynchronously
    // read by other threads during the inflate() function.
    // Any change to stack may not propagate to other threads
    // correctly.
  }

  // Inflate the monitor to set hash code
  monitor = ObjectSynchronizer::inflate(Self, obj);
  // Load displaced header and check it has hash code
  mark = monitor->header();
  assert (mark->is_neutral(), "invariant") ;
  hash = mark->hash();
  if (hash == 0) {
    hash = get_next_hash(Self, obj);
    temp = mark->copy_set_hash(hash); // merge hash code into header
    assert (temp->is_neutral(), "invariant") ;
    test = (markOop) Atomic::cmpxchg_ptr(temp, monitor, mark);
    if (test != mark) {
      // The only update to the header in the monitor (outside GC)
      // is install the hash code. If someone add new usage of
      // displaced header, please update this code
      hash = test->hash();
      assert (test->is_neutral(), "invariant") ;
      assert (hash != 0, "Trivial unexpected object/monitor header usage.");
    }
  }
  // We finally get the hash
  return hash;
}
复制代码

get_next_hash

static inline intptr_t get_next_hash(Thread * Self, oop obj) {
  intptr_t value = 0 ;
  //随机获取一个
  if (hashCode == 0) {
     // This form uses an unguarded global Park-Miller RNG,
     // so it's possible for two threads to race and generate the same RNG.
     // On MP system we'll have lots of RW access to a global, so the
     // mechanism induces lots of coherency traffic.
     value = os::random() ;
  } else
  //根据内存地址计算一个
  if (hashCode == 1) {
     // This variation has the property of being stable (idempotent)
     // between STW operations.  This can be useful in some of the 1-0
     // synchronization schemes.
     intptr_t addrBits = cast_from_oop<intptr_t>(obj) >> 3 ;
     value = addrBits ^ (addrBits >> 5) ^ GVars.stwRandom ;
  } else
  //返回1
  if (hashCode == 2) {
     value = 1 ;            // for sensitivity testing
  } else
  //不太清楚什么意思 求大佬解释
  if (hashCode == 3) {
     value = ++GVars.hcSequence ;
  } else
  //直接返回内存地址
  if (hashCode == 4) {
     value = cast_from_oop<intptr_t>(obj) ;
  }
  //这是一种生成随机数的一种算法
   else {
     // Marsaglia's xor-shift scheme with thread-specific state
     // This is probably the best overall implementation -- we'll
     // likely make this the default in future releases.
     unsigned t = Self->_hashStateX ;
     t ^= (t << 11) ;
     Self->_hashStateX = Self->_hashStateY ;
     Self->_hashStateY = Self->_hashStateZ ;
     Self->_hashStateZ = Self->_hashStateW ;
     unsigned v = Self->_hashStateW ;
     v = (v ^ (v >> 19)) ^ (t ^ (t >> 8)) ;
     Self->_hashStateW = v ;
     value = v ;
  }

  value &= markOopDesc::hash_mask;
  if (value == 0) value = 0xBAD ;
  assert (value != markOopDesc::no_hash, "invariant") ;
  TEVENT (hashCode: GENERATE) ;
  return value;
}
复制代码

到了这里我们基本上就知道了hashcode是怎么生成的了,实际上在jdk1.8在是用的第五种生成方式,我们可以在Linux系统下输入:java -XX:+PrintFlagsFinal -version|grep hashCode命令查看

ok,接下来我们来分析一下第5种方式,看到这个代码我的第一反应是懵逼的,那个什么_hashStateX是个什么鬼? 我们来看一下thead.cpp里面是怎样定义的:

// thread-specific hashCode stream generator state - Marsaglia shift-xor form
  _hashStateX = os::random() ;
  _hashStateY = 842502087 ;
  _hashStateZ = 0x8767 ;    // (int)(3579807591LL & 0xffff) ;
  _hashStateW = 273326509 ;
复制代码

在thread里面定义了一个随机数,三个常数,通过这四个数根据上述的算法来生成hashcode。

具体原理请参考论文:Xorshift RNGs

因为在上述算法在,需要得到线程里面的一个随机数作为一个初始值,上述算法在前后具有因果关系,后面的结果是根据前面的结果推算而来的,因此对于相同的线程来说,在某种意义上,对象的hashcode的生成是和顺序有关的。

为什么主函数里面的hashcode的生成一直是有序的呢?因为主线程里面的是固定值。

可见hashCode在1.8中和内存地址是无关的,与所在线程(或者说生成的一个随机数)以及生成顺序有关

分类:
后端
标签:
分类:
后端
标签:
收藏成功!
已添加到「」, 点击更改