MurmurHash2算法简介

321 阅读3分钟

介绍 MurmurHash2 算法的特点,优势,缺点,应用场景,基本实现,MurmurHash2 是一种哈希算法,高效,非加密。

特点

  1. 非加密
    • 不适用于密码学。
    • 设计目标不为 高碰撞性与安全性。
  2. 高性能:
    • 使用位运算,混合函数实现,计算速度快。
  3. 分布均匀:
    • 体现输入中各个位的影响,hash 值分布均匀。

算法思想

  1. 分块处理:
    • 把输入按照每 32/64 位分为一组处理。
  2. 混合操作:
    • 对每个分块进行混合计算,包括与常数相乘,移位异或,与当前结果混合。
  3. 最终混合:
    • 将剩下的不足 4/8 字节的数组混合进结果,然后在对结果做混合,确保所有输入位都对结果有影响。

伪代码:

以 32 位版本的 MurmurHash2 为例,伪代码如下:

uint m(const void *key,int len){
  uint seed = gen();
  const uint m;
  const int r;
  uint h = len ^ seed;
  
  const uchar *data = (const uchar *)key;
  while(len>=4){
    uint k = *(uint *)data;
    
    // mix m r h
    /*
    ....
    */
    
    data += 4;
    len -= 4;
  }
  // rest bytes
  /*
  swich(len)
  */
  
  // final mix m h
  /*
  ...
  */
  return h;
}

redis 中的实现:

/* MurmurHash2, by Austin Appleby
 * Note - This code makes a few assumptions about how your machine behaves -
 * 1. We can read a 4-byte value from any address without crashing
 * 2. sizeof(int) == 4
 *
 * And it has a few limitations -
 *
 * 1. It will not work incrementally. liyazzi: 也就是key需要一次性完整知道,不适合动态数据流,因为每一次的混合计算都依赖于前面的数据,然后最后的混合依赖于整个数据
 * 2. It will not produce the same results on little-endian and big-endian 影响跨平台性
 *    machines.
 */
unsigned int dictGenHashFunction(const void *key, int len) {
    /* 'm' and 'r' are mixing constants generated offline.
     They're not really 'magic', they just happen to work well.  */
    uint32_t seed = dict_hash_function_seed;
    const uint32_t m = 0x5bd1e995;
    const int r = 24;

    /* Initialize the hash to a 'random' value */
    uint32_t h = seed ^ len;

    /* Mix 4 bytes at a time into the hash */
    const unsigned char *data = (const unsigned char *)key;

    while(len >= 4) {
        uint32_t k = *(uint32_t*)data;

        k *= m;
        k ^= k >> r;
        k *= m;

        h *= m;
        h ^= k;

        data += 4;
        len -= 4;
    }

    /* Handle the last few bytes of the input array  */
    switch(len) {
    case 3: h ^= data[2] << 16;
    case 2: h ^= data[1] << 8;
    case 1: h ^= data[0]; h *= m;
    };

    /* Do a few final mixes of the hash to ensure the last few
     * bytes are well-incorporated. */
    h ^= h >> 13;
    h *= m;
    h ^= h >> 15;

    return (unsigned int)h;
}

哈希函数中大多混淆操作选用 ^ 的原因:

  • 异或操作能混合数据的各个位的特征,异或操作拥有自反性
  • 消除偏置:避免简单的加法与乘法带来的累计偏置
  • 性能高效:CPU级别运算,位运算通常速度较快

Redis中的使用

在 redis 中,MurmurHash2 被用作默认哈希函数实现,主要用于:

  1. 计算哈希表的 key 索引
  2. 计算一致性哈希中虚拟节点的位置

选择 MurmurHash2 的原因:

  • 快速:适合 redis 高性能的需求
  • 均匀:减少哈希冲突
  • 简单:实现简单

缺点

MurmurHash2 在某些情况下可能会出现哈希碰撞率略高的问题。

Hash functions can be vulnerable to collision attacks, where a user can choose input data in such a way so as to intentionally cause hash collisions. Jean-Philippe Aumasson and Daniel J. Bernstein were able to show that even implementations of MurmurHash using a randomized seed are vulnerable to so-called HashDoS attacks.[51] With the use of differential cryptanalysis, they were able to generate inputs that would lead to a hash collision. The authors of the attack recommend using their own SipHash instead.