HBase原理解析—Flush与Compaction

64 阅读7分钟

随着数据的不断写入,MemStore中存储的数据会越来越多,系统为了将使用的内存保持在一个合理的水平,会将MemStore中的数据写入文件形成HFile。flush阶段是Hbase的非常核心阶段,需要重点关注几个问题:

  • MemStore Flush的触发时机。即在哪些情况下Hbase会触发flush操作
  • MemStore Flush的整体流程。
  • HFile的构建流程。HFile构建是MemStore Flush整体流程中最重要的一部分,这部分会涉及HFile文件格式的构建、布隆过滤器的构建、HFile索引的构建等
  • Flush HFile文件过多之后影响读延时的解决方法。

MemStore Flush

触发时机

HBase会在以下几种情况下触发flush操作:

  • MemStore级别限制:当Region中任意一个MemStore的大小到达了上限(hbase.hregion.memstore.flush.size,默认为128M),会触发MemStore刷新。
  • Region级别限制:当Region中所有MemStore的大小总和达到了上限(hbase.hregion.memstore.block.multiplier * hbase.hregion.memstore.flush.size),会触发MemStore刷新。
  • RegionServer级别限制:当RegionServer中MemStore的大小总和超过低水平阀值hbase.regionserver.global.memstore.sze.lower.limit*hbase.regionserver.global.memstore.size,RegionServer开始强制执行flush,先flush MemStore最大的Region,再flush次大的,依次执行。如果此时写入吞吐量依然很高,导致总MemStore大小超过高水位阀值hbase.regionserver.global.memstore.size,RegionServer会阻塞更新并强制执行flush,直到总MemStore大小下降到低水平阀值。
  • 当一个RegionServer中HLog数量达到上限时,系统会选择最早的HLog对应的一个或者多个Region进行Flush。
  • Hbase定期刷新MemStore:默认周期是1个小时,确保MemStore不会长时间没有持久化。为避免所有MemStore在同一时间都进行flush而导致的问题,定期的flush操作有一定时间的随机延时。
  • 手动执行flush:用户可以通过shell命令flush 'tablebname' 或者flush 'regionname'分别对一个表活着一个Region进行flush。

当满足以上触发条件时,HRegion会调用requestFlush()触发flush行为,flush发生在每一处Region可能发生变化的地方,包括Region有新数据写入,客户端调用put/batchMutate等接口。触发入口的核心代码在如下HRegion#batchMutate中:

 /**
   * Perform a batch of mutations.
   * <p/>
   * Operations in a batch are stored with highest durability specified of for all operations in a
   * batch, except for {@link Durability#SKIP_WAL}.
   * <p/>
   * This function is called from {@link #batchReplay(WALSplitUtil.MutationReplay[], long)} with
   * {@link ReplayBatchOperation} instance and {@link #batchMutate(Mutation[])} with
   * {@link MutationBatchOperation} instance as an argument. As the processing of replay batch and
   * mutation batch is very similar, lot of code is shared by providing generic methods in base
   * class {@link BatchOperation}. The logic for this method and
   * {@link #doMiniBatchMutate(BatchOperation)} is implemented using methods in base class which are
   * overridden by derived classes to implement special behavior.
   * @param batchOp contains the list of mutations
   * @return an array of OperationStatus which internally contains the OperationStatusCode and the
   *         exceptionMessage if any.
   * @throws IOException if an IO problem is encountered
   */
  private OperationStatus[] batchMutate(BatchOperation<?> batchOp) throws IOException {
    boolean initialized = false;
    batchOp.startRegionOperation();
    try {
      while (!batchOp.isDone()) {
        if (!batchOp.isInReplay()) {
          checkReadOnly();
        }
        // check 资源
        checkResources();

        if (!initialized) {
          this.writeRequestsCount.add(batchOp.size());
          // validate and prepare batch for write, for MutationBatchOperation it also calls CP
          // prePut()/preDelete()/preIncrement()/preAppend() hooks
          batchOp.checkAndPrepare();
          initialized = true;
        }
        // 写入数据
        doMiniBatchMutate(batchOp);
        // 写完之后再判断是否满足触发条件,如果满足则往阻塞队列塞一个Flush请求
        requestFlushIfNeeded();
      }
    } finally {
      if (rsServices != null && rsServices.getMetrics() != null) {
        rsServices.getMetrics().updateWriteQueryMeter(this.htableDescriptor.getTableName(),
          batchOp.size());
      }
      batchOp.closeRegionOperation();
    }
    return batchOp.retCodeDetails;
  }

处理流程

为了减少flush过程对读写的影响,HBase采用了类似两阶段提交的方式,将整个flush过程分为三个阶段:

  1. prepare阶段:遍历当前Region中的所有MemStore,将MemStore中当前数据集CellSkipListSet(内部采用ConcurrentSkipListMap)做一个快照snapshot,然后再新建一个CellSkipListSet接收新的数据写入。prepare阶段需要添加updateLock对写请求阻塞,结束之后会释放该锁。因为此阶段没有任何费时操作,因此持锁时间很短。
  2. flush阶段:遍历所有MemStore,将prepare阶段生成的snapshot持久化为临时文件,临时文件会统一放到目录.tmp下。这个过程因为涉及磁盘IO操作,因此相对比较耗时。
  3. commit阶段:遍历所有的MemStore,将flush阶段生成的临时文件移到指定的ColumnFamily目录下,针对HFile生成对应的Storefile和Reader,把storefile添加到Store的storefile列表中,最后再清空prepare阶段生成的snapshot。

HBase中flush请求的处理流程简化如下图中所示:

HRegion中requestFlush()的源码如下所示:

 private void requestFlush0(FlushLifeCycleTracker tracker) {
    boolean shouldFlush = false;
    // 检查下状态,避免重复请求
    synchronized (writestate) {
      if (!this.writestate.isFlushRequested()) {
        shouldFlush = true;
        writestate.flushRequested = true;
      }
    }
    if (shouldFlush) {
      // Make request outside of synchronize block; HBASE-818.
      this.rsServices.getFlushRequester().requestFlush(this, tracker);
      if (LOG.isDebugEnabled()) {
        LOG.debug("Flush requested on " + this.getRegionInfo().getEncodedName());
      }
    } else {
      tracker.notExecuted("Flush already requested on " + this);
    }
  }

主要是这句代码:

this.rsServices.getFlushRequester().requestFlush(this, tracker);

其中rsServices向RegionServer发起一个RPC请求,getFlushRequest()是RegionServer中的成员变量coreFusher定义的方法,该变量是MemStoreFlusher类型,用于管理该RegionServer上的各种flush请求,它里面定义的几个关键变量如下:

// These two data members go together. Any entry in the one must have
  // a corresponding entry in the other.
  private final BlockingQueue<FlushQueueEntry> flushQueue = new DelayQueue<>();
  protected final Map<Region, FlushRegionEntry> regionsInQueue = new HashMap<>();
  // 原子bool
  private AtomicBoolean wakeupPending = new AtomicBoolean();

  private final long threadWakeFrequency;
  // HRegionServer实例
  private final HRegionServer server;
  private final ReentrantReadWriteLock lock = new ReentrantReadWriteLock();
  private final Object blockSignal = new Object();
  // HRegion 一个阻塞更新的等待时间
  private long blockingWaitTime;
  private final LongAdder updatesBlockedMsHighWater = new LongAdder();

  private final FlushHandler[] flushHandlers;
  private List<FlushRequestListener> flushRequestListeners = new ArrayList<>(1);

下面随着讲解HBase的flush流程来讲解上述几个变量的作用,我们先看下requestFlush(),它将待flush的region放入待处理队列,这里包含了两个队列,flushQueue是一个无界阻塞队列,属于flush的工作队列,而regionsInQueue则用于保存位于flush队列的region的信息。

@Override
  public boolean requestFlush(HRegion r, List<byte[]> families, FlushLifeCycleTracker tracker) {
    synchronized (regionsInQueue) {
      FlushRegionEntry existFqe = regionsInQueue.get(r);
      if (existFqe != null) {
        // if a delayed one exists and not reach the time to execute, just remove it
        if (existFqe.isDelay() && existFqe.whenToExpire > EnvironmentEdgeManager.currentTime()) {
          LOG.info("Remove the existing delayed flush entry for {}, "
            + "because we need to flush it immediately", r);
          this.regionsInQueue.remove(r);
          this.flushQueue.remove(existFqe);
          r.decrementFlushesQueuedCount();
        } else {
          tracker.notExecuted("Flush already requested on " + r);
          return false;
        }
      }

      // This entry has no delay so it will be added at the top of the flush
      // queue. It'll come out near immediately.
      FlushRegionEntry fqe = new FlushRegionEntry(r, families, tracker);
      this.regionsInQueue.put(r, fqe);
      this.flushQueue.add(fqe);
      r.incrementFlushesQueuedCount();
      return true;
    }
  }

至此flush任务已经放入了工作队列,等到flush线程的处理。MemStoreFlusher的flush工作线程定义在了flushHander中,初始化代码如下:

public MemStoreFlusher(final Configuration conf, final HRegionServer server) {
    super();
    this.conf = conf;
    this.server = server;
    this.threadWakeFrequency = conf.getLong(HConstants.THREAD_WAKE_FREQUENCY, 10 * 1000);
    this.blockingWaitTime = conf.getInt("hbase.hstore.blockingWaitTime", 90000);
    // 用于flush的线程数
    int handlerCount = conf.getInt("hbase.hstore.flusher.count", 2);
    if (server != null) {
      if (handlerCount < 1) {
        LOG.warn(
          "hbase.hstore.flusher.count was configed to {} which is less than 1, " + "corrected to 1",
          handlerCount);
        handlerCount = 1;
      }
      LOG.info("globalMemStoreLimit="
        + TraditionalBinaryPrefix
          .long2String(this.server.getRegionServerAccounting().getGlobalMemStoreLimit(), "", 1)
        + ", globalMemStoreLimitLowMark="
        + TraditionalBinaryPrefix.long2String(
          this.server.getRegionServerAccounting().getGlobalMemStoreLimitLowMark(), "", 1)
        + ", Offheap=" + (this.server.getRegionServerAccounting().isOffheap()));
    }
    this.flushHandlers = new FlushHandler[handlerCount];
  }

后面我们看下这些flushHandlers都做了什么,看看它的run方法,里面的主要逻辑都写在下面:

 @Override
    public void run() {
      while (!server.isStopped()) {
        FlushQueueEntry fqe = null;
        try {
          wakeupPending.set(false); 
          // 从flushQueue队列中获取一个Region  
          fqe = flushQueue.poll(threadWakeFrequency, TimeUnit.MILLISECONDS);
          if (fqe == null || fqe == WAKEUPFLUSH_INSTANCE) {
            FlushType type = isAboveLowWaterMark();
            if (type != FlushType.NORMAL) {
              LOG.debug("Flush thread woke up because memory above low water="
                + TraditionalBinaryPrefix.long2String(
                  server.getRegionServerAccounting().getGlobalMemStoreLimitLowMark(), "", 1));
              // For offheap memstore, even if the lower water mark was breached due to heap
              // overhead
              // we still select the regions based on the region's memstore data size.
              // TODO : If we want to decide based on heap over head it can be done without tracking
              // it per region.
              if (!flushOneForGlobalPressure(type)) {
                // Wasn't able to flush any region, but we're above low water mark
                // This is unlikely to happen, but might happen when closing the
                // entire server - another thread is flushing regions. We'll just
                // sleep a little bit to avoid spinning, and then pretend that
                // we flushed one, so anyone blocked will check again
                Thread.sleep(1000);
                wakeUpIfBlocking();
              }
              // Enqueue another one of these tokens so we'll wake up again
              wakeupFlushThread();
            }
            continue;
          }
          FlushRegionEntry fre = (FlushRegionEntry) fqe;
          // 发起flush操作
          if (!flushRegion(fre)) {
            break;
          }
          ...
    }
  }

所有检查通过后,开始真正的flush实现,一层层进入调用的方法,最终实现在下面的代码:

protected FlushResultImpl internalFlushcache(WAL wal, long myseqid,
    Collection<HStore> storesToFlush, MonitoredTask status, boolean writeFlushWalMarker,
    FlushLifeCycleTracker tracker) throws IOException {
    // 准备Region下所有MemStore的快照
    PrepareFlushResult result =
      internalPrepareFlushCache(wal, myseqid, storesToFlush, status, writeFlushWalMarker, tracker);
    if (result.result == null) {
      // Flush 到临时目录且提交移动到CF目录下
      return internalFlushCacheAndCommit(wal, status, result, storesToFlush);
    } else {
      return result.result; // early exit due to failure from prepare stage
    }
  }

其中

  • internalPrepareFlushCache进行flush前的准备工作,包括生成一次MVCC的事务ID,准备flush时所需要的缓存和中间数据结构,以及生成当前MemStore的一个快照。
  • internalFlushCacheAndCommit则执行了具体的flush行为,包括首先将数据写入临时的tmp文件,提交一次更新事务(commit),最后将文件移入HDFS中的正确目录下。

这里有几个关键点,一,该方法是被updatesLock().writeLock保护起来的,updatesLock与上文中提到的lock一样,都是ReentrantReadWriteLock,这里为啥还要加锁呢,前面讲过所加的锁是对region整体行为而言的,如split、move、merge等宏观行为,而这里的updatesLock数数据的更新请求,快照生成期间加入updatesLock是为了保证数据一致性,快照生成后立刻释放了锁,保证了用户请求与快照flush到磁盘同时进行,提高了系统并发的吞吐量。

HFile构建流程

HFile生成流程参考另一篇

HBase Compaction

但随着Flush次数的不断增多,HFile的文件数量也会不断增多,那么这样会带来什么影响呢?在HBaseCon 2013大会上,Hontonworks的名为《Compaction Improvements in Apache HBase》的主题演讲中,提到了他们测试过的随着HFile的数量的不断增多对读取时延性带来的影响:

这里结合HBase的读流程如下图,可以帮助我们简单的理解这其中的原因:

图中说明了从一个文件中指定RowKey读取记录和从两个文件中读取记录的区别,明显,从两个文件中读取,将导致更多的磁盘IO。这就是HBase执行Compaction得一个初衷,Compaction可以将一些HFile文件合并为一个较大的HFile文件,也可以把所有的HFile文件合并为一个大的HFile文件,这个过程可以理解为:将多个HFile的“交错无序态” 变成了单个HFile的"有序状态",降低读取时延性,小范围的HFile文件合并,称之为Minor Compaction,一个列簇中所有的HFile文件合并,称之为Major Compaction。除了文件合并范围的不同之处,Major Compaction 还会清理一些TTL过期/版本过旧以及被标记删除的数据。下图直观描述了Flush与Compaction流程:

Compaction的本质

我们再想下,在集群中执行Compaction,本质是为了什么?大概的原因如下:

  • 减少HFile的文件数量,减少文件句柄数量,降低读取时延
  • Major Compaction 可以帮助清理集群中不再需要的数据(过期数据,被标记删除数据、版本溢出数据)
  • 提高数据本地化率

很多Hbase用户在集群中关闭了自动Major Compaction,为了降低Compaction 对ID资源的抢占,但出于清理数据的需要,又不得不在一些非繁忙时段手动触发Major Compaction,这样既可以有效降低存储空间,也可以有效降低读取时效。