HBase原理解析—读数据流程

179 阅读14分钟

读数据流程简述

  1. CLient先访问Zookeeper,获取hbase:meta表位于哪个RegionServer中并且缓存到MetaCache
  2. 访问对应的RegionServer,获取hbase:meta表,根据读请求的table/rowkey,查询出目标数据位于哪个RegionServer中的哪个Region中,并将该Region信息缓存到客户端的MetaCache,方便下次访问
  3. 与目标RegionServer进行通讯
  4. 分别在BlockCache(读缓存),MemStore和StoreFile(HFile)中查询目标数据,并将查到的所有数据进行合并
  5. 将文件中查询到的数据块(Block,HFIle 数据存储单元,默认大小为64KB)缓存到BlockCache
  6. 将合并后的最终数据返回给客户端

两种读取模式

Get

Get是指基于确切的RowKey去获取一行数据,通常称之为随机点查,这正是HBase所擅长的读取模式,一次Get操作,包含两个主要的步骤:

1.构建Get

基于RowKey构建Get对象的最简单示例代码如下:

final byte[] key = Bytes.toBytes("class***");
Get get = new Get(key);

可以为构建的Get对象执行返回的列簇:

final byte[] family = Bytes.toBytes("f1");
//指定返回列簇f1的所有列
get.addFamily(family);

也可以直接返回某列簇中的指定列

final byte[] family = Bytes.toBytes("f1");
final byte[] qualifierMobile = Bytes.toBytes("age");
//指定返回列簇f1中的列age
get.addColumn(family, qualifierMobile);

2.发送Get请求并获取对应的记录

与写数据类似,发送Get请求的接口也是Table提供的,获取到的一行记录,被封装成一个Result对象。也可以这么理解一个Result对象:

  • 关联一行数据,一定不可能包含跨行的结果
  • 包含一个或者多个被请求的列。有可能包含这行数据的所有列,也可能仅包含部分列

上面给出的时一次随机获取一行记录的例子,但事实上,一次批量获取多行记录的需求也是普遍存在的,Table中也定义了Batch Get的接口(对应RSRpcServices的multi接口),这样可以在一次网络请求中同时获取多行数据。

Scan

Hbase中的数据表通过划分一个个的Region来实现数据的分片,每一个Region关联一个RowKey的范围区间,而每一个Region中的数据,按RowKey的字典顺序进行组织。

正是基于这种设计,使得HBase能够轻松应对这类查询:“指定一个RowKey的范围区间,获取该区间的所有记录”,这类查询在HBase被称之为Scan。

一次Scan操作,包含如下几个关键步骤:

1.构建Scan

最简单也最常用的构建Scan对象的方法,就是仅仅指定Scan的StartRow与StopRow。

示例如下:

final byte[] startKey = Bytes.toBytes("600430");
final byte[] stopKey  = Bytes.toBytes("600439");
Scan scan = new Scan();
scan.withStartRow(startKey).withStopRow(stopKey);

如果StartRow未指定,则本次Scan将从表的第一行数据开始读取。

如果StopRow未指定,而且在不主动停止本次Scan操作的前提下,本次Scan将会一直读取表的最后一行记录

如果StartRow与StopRow都未指定,那本次Scan就是一次全表扫描操作

同Get类似,Scan也可以主动指定返回的列簇或列

2.获取ResultScanner

ResultScanner scanner = table.getScanner(scan);

3.遍历查询结果

Result result = null;
// 通过scanner.next方法获取返回的每一行数据
while((result = scanner.next()) != null) {
    // 读取result
}

4.关闭ResultScanner

通过下面的方法关闭一个ResultScanner:

scanner.close()

Scan的其他重要参数

1.Cacheing:设置一次RPC请求批量读取的Result数量

下面的示例代码设定了一次读取回来的Results数量为100:

scan.setCaching(100);

Client每一次往RegionServer发送scan请求,都会批量拿回一批数据(由Caching决定过了每一次拿回的Results数量),然后放到本次的Result Cache中:

应用每一次读取数据时,都是从本地的Result Cache中获取的,如果Result Cache中的数据读完了,则Client会再次往RegionServer发送scan请求更多的数据。

2.Batch:设置每一个Result中的列的数量

下面的示例代码设定了每一个Result中的列的数量限制值为3:

scan.setBatch(3)

该参数适用于一行数据量过大的场景,这样,一行数据被请求的列会被拆成多个Results返回给Client。

3.Limit:限制一次Scan操作所获取的行的数量

同SQL语法中的limit子句,限制一次Scan操作所获取的行的总量:

scan.setLimit(10000);

4.MaxResultSize:从内存占用量的维度限制一次RPC的返回结果集

下面的示例代码将返回结果集合的最大值设置为5MB

scan.setMaxResultSize(5*1024*1025);

5.Reversed Scan:反向扫描

普通的Scan操作是按照字典顺序从小到大的顺序读取的,而Reversed Scan则恰好相反:

scan.setReversed(true);

Client发送读请求到RegionServer

无论是Get,还是Scan,Client在发送请求之到RegionServer之前,都需要先获取路由信息

1.定位该请求所关联的Region

因为Get请求关联一个RowKey,所以,直接定位该RowKey所关联的Region即可

对于scan请求,先定位Scan的StartKey所关联的Region

2.往该Region所关联的RegionServer发送读取请求

该过程与前面的《写数据原理》的数据路由所描述的类似,在此不再赘述。

如果一次Scan涉及到跨Region的读取,读完一个Region的数据以后,需要继续读取下一个Region的数据,这需要再Client侧不断记录和刷新scan的进展信息。如果一个Region中已无更多的数据,在scan请求的响应结果中会带有提示信息,这样可以让Client侧切换到下一个Region继续读取。

RegionServer如何处理读取请求

内部结构梳理

1.一个表可能包含一个或多个Region

将Hbase中拥有数亿行的一个大表,横向切隔成一个个"子表",这一个个“子表”就是Region

2.每一个Region中关联一个或多个列簇

如果将Region看成是一个表的横向切隔,那么一个Region中的数据列的纵向切隔,称之为Column Family。每一个列,都必须归属于一个Column Family,这个归属关系是写数据时指定的,而不是建表时预先定义。

3.每一个列簇关联一个MemStore,以及一个或多个HFile文件

上面的关于“Region与多列簇”的图中,泛化了Column Family的内部结构,下图是包含了MemStore与HFile的Column Family组成结构:

在Hbae的源码实现中,将一个Column Family抽象成一个Store对象。可以这么简单理解Column Family与Store的概念差异:Column Family更多的是面向用户层可感知的逻辑概念,而Store则是源码实现中的概念,是关于一个Column Family的抽象。

4.每一个MemStore中可能涉及一个Active Segment,以及一个或多个Immutable Segments

扩展到一个Region包含两个Column Family的情形:

5.HFile由Block构成,默认数据是按照有序组织成一个个64KB的Block

  • Data Block(上图中左侧的Data块):保存了实际的KeyValue数据。
  • Data Index:关于Data Block的索引信息

基于一个给定的RowKey,HFile中提供的块索引信息能够快速查到对应的Data Block。

从上面的内容,我们大致了解了HBase读取的本质就是如何从包含1隔或多个列簇(每个列簇包含1个MemStore Segments,以及一个或多个HFiles)的Region中读取用户所期望的数据

为每个查询构建Scanner体系

在Store/Column Family内部,KeyValue可能存储在MemStore的Segment中,也可能存在于HFile中,无论是Segment还是HFile中,我们统称为KeyValue数据源

初次阅读RegionServer/Region的读取流程所涉及的源码时,会被各色各样的Scanner类搞混淆,Hbase使用了各种Scanner来抽象每一层/每一类KeyValue数据源的Scan操作:

  • 关于一个Region的读取,被封装成一个RegionScanner对象。
  • 每一个Store/Column Family的读取操作,被封装成一个StoreScanner对象中。
  • SegmentScanner与StoreFileScanner分别用来描述关于MemStore中的Segment以及HFile的读取操作。
  • StoreFileScanner中关于HFile的实际读取操作,由HFileScanner完成。

RegionScanner的构成如下图所示:

在StoreScanner内部,多个SegmentScanner与多个StoreFileScanner被组织在一个称之为KeyValueHeap的对象中

每一个Scanner内部有一个指针指向当前要读取的Keyvalue,KeyValue的核心是一个优先级队列,在这个队列中,按照每一个Scanner当前指针所指向的KeyValue进行排序

同样的,RegionScanner中的多个StoreScanner,也被组织在一个KeyvalueHeap对象中:

初始化Scanner体系

Scanner体系的核心在于三层scanner:RegionScanner、StoreScanner以及StoreFileScanner。三者是层级的关系,一个RegionScanner由多个StoreScanner构成,一张表由多个列簇组成,就有多个StoreScanner负责该列簇的数据扫描。一个StoreScanner又是多个StoreFileScanner组成。每个Store的数据由内存中的MemStore和磁盘上的StoreFile文件组成,相对应的,StoreScanner对象会持有N个SegmentScanner和N个StoreFileScanner来进行实际的数据读取,每个StoreFile文件对应一个StoreFileScanner,注意:StoreFileScanner和SegmentScanner时整个scan的最终执行者。初始化Scanner体系主要有以下几个核心步骤:

1)构建RegionScanner

按需选择对应的Store,并初始化对应的StoreScanner。

 private void initializeScanners(Scan scan, List<KeyValueScanner> additionalScanners)
    throws IOException {
    // Here we separate all scanners into two lists - scanner that provide data required
    // by the filter to operate (scanners list) and all others (joinedScanners list).
    List<KeyValueScanner> scanners = new ArrayList<>(scan.getFamilyMap().size());
    List<KeyValueScanner> joinedScanners = new ArrayList<>(scan.getFamilyMap().size());
    // Store all already instantiated scanners for exception handling
    List<KeyValueScanner> instantiatedScanners = new ArrayList<>();
    // handle additionalScanners
    if (additionalScanners != null && !additionalScanners.isEmpty()) {
      scanners.addAll(additionalScanners);
      instantiatedScanners.addAll(additionalScanners);
    }

    try {
      // 只初始化客户端需要的Family对应的StoreScanner
      for (Map.Entry<byte[], NavigableSet<byte[]>> entry : scan.getFamilyMap().entrySet()) {
        HStore store = region.getStore(entry.getKey());
        // 构建StoreScanner
        KeyValueScanner scanner = store.getScanner(scan, entry.getValue(), this.readPt);
        instantiatedScanners.add(scanner);
        if (
          this.filter == null || !scan.doLoadColumnFamiliesOnDemand()
            || this.filter.isFamilyEssential(entry.getKey())
        ) {
          scanners.add(scanner);
        } else {
          joinedScanners.add(scanner);
        }
      }
      // 往RegionScanner中添加StoreScanner列表,合成一个KVHeap
      initializeKVHeap(scanners, joinedScanners, region);
    } catch (Throwable t) {
      throw handleException(instantiatedScanners, t);
    }
  }
2)构建StoreScanner

每个StoreScanner会为当前该Store中每个HFile构造一个StoreFileScanner,用于实际执行对应文件的检索。同时会为对应Memstore构造SegmentScanner,用于执行该Store中Memstore的数据检索

/**
   * Opens a scanner across memstore, snapshot, and all StoreFiles. Assumes we are not in a
   * compaction.
   * @param store   who we scan
   * @param scan    the spec
   * @param columns which columns we are scanning
   */
  public StoreScanner(HStore store, ScanInfo scanInfo, Scan scan, NavigableSet<byte[]> columns,
    long readPt) throws IOException {
    this(store, scan, scanInfo, columns != null ? columns.size() : 0, readPt, scan.getCacheBlocks(),
      ScanType.USER_SCAN);
    if (columns != null && scan.isRaw()) {
      throw new DoNotRetryIOException("Cannot specify any column for a raw scan");
    }
    matcher = UserScanQueryMatcher.create(scan, scanInfo, columns, oldestUnexpiredTS, now,
      store.getCoprocessorHost());

    store.addChangedReaderObserver(this);

    List<KeyValueScanner> scanners = null;
    try {
      // Pass columns to try to filter out unnecessary StoreFiles.
      //1.先获取该Store下所有SegmentScanner(对应MemStore)和所有StoreFileScanner(对应StoreFile)
      //2.再过滤淘汰不符合查询的HFile
      scanners = selectScannersFrom(store,
        store.getScanners(cacheBlocks, scanUsePread, false, matcher, scan.getStartRow(),
          scan.includeStartRow(), scan.getStopRow(), scan.includeStopRow(), this.readPt));

      // Seek all scanners to the start of the Row (or if the exact matching row
      // key does not exist, then to the start of the next matching Row).
      // Always check bloom filter to optimize the top row seek for delete
      // family marker.
      // 3.seek Store下的所有Scanner,通过查询(get/scan)的startKey,通过Block 块索引查询startKey对应所在Block的offset,接着
      // 读取对应Block,栈道startKey所在KV或下一个KV为当前Cell
      seekScanners(scanners, matcher.getStartKey(), explicitColumnQuery && lazySeekEnabledGlobally,
        parallelSeekEnabled);

      // set storeLimit
      this.storeLimit = scan.getMaxResultsPerColumnFamily();

      // set rowOffset
      this.storeOffset = scan.getRowOffsetPerColumnFamily();
      addCurrentScanners(scanners);
      // Combine all seeked scanners with a heap
      resetKVHeap(scanners, comparator);
    } catch (IOException e) {
      clearAndClose(scanners);
      // remove us from the HStore#changedReaderObservers here or we'll have no chance to
      // and might cause memory leak
      store.deleteChangedReaderObserver(this);
      throw e;
    }
  }
3)过滤淘汰部分不满足查询条件的Scanner

StoreScanner为每一个HFile构造一个对应的StoreFileScanner,需要注意的事实是,并不是每一个HFile都包含用户想要查找的KeyValue,相反,可以通过一些查询条件过滤掉很多肯定不存在待查找KeyValue的HFile。主要过滤策略有:Time Range过滤、Rowkey Range过滤以及布隆过滤器

protected List<KeyValueScanner> selectScannersFrom(HStore store,
    List<? extends KeyValueScanner> allScanners) {
    boolean memOnly;
    boolean filesOnly;
    if (scan instanceof InternalScan) {
      InternalScan iscan = (InternalScan) scan;
      memOnly = iscan.isCheckOnlyMemStore();
      filesOnly = iscan.isCheckOnlyStoreFiles();
    } else {
      memOnly = false;
      filesOnly = false;
    }

    List<KeyValueScanner> scanners = new ArrayList<>(allScanners.size());

    // We can only exclude store files based on TTL if minVersions is set to 0.
    // Otherwise, we might have to return KVs that have technically expired.
    long expiredTimestampCutoff = minVersions == 0 ? oldestUnexpiredTS : Long.MIN_VALUE;

    // include only those scan files which pass all filters
    for (KeyValueScanner kvs : allScanners) {
      boolean isFile = kvs.isFileScanner();
      if ((!isFile && filesOnly) || (isFile && memOnly)) {
        kvs.close();
        continue;
      }
      // 过滤淘汰不符合查询条件的HFile
      if (kvs.shouldUseScanner(scan, store, expiredTimestampCutoff)) {
        scanners.add(kvs);
      } else {
        kvs.close();
      }
    }
    return scanners;
}

 public boolean shouldUseScanner(Scan scan, HStore store, long oldestUnexpiredTS) {
    // if the file has no entries, no need to validate or create a scanner.
    byte[] cf = store.getColumnFamilyDescriptor().getName();
    TimeRange timeRange = scan.getColumnFamilyTimeRange().get(cf);
    if (timeRange == null) {
      timeRange = scan.getTimeRange();
    }
    //TimeRange过滤 & KeyRange过滤 & 布隆过滤器
    return reader.passesTimerangeFilter(timeRange, oldestUnexpiredTS)
      && reader.passesKeyRangeFilter(scan)
      && reader.passesBloomFilter(scan, scan.getFamilyMap().get(cf));
  }
4)每个Scanner seek 到startKey

这个步骤在每个HFile(MemStore)文件中seek扫描起始点startKey。如果HFile中没有找到startKey,则seek下一个KV地址,Seek过程是一个很核心的步骤,它主要包含下面三个步骤:

  • 定位Block Offset:在Blockcache中读取该HFile的索引结构,根据索引树检索到对应RowKey所在的BLock Offset 和Block Size

  • Load Block:根据BlockOffset首先在BlockCache查找Data BLock,如果不在缓存,再从HFile中加载

  • Seek Key:在加载的Data Block中定位具体的RowKey

    HFileReaderImpl#seekTo中 public int seekTo(Cell key, boolean rewind) throws IOException { // 读取HFile块索引 HFileBlockIndex.BlockIndexReader indexReader = reader.getDataBlockIndexReader(); BlockWithScanInfo blockWithScanInfo = indexReader.loadDataBlockWithScanInfo(key, curBlock, cacheBlocks, pread, isCompaction, getEffectiveDataBlockEncoding(), reader); if (blockWithScanInfo == null || blockWithScanInfo.getHFileBlock() == null) { // This happens if the key e.g. falls before the beginning of the file. return -1; } // 根据已加载的Block,seek startKey位置 return loadBlockAndSeekToKey(blockWithScanInfo.getHFileBlock(), blockWithScanInfo.getNextIndexedKey(), rewind, key, false); }

5)KeyValueScanner合并构建小顶堆

将该Store中的所有StoreFileScanner和MemStoreScanner形成一个heap(小顶堆),所谓heap实际上是一个优先级队列。在队列中,按照Scanner排序规则将Scanner seek得到的KeyValue由小到大进行排序。最小堆管理Scanner可以保证·取出来的KV都是最小的,这样依次不断的pop就可以由小到大获取目标KeyValue集合,保证有序性。小顶堆的核心操作主要有peek和next如下:

@InterfaceAudience.Private
public class KeyValueHeap extends NonReversedNonLazyKeyValueScanner
  implements KeyValueScanner, InternalScanner {
  private static final Logger LOG = LoggerFactory.getLogger(KeyValueHeap.class);
  protected PriorityQueue<KeyValueScanner> heap = null;
  // Holds the scanners when a ever a eager close() happens. All such eagerly closed
  // scans are collected and when the final scanner.close() happens will perform the
  // actual close.
  protected List<KeyValueScanner> scannersForDelayedClose = null;
  
  @Override
  public Cell peek() {
    if (this.current == null) {
      return null;
    }
    return this.current.peek();
  }

  boolean isLatestCellFromMemstore() {
    return !this.current.isFileScanner();
  }

  @Override
  public Cell next() throws IOException {
    if (this.current == null) {
      return null;
    }
    Cell kvReturn = this.current.next();
    Cell kvNext = this.current.peek();
    if (kvNext == null) {
      this.scannersForDelayedClose.add(this.current);
      this.current = null;
      this.current = pollRealKV();
    } else {
      KeyValueScanner topScanner = this.heap.peek();
      // no need to add current back to the heap if it is the only scanner left
      if (topScanner != null && this.comparator.compare(kvNext, topScanner.peek()) >= 0) {
        this.heap.add(this.current);
        this.current = null;
        this.current = pollRealKV();
      }
    }
    return kvReturn;
  }  
}

通过next请求读取一行行数据

完成构建及初始化Scanner体系后,KeyValue此时已经可以由大到小依次通过RegionScanner#next获得

如果将RegionScanner理解为一个内部构造复杂的机器,而驱动这个机器运转的动力源自Client侧的一次次scan请求,scan请求通过调用RegionScanner的next方法来获取一行行结果

  • Get在服务端的入口RERpcServices#get()

    private Result get(Get get, HRegion region, RegionScannersCloseCallBack closeCallBack, RpcCallContext context) throws IOException { region.prepareGet(get); boolean stale = region.getRegionInfo().getReplicaId() != 0;

    // This method is almost the same as HRegion#get.
    List<Cell> results = new ArrayList<>();
    long before = EnvironmentEdgeManager.currentTime();
    // pre-get CP hook
    if (region.getCoprocessorHost() != null) {
      if (region.getCoprocessorHost().preGet(get, results)) {
        region.metricsUpdateForGet(results, before);
        return Result.create(results, get.isCheckExistenceOnly() ? !results.isEmpty() : null,
          stale);
      }
    }
    Scan scan = new Scan(get);
    if (scan.getLoadColumnFamiliesOnDemandValue() == null) {
      scan.setLoadColumnFamiliesOnDemand(region.isLoadingCfsOnDemandDefault());
    }
    RegionScannerImpl scanner = null;
    try {
      // 先通过HRegion 构建一个RegionScanner
      scanner = region.getScanner(scan);
      // 再通过RegionScanner#next获取一行数据
      scanner.next(results);
    } finally {
      if (scanner != null) {
        if (closeCallBack == null) {
          // If there is a context then the scanner can be added to the current
          // RpcCallContext. The rpc callback will take care of closing the
          // scanner, for eg in case
          // of get()
          context.setCallBack(scanner);
        } else {
          // The call is from multi() where the results from the get() are
          // aggregated and then send out to the
          // rpc. The rpccall back will close all such scanners created as part
          // of multi().
          closeCallBack.addScanner(scanner);
        }
      }
    }
    

  • Scan在服务端的入口RSRpcServices#scan()

    @Override public ScanResponse scan(final RpcController controller, final ScanRequest request) throws ServiceException { if (controller != null && !(controller instanceof HBaseRpcController)) { throw new UnsupportedOperationException( "We only do " + "HBaseRpcControllers! FIX IF A PROBLEM: " + controller); } // 判断此请求是否已经有ScannerId if (!request.hasScannerId() && !request.hasScan()) { throw new ServiceException( new DoNotRetryIOException("Missing required input: scannerId or scan")); } try { checkOpen(); } catch (IOException e) { if (request.hasScannerId()) { String scannerName = toScannerName(request.getScannerId()); if (LOG.isDebugEnabled()) { LOG.debug( "Server shutting down and client tried to access missing scanner " + scannerName); } final LeaseManager leaseManager = server.getLeaseManager(); if (leaseManager != null) { try { leaseManager.cancelLease(scannerName); } catch (LeaseException le) { // No problem, ignore if (LOG.isTraceEnabled()) { LOG.trace("Un-able to cancel lease of scanner. It could already be closed."); } } } } throw new ServiceException(e); } requestCount.increment(); rpcScanRequestCount.increment(); RegionScannerHolder rsh; ScanResponse.Builder builder = ScanResponse.newBuilder(); String scannerName; try { if (request.hasScannerId()) { // The downstream projects such as AsyncHBase in OpenTSDB need this value. See HBASE-18000 // for more details. long scannerId = request.getScannerId(); builder.setScannerId(scannerId); scannerName = toScannerName(scannerId); //根据scannerId获取RegionScanner rsh = getRegionScanner(request); } else { Pair<String, RegionScannerHolder> scannerNameAndRSH = newRegionScanner(request, builder); scannerName = scannerNameAndRSH.getFirst(); //如果没有ScannerId,则构建一个RegionScanner rsh = scannerNameAndRSH.getSecond(); } } catch (IOException e) { if (e == SCANNER_ALREADY_CLOSED) { // Now we will close scanner automatically if there are no more results for this region but // the old client will still send a close request to us. Just ignore it and return. return builder.build(); } throw new ServiceException(e); } if (rsh.fullRegionScan) { rpcFullScanRequestCount.increment(); } HRegion region = rsh.r; LeaseManager.Lease lease; try { // Remove lease while its being processed in server; protects against case // where processing of request takes > lease expiration time. or null if none found. lease = server.getLeaseManager().removeLease(scannerName); } catch (LeaseException e) { throw new ServiceException(e); } if (request.hasRenew() && request.getRenew()) { // add back and return addScannerLeaseBack(lease); try { checkScanNextCallSeq(request, rsh); } catch (OutOfOrderScannerNextException e) { throw new ServiceException(e); } return builder.build(); } OperationQuota quota; try { quota = getRpcQuotaManager().checkQuota(region, OperationQuota.OperationType.SCAN); } catch (IOException e) { addScannerLeaseBack(lease); throw new ServiceException(e); } try { checkScanNextCallSeq(request, rsh); } catch (OutOfOrderScannerNextException e) { addScannerLeaseBack(lease); throw new ServiceException(e); } // Now we have increased the next call sequence. If we give client an error, the retry will // never success. So we'd better close the scanner and return a DoNotRetryIOException to client // and then client will try to open a new scanner. boolean closeScanner = request.hasCloseScanner() ? request.getCloseScanner() : false; int rows; // this is scan.getCaching if (request.hasNumberOfRows()) { rows = request.getNumberOfRows(); } else { rows = closeScanner ? 0 : 1; } RpcCall rpcCall = RpcServer.getCurrentCall().orElse(null); // now let's do the real scan. long maxQuotaResultSize = Math.min(maxScannerResultSize, quota.getReadAvailable()); RegionScanner scanner = rsh.s; // this is the limit of rows for this scan, if we the number of rows reach this value, we will // close the scanner. int limitOfRows; if (request.hasLimitOfRows()) { limitOfRows = request.getLimitOfRows(); } else { limitOfRows = -1; } MutableObject lastBlock = new MutableObject<>(); boolean scannerClosed = false; try { List results = new ArrayList<>(Math.min(rows, 512)); if (rows > 0) { boolean done = false; // Call coprocessor. Get region info from scanner. if (region.getCoprocessorHost() != null) { Boolean bypass = region.getCoprocessorHost().preScannerNext(scanner, results, rows); if (!results.isEmpty()) { for (Result r : results) { lastBlock.setValue(addSize(rpcCall, r, lastBlock.getValue())); } } if (bypass != null && bypass.booleanValue()) { done = true; } } if (!done) { scan((HBaseRpcController) controller, request, rsh, maxQuotaResultSize, rows, limitOfRows, results, builder, lastBlock, rpcCall); } else { builder.setMoreResultsInRegion(!results.isEmpty()); } } else { // This is a open scanner call with numberOfRow = 0, so set more results in region to true. builder.setMoreResultsInRegion(true); }

      quota.addScanResult(results);
      addResults(builder, results, (HBaseRpcController) controller,
        RegionReplicaUtil.isDefaultReplica(region.getRegionInfo()),
        isClientCellBlockSupport(rpcCall));
      if (scanner.isFilterDone() && results.isEmpty()) {
        // If the scanner's filter - if any - is done with the scan
        // only set moreResults to false if the results is empty. This is used to keep compatible
        // with the old scan implementation where we just ignore the returned results if moreResults
        // is false. Can remove the isEmpty check after we get rid of the old implementation.
        builder.setMoreResults(false);
      }
      // Later we may close the scanner depending on this flag so here we need to make sure that we
      // have already set this flag.
      assert builder.hasMoreResultsInRegion();
      // we only set moreResults to false in the above code, so set it to true if we haven't set it
      // yet.
      if (!builder.hasMoreResults()) {
        builder.setMoreResults(true);
      }
      if (builder.getMoreResults() && builder.getMoreResultsInRegion() && !results.isEmpty()) {
        // Record the last cell of the last result if it is a partial result
        // We need this to calculate the complete rows we have returned to client as the
        // mayHaveMoreCellsInRow is true does not mean that there will be extra cells for the
        // current row. We may filter out all the remaining cells for the current row and just
        // return the cells of the nextRow when calling RegionScanner.nextRaw. So here we need to
        // check for row change.
        Result lastResult = results.get(results.size() - 1);
        if (lastResult.mayHaveMoreCellsInRow()) {
          rsh.rowOfLastPartialResult = lastResult.getRow();
        } else {
          rsh.rowOfLastPartialResult = null;
        }
      }
      if (!builder.getMoreResults() || !builder.getMoreResultsInRegion() || closeScanner) {
        scannerClosed = true;
        closeScanner(region, scanner, scannerName, rpcCall);
      }
    
      // There's no point returning to a timed out client. Throwing ensures scanner is closed
      if (rpcCall != null && EnvironmentEdgeManager.currentTime() > rpcCall.getDeadline()) {
        throw new TimeoutIOException("Client deadline exceeded, cannot return results");
      }
    
      return builder.build();
    } catch (IOException e) {
      try {
        // scanner is closed here
        scannerClosed = true;
        // The scanner state might be left in a dirty state, so we will tell the Client to
        // fail this RPC and close the scanner while opening up another one from the start of
        // row that the client has last seen.
        closeScanner(region, scanner, scannerName, rpcCall);
    
        // If it is a DoNotRetryIOException already, throw as it is. Unfortunately, DNRIOE is
        // used in two different semantics.
        // (1) The first is to close the client scanner and bubble up the exception all the way
        // to the application. This is preferred when the exception is really un-recoverable
        // (like CorruptHFileException, etc). Plain DoNotRetryIOException also falls into this
        // bucket usually.
        // (2) Second semantics is to close the current region scanner only, but continue the
        // client scanner by overriding the exception. This is usually UnknownScannerException,
        // OutOfOrderScannerNextException, etc where the region scanner has to be closed, but the
        // application-level ClientScanner has to continue without bubbling up the exception to
        // the client. See ClientScanner code to see how it deals with these special exceptions.
        if (e instanceof DoNotRetryIOException) {
          throw e;
        }
    
        // If it is a FileNotFoundException, wrap as a
        // DoNotRetryIOException. This can avoid the retry in ClientScanner.
        if (e instanceof FileNotFoundException) {
          throw new DoNotRetryIOException(e);
        }
    
        // We closed the scanner already. Instead of throwing the IOException, and client
        // retrying with the same scannerId only to get USE on the next RPC, we directly throw
        // a special exception to save an RPC.
        if (VersionInfoUtil.hasMinimumVersion(rpcCall.getClientVersionInfo(), 1, 4)) {
          // 1.4.0+ clients know how to handle
          throw new ScannerResetException("Scanner is closed on the server-side", e);
        } else {
          // older clients do not know about SRE. Just throw USE, which they will handle
          throw new UnknownScannerException("Throwing UnknownScannerException to reset the client"
            + " scanner state for clients older than 1.3.", e);
        }
      } catch (IOException ioe) {
        throw new ServiceException(ioe);
      }
    } finally {
      if (!scannerClosed) {
        // Adding resets expiration time on lease.
        // the closeCallBack will be set in closeScanner so here we only care about shippedCallback
        if (rpcCall != null) {
          rpcCall.setCallBack(rsh.shippedCallback);
        } else {
          // If context is null,here we call rsh.shippedCallback directly to reuse the logic in
          // rsh.shippedCallback to release the internal resources in rsh,and lease is also added
          // back to regionserver's LeaseManager in rsh.shippedCallback.
          runShippedCallback(rsh);
        }
      }
      quota.close();
    }
    

    }

    我们假定一个RegionScanner中仅包含一个StoreScanner,那么这个RegionScanner中的核心读取操作,是由StoreScanner完成的,我们进一步假定StoreScanner由四个Scanners组成,如下所示:

    每一个Scanner中都有一个current指针指向下一个即将要读取的KV,KVHeap中的PriorityQueue正是按照每一个Scanner的current所指向的KV进行排序

    第一次next请求,将会返回ScannerA中的Row01:FamA:Col1,而后ScannerA得指针移动到下一个KV的Row01:FamA:Col2,PrirorityQueue中的Scanners排序依然不变:

    第二次next请求,依然返回ScannerA中的Row01:FamA:Col2,ScannerA的指针移动到下一个KV Row02:FamA:Col1,此时,PriorityQueue中的Scanners排序发生了变化:

    下一次next请求,将会返回ScannerB中的KV....周而复始,直到某一个Scanner所读取的数据耗尽,该Scanner将会close,不再出现上面的PriorityQueue中。