读数据流程简述
- CLient先访问Zookeeper,获取hbase:meta表位于哪个RegionServer中并且缓存到MetaCache
- 访问对应的RegionServer,获取hbase:meta表,根据读请求的table/rowkey,查询出目标数据位于哪个RegionServer中的哪个Region中,并将该Region信息缓存到客户端的MetaCache,方便下次访问
- 与目标RegionServer进行通讯
- 分别在BlockCache(读缓存),MemStore和StoreFile(HFile)中查询目标数据,并将查到的所有数据进行合并
- 将文件中查询到的数据块(Block,HFIle 数据存储单元,默认大小为64KB)缓存到BlockCache
- 将合并后的最终数据返回给客户端
两种读取模式
Get
Get是指基于确切的RowKey去获取一行数据,通常称之为随机点查,这正是HBase所擅长的读取模式,一次Get操作,包含两个主要的步骤:
1.构建Get
基于RowKey构建Get对象的最简单示例代码如下:
final byte[] key = Bytes.toBytes("class***");
Get get = new Get(key);
可以为构建的Get对象执行返回的列簇:
final byte[] family = Bytes.toBytes("f1");
//指定返回列簇f1的所有列
get.addFamily(family);
也可以直接返回某列簇中的指定列
final byte[] family = Bytes.toBytes("f1");
final byte[] qualifierMobile = Bytes.toBytes("age");
//指定返回列簇f1中的列age
get.addColumn(family, qualifierMobile);
2.发送Get请求并获取对应的记录
与写数据类似,发送Get请求的接口也是Table提供的,获取到的一行记录,被封装成一个Result对象。也可以这么理解一个Result对象:
- 关联一行数据,一定不可能包含跨行的结果
- 包含一个或者多个被请求的列。有可能包含这行数据的所有列,也可能仅包含部分列
上面给出的时一次随机获取一行记录的例子,但事实上,一次批量获取多行记录的需求也是普遍存在的,Table中也定义了Batch Get的接口(对应RSRpcServices的multi接口),这样可以在一次网络请求中同时获取多行数据。
Scan
Hbase中的数据表通过划分一个个的Region来实现数据的分片,每一个Region关联一个RowKey的范围区间,而每一个Region中的数据,按RowKey的字典顺序进行组织。
正是基于这种设计,使得HBase能够轻松应对这类查询:“指定一个RowKey的范围区间,获取该区间的所有记录”,这类查询在HBase被称之为Scan。
一次Scan操作,包含如下几个关键步骤:
1.构建Scan
最简单也最常用的构建Scan对象的方法,就是仅仅指定Scan的StartRow与StopRow。
示例如下:
final byte[] startKey = Bytes.toBytes("600430");
final byte[] stopKey = Bytes.toBytes("600439");
Scan scan = new Scan();
scan.withStartRow(startKey).withStopRow(stopKey);
如果StartRow未指定,则本次Scan将从表的第一行数据开始读取。
如果StopRow未指定,而且在不主动停止本次Scan操作的前提下,本次Scan将会一直读取表的最后一行记录
如果StartRow与StopRow都未指定,那本次Scan就是一次全表扫描操作

同Get类似,Scan也可以主动指定返回的列簇或列
2.获取ResultScanner
ResultScanner scanner = table.getScanner(scan);
3.遍历查询结果
Result result = null;
// 通过scanner.next方法获取返回的每一行数据
while((result = scanner.next()) != null) {
// 读取result
}
4.关闭ResultScanner
通过下面的方法关闭一个ResultScanner:
scanner.close()
Scan的其他重要参数
1.Cacheing:设置一次RPC请求批量读取的Result数量
下面的示例代码设定了一次读取回来的Results数量为100:
scan.setCaching(100);
Client每一次往RegionServer发送scan请求,都会批量拿回一批数据(由Caching决定过了每一次拿回的Results数量),然后放到本次的Result Cache中:

应用每一次读取数据时,都是从本地的Result Cache中获取的,如果Result Cache中的数据读完了,则Client会再次往RegionServer发送scan请求更多的数据。
2.Batch:设置每一个Result中的列的数量
下面的示例代码设定了每一个Result中的列的数量限制值为3:
scan.setBatch(3)
该参数适用于一行数据量过大的场景,这样,一行数据被请求的列会被拆成多个Results返回给Client。
3.Limit:限制一次Scan操作所获取的行的数量
同SQL语法中的limit子句,限制一次Scan操作所获取的行的总量:
scan.setLimit(10000);
4.MaxResultSize:从内存占用量的维度限制一次RPC的返回结果集
下面的示例代码将返回结果集合的最大值设置为5MB
scan.setMaxResultSize(5*1024*1025);
5.Reversed Scan:反向扫描
普通的Scan操作是按照字典顺序从小到大的顺序读取的,而Reversed Scan则恰好相反:
scan.setReversed(true);
Client发送读请求到RegionServer
无论是Get,还是Scan,Client在发送请求之到RegionServer之前,都需要先获取路由信息
1.定位该请求所关联的Region
因为Get请求关联一个RowKey,所以,直接定位该RowKey所关联的Region即可
对于scan请求,先定位Scan的StartKey所关联的Region
2.往该Region所关联的RegionServer发送读取请求
该过程与前面的《写数据原理》的数据路由所描述的类似,在此不再赘述。
如果一次Scan涉及到跨Region的读取,读完一个Region的数据以后,需要继续读取下一个Region的数据,这需要再Client侧不断记录和刷新scan的进展信息。如果一个Region中已无更多的数据,在scan请求的响应结果中会带有提示信息,这样可以让Client侧切换到下一个Region继续读取。
RegionServer如何处理读取请求
内部结构梳理
1.一个表可能包含一个或多个Region
将Hbase中拥有数亿行的一个大表,横向切隔成一个个"子表",这一个个“子表”就是Region

2.每一个Region中关联一个或多个列簇
如果将Region看成是一个表的横向切隔,那么一个Region中的数据列的纵向切隔,称之为Column Family。每一个列,都必须归属于一个Column Family,这个归属关系是写数据时指定的,而不是建表时预先定义。

3.每一个列簇关联一个MemStore,以及一个或多个HFile文件
上面的关于“Region与多列簇”的图中,泛化了Column Family的内部结构,下图是包含了MemStore与HFile的Column Family组成结构:

在Hbae的源码实现中,将一个Column Family抽象成一个Store对象。可以这么简单理解Column Family与Store的概念差异:Column Family更多的是面向用户层可感知的逻辑概念,而Store则是源码实现中的概念,是关于一个Column Family的抽象。
4.每一个MemStore中可能涉及一个Active Segment,以及一个或多个Immutable Segments

扩展到一个Region包含两个Column Family的情形:

5.HFile由Block构成,默认数据是按照有序组织成一个个64KB的Block

- Data Block(上图中左侧的Data块):保存了实际的KeyValue数据。
- Data Index:关于Data Block的索引信息
基于一个给定的RowKey,HFile中提供的块索引信息能够快速查到对应的Data Block。
从上面的内容,我们大致了解了HBase读取的本质就是如何从包含1隔或多个列簇(每个列簇包含1个MemStore Segments,以及一个或多个HFiles)的Region中读取用户所期望的数据
为每个查询构建Scanner体系
在Store/Column Family内部,KeyValue可能存储在MemStore的Segment中,也可能存在于HFile中,无论是Segment还是HFile中,我们统称为KeyValue数据源
初次阅读RegionServer/Region的读取流程所涉及的源码时,会被各色各样的Scanner类搞混淆,Hbase使用了各种Scanner来抽象每一层/每一类KeyValue数据源的Scan操作:
- 关于一个Region的读取,被封装成一个RegionScanner对象。
- 每一个Store/Column Family的读取操作,被封装成一个StoreScanner对象中。
- SegmentScanner与StoreFileScanner分别用来描述关于MemStore中的Segment以及HFile的读取操作。
- StoreFileScanner中关于HFile的实际读取操作,由HFileScanner完成。
RegionScanner的构成如下图所示:

在StoreScanner内部,多个SegmentScanner与多个StoreFileScanner被组织在一个称之为KeyValueHeap的对象中

每一个Scanner内部有一个指针指向当前要读取的Keyvalue,KeyValue的核心是一个优先级队列,在这个队列中,按照每一个Scanner当前指针所指向的KeyValue进行排序
同样的,RegionScanner中的多个StoreScanner,也被组织在一个KeyvalueHeap对象中:

初始化Scanner体系
Scanner体系的核心在于三层scanner:RegionScanner、StoreScanner以及StoreFileScanner。三者是层级的关系,一个RegionScanner由多个StoreScanner构成,一张表由多个列簇组成,就有多个StoreScanner负责该列簇的数据扫描。一个StoreScanner又是多个StoreFileScanner组成。每个Store的数据由内存中的MemStore和磁盘上的StoreFile文件组成,相对应的,StoreScanner对象会持有N个SegmentScanner和N个StoreFileScanner来进行实际的数据读取,每个StoreFile文件对应一个StoreFileScanner,注意:StoreFileScanner和SegmentScanner时整个scan的最终执行者。初始化Scanner体系主要有以下几个核心步骤:

1)构建RegionScanner
按需选择对应的Store,并初始化对应的StoreScanner。
private void initializeScanners(Scan scan, List<KeyValueScanner> additionalScanners)
throws IOException {
// Here we separate all scanners into two lists - scanner that provide data required
// by the filter to operate (scanners list) and all others (joinedScanners list).
List<KeyValueScanner> scanners = new ArrayList<>(scan.getFamilyMap().size());
List<KeyValueScanner> joinedScanners = new ArrayList<>(scan.getFamilyMap().size());
// Store all already instantiated scanners for exception handling
List<KeyValueScanner> instantiatedScanners = new ArrayList<>();
// handle additionalScanners
if (additionalScanners != null && !additionalScanners.isEmpty()) {
scanners.addAll(additionalScanners);
instantiatedScanners.addAll(additionalScanners);
}
try {
// 只初始化客户端需要的Family对应的StoreScanner
for (Map.Entry<byte[], NavigableSet<byte[]>> entry : scan.getFamilyMap().entrySet()) {
HStore store = region.getStore(entry.getKey());
// 构建StoreScanner
KeyValueScanner scanner = store.getScanner(scan, entry.getValue(), this.readPt);
instantiatedScanners.add(scanner);
if (
this.filter == null || !scan.doLoadColumnFamiliesOnDemand()
|| this.filter.isFamilyEssential(entry.getKey())
) {
scanners.add(scanner);
} else {
joinedScanners.add(scanner);
}
}
// 往RegionScanner中添加StoreScanner列表,合成一个KVHeap
initializeKVHeap(scanners, joinedScanners, region);
} catch (Throwable t) {
throw handleException(instantiatedScanners, t);
}
}
2)构建StoreScanner
每个StoreScanner会为当前该Store中每个HFile构造一个StoreFileScanner,用于实际执行对应文件的检索。同时会为对应Memstore构造SegmentScanner,用于执行该Store中Memstore的数据检索
/**
* Opens a scanner across memstore, snapshot, and all StoreFiles. Assumes we are not in a
* compaction.
* @param store who we scan
* @param scan the spec
* @param columns which columns we are scanning
*/
public StoreScanner(HStore store, ScanInfo scanInfo, Scan scan, NavigableSet<byte[]> columns,
long readPt) throws IOException {
this(store, scan, scanInfo, columns != null ? columns.size() : 0, readPt, scan.getCacheBlocks(),
ScanType.USER_SCAN);
if (columns != null && scan.isRaw()) {
throw new DoNotRetryIOException("Cannot specify any column for a raw scan");
}
matcher = UserScanQueryMatcher.create(scan, scanInfo, columns, oldestUnexpiredTS, now,
store.getCoprocessorHost());
store.addChangedReaderObserver(this);
List<KeyValueScanner> scanners = null;
try {
// Pass columns to try to filter out unnecessary StoreFiles.
//1.先获取该Store下所有SegmentScanner(对应MemStore)和所有StoreFileScanner(对应StoreFile)
//2.再过滤淘汰不符合查询的HFile
scanners = selectScannersFrom(store,
store.getScanners(cacheBlocks, scanUsePread, false, matcher, scan.getStartRow(),
scan.includeStartRow(), scan.getStopRow(), scan.includeStopRow(), this.readPt));
// Seek all scanners to the start of the Row (or if the exact matching row
// key does not exist, then to the start of the next matching Row).
// Always check bloom filter to optimize the top row seek for delete
// family marker.
// 3.seek Store下的所有Scanner,通过查询(get/scan)的startKey,通过Block 块索引查询startKey对应所在Block的offset,接着
// 读取对应Block,栈道startKey所在KV或下一个KV为当前Cell
seekScanners(scanners, matcher.getStartKey(), explicitColumnQuery && lazySeekEnabledGlobally,
parallelSeekEnabled);
// set storeLimit
this.storeLimit = scan.getMaxResultsPerColumnFamily();
// set rowOffset
this.storeOffset = scan.getRowOffsetPerColumnFamily();
addCurrentScanners(scanners);
// Combine all seeked scanners with a heap
resetKVHeap(scanners, comparator);
} catch (IOException e) {
clearAndClose(scanners);
// remove us from the HStore#changedReaderObservers here or we'll have no chance to
// and might cause memory leak
store.deleteChangedReaderObserver(this);
throw e;
}
}
3)过滤淘汰部分不满足查询条件的Scanner
StoreScanner为每一个HFile构造一个对应的StoreFileScanner,需要注意的事实是,并不是每一个HFile都包含用户想要查找的KeyValue,相反,可以通过一些查询条件过滤掉很多肯定不存在待查找KeyValue的HFile。主要过滤策略有:Time Range过滤、Rowkey Range过滤以及布隆过滤器
protected List<KeyValueScanner> selectScannersFrom(HStore store,
List<? extends KeyValueScanner> allScanners) {
boolean memOnly;
boolean filesOnly;
if (scan instanceof InternalScan) {
InternalScan iscan = (InternalScan) scan;
memOnly = iscan.isCheckOnlyMemStore();
filesOnly = iscan.isCheckOnlyStoreFiles();
} else {
memOnly = false;
filesOnly = false;
}
List<KeyValueScanner> scanners = new ArrayList<>(allScanners.size());
// We can only exclude store files based on TTL if minVersions is set to 0.
// Otherwise, we might have to return KVs that have technically expired.
long expiredTimestampCutoff = minVersions == 0 ? oldestUnexpiredTS : Long.MIN_VALUE;
// include only those scan files which pass all filters
for (KeyValueScanner kvs : allScanners) {
boolean isFile = kvs.isFileScanner();
if ((!isFile && filesOnly) || (isFile && memOnly)) {
kvs.close();
continue;
}
// 过滤淘汰不符合查询条件的HFile
if (kvs.shouldUseScanner(scan, store, expiredTimestampCutoff)) {
scanners.add(kvs);
} else {
kvs.close();
}
}
return scanners;
}
public boolean shouldUseScanner(Scan scan, HStore store, long oldestUnexpiredTS) {
// if the file has no entries, no need to validate or create a scanner.
byte[] cf = store.getColumnFamilyDescriptor().getName();
TimeRange timeRange = scan.getColumnFamilyTimeRange().get(cf);
if (timeRange == null) {
timeRange = scan.getTimeRange();
}
//TimeRange过滤 & KeyRange过滤 & 布隆过滤器
return reader.passesTimerangeFilter(timeRange, oldestUnexpiredTS)
&& reader.passesKeyRangeFilter(scan)
&& reader.passesBloomFilter(scan, scan.getFamilyMap().get(cf));
}
4)每个Scanner seek 到startKey
这个步骤在每个HFile(MemStore)文件中seek扫描起始点startKey。如果HFile中没有找到startKey,则seek下一个KV地址,Seek过程是一个很核心的步骤,它主要包含下面三个步骤:
-
定位Block Offset:在Blockcache中读取该HFile的索引结构,根据索引树检索到对应RowKey所在的BLock Offset 和Block Size
-
Load Block:根据BlockOffset首先在BlockCache查找Data BLock,如果不在缓存,再从HFile中加载
-
Seek Key:在加载的Data Block中定位具体的RowKey
HFileReaderImpl#seekTo中 public int seekTo(Cell key, boolean rewind) throws IOException { // 读取HFile块索引 HFileBlockIndex.BlockIndexReader indexReader = reader.getDataBlockIndexReader(); BlockWithScanInfo blockWithScanInfo = indexReader.loadDataBlockWithScanInfo(key, curBlock, cacheBlocks, pread, isCompaction, getEffectiveDataBlockEncoding(), reader); if (blockWithScanInfo == null || blockWithScanInfo.getHFileBlock() == null) { // This happens if the key e.g. falls before the beginning of the file. return -1; } // 根据已加载的Block,seek startKey位置 return loadBlockAndSeekToKey(blockWithScanInfo.getHFileBlock(), blockWithScanInfo.getNextIndexedKey(), rewind, key, false); }
5)KeyValueScanner合并构建小顶堆
将该Store中的所有StoreFileScanner和MemStoreScanner形成一个heap(小顶堆),所谓heap实际上是一个优先级队列。在队列中,按照Scanner排序规则将Scanner seek得到的KeyValue由小到大进行排序。最小堆管理Scanner可以保证·取出来的KV都是最小的,这样依次不断的pop就可以由小到大获取目标KeyValue集合,保证有序性。小顶堆的核心操作主要有peek和next如下:
@InterfaceAudience.Private
public class KeyValueHeap extends NonReversedNonLazyKeyValueScanner
implements KeyValueScanner, InternalScanner {
private static final Logger LOG = LoggerFactory.getLogger(KeyValueHeap.class);
protected PriorityQueue<KeyValueScanner> heap = null;
// Holds the scanners when a ever a eager close() happens. All such eagerly closed
// scans are collected and when the final scanner.close() happens will perform the
// actual close.
protected List<KeyValueScanner> scannersForDelayedClose = null;
@Override
public Cell peek() {
if (this.current == null) {
return null;
}
return this.current.peek();
}
boolean isLatestCellFromMemstore() {
return !this.current.isFileScanner();
}
@Override
public Cell next() throws IOException {
if (this.current == null) {
return null;
}
Cell kvReturn = this.current.next();
Cell kvNext = this.current.peek();
if (kvNext == null) {
this.scannersForDelayedClose.add(this.current);
this.current = null;
this.current = pollRealKV();
} else {
KeyValueScanner topScanner = this.heap.peek();
// no need to add current back to the heap if it is the only scanner left
if (topScanner != null && this.comparator.compare(kvNext, topScanner.peek()) >= 0) {
this.heap.add(this.current);
this.current = null;
this.current = pollRealKV();
}
}
return kvReturn;
}
}
通过next请求读取一行行数据
完成构建及初始化Scanner体系后,KeyValue此时已经可以由大到小依次通过RegionScanner#next获得
如果将RegionScanner理解为一个内部构造复杂的机器,而驱动这个机器运转的动力源自Client侧的一次次scan请求,scan请求通过调用RegionScanner的next方法来获取一行行结果
-
Get在服务端的入口RERpcServices#get()
private Result get(Get get, HRegion region, RegionScannersCloseCallBack closeCallBack, RpcCallContext context) throws IOException { region.prepareGet(get); boolean stale = region.getRegionInfo().getReplicaId() != 0;
// This method is almost the same as HRegion#get. List<Cell> results = new ArrayList<>(); long before = EnvironmentEdgeManager.currentTime(); // pre-get CP hook if (region.getCoprocessorHost() != null) { if (region.getCoprocessorHost().preGet(get, results)) { region.metricsUpdateForGet(results, before); return Result.create(results, get.isCheckExistenceOnly() ? !results.isEmpty() : null, stale); } } Scan scan = new Scan(get); if (scan.getLoadColumnFamiliesOnDemandValue() == null) { scan.setLoadColumnFamiliesOnDemand(region.isLoadingCfsOnDemandDefault()); } RegionScannerImpl scanner = null; try { // 先通过HRegion 构建一个RegionScanner scanner = region.getScanner(scan); // 再通过RegionScanner#next获取一行数据 scanner.next(results); } finally { if (scanner != null) { if (closeCallBack == null) { // If there is a context then the scanner can be added to the current // RpcCallContext. The rpc callback will take care of closing the // scanner, for eg in case // of get() context.setCallBack(scanner); } else { // The call is from multi() where the results from the get() are // aggregated and then send out to the // rpc. The rpccall back will close all such scanners created as part // of multi(). closeCallBack.addScanner(scanner); } } }

-
Scan在服务端的入口RSRpcServices#scan()
@Override public ScanResponse scan(final RpcController controller, final ScanRequest request) throws ServiceException { if (controller != null && !(controller instanceof HBaseRpcController)) { throw new UnsupportedOperationException( "We only do " + "HBaseRpcControllers! FIX IF A PROBLEM: " + controller); } // 判断此请求是否已经有ScannerId if (!request.hasScannerId() && !request.hasScan()) { throw new ServiceException( new DoNotRetryIOException("Missing required input: scannerId or scan")); } try { checkOpen(); } catch (IOException e) { if (request.hasScannerId()) { String scannerName = toScannerName(request.getScannerId()); if (LOG.isDebugEnabled()) { LOG.debug( "Server shutting down and client tried to access missing scanner " + scannerName); } final LeaseManager leaseManager = server.getLeaseManager(); if (leaseManager != null) { try { leaseManager.cancelLease(scannerName); } catch (LeaseException le) { // No problem, ignore if (LOG.isTraceEnabled()) { LOG.trace("Un-able to cancel lease of scanner. It could already be closed."); } } } } throw new ServiceException(e); } requestCount.increment(); rpcScanRequestCount.increment(); RegionScannerHolder rsh; ScanResponse.Builder builder = ScanResponse.newBuilder(); String scannerName; try { if (request.hasScannerId()) { // The downstream projects such as AsyncHBase in OpenTSDB need this value. See HBASE-18000 // for more details. long scannerId = request.getScannerId(); builder.setScannerId(scannerId); scannerName = toScannerName(scannerId); //根据scannerId获取RegionScanner rsh = getRegionScanner(request); } else { Pair<String, RegionScannerHolder> scannerNameAndRSH = newRegionScanner(request, builder); scannerName = scannerNameAndRSH.getFirst(); //如果没有ScannerId,则构建一个RegionScanner rsh = scannerNameAndRSH.getSecond(); } } catch (IOException e) { if (e == SCANNER_ALREADY_CLOSED) { // Now we will close scanner automatically if there are no more results for this region but // the old client will still send a close request to us. Just ignore it and return. return builder.build(); } throw new ServiceException(e); } if (rsh.fullRegionScan) { rpcFullScanRequestCount.increment(); } HRegion region = rsh.r; LeaseManager.Lease lease; try { // Remove lease while its being processed in server; protects against case // where processing of request takes > lease expiration time. or null if none found. lease = server.getLeaseManager().removeLease(scannerName); } catch (LeaseException e) { throw new ServiceException(e); } if (request.hasRenew() && request.getRenew()) { // add back and return addScannerLeaseBack(lease); try { checkScanNextCallSeq(request, rsh); } catch (OutOfOrderScannerNextException e) { throw new ServiceException(e); } return builder.build(); } OperationQuota quota; try { quota = getRpcQuotaManager().checkQuota(region, OperationQuota.OperationType.SCAN); } catch (IOException e) { addScannerLeaseBack(lease); throw new ServiceException(e); } try { checkScanNextCallSeq(request, rsh); } catch (OutOfOrderScannerNextException e) { addScannerLeaseBack(lease); throw new ServiceException(e); } // Now we have increased the next call sequence. If we give client an error, the retry will // never success. So we'd better close the scanner and return a DoNotRetryIOException to client // and then client will try to open a new scanner. boolean closeScanner = request.hasCloseScanner() ? request.getCloseScanner() : false; int rows; // this is scan.getCaching if (request.hasNumberOfRows()) { rows = request.getNumberOfRows(); } else { rows = closeScanner ? 0 : 1; } RpcCall rpcCall = RpcServer.getCurrentCall().orElse(null); // now let's do the real scan. long maxQuotaResultSize = Math.min(maxScannerResultSize, quota.getReadAvailable()); RegionScanner scanner = rsh.s; // this is the limit of rows for this scan, if we the number of rows reach this value, we will // close the scanner. int limitOfRows; if (request.hasLimitOfRows()) { limitOfRows = request.getLimitOfRows(); } else { limitOfRows = -1; } MutableObject lastBlock = new MutableObject<>(); boolean scannerClosed = false; try { List results = new ArrayList<>(Math.min(rows, 512)); if (rows > 0) { boolean done = false; // Call coprocessor. Get region info from scanner. if (region.getCoprocessorHost() != null) { Boolean bypass = region.getCoprocessorHost().preScannerNext(scanner, results, rows); if (!results.isEmpty()) { for (Result r : results) { lastBlock.setValue(addSize(rpcCall, r, lastBlock.getValue())); } } if (bypass != null && bypass.booleanValue()) { done = true; } } if (!done) { scan((HBaseRpcController) controller, request, rsh, maxQuotaResultSize, rows, limitOfRows, results, builder, lastBlock, rpcCall); } else { builder.setMoreResultsInRegion(!results.isEmpty()); } } else { // This is a open scanner call with numberOfRow = 0, so set more results in region to true. builder.setMoreResultsInRegion(true); }
quota.addScanResult(results); addResults(builder, results, (HBaseRpcController) controller, RegionReplicaUtil.isDefaultReplica(region.getRegionInfo()), isClientCellBlockSupport(rpcCall)); if (scanner.isFilterDone() && results.isEmpty()) { // If the scanner's filter - if any - is done with the scan // only set moreResults to false if the results is empty. This is used to keep compatible // with the old scan implementation where we just ignore the returned results if moreResults // is false. Can remove the isEmpty check after we get rid of the old implementation. builder.setMoreResults(false); } // Later we may close the scanner depending on this flag so here we need to make sure that we // have already set this flag. assert builder.hasMoreResultsInRegion(); // we only set moreResults to false in the above code, so set it to true if we haven't set it // yet. if (!builder.hasMoreResults()) { builder.setMoreResults(true); } if (builder.getMoreResults() && builder.getMoreResultsInRegion() && !results.isEmpty()) { // Record the last cell of the last result if it is a partial result // We need this to calculate the complete rows we have returned to client as the // mayHaveMoreCellsInRow is true does not mean that there will be extra cells for the // current row. We may filter out all the remaining cells for the current row and just // return the cells of the nextRow when calling RegionScanner.nextRaw. So here we need to // check for row change. Result lastResult = results.get(results.size() - 1); if (lastResult.mayHaveMoreCellsInRow()) { rsh.rowOfLastPartialResult = lastResult.getRow(); } else { rsh.rowOfLastPartialResult = null; } } if (!builder.getMoreResults() || !builder.getMoreResultsInRegion() || closeScanner) { scannerClosed = true; closeScanner(region, scanner, scannerName, rpcCall); } // There's no point returning to a timed out client. Throwing ensures scanner is closed if (rpcCall != null && EnvironmentEdgeManager.currentTime() > rpcCall.getDeadline()) { throw new TimeoutIOException("Client deadline exceeded, cannot return results"); } return builder.build(); } catch (IOException e) { try { // scanner is closed here scannerClosed = true; // The scanner state might be left in a dirty state, so we will tell the Client to // fail this RPC and close the scanner while opening up another one from the start of // row that the client has last seen. closeScanner(region, scanner, scannerName, rpcCall); // If it is a DoNotRetryIOException already, throw as it is. Unfortunately, DNRIOE is // used in two different semantics. // (1) The first is to close the client scanner and bubble up the exception all the way // to the application. This is preferred when the exception is really un-recoverable // (like CorruptHFileException, etc). Plain DoNotRetryIOException also falls into this // bucket usually. // (2) Second semantics is to close the current region scanner only, but continue the // client scanner by overriding the exception. This is usually UnknownScannerException, // OutOfOrderScannerNextException, etc where the region scanner has to be closed, but the // application-level ClientScanner has to continue without bubbling up the exception to // the client. See ClientScanner code to see how it deals with these special exceptions. if (e instanceof DoNotRetryIOException) { throw e; } // If it is a FileNotFoundException, wrap as a // DoNotRetryIOException. This can avoid the retry in ClientScanner. if (e instanceof FileNotFoundException) { throw new DoNotRetryIOException(e); } // We closed the scanner already. Instead of throwing the IOException, and client // retrying with the same scannerId only to get USE on the next RPC, we directly throw // a special exception to save an RPC. if (VersionInfoUtil.hasMinimumVersion(rpcCall.getClientVersionInfo(), 1, 4)) { // 1.4.0+ clients know how to handle throw new ScannerResetException("Scanner is closed on the server-side", e); } else { // older clients do not know about SRE. Just throw USE, which they will handle throw new UnknownScannerException("Throwing UnknownScannerException to reset the client" + " scanner state for clients older than 1.3.", e); } } catch (IOException ioe) { throw new ServiceException(ioe); } } finally { if (!scannerClosed) { // Adding resets expiration time on lease. // the closeCallBack will be set in closeScanner so here we only care about shippedCallback if (rpcCall != null) { rpcCall.setCallBack(rsh.shippedCallback); } else { // If context is null,here we call rsh.shippedCallback directly to reuse the logic in // rsh.shippedCallback to release the internal resources in rsh,and lease is also added // back to regionserver's LeaseManager in rsh.shippedCallback. runShippedCallback(rsh); } } quota.close(); }}
我们假定一个RegionScanner中仅包含一个StoreScanner,那么这个RegionScanner中的核心读取操作,是由StoreScanner完成的,我们进一步假定StoreScanner由四个Scanners组成,如下所示:

每一个Scanner中都有一个current指针指向下一个即将要读取的KV,KVHeap中的PriorityQueue正是按照每一个Scanner的current所指向的KV进行排序
第一次next请求,将会返回ScannerA中的Row01:FamA:Col1,而后ScannerA得指针移动到下一个KV的Row01:FamA:Col2,PrirorityQueue中的Scanners排序依然不变:

第二次next请求,依然返回ScannerA中的Row01:FamA:Col2,ScannerA的指针移动到下一个KV Row02:FamA:Col1,此时,PriorityQueue中的Scanners排序发生了变化:

下一次next请求,将会返回ScannerB中的KV....周而复始,直到某一个Scanner所读取的数据耗尽,该Scanner将会close,不再出现上面的PriorityQueue中。