背景

有一份json格式的日志，每条日志均包含一条sid。sid与日志是一对多的关系，每条sid对应1~4条日志。希望能够根据sid迅速检索到对应的所有日志。
本文基于MapFile实现。

架构

图中蓝色模块是我们所需要实现的部分。

索引构建：将json格式日志转为MapFile格式；
查询服务：根据sid从MapFile中检索到对应的日志。

其实就是对应了MapFile的”写“和”读“。

写MapFile

本文采取的是spark框架，基于MapFileOutputFormat实现。
核心代码如下（为方便梳理主体逻辑，代码中做了一些忽略和简化）：

//令rdd是已经处理成键值对(sid, log)的数据集
rdd.reduceByKey((log1, log2) => logMerge(log1, log2), 3) //将同一sid的日志聚合在一起，忽略logMerge的具体实现，手动设置分区数为3
  .map(x => {
    val sid = x._1
    val logs = x._2
    (sid, logs.toString) //简化logs转string的步骤
  })
  .mapPartitions(x => x.toList.sortWith((a, b) => a._1.compareTo(b._1) < 0).iterator) //对分区内的数据按key正序
  .map(x => (new Text(x._1), new Text(x._2))) //转成MapFileOutputFormat支持的类型
  .saveAsNewAPIHadoopFile(outputDir, classOf[Text], classOf[Text], classOf[MapFileOutputFormat])

代码在reduceByKey时，设置了分区数为3，所以，本地测试最终输出了3个目录(part-r-0000, part-r-0001, part-r-0002)。每个目录下都有一个index文件和一个data文件。

压缩

上面的实现，输出的data文件是没有压缩的。实现压缩需要为MapFileOutputFormat添加config，核心代码如下：

val conf: Configuration = new Configuration()
val codec = "org.apache.hadoop.io.compress.SnappyCodec"
conf.set("mapreduce.output.fileoutputformat.compress", "true")
conf.set("mapreduce.output.fileoutputformat.compress.type", CompressionType.BLOCK.toString) // "BLOCK" as string
conf.set("mapreduce.output.fileoutputformat.compress.codec", codec)

...saveAsNewAPIHadoopFile(outputDir, classOf[Text], classOf[Text], classOf[MapFileOutputFormat], conf)

这里指定的压缩方式为snappy，其执行需依赖hadoop native lib。
本文的本地测试是在mac系统上，而后文查询服务（需要”读“MapFile）是部署在单独的linux机器上，它们均不是集群环境，需要自己配置hadoop native lib。
hadoop native lib可以在执行环境所在机器上，对相应的hadoop版本源码进行编译得到，网上也存在一些现成的资源。
mac hadoop native lib提供了mac系统下的一些版本的hadoop本地库。其中，hadoop-2.7.3版本的本地库，在snappy压缩上，对于hadoop-2.6.0版本的代码也是适用的。
linux hadoop native lib，这是官方提供的发行版地址。以hadoop-2.6.0为例，将下图中的hadoop-2.6.0.tar.gz下载解压后，在其中可找到hadop native lib。

我们需要将hadoop native lib所在路径配置到java.library.path中，示例如下。该示例的配置方式将完全覆盖掉java.library.path的默认值。

-Djava.library.path=xxx/hadoop_home/lib/native

读MapFile

我们最终是希望根据sid检索到对应的日志，根据上文可以看到，最终输出了3个目录，那sid在哪个目录里呢？

sid的分区

在上文的测试中，测试数据只有3条sid，我们通过如下代码打印出每个目录中的sid。

Configuration conf = new Configuration();
List<String> dirs = Lists.newArrayList("part-r-00000", "part-r-00001", "part-r-00002");

for (String dir : dirs) {
  MapFile.Reader reader = new MapFile.Reader(new Path("file:///xxx/" + dir), conf);

  Text k = new Text();
  Text v = new Text();
  while (reader.next(k, v)) {
    System.out.println(dir + ": " + k);
  }
}

输出如下

part-r-00000: 1bc30a2b-deb5-46e1-bcc8-abbb68d432d5-1634017585538
part-r-00000: b4f5d876-8977-43a9-b728-62fb34f9ffdb-1625576229860
part-r-00002: b4f5d876-8977-43a9-b728-62fb34f9ffdb-1625576229856 //和上一sid相比，末尾两位数不同

可以看到part-r-00000有2个sid，part-r-00002有1个sid，而part-r-00001没有sid。

其实每个目录就对应了reduceByKey shuffle后的一个分区，而reduceByKey是根据sid进行分区的（然后对分区内的sid排序并输出MapFile）。reduceByKey的Partitioner是HashPartitioner，其获取分区号的实现如下：

def getPartition(key: Any): Int = key match {
  case null => 0
  case _ => Utils.nonNegativeMod(key.hashCode, numPartitions)
}

def nonNegativeMod(x: Int, mod: Int): Int = {
  val rawMod = x % mod
  rawMod + (if (rawMod < 0) mod else 0)
}

我们的分区数为3，在java中用nonNegativeMod(sid, 3)可以算出3个sid的分区号，正好是两个0和一个2。

1bc30a2b-deb5-46e1-bcc8-abbb68d432d5-1634017585538: 0
b4f5d876-8977-43a9-b728-62fb34f9ffdb-1625576229860: 0
b4f5d876-8977-43a9-b728-62fb34f9ffdb-1625576229856: 2

sid的检索

根据sid检索日志分为两步。

先根据sid计算出分区号，找到该sid对应的目录。
使用MapFile.Reader查询目录下的日志。

核心代码如下：

FileStatus[] dir = fs.globStatus("xxx/part-*");
Arrays.sort(dir);
int part = nonNegativeMod(sid.hashCode(), dir.length);
Path hashPath = dir[part].getPath();

MapFile.Reader reader = new MapFile.Reader(hashPath, conf);
Text k = new Text(sid);
Text v = new Text();
reader.get(k, v);

System.out.println(v);

再谈排序

sortByKey的误用

sortByKey()是根据key的全局排序，其Partitioner并不是HashPartitioner，而是RangePartitioner，该分区器还需要对数据进行采样。

repartitionAndSortWithinPartitions

上文用mapPartitions对分区内的数据进行排序，是在应用代码层面完全依赖堆内内存实现的。我们也可以考虑使用repartitionAndSortWithinPartitions，以依赖spark本身的机制，对内存使用进行一些优化（详细机制不甚清楚）。

基于MapFile的日志检索实现

背景

架构