开启掘金成长之旅!这是我参与「掘金日新计划 · 12 月更文挑战」的第20天,点击查看活动详情
为什么关注这篇论文
LSM compaction对于开发人员来说较少被关注以及修改,常被当做黑盒处理,我们急需一种打开黑盒的思维方式和benchmark方法论&工具。我们已经关注到实时在线LSM存储服务中,compaction对性能的影响是巨大的。
下图是一个write-heavy的业务的CPU表现,这梳子状的尖峰就是由LSM的L0,bottomest compaction导致,每次memtable大规模刷到L0时,目前默认的compaction参数将产生一次或多次级联的压缩,这拉高了CPU水位。我们希望得到一个较优水平的compact参数组合去应对这种写量巨大的业务。
还好我们有居于raft-learner实验平台,我们可以快速clone出相同workload去调参实验。
- Universal?
- rate limiter ?
- write buffer ? #无效
编译修改后的RocksDB
git clone https://github.com/dongdongwcpp/LSM-Compaction-Analysis.git
mkdir build
cmake ..
发现编译不过:
[ 60%] Linking CXX executable c_test
/root/.bin/ld: librocksdb.so.6.12.0: undefined reference to `EmuEnv::PrintRRIndices(EmuEnv*)'
/root/.bin/ld: librocksdb.so.6.12.0: undefined reference to `EmuEnv::AddNewLevel(int, EmuEnv*)'
/root/.bin/ld: librocksdb.so.6.12.0: undefined reference to `Stats::getInstance()'
/root/.bin/ld: librocksdb.so.6.12.0: undefined reference to `EmuEnv::GetLevelDeletePersistenceLatency(int, EmuEnv*)'
/root/.bin/ld: librocksdb.so.6.12.0: undefined reference to `EmuEnv::CheckingVector(unsigned long)'
/root/.bin/ld: librocksdb.so.6.12.0: undefined reference to `EmuEnv::GetDumpedDeleteFileTimestamp()'
/root/.bin/ld: librocksdb.so.6.12.0: undefined reference to `EmuEnv::DumpDeleteFileTimestamp(std::chrono::time_point<std::chrono::_V2::system_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >, unsigned long, EmuEnv*)'
/root/.bin/ld: librocksdb.so.6.12.0: undefined reference to `EmuEnv::getInstance()'
/root/.bin/ld: librocksdb.so.6.12.0: undefined reference to `EmuEnv::PopulatingVector(unsigned long)'
collect2: error: ld returned 1 exit status
修复编译
这是因为这里的CMAKE,以及librocksdb.so 缺乏维护,所以我们不要使用cmake,直接make static_lib搞定
root@ubuntu:~/tmp/LSM-Compaction-Analysis# bear make static_lib -j10
需要workload生成器
接下来我们需要https://github.com/BU-DiSC/K-V-Workload-Generator.git 来生成workload,这里的workload是生成一个文件放到txt中.
git clone https://github.com/BU-DiSC/K-V-Workload-Generator.git
make
./load_gen -I500000 -Q30000 #会生成500000个insert,30000个query,将op写到workload.txt,内部文件如下:
I 3128772376 WGgx
I 3941535506 oikM
I 1332567874 XVPI
I 3324067961 XAQj
I 675010326 HExC
I 2446664524 tgSZ
I 2207874088 itZS
I 1043530926 IFvX
I 821974831 RkCi
I 288380693 tCcS
I 3497810118 eySx
I 3134029994 QiQp
I 3677169663 QkDL
I 186104787 NBcr
I 3272445626 oECs
I 1528462652 eFFY
copy生成器到压测工具
cd ../examples/__working_branch/
make
cp workload.txt ../examples/__working_branch/
./working_version --compaction_style=2 --compaction_pri=4
看到working_version我们可以指定多种compact策略和style,至此,我们似乎可以复现论文实验了。在分析和实验中,我们选择了九个有代表性的compaction策略,这些策略在基于LSM的生产和学术系统中很普遍。这些策略反映了可能的compaction设计的广泛多样性。我们在表2中对这些候选compaction策略进行了编码和介绍。
Full代表分级LSM-树的经典compaction策略,在调用时compaction两个连续的级别。
LO+1和LO+2表示两个部分compaction程序,分别选择与父级(i+1)和祖级(i+2)中的文件重叠最小的文件进行compaction。
RR以轮流的方式从每一层选择文件进行压缩。
TSD和TSA是删除驱动的compaction策略,具有触发器和数据移动策略,分别由墓碑的密度和文件中包含的最老的墓碑的年龄决定。
Tier代表带有空间放大触发器的分层compaction的变体。最后,
1-Lvl代表混合数据布局,其中第一层是作为tiering实现的,其余的作为Leveling。
具体到代码层面:0x4以上是论文作者加的代码实现,原来rocksdb已经有tombstone优先的压缩方式了,kByCompensatedSize我们应该试试,特别是在索引CF列。我看了许久代码,没看出FADE的缩写是什么,真的是非常坑爹。
// In Level-based compaction, it Determines which file from a level to be
// picked to merge to the next level. We suggest people try
// kMinOverlappingRatio first when you tune your database.
enum CompactionPri : char {
// Slightly prioritize larger files by size compensated by #deletes
kByCompensatedSize = 0x0,
// First compact files whose data's latest update time is oldest.
// Try this if you only update some hot keys in small ranges.
kOldestLargestSeqFirst = 0x1,
// First compact files whose range hasn't been compacted to the next level
// for the longest. If your updates are random across the key space,
// write amplification is slightly better with this option.
kOldestSmallestSeqFirst = 0x2,
// First compact files whose ratio between overlapping size in next level
// and its size is the smallest. It in many cases can optimize write
// amplification.
kMinOverlappingRatio = 0x3,
// Introducing FADE
kFADE = 0x4, // !YBS-aug17-XX!
// Introducing kRoundRobin
kRoundRobin = 0x5, // !YBS-sep06-XX!
// Introducing kMinOverlappingGrandparent
kMinOverlappingGrandparent = 0x6, // !YBS-sep07-XX!
// Introducing kMinOverlappingGrandparent
kFullLevel = 0x7, // !YBS-sep08-XX!
};
运行与结果解析
运行方式是:./working_version --verbosity=1 --compaction_style=2 --compaction_pri=5发现--help信息有漏许多data movement提示,但底层代码是完整支持的。下面是支持的选项:
Compaction priority:
- 1 for kMinOverlappingRatio,
- 2 for kByCompensatedSize,
- 3 for kOldestLargestSeqFirst,
- 4 for kOldestSmallestSeqFirst,
- 5 for kFADE,
- 6 for kRoundRobin,
- 7 for kMinOverlappingGrandparent,
- 8 for kFullLevel;
开始运行。这里看到一些运行结果,比如#I是insert个数,#U更新个数,#PD是point delete,以及range delete,point query,range query,这里是固定将workload生成到workload.txt,而后反复使用不同的compaction style + compaction pri去组合测试,这里确保了输入的workload是不变的。并以此去探寻不同LSM design space组合的结果。
root$ ./working_version --verbosity=1 --compaction_style=2 --compaction_pri=5
printing: max_bytes_for_level_base = 167772160 buffer_size = 16777216 size_ratio = 10
Destroying database ...
cmpt_sty cmpt_pri T P B E M file_size L1_size blk_cch BPK
2 5 10 4096 4 1024 16777216 16777216 167772160 8 10
#I #U #PD #RD #PQ #RQ
500001 100000 0 0 0
----------------------------------------
#I #U #PD #RD #PQ #RQ
500001 100000 0 0 0 0
#p_ts_in_tree #kv_in_tree #I_done L_in_tree #U_done #PD_done #cmpt #cmpt_easy fls_rd_cmpt fls_wr_cmpt bts_rd_cmpt bts_wr_cmpt
0 520936 500000 7 99999 -1 8 -1 37 34 502858614 502482819
files in tree = 20
%space_amp %write_amp exp_runtime (ms)
0 -1 356078.81
作者如何修改代码实现8种compaction pri?
void VersionStorageInfo::UpdateFilesByCompactionPri这里根据不同的策略,排序要压缩的文件顺序,用EmuEnv* _env = EmuEnv::getInstance();自己的环境变量入侵较小地传参。实际kFADE也没看出来是什么策略。