初期的规划是,受duckdb以及sqlite、rocksdb这类嵌入式存储引擎较高的热度,以及datafusion这类Arrow生态的数据处理的组件越来越多,因此我们认为可以以Arrow为重点,做一个支持结构化定义数据以主键作为类似KV的嵌入式LSM存储引擎:Tonbo(とんぼ 蜻蜓) - 轻量与灵活,集群化。希望大家能够持续关注以及star一下!
官网地址:tonbo.io/
可以通过Example了解我们目前基础的使用方式,下文将一一对其中主要的设计点进行解析GitHub - tonbo-io/tonbo: An embedded persistent database in Rust.可以通过Example了解我们目前基础的使用方式,下文将一一对其中主要的设计点进行解析
Example
use std::ops::Bound;
use futures_util::stream::StreamExt;
use tonbo::{executor::tokio::TokioExecutor, tonbo_record, Projection, DB};
// use macro to define schema of column family just like ORM
// it provides type safety read & write API
#[tonbo_record]
pub struct User {
#[primary_key]
name: String,
email: Option<String>,
age: u8,
}
#[tokio::main]
async fn main() {
// pluggable async runtime and I/O
let db = DB::new("./db_path/users".into(), TokioExecutor::default())
.await
.unwrap();
// insert with owned value
db.insert(User {
name: "Alice".into(),
email: Some("alice@gmail.com".into()),
age: 22,
})
.await
.unwrap();
{
// tonbo supports transaction
let txn = db.transaction().await;
{
let lower = "Alice".into();
let upper = "Blob".into();
// range scan of
let mut scan = txn
.scan((Bound::Included(&lower), Bound::Excluded(&upper)))
.await
// tonbo supports pushing down projection
.projection(vec![1])
.take()
.await
.unwrap();
while let Some(entry) = scan.next().await.transpose().unwrap() {
assert_eq!(
entry.value(),
Some(UserRef {
name: "Alice",
email: Some("alice@gmail.com"),
age: Some(22),
})
);
}
}
// commit transaction
txn.commit().await.unwrap();
}
}
Filter下推
以TIDB的键值映射官方文档为例:
每行数据按照如下规则编码成 (Key, Value) 键值对:
Key: tablePrefix{TableID}_recordPrefixSep{RowID}
Value: [col1, col2, col3, col4]
主键作为主要的key被编码入其次Table的高位中,利用KV常规的顺序性,使键值能够直接memcomparable,从而在一些以知主键范围内可以直接通过kv的range实现部分filter下推,例:
假设存在表为t1且table id 为 0,并且SQL语句为 Select * from t1 where id > 50 and id < 100;
可以被直接转换为单次KV调用:range("tablePrefix_0_recordPrefixSep_50", "tablePrefix_0_recordPrefixSep_100")
因此Tonbo以这一常规实现作为基础,使用主键作为Key使其在作为类似Datafusion的存储引擎时,支持直接通过filters的expr通过range detach来实现主键的filter下推,这也是为什么我们选择Tonbo作为存储结构的原因
Projection下推
相较于传统的KV数据库例如RocksDB,往往都是直接将每行数据序列化作为Value,这一点在许多OLTP数据库上都得到了应用,但是在OLAP场景下,数据库往往都会对宽表进行操作并可能仅仅对个别列进行查询使用,因此列裁剪的优化技巧会相较于OLTP要更加深入,因此类似Parquet这样的存储文件格式应运而生并广泛应用于Arrow生态圈中。
而RocksDB无法很好地适应这一点,因为它仍然需要读取出Value进行完整地反序列化后取出需要的列值,而Tonbo与常规KV不同点在于,它完全基于Arrow以及Parquet作为数据处理结构,通过用户定义的TonboRecord结构生成对应的Schema,并以主键作为主要列的基础基于RowFilter与PageIndex特性实现值查询,在查询时也能利用Parquet支持的Projection下推操作,减少IO开销。其本身也能也能基于Parquet带来的列存编码而提供更高的压缩比。
而值得一提的是,tonbo目前并未使用外置的索引文件进行加速查询,并尽可能减少Parquet以外的依赖,期望在后继在实现S3的远程存储时,用户可以仅将Tonbo作为数据源,而使其其他数据分析工具直接使用Parquet而避免形成数据孤岛。
数据拓展
Datafusion是我们在设计之初就作为主要支持的重点,因此许多设计都是围绕着它而进行支持的,我们也提供了一个Tonbo x Datafusion的示例:
#[tonbo_record]
pub struct Music {
#[primary_key]
id: u64,
name: String,
like: i64,
}
#[tokio::main]
async fn main() -> Result<()> {
let db = DB::new("./db_path/music".into(), TokioExecutor::default())
.await
.unwrap();
for (id, name, like) in vec![
(0, "welcome".to_string(), 0),
(1, "tonbo".to_string(), 999),
(2, "star".to_string(), 233),
(3, "plz".to_string(), 2),
] {
db.insert(Music { id, name, like }).await.unwrap();
}
let ctx = SessionContext::new();
let provider = MusicProvider { db: Arc::new(db) };
ctx.register_table("music", Arc::new(provider))?;
let df = ctx.table("music").await?;
let df = df.select(vec![col("name")])?;
let batches = df.collect().await?;
pretty::print_batches(&batches).unwrap();
Ok(())
}
struct MusicProvider {
db: Arc<DB<Music, TokioExecutor>>,
}
struct MusicExec {
cache: PlanProperties,
db: Arc<DB<Music, TokioExecutor>>,
projection: Option<Vec<usize>>,
limit: Option<usize>,
range: (Bound<<Music as Record>::Key>, Bound<<Music as Record>::Key>),
}
struct MusicStream {
stream: Pin<Box<dyn Stream<Item = Result<RecordBatch, DataFusionError>> + Send>>,
}
#[async_trait]
impl TableProvider for MusicProvider {
fn as_any(&self) -> &dyn Any {
self
}
fn schema(&self) -> SchemaRef {
Music::arrow_schema().clone()
}
fn table_type(&self) -> TableType {
TableType::Base
}
async fn scan(
&self,
_: &SessionState,
projection: Option<&Vec<usize>>,
_filters: &[Expr],
limit: Option<usize>,
) -> Result<Arc<dyn ExecutionPlan>> {
let mut exec = MusicExec::new(self.db.clone());
// TODO: filters to range detach
// exec.range =
exec.projection = projection.cloned();
if let Some(projection) = exec.projection.as_mut() {
for index in projection {
*index = index.checked_sub(2).unwrap_or(0);
}
}
exec.limit = limit;
Ok(Arc::new(exec))
}
}
impl MusicExec {
fn new(db: Arc<DB<Music, TokioExecutor>>) -> Self {
MusicExec {
cache: PlanProperties::new(
EquivalenceProperties::new_with_orderings(Music::arrow_schema().clone(), &[]),
datafusion::physical_expr::Partitioning::UnknownPartitioning(1),
ExecutionMode::Unbounded,
),
db,
projection: None,
limit: None,
range: (Bound::Unbounded, Bound::Unbounded),
}
}
}
impl Stream for MusicStream {
type Item = Result<RecordBatch>;
fn poll_next(mut self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Option<Self::Item>> {
pin!(&mut self.stream).poll_next(cx)
}
}
impl RecordBatchStream for MusicStream {
fn schema(&self) -> SchemaRef {
Music::arrow_schema().clone()
}
}
impl DisplayAs for MusicExec {
fn fmt_as(&self, _: DisplayFormatType, f: &mut Formatter) -> std::fmt::Result {
let (lower, upper) = self.range;
write!(
f,
"MusicExec: range:({:?}, {:?}), projection: [{:?}], limit: {:?}",
lower, upper, self.projection, self.limit
)
}
}
impl Debug for MusicExec {
fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result {
f.debug_struct("MusicExec")
.field("cache", &self.cache)
.field("limit", &self.limit)
.field("projection", &self.projection)
.field("range", &self.range)
.finish()
}
}
impl ExecutionPlan for MusicExec {
fn name(&self) -> &str {
"MusicExec"
}
fn as_any(&self) -> &dyn Any {
self
}
fn properties(&self) -> &PlanProperties {
&self.cache
}
fn children(&self) -> Vec<&Arc<dyn ExecutionPlan>> {
vec![]
}
fn with_new_children(
self: Arc<Self>,
children: Vec<Arc<dyn ExecutionPlan>>,
) -> Result<Arc<dyn ExecutionPlan>> {
if children.is_empty() {
Ok(self)
} else {
internal_err!("Children cannot be replaced in {self:?}")
}
}
fn execute(&self, _: usize, _: Arc<TaskContext>) -> Result<SendableRecordBatchStream> {
let db = self.db.clone();
let (lower, upper) = self.range.clone();
let limit = self.limit.clone();
let projection = self.projection.clone();
Ok(Box::pin(MusicStream {
stream: Box::pin(stream! {
let txn = db.transaction().await;
let mut scan = txn
.scan((lower.as_ref(), upper.as_ref()))
.await;
if let Some(limit) = limit {
scan = scan.limit(limit);
}
if let Some(projection) = projection {
scan = scan.projection(projection.clone());
}
let mut scan = scan.package(8192).await.map_err(|err| DataFusionError::Internal(err.to_string()))?;
while let Some(record) = scan.next().await {
yield Ok(record?.as_record_batch().clone())
}
}),
}))
}
}
Last
Tonbo仍处于初期探索阶段阶段,我们希望致力于将Tonbo打造成一个通用AP中间存储引擎,也非常期望大家能给出更多的建议以及贡献!