开发日记 - Tonbo 结构化存储引擎

227 阅读5分钟

初期的规划是,受duckdb以及sqlite、rocksdb这类嵌入式存储引擎较高的热度,以及datafusion这类Arrow生态的数据处理的组件越来越多,因此我们认为可以以Arrow为重点,做一个支持结构化定义数据以主键作为类似KV的嵌入式LSM存储引擎:Tonbo(とんぼ 蜻蜓) - 轻量与灵活,集群化。希望大家能够持续关注以及star一下!

官网地址:tonbo.io/

仓库地址:github.com/tonbo-io/to…

可以通过Example了解我们目前基础的使用方式,下文将一一对其中主要的设计点进行解析GitHub - tonbo-io/tonbo: An embedded persistent database in Rust.可以通过Example了解我们目前基础的使用方式,下文将一一对其中主要的设计点进行解析

Example

use std::ops::Bound;

use futures_util::stream::StreamExt;
use tonbo::{executor::tokio::TokioExecutor, tonbo_record, Projection, DB};

// use macro to define schema of column family just like ORM
// it provides type safety read & write API
#[tonbo_record]
pub struct User {
    #[primary_key]
    name: String,
    email: Option<String>,
    age: u8,
}

#[tokio::main]
async fn main() {
    // pluggable async runtime and I/O
    let db = DB::new("./db_path/users".into(), TokioExecutor::default())
        .await
        .unwrap();

    // insert with owned value
    db.insert(User {
        name: "Alice".into(),
        email: Some("alice@gmail.com".into()),
        age: 22,
    })
    .await
    .unwrap();

    {
        // tonbo supports transaction
        let txn = db.transaction().await;
        {
            let lower = "Alice".into();
            let upper = "Blob".into();
            // range scan of
            let mut scan = txn
                .scan((Bound::Included(&lower), Bound::Excluded(&upper)))
                .await
                // tonbo supports pushing down projection
                .projection(vec![1])
                .take()
                .await
                .unwrap();
            while let Some(entry) = scan.next().await.transpose().unwrap() {
                assert_eq!(
                    entry.value(),
                    Some(UserRef {
                        name: "Alice",
                        email: Some("alice@gmail.com"),
                        age: Some(22),
                    })
                );
            }
        }

        // commit transaction
        txn.commit().await.unwrap();
    }
}

Filter下推

以TIDB的键值映射官方文档为例:

每行数据按照如下规则编码成 (Key, Value) 键值对:
Key: tablePrefix{TableID}_recordPrefixSep{RowID}
Value: [col1, col2, col3, col4]

主键作为主要的key被编码入其次Table的高位中,利用KV常规的顺序性,使键值能够直接memcomparable,从而在一些以知主键范围内可以直接通过kv的range实现部分filter下推,例:

假设存在表为t1且table id 为 0,并且SQL语句为 Select * from t1 where id > 50 and id < 100;
可以被直接转换为单次KV调用:range("tablePrefix_0_recordPrefixSep_50", "tablePrefix_0_recordPrefixSep_100")

因此Tonbo以这一常规实现作为基础,使用主键作为Key使其在作为类似Datafusion的存储引擎时,支持直接通过filters的expr通过range detach来实现主键的filter下推,这也是为什么我们选择Tonbo作为存储结构的原因

Projection下推

相较于传统的KV数据库例如RocksDB,往往都是直接将每行数据序列化作为Value,这一点在许多OLTP数据库上都得到了应用,但是在OLAP场景下,数据库往往都会对宽表进行操作并可能仅仅对个别列进行查询使用,因此列裁剪的优化技巧会相较于OLTP要更加深入,因此类似Parquet这样的存储文件格式应运而生并广泛应用于Arrow生态圈中。

而RocksDB无法很好地适应这一点,因为它仍然需要读取出Value进行完整地反序列化后取出需要的列值,而Tonbo与常规KV不同点在于,它完全基于Arrow以及Parquet作为数据处理结构,通过用户定义的TonboRecord结构生成对应的Schema,并以主键作为主要列的基础基于RowFilter与PageIndex特性实现值查询,在查询时也能利用Parquet支持的Projection下推操作,减少IO开销。其本身也能也能基于Parquet带来的列存编码而提供更高的压缩比。

而值得一提的是,tonbo目前并未使用外置的索引文件进行加速查询,并尽可能减少Parquet以外的依赖,期望在后继在实现S3的远程存储时,用户可以仅将Tonbo作为数据源,而使其其他数据分析工具直接使用Parquet而避免形成数据孤岛。

数据拓展

Datafusion是我们在设计之初就作为主要支持的重点,因此许多设计都是围绕着它而进行支持的,我们也提供了一个Tonbo x Datafusion的示例

#[tonbo_record]
pub struct Music {
    #[primary_key]
    id: u64,
    name: String,
    like: i64,
}

#[tokio::main]
async fn main() -> Result<()> {
    let db = DB::new("./db_path/music".into(), TokioExecutor::default())
        .await
        .unwrap();
    for (id, name, like) in vec![
        (0, "welcome".to_string(), 0),
        (1, "tonbo".to_string(), 999),
        (2, "star".to_string(), 233),
        (3, "plz".to_string(), 2),
    ] {
        db.insert(Music { id, name, like }).await.unwrap();
    }
    let ctx = SessionContext::new();

    let provider = MusicProvider { db: Arc::new(db) };
    ctx.register_table("music", Arc::new(provider))?;

    let df = ctx.table("music").await?;
    let df = df.select(vec![col("name")])?;
    let batches = df.collect().await?;
    pretty::print_batches(&batches).unwrap();
    Ok(())
}

struct MusicProvider {
    db: Arc<DB<Music, TokioExecutor>>,
}

struct MusicExec {
    cache: PlanProperties,
    db: Arc<DB<Music, TokioExecutor>>,
    projection: Option<Vec<usize>>,
    limit: Option<usize>,
    range: (Bound<<Music as Record>::Key>, Bound<<Music as Record>::Key>),
}

struct MusicStream {
    stream: Pin<Box<dyn Stream<Item = Result<RecordBatch, DataFusionError>> + Send>>,
}

#[async_trait]
impl TableProvider for MusicProvider {
    fn as_any(&self) -> &dyn Any {
        self
    }

    fn schema(&self) -> SchemaRef {
        Music::arrow_schema().clone()
    }

    fn table_type(&self) -> TableType {
        TableType::Base
    }

    async fn scan(
        &self,
        _: &SessionState,
        projection: Option<&Vec<usize>>,
        _filters: &[Expr],
        limit: Option<usize>,
    ) -> Result<Arc<dyn ExecutionPlan>> {
        let mut exec = MusicExec::new(self.db.clone());

        // TODO: filters to range detach
        // exec.range =
        exec.projection = projection.cloned();
        if let Some(projection) = exec.projection.as_mut() {
            for index in projection {
                *index = index.checked_sub(2).unwrap_or(0);
            }
        }

        exec.limit = limit;

        Ok(Arc::new(exec))
    }
}

impl MusicExec {
    fn new(db: Arc<DB<Music, TokioExecutor>>) -> Self {
        MusicExec {
            cache: PlanProperties::new(
                EquivalenceProperties::new_with_orderings(Music::arrow_schema().clone(), &[]),
                datafusion::physical_expr::Partitioning::UnknownPartitioning(1),
                ExecutionMode::Unbounded,
            ),
            db,
            projection: None,
            limit: None,
            range: (Bound::Unbounded, Bound::Unbounded),
        }
    }
}

impl Stream for MusicStream {
    type Item = Result<RecordBatch>;

    fn poll_next(mut self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Option<Self::Item>> {
        pin!(&mut self.stream).poll_next(cx)
    }
}

impl RecordBatchStream for MusicStream {
    fn schema(&self) -> SchemaRef {
        Music::arrow_schema().clone()
    }
}

impl DisplayAs for MusicExec {
    fn fmt_as(&self, _: DisplayFormatType, f: &mut Formatter) -> std::fmt::Result {
        let (lower, upper) = self.range;

        write!(
            f,
            "MusicExec: range:({:?}, {:?}), projection: [{:?}], limit: {:?}",
            lower, upper, self.projection, self.limit
        )
    }
}

impl Debug for MusicExec {
    fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result {
        f.debug_struct("MusicExec")
            .field("cache", &self.cache)
            .field("limit", &self.limit)
            .field("projection", &self.projection)
            .field("range", &self.range)
            .finish()
    }
}

impl ExecutionPlan for MusicExec {
    fn name(&self) -> &str {
        "MusicExec"
    }

    fn as_any(&self) -> &dyn Any {
        self
    }

    fn properties(&self) -> &PlanProperties {
        &self.cache
    }

    fn children(&self) -> Vec<&Arc<dyn ExecutionPlan>> {
        vec![]
    }

    fn with_new_children(
        self: Arc<Self>,
        children: Vec<Arc<dyn ExecutionPlan>>,
    ) -> Result<Arc<dyn ExecutionPlan>> {
        if children.is_empty() {
            Ok(self)
        } else {
            internal_err!("Children cannot be replaced in {self:?}")
        }
    }

    fn execute(&self, _: usize, _: Arc<TaskContext>) -> Result<SendableRecordBatchStream> {
        let db = self.db.clone();
        let (lower, upper) = self.range.clone();
        let limit = self.limit.clone();
        let projection = self.projection.clone();

        Ok(Box::pin(MusicStream {
            stream: Box::pin(stream! {
                let txn = db.transaction().await;

                let mut scan = txn
                    .scan((lower.as_ref(), upper.as_ref()))
                    .await;
                if let Some(limit) = limit {
                    scan = scan.limit(limit);
                }
                if let Some(projection) = projection {
                    scan = scan.projection(projection.clone());
                }
                let mut scan = scan.package(8192).await.map_err(|err| DataFusionError::Internal(err.to_string()))?;

                while let Some(record) = scan.next().await {
                    yield Ok(record?.as_record_batch().clone())
                }
            }),
        }))
    }
}

Last

Tonbo仍处于初期探索阶段阶段,我们希望致力于将Tonbo打造成一个通用AP中间存储引擎,也非常期望大家能给出更多的建议以及贡献!