本文在绿泡泡“狗哥琐话”首发于2025.3.14 <-关注不走丢。
1.前言
书接上文,我们继续RisingWave的代码剖析。这次讲讲“流作业提交到运行的流程”,在这个内容中呢,你不仅可以学习生产级Rust代码,还可以学到一些实时计算相关的知识点。
2.读代码
为了方便大家理清思路,这边直接给出一个简单的代码地图。
-- utils/pgwite/pg_server.rs
-- pg_serve
-- handle_connection
-- process
-- do_process
-- do_process_inner
-- process_query_msg
-- inner_process_query_msg #SQL解析的地方
-- inner_process_query_msg_one_stmt
-- frontend/src/session.rs
-- run_one_query
-- frontend/src/handler/mod.rs
-- handle #选择CreateView分支
-- frontend/handler/create_mv.rs
-- handle_create_mv_bound
-- gen_create_mv_plan_bound #生成plan的地方
-- fronted/catalog/catalog_service.rs
-- create_materialized_view
-- rpc_client/meta_client.rs
-- create_materialized_view
-- prost/ddl_service.rs
-- ddl_service_server/call #选择"/ddl_service.DdlService/CreateMaterializedView"分支
-- meta/service/ddl_service.rs
-- create_materialized_view
-- meta/service/ddl_controller.rs
-- run_command #选择CreateStreamingJob
-- create_streaming_job
-- create_streaming_job_inner #生成StreamFragmentGraph
-- build_stream_job #生成ActorGraphBuildResult和StreamJobFragments、CompleteStreamFragmentGraph
-- meta/stream/stream_manager.rs
-- create_streaming_job
-- create_streaming_job_impl
-- meta/barrier/schedule.rs
-- run_command
-- run_multiple_commands #执行细节下期展开。
那今天我们就来看看SQL解析后的一系列转换流程,众所周知Flink也有一层层的转换。那为什么要做这么多分层呢?相信你看完今天的视频会有自己的答案。
这个时候可能就有人会说了,你讲这个不讲SQL解析和优化的东西,你在讲锤子。别急,这个内容很大,你忍一下。我想单独出一期视频讲讲。
整体代码看下来,我们可以看到几个分层:
-
PlanRef
-
StreamFragmentGraph
-
StreamJobFragments
2.1 PlanRef
// Using a new type wrapper allows direct function implementation on `PlanRef`,
// and we currently need a manual implementation of `PartialEq` for `PlanRef`.
#[allow(clippy::derived_hash_with_manual_eq)]
#[derive(Clone, Debug, Eq, Hash)]
pub struct PlanRef(Rc<dyn PlanNode>);
PlanRef的核心结构其实就是一个实现PlanNode的字段,我们看看PlanNode。
/// The common trait over all plan nodes. Used by optimizer framework which will treat all node as
/// `dyn PlanNode`
///
/// We split the trait into lots of sub-trait so that we can easily use macro to impl them.
pub trait PlanNode:
PlanTreeNode
+ DynClone
+ DynEq
+ DynHash
+ Distill
+ Debug
+ Downcast
+ ColPrunable
+ ExprRewritable
+ ExprVisitable
+ ToBatch
+ ToStream
+ ToDistributedBatch
+ ToPb
+ ToLocalBatch
+ PredicatePushdown
+ AnyPlanNodeMeta
{
}
注释写得很清楚,这个trait 是所有计划节点的公共接口。它通过组合多个子 trait 来实现复杂的功能,如树结构操作、克隆、比较、哈希、调试等。这些子 trait 使得 PlanNode 可以在优化器框架中被灵活使用和扩展。
为了方便大家去理解呢,我这边把每一个Trait都解释下。
- PlanTreeNode:规定每个计划节点必须实现的方法,包括获取输入节点和克隆节点。
- DynXX之类的就不说了,顾名思义的。只是为了动态类型做的。
- Distill包含两个方法:distill 和 distill_to_string。distill 方法将实现该 trait 的对象转换为 XmlNode 类型,而 distill_to_string 方法则进一步将 XmlNode 转换为格式化的字符串表示。
- Debug:用于格式化输出调试信息。它支持通过derive自动生成实现,或手动实现。使用?和#?格式说明符可以分别进行普通和美化输出。Debug trait的实现必须提供fmt方法,该方法接收一个Formatter对象并返回格式化结果。
- Downcast:负责动态类型转换和Any 类型的转换。
- ColPrunable:用于逻辑计划节点的列裁剪。主要要做:将计划节点输出限制为所需的列;同时也会调整子节点的输出。
- ExprRewritable:用于重写表达式。主要功能包括:has_rewritable_expr:默认返回 false,表示是否包含可重写的表达式;rewrite_exprs:未实现的方法,用于重写表达式,接收一个 ExprRewriter 实例并返回一个新的 PlanRef。
- ExprVisitable:用于遍历表达式。它包含一个方法 visit_exprs,默认实现为空。
- ToBatch:用于将逻辑计划节点转换为批处理物理节点。它提供了两种实现方式,默认你实现to_batch就好了,如果你对于排序时有更好的实现呢,就去覆盖to_batch_with_order_required。
- ToStream:用于将逻辑计划节点转换为流式物理节点,并可以选择满足特定的分布要求。主要功能包括:logical_rewrite_for_stream:重写逻辑节点以确保输出包含主键列或添加行计数;to_stream:默认实现为调用 to_stream_with_dist_required(RequiredDist::Any);to_stream_with_dist_required:根据所需的分布要求转换计划节点。
- to_stream_with_dist_required里的RequiredDist参数很有意思,分别代表了4种分法方式。
- ToDistributedBatch:用于将批处理物理计划转换为分布式批处理计划。它提供了两种实现方式:实现 to_distributed 方法,使用默认的 to_distributed_with_required;实现 to_distributed_with_required 方法,并在 to_distributed 中调用to_distributed_with_required(&Order::any(), &RequiredDist::Any)。
- ToPb:由TryToBatchPb和TryToStreamPb组成,但没有实现任何方法。TryToBatchPb 定义了一个默认实现的方法 try_to_batch_prost_body,该方法返回一个错误,表示节点不能转换为批处理节点。ToBatchPb 定义了一个方法 to_batch_prost_body,用于将节点转换为批处理节点。
- ToLocalBatch:用于将批量物理执行计划转换为本地执行计划。主要功能包括:to_local 方法:将当前计划转换为本地执行计划;to_local_with_order_required 方法:在满足指定顺序要求的情况下,将计划转换为本地执行计划。
- PredicatePushdown:用于逻辑计划节点的谓词下推。其功能包括:
-
- 对于无法下推的谓词,在当前计划节点之上创建一个 LogicalFilter;
- 对于可以合并的谓词,将其与当前节点的条件合并;
- 对于可以下推的谓词,传递给当前节点的输入。
- AnyPlanNodeMeta:它是 PlanNodeMeta 的对象安全版本,并作为 PlanNode 的父trait。它包含三个方法:node_type 返回节点类型,plan_base 返回计划基础引用,convention 返回约定类型。
然后我们来看看PlanNode的实现。
随便跳转一下,会发现跳转到了这里,一脸懵逼:
for_all_plan_nodes! { impl_plan_node }
for_all_plan_nodes! 是一个宏,用于迭代所有类型的计划节点,并对每个节点应用 impl_plan_node 实现。
这么说可能有点懵,我们先来看看for_all_plan_nodes的实现。
/// `for_all_plan_nodes` includes all plan nodes. If you added a new plan node
/// inside the project, be sure to add here and in its conventions like `for_logical_plan_nodes`
///
/// Every tuple has two elements, where `{ convention, name }`
/// You can use it as follows
/// ```rust
/// macro_rules! use_plan {
/// ($({ $convention:ident, $name:ident }),*) => {};
/// }
/// risingwave_frontend::for_all_plan_nodes! { use_plan }
/// ```
/// See the following implementations for example.
#[macro_export]
macro_rules! for_all_plan_nodes {
($macro:ident) => {
$macro! {
{ Logical, Agg }
, { Logical, Apply }
, { Logical, Filter }
, { Logical, Project }
, { Logical, Scan }
, { Logical, CdcScan }
, { Logical, SysScan }
, { Logical, Source }
, { Logical, Insert }
, { Logical, Delete }
, { Logical, Update }
, { Logical, Join }
, { Logical, Values }
, { Logical, Limit }
, { Logical, TopN }
, { Logical, HopWindow }
, { Logical, TableFunction }
, { Logical, MultiJoin }
, { Logical, Expand }
, { Logical, ProjectSet }
, { Logical, Union }
, { Logical, OverWindow }
, { Logical, Share }
, { Logical, Now }
, { Logical, Dedup }
, { Logical, Intersect }
, { Logical, Except }
, { Logical, MaxOneRow }
, { Logical, KafkaScan }
, { Logical, IcebergScan }
, { Logical, RecursiveUnion }
, { Logical, CteRef }
, { Logical, ChangeLog }
, { Logical, FileScan }
, { Logical, PostgresQuery }
, { Logical, MySqlQuery }
, { Batch, SimpleAgg }
, { Batch, HashAgg }
, { Batch, SortAgg }
, { Batch, Project }
, { Batch, Filter }
, { Batch, Insert }
, { Batch, Delete }
, { Batch, Update }
, { Batch, SeqScan }
, { Batch, SysSeqScan }
, { Batch, LogSeqScan }
, { Batch, HashJoin }
, { Batch, NestedLoopJoin }
, { Batch, Values }
, { Batch, Sort }
, { Batch, Exchange }
, { Batch, Limit }
, { Batch, TopN }
, { Batch, HopWindow }
, { Batch, TableFunction }
, { Batch, Expand }
, { Batch, LookupJoin }
, { Batch, ProjectSet }
, { Batch, Union }
, { Batch, GroupTopN }
, { Batch, Source }
, { Batch, OverWindow }
, { Batch, MaxOneRow }
, { Batch, KafkaScan }
, { Batch, IcebergScan }
, { Batch, FileScan }
, { Batch, PostgresQuery }
, { Batch, MySqlQuery }
, { Stream, Project }
, { Stream, Filter }
, { Stream, TableScan }
, { Stream, CdcTableScan }
, { Stream, Sink }
, { Stream, Source }
, { Stream, SourceScan }
, { Stream, HashJoin }
, { Stream, Exchange }
, { Stream, HashAgg }
, { Stream, SimpleAgg }
, { Stream, StatelessSimpleAgg }
, { Stream, Materialize }
, { Stream, TopN }
, { Stream, HopWindow }
, { Stream, DeltaJoin }
, { Stream, Expand }
, { Stream, DynamicFilter }
, { Stream, ProjectSet }
, { Stream, GroupTopN }
, { Stream, Union }
, { Stream, RowIdGen }
, { Stream, Dml }
, { Stream, Now }
, { Stream, Share }
, { Stream, WatermarkFilter }
, { Stream, TemporalJoin }
, { Stream, Values }
, { Stream, Dedup }
, { Stream, EowcOverWindow }
, { Stream, EowcSort }
, { Stream, OverWindow }
, { Stream, FsFetch }
, { Stream, ChangeLog }
, { Stream, GlobalApproxPercentile }
, { Stream, LocalApproxPercentile }
, { Stream, RowMerge }
, { Stream, AsOfJoin }
}
};
}
再看看impl_plan_node。
macro_rules! impl_plan_node {
($({ $convention:ident, $name:ident }),*) => {
paste!{
$(impl PlanNode for [<$convention $name>] { })*
}
}
}
是不是就恍然大悟了。这个宏用于批量实现 PlanNode trait。它接收多个元组作为参数,每个元组包含两个标识符:convention 和 name。宏会为每个组合生成对应的实现空方法。
话说回来,如果宏for_all_plan_nodes里的代码,是用工具生成的,那这么写还是挺有意义的。不然有脱裤子放屁之嫌。
2.2 StreamFragmentGraph
/// [`StreamFragmentGraph`] stores a fragment graph (DAG).
#[derive(Default)]
pub struct StreamFragmentGraph {
/// stores all the fragments in the graph.
fragments: HashMap<LocalFragmentId, Rc<StreamFragment>>,
/// stores edges between fragments: (upstream, downstream) => edge.
edges: HashMap<(LocalFragmentId, LocalFragmentId), StreamFragmentEdgeProto>,
}
为了方便大家快速理解,我这边直接讲一下StreamFragment,这个Fragement可以理解为是一个独立的最小计算单元,里面维护了上游,本地的表、本身node等信息。所以StreamFragmentGraph就是把这些算子组织起来的一张图。
2.3 CompleteStreamFragmentGraph
那这里可以简单理解为根据集群已有信息加强过的StreamFragmentGraph。比如考虑MV是否能够重复利用。
2.4 ActorGraphBuildResult
Actor是一种通信模型。
/// The result of a built actor graph. Will be further embedded into the `Context` for building
/// actors on the compute nodes.
pub struct ActorGraphBuildResult {
/// The graph of sealed fragments, including all actors.
pub graph: BTreeMap<FragmentId, Fragment>,
/// The scheduled locations of the actors to be built.
pub building_locations: Locations,
/// The actual locations of the external actors.
pub existing_locations: Locations,
/// The new dispatchers to be added to the upstream mview actors. Used for MV on MV.
pub dispatchers: HashMap<ActorId, Vec<Dispatcher>>,
/// The updates to be applied to the downstream chain actors. Used for schema change (replace
/// table plan).
pub merge_updates: Vec<MergeUpdate>,
}
这一层更加关注一些Actor之间的关系、Actor与Fragment之间的关系。
2.5 StreamJobFragments
/// Fragments of a streaming job. Corresponds to [`PbTableFragments`].
/// (It was previously called `TableFragments` due to historical reasons.)
///
/// We store whole fragments in a single column family as follow:
/// `stream_job_id` => `StreamJobFragments`.
#[derive(Debug, Clone)]
pub struct StreamJobFragments {
/// The table id.
stream_job_id: TableId,
/// The state of the table fragments.
state: State,
/// The table fragments.
pub fragments: BTreeMap<FragmentId, Fragment>,
/// The status of actors
pub actor_status: BTreeMap<ActorId, ActorStatus>,
/// The splits of actors,
/// incl. both `Source` and `SourceBackfill` actors.
pub actor_splits: HashMap<ActorId, Vec<SplitImpl>>,
/// The streaming context associated with this stream plan and its fragments
pub ctx: StreamContext,
/// The parallelism assigned to this table fragments
pub assigned_parallelism: TableParallelism,
/// The max parallelism specified when the streaming job was created, i.e., expected vnode count.
///
/// The reason for persisting this value is mainly to check if a parallelism change (via `ALTER
/// .. SET PARALLELISM`) is valid, so that the behavior can be consistent with the creation of
/// the streaming job.
///
/// Note that the actual vnode count, denoted by `vnode_count` in `fragments`, may be different
/// from this value (see `StreamFragmentGraph.max_parallelism` for more details.). As a result,
/// checking the parallelism change with this value can be inaccurate in some cases. However,
/// when generating resizing plans, we still take the `vnode_count` of each fragment into account.
pub max_parallelism: usize,
}
到了StreamJobFragments其实已经是非常具体靠近运行态的数据结构了。
3.小结
- PlanRef:主要负责SQL解析,到Plan阶段的表达。
- StreamFragmentGraph:更像是一种逻辑Graph的表达,CompleteStreamFragmentGraph也属于这个范畴。
- ActorGraphBuildResult:基于Actor模型构建的Graph信息。
- StreamJobFragments:接近运行期的结构体,关注大量执行的细节。