Paimon Action Jar 实现机制分析
目录
- 1. 概述
- 2. 整体架构设计
- 3. SPI 服务发现机制
- 4. Action 执行流程
- 5. ExpireSnapshotsAction 详细分析
- 6. 如何实现自定义 Action
- 7. 最佳实践和注意事项
- 8. 总结
1. 概述
Paimon Action Jar 是 Apache Paimon 提供的一套用于表维护操作的命令行工具框架。通过 flink run 命令,用户可以执行各种维护操作,如快照过期、分区删除、表压缩等。
1.1 使用示例
<FLINK_HOME>/bin/flink run \
/path/to/paimon-flink-action.jar \
expire_snapshots \
--warehouse <warehouse-path> \
--database <database-name> \
--table <table-name> \
--retain_max 5 \
--retain_min 10 \
--older_than '2024-01-01 12:00:00' \
--max_deletes 10
1.2 核心特性
- 插件化架构:基于 Java SPI 实现可扩展的 Action 体系
- 模块隔离:独立的入口模块避免类加载冲突
- 执行模式:支持本地执行(LocalAction)和 Flink 作业两种模式
- 统一接口:所有维护操作遵循统一的 Action 接口规范
2. 整体架构设计
2.1 模块划分
Paimon Action Jar 采用了模块化设计,主要分为两个模块:
2.1.1 paimon-flink-action 模块
位置:paimon-flink/paimon-flink-action/
职责:
- 提供唯一的入口类
FlinkActions - 作为独立的可执行 JAR,包含 Main-Class 声明
- 避免与 Flink lib 目录中的 paimon-flink.jar 产生类加载冲突
pom.xml 配置:
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<executions>
<execution>
<configuration>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass>org.apache.paimon.flink.action.FlinkActions</mainClass>
</transformer>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
FlinkActions.java:
public class FlinkActions {
public static void main(String[] args) throws Exception {
if (args.length < 1) {
printDefaultHelp();
System.exit(1);
}
Optional<Action> action = ActionFactory.createAction(args);
if (action.isPresent()) {
action.get().run();
} else {
System.exit(1);
}
}
}
2.1.2 paimon-flink-common 模块
位置:paimon-flink/paimon-flink-common/
职责:
- 包含所有 Action 的实现类
- 包含所有 ActionFactory 的实现类
- 提供 Action 基础设施(ActionBase、LocalAction 等)
- 包含 Procedure 实现(用于 Flink SQL CALL 语句)
2.2 核心类图
classDiagram
class Factory {
<<interface>>
+identifier() String
}
class ActionFactory {
<<interface>>
+create(params) Optional~Action~
+printHelp() void
+createAction(args) Optional~Action~
+catalogConfigMap(params) Map
}
class Action {
<<interface>>
+run() void
+build() void
}
class ActionBase {
<<abstract>>
#catalogOptions Options
#catalog Catalog
#flinkCatalog FlinkCatalog
#env StreamExecutionEnvironment
#batchTEnv StreamTableEnvironment
#forceStartFlinkJob boolean
+ActionBase(catalogConfig)
+run() void
#initCatalog() void
#initFlinkEnv(env) void
#execute(name) void
}
class LocalAction {
<<interface>>
+executeLocally() void
}
class ExpireSnapshotsActionFactory {
+IDENTIFIER "expire_snapshots"
+identifier() String
+create(params) Optional~Action~
+printHelp() void
}
class ExpireSnapshotsAction {
-database String
-table String
-retainMax Integer
-retainMin Integer
-olderThan String
-maxDeletes Integer
-options String
+ExpireSnapshotsAction(...)
+executeLocally() void
}
class CompactAction {
-partitions List
-whereSql String
-fullCompaction Boolean
+CompactAction(...)
+build() void
}
Factory <|-- ActionFactory
Action <|-- ActionBase
Action <|-- LocalAction
ActionFactory <|.. ExpireSnapshotsActionFactory
ActionFactory <|.. CompactActionFactory
ActionBase <|-- ExpireSnapshotsAction
ActionBase <|-- CompactAction
LocalAction <|.. ExpireSnapshotsAction
ExpireSnapshotsActionFactory ..> ExpireSnapshotsAction : creates
CompactActionFactory ..> CompactAction : creates
2.3 设计理念
2.3.1 为什么需要独立的 paimon-flink-action 模块?
问题背景:
- Flink 的
flink run命令要求指定一个包含 Main-Class 的 JAR - 如果直接使用
paimon-flink.jar,会导致类加载冲突 - Flink lib 目录和用户 ClassLoader 中都包含相同的类
解决方案:
- 创建独立的
paimon-flink-action.jar - 只包含
FlinkActions入口类 - 依赖
paimon-flink-common(scope=provided) - 运行时从 Flink lib 目录加载实际的 Action 实现
2.3.2 LocalAction vs 普通 Action
LocalAction:
- 适用于轻量级维护操作
- 默认在客户端本地执行,不启动 Flink 作业
- 示例:expire_snapshots、rollback_to、create_tag
- 优势:执行快速,资源消耗小
普通 Action:
- 适用于需要分布式计算的操作
- 必须构建完整的 Flink 作业图
- 示例:compact、merge_into
- 优势:可以利用 Flink 的并行处理能力
3. SPI 服务发现机制
3.1 什么是 SPI
SPI(Service Provider Interface)是 Java 提供的服务发现机制,允许在运行时动态加载接口的实现类。
3.2 SPI 配置文件
位置:paimon-flink/paimon-flink-common/src/main/resources/META-INF/services/org.apache.paimon.factories.Factory
内容片段:
### action factories
org.apache.paimon.flink.action.CopyFilesActionFactory
org.apache.paimon.flink.action.CompactActionFactory
org.apache.paimon.flink.action.CompactDatabaseActionFactory
org.apache.paimon.flink.action.DropPartitionActionFactory
org.apache.paimon.flink.action.DeleteActionFactory
org.apache.paimon.flink.action.MergeIntoActionFactory
org.apache.paimon.flink.action.RollbackToActionFactory
...
org.apache.paimon.flink.action.ExpireSnapshotsActionFactory # 第 44 行
org.apache.paimon.flink.action.ExpireChangelogsActionFactory
...
### procedure factories
org.apache.paimon.flink.procedure.CompactDatabaseProcedure
org.apache.paimon.flink.procedure.CompactProcedure
...
org.apache.paimon.flink.procedure.ExpireSnapshotsProcedure # 第 74 行
...
3.3 SPI 加载流程
3.3.1 FactoryUtil.discoverFactory() 方法
源码位置:paimon-api/src/main/java/org/apache/paimon/factories/FactoryUtil.java
public static <T extends Factory> T discoverFactory(
ClassLoader classLoader, Class<T> factoryClass, String identifier) {
// 1. 加载所有 Factory 实现
final List<Factory> factories = getFactories(classLoader);
// 2. 过滤出指定类型的 Factory
final List<Factory> foundFactories =
factories.stream()
.filter(f -> factoryClass.isAssignableFrom(f.getClass()))
.collect(Collectors.toList());
// 3. 根据 identifier 匹配
final List<Factory> matchingFactories =
foundFactories.stream()
.filter(f -> f.identifier().equals(identifier))
.collect(Collectors.toList());
// 4. 返回匹配的 Factory
if (matchingFactories.size() == 1) {
return (T) matchingFactories.get(0);
}
// 处理错误情况...
}
3.3.2 ServiceLoader 加载
public static <T> List<T> discoverFactories(ClassLoader classLoader, Class<T> klass) {
final Iterator<T> serviceLoaderIterator = ServiceLoader.load(klass, classLoader).iterator();
final List<T> loadResults = new ArrayList<>();
while (serviceLoaderIterator.hasNext()) {
try {
loadResults.add(serviceLoaderIterator.next());
} catch (NoClassDefFoundError e) {
// 处理可选依赖缺失的情况
LOG.debug("NoClassDefFoundError when loading factory", e);
}
}
return loadResults;
}
3.4 ActionFactory.createAction() 流程
源码位置:paimon-flink/paimon-flink-common/src/main/java/org/apache/paimon/flink/action/ActionFactory.java
static Optional<Action> createAction(String[] args) {
// 1. 提取 action 名称(第一个参数)
String action = args[0].toLowerCase().replaceAll("-", "_");
String[] actionArgs = Arrays.copyOfRange(args, 1, args.length);
// 2. 使用 SPI 加载对应的 ActionFactory
ActionFactory actionFactory;
try {
actionFactory = FactoryUtil.discoverFactory(
ActionFactory.class.getClassLoader(),
ActionFactory.class,
action);
} catch (FactoryException e) {
printDefaultHelp();
throw new UnsupportedOperationException("Unknown action \"" + action + "\".");
}
LOG.info("{} job args: {}", actionFactory.identifier(), String.join(" ", actionArgs));
// 3. 解析命令行参数
MultipleParameterToolAdapter params = new MultipleParameterToolAdapter(actionArgs);
// 4. 处理 --help 参数
if (params.has(HELP)) {
actionFactory.printHelp();
return Optional.empty();
}
// 5. 调用 Factory 创建 Action 实例
Optional<Action> optionalAction = actionFactory.create(params);
// 6. 处理 --force_start_flink_job 参数
if (params.has(FORCE_START_FLINK_JOB)) {
optionalAction = optionalAction.map(a -> {
return ((ActionBase) a).forceStartFlinkJob(
Boolean.parseBoolean(params.get(FORCE_START_FLINK_JOB)));
});
}
return optionalAction;
}
4. Action 执行流程
4.1 完整执行流程图
sequenceDiagram
participant User as 用户命令行
participant FlinkActions as FlinkActions.main
participant ActionFactory as ActionFactory
participant FactoryUtil as FactoryUtil
participant ExpireSnapshotsActionFactory as ExpireSnapshotsActionFactory
participant ExpireSnapshotsAction as ExpireSnapshotsAction
participant ActionBase as ActionBase.run
participant ExpireSnapshotsProcedure as ExpireSnapshotsProcedure
participant ExpireSnapshotsImpl as ExpireSnapshotsImpl
User->>FlinkActions: flink run paimon-flink-action.jar<br/>expire_snapshots --warehouse ... --database ... --table ...
FlinkActions->>ActionFactory: createAction(args)
Note over ActionFactory: 解析 action = "expire_snapshots"
ActionFactory->>FactoryUtil: discoverFactory(ActionFactory.class, "expire_snapshots")
Note over FactoryUtil: 使用 ServiceLoader 加载所有 Factory<br/>从 META-INF/services 文件读取
FactoryUtil->>FactoryUtil: 过滤并匹配 identifier
FactoryUtil-->>ActionFactory: ExpireSnapshotsActionFactory 实例
ActionFactory->>ExpireSnapshotsActionFactory: create(params)
Note over ExpireSnapshotsActionFactory: 解析参数<br/>- database<br/>- table<br/>- retain_max<br/>- retain_min<br/>- older_than<br/>- max_deletes
ExpireSnapshotsActionFactory->>ExpireSnapshotsAction: new ExpireSnapshotsAction(...)
ExpireSnapshotsAction-->>ExpireSnapshotsActionFactory: action 实例
ExpireSnapshotsActionFactory-->>ActionFactory: Optional.of(action)
ActionFactory-->>FlinkActions: Optional~Action~
FlinkActions->>ExpireSnapshotsAction: action.run()
ExpireSnapshotsAction->>ActionBase: super.run()
alt LocalAction && !forceStartFlinkJob
Note over ActionBase: 检测到 LocalAction<br/>且未强制启动 Flink 作业
ActionBase->>ExpireSnapshotsAction: executeLocally()
ExpireSnapshotsAction->>ExpireSnapshotsProcedure: new ExpireSnapshotsProcedure()
ExpireSnapshotsAction->>ExpireSnapshotsProcedure: withCatalog(catalog)
ExpireSnapshotsAction->>ExpireSnapshotsProcedure: call(null, "db.table", retainMax, ...)
ExpireSnapshotsProcedure->>ExpireSnapshotsProcedure: table.newExpireSnapshots()
ExpireSnapshotsProcedure->>ExpireSnapshotsImpl: new ExpireSnapshotsImpl(...)
ExpireSnapshotsProcedure->>ExpireSnapshotsImpl: config(expireConfig).expire()
Note over ExpireSnapshotsImpl: 计算过期快照范围<br/>删除快照文件<br/>删除数据文件
ExpireSnapshotsImpl-->>ExpireSnapshotsProcedure: 返回删除数量
ExpireSnapshotsProcedure-->>ExpireSnapshotsAction: String[] result
ExpireSnapshotsAction-->>ActionBase: 完成
else LocalAction && forceStartFlinkJob
Note over ActionBase: 强制启动 Flink 作业模式
ActionBase->>ActionBase: env.fromSequence(0, 0)<br/>.flatMap(LocalActionExecutor)
ActionBase->>ActionBase: execute("ExpireSnapshotsAction")
Note over ActionBase: 在 Flink 算子中执行 executeLocally()
else 普通 Action (如 CompactAction)
Note over ActionBase: 需要构建 Flink 作业图
ActionBase->>ExpireSnapshotsAction: build()
Note over ExpireSnapshotsAction: 构建 Source、Transform、Sink
ActionBase->>ActionBase: execute("CompactAction")
end
ActionBase-->>FlinkActions: 完成
FlinkActions-->>User: 执行成功
4.2 ActionBase.run() 核心逻辑
源码位置:paimon-flink/paimon-flink-common/src/main/java/org/apache/paimon/flink/action/ActionBase.java
@Override
public void run() throws Exception {
// 判断是否为 LocalAction
if (LocalAction.class.isAssignableFrom(this.getClass())) {
if (forceStartFlinkJob) {
// 强制启动 Flink 作业模式
// 将 LocalAction 包装成 Flink 算子执行
env.fromSequence(0, 0)
.flatMap(new LocalActionExecutor<>(this))
.setParallelism(1)
.sinkTo(new DiscardingSink<>());
execute(this.getClass().getSimpleName());
} else {
// 默认本地执行模式
((LocalAction) this).executeLocally();
}
} else {
// 普通 Action:构建 Flink 作业图并执行
build();
execute(this.getClass().getSimpleName());
}
}
4.3 LocalActionExecutor 包装器
private static class LocalActionExecutor<T extends ActionBase & LocalAction>
extends RichFlatMapFunction<Long, Object> {
private final T action;
public void open(Configuration parameters) {
// 在 Flink 算子中初始化 Catalog
action.initCatalog();
}
@Override
public void flatMap(Long aLong, Collector<Object> collector) throws Exception {
// 在 Flink 算子中执行 LocalAction
action.executeLocally();
}
}
5. ExpireSnapshotsAction 详细分析
5.1 命令行参数
<FLINK_HOME>/bin/flink run \
/path/to/paimon-flink-action.jar \
expire_snapshots \
--warehouse <warehouse-path> # Catalog 配置:数据仓库路径
--database <database-name> # 目标数据库名称
--table <table-name> # 目标表名称
--retain_max <num> # 最多保留的快照数量
--retain_min <num> # 至少保留的快照数量
--older_than <timestamp> # 删除早于此时间的快照
--max_deletes <num> # 单次最多删除的快照数量
--catalog_conf key=value # 额外的 Catalog 配置
5.2 参数映射关系
| 命令行参数 | Java 字段 | 说明 |
|---|---|---|
--warehouse | catalogOptions.warehouse | 通过 catalogConfigMap() 映射 |
--database | ExpireSnapshotsAction.database | 直接映射 |
--table | ExpireSnapshotsAction.table | 直接映射 |
--retain_max | ExpireSnapshotsAction.retainMax | 转换为 Integer |
--retain_min | ExpireSnapshotsAction.retainMin | 转换为 Integer |
--older_than | ExpireSnapshotsAction.olderThan | 时间戳字符串 |
--max_deletes | ExpireSnapshotsAction.maxDeletes | 转换为 Integer |
--catalog_conf | catalogOptions | 解析为 Map<String, String> |
5.3 ExpireSnapshotsActionFactory 实现
源码位置:paimon-flink/paimon-flink-common/src/main/java/org/apache/paimon/flink/action/ExpireSnapshotsActionFactory.java
public class ExpireSnapshotsActionFactory implements ActionFactory {
public static final String IDENTIFIER = "expire_snapshots";
private static final String RETAIN_MAX = "retain_max";
private static final String RETAIN_MIN = "retain_min";
private static final String OLDER_THAN = "older_than";
private static final String MAX_DELETES = "max_deletes";
private static final String OPTIONS = "options";
@Override
public String identifier() {
return IDENTIFIER;
}
@Override
public Optional<Action> create(MultipleParameterToolAdapter params) {
// 解析参数(可选)
Integer retainMax =
params.has(RETAIN_MAX) ? Integer.parseInt(params.get(RETAIN_MAX)) : null;
Integer retainMin =
params.has(RETAIN_MIN) ? Integer.parseInt(params.get(RETAIN_MIN)) : null;
String olderThan = params.has(OLDER_THAN) ? params.get(OLDER_THAN) : null;
Integer maxDeletes =
params.has(MAX_DELETES) ? Integer.parseInt(params.get(MAX_DELETES)) : null;
String options = params.has(OPTIONS) ? params.get(OPTIONS) : null;
// 创建 Action 实例
ExpireSnapshotsAction action =
new ExpireSnapshotsAction(
params.getRequired(DATABASE), // 必需参数
params.getRequired(TABLE), // 必需参数
catalogConfigMap(params), // Catalog 配置
retainMax,
retainMin,
olderThan,
maxDeletes,
options);
return Optional.of(action);
}
@Override
public void printHelp() {
System.out.println("Action \"expire_snapshots\" expire the target snapshots.");
System.out.println();
System.out.println("Syntax:");
System.out.println(
" expire_snapshots \\\n"
+ "--warehouse <warehouse_path> \\\n"
+ "--database <database> \\\n"
+ "--table <table> \\\n"
+ "--retain_max <max> \\\n"
+ "--retain_min <min> \\\n"
+ "--older_than <older_than> \\\n"
+ "--max_delete <max_delete>");
}
}
5.4 ExpireSnapshotsAction 实现
源码位置:paimon-flink/paimon-flink-common/src/main/java/org/apache/paimon/flink/action/ExpireSnapshotsAction.java
public class ExpireSnapshotsAction extends ActionBase implements LocalAction {
private final String database;
private final String table;
private final Integer retainMax;
private final Integer retainMin;
private final String olderThan;
private final Integer maxDeletes;
private final String options;
public ExpireSnapshotsAction(
String database,
String table,
Map<String, String> catalogConfig,
Integer retainMax,
Integer retainMin,
String olderThan,
Integer maxDeletes,
String options) {
super(catalogConfig); // 初始化 Catalog
this.database = database;
this.table = table;
this.retainMax = retainMax;
this.retainMin = retainMin;
this.olderThan = olderThan;
this.maxDeletes = maxDeletes;
this.options = options;
}
@Override
public void executeLocally() throws Exception {
// 创建 Procedure 实例
ExpireSnapshotsProcedure expireSnapshotsProcedure = new ExpireSnapshotsProcedure();
// 设置 Catalog
expireSnapshotsProcedure.withCatalog(catalog);
// 调用 Procedure(复用 Flink SQL CALL 的逻辑)
expireSnapshotsProcedure.call(
null, // ProcedureContext(Action 中为 null)
database + "." + table, // 表标识符
retainMax,
retainMin,
olderThan,
maxDeletes,
options);
}
}
5.5 ExpireSnapshotsProcedure 实现
源码位置:paimon-flink/paimon-flink-common/src/main/java/org/apache/paimon/flink/procedure/ExpireSnapshotsProcedure.java
public class ExpireSnapshotsProcedure extends ProcedureBase {
@Override
public String identifier() {
return "expire_snapshots";
}
public String[] call(
ProcedureContext procedureContext,
String tableId,
Integer retainMax,
Integer retainMin,
String olderThanStr,
Integer maxDeletes,
String options)
throws Catalog.TableNotExistException {
// 1. 获取表对象
Table table = table(tableId);
// 2. 解析动态选项
HashMap<String, String> dynamicOptions = new HashMap<>();
ProcedureUtils.putAllOptions(dynamicOptions, options);
// 3. 应用动态选项(创建表的副本)
table = table.copy(dynamicOptions);
// 4. 创建 ExpireSnapshots 实例
ExpireSnapshots expireSnapshots = table.newExpireSnapshots();
// 5. 构建过期配置
CoreOptions tableOptions = ((FileStoreTable) table).store().options();
ExpireConfig.Builder builder =
ProcedureUtils.fillInSnapshotOptions(
tableOptions, retainMax, retainMin, olderThanStr, maxDeletes);
// 6. 执行快照过期
int expiredCount = expireSnapshots.config(builder.build()).expire();
return new String[] {expiredCount + ""};
}
}
5.6 ExpireSnapshotsImpl 核心实现
源码位置:paimon-core/src/main/java/org/apache/paimon/table/ExpireSnapshotsImpl.java
public class ExpireSnapshotsImpl implements ExpireSnapshots {
private final SnapshotManager snapshotManager;
private final ChangelogManager changelogManager;
private final ConsumerManager consumerManager;
private final SnapshotDeletion snapshotDeletion;
private final TagManager tagManager;
private ExpireConfig expireConfig;
@Override
public int expire() {
// 1. 设置配置
snapshotDeletion.setChangelogDecoupled(expireConfig.isChangelogDecoupled());
int retainMax = expireConfig.getSnapshotRetainMax();
int retainMin = expireConfig.getSnapshotRetainMin();
int maxDeletes = expireConfig.getSnapshotMaxDeletes();
long olderThanMills =
System.currentTimeMillis() - expireConfig.getSnapshotTimeRetain().toMillis();
// 2. 获取最新和最早的快照 ID
Long latestSnapshotId = snapshotManager.latestSnapshotId();
if (latestSnapshotId == null) {
return 0; // 没有快照,无需过期
}
Long earliest = snapshotManager.earliestSnapshotId();
if (earliest == null) {
return 0;
}
// 3. 计算过期范围
// retainMax: 从最新快照算起,最多保留的快照数量
long min = Math.max(latestSnapshotId - retainMax + 1, earliest);
// retainMin: 至少保留的快照数量(保护阈值)
long maxExclusive = latestSnapshotId - retainMin + 1;
// 保护正在被消费者读取的快照
maxExclusive =
Math.min(maxExclusive, consumerManager.minNextSnapshot().orElse(Long.MAX_VALUE));
// 限制单次删除的快照数量
maxExclusive = Math.min(maxExclusive, earliest + maxDeletes);
// 4. 检查时间条件,提前退出
for (long id = min; id < maxExclusive; id++) {
if (snapshotManager.snapshotExists(id)
&& olderThanMills <= snapshotManager.snapshot(id).timeMillis()) {
return expireUntil(earliest, id);
}
}
// 5. 执行过期
return expireUntil(earliest, maxExclusive);
}
public int expireUntil(long earliestId, long endExclusiveId) {
// 1. 找到第一个要过期的快照
long beginInclusiveId = earliestId;
for (long id = endExclusiveId - 1; id >= earliestId; id--) {
if (!snapshotManager.snapshotExists(id)) {
beginInclusiveId = id + 1;
break;
}
}
// 2. 获取被 Tag 保护的快照
List<Snapshot> taggedSnapshots = tagManager.taggedSnapshots();
// 3. 删除数据文件(合并树文件)
// 范围:(beginInclusiveId, endExclusiveId]
for (long id = beginInclusiveId + 1; id <= endExclusiveId; id++) {
if (snapshotManager.snapshotExists(id)) {
Snapshot snapshot = snapshotManager.snapshot(id);
// 跳过被 Tag 保护的快照
if (isTaggedSnapshot(snapshot, taggedSnapshots)) {
continue;
}
// 删除该快照的数据文件
snapshotDeletion.deleteAddedDataFiles(snapshot);
}
}
// 4. 删除 Manifest 文件
for (long id = beginInclusiveId; id < endExclusiveId; id++) {
if (snapshotManager.snapshotExists(id)) {
Snapshot snapshot = snapshotManager.snapshot(id);
if (!isTaggedSnapshot(snapshot, taggedSnapshots)) {
snapshotDeletion.deleteAddedManifests(snapshot);
}
}
}
// 5. 删除快照文件本身
for (long id = beginInclusiveId; id < endExclusiveId; id++) {
Snapshot snapshot;
try {
snapshot = snapshotManager.tryGetSnapshot(id);
} catch (FileNotFoundException e) {
beginInclusiveId = id + 1;
continue;
}
// 如果启用了 changelog 解耦,提交 changelog
if (expireConfig.isChangelogDecoupled()) {
commitChangelog(new Changelog(snapshot));
}
// 删除快照文件
snapshotManager.deleteSnapshot(id);
}
// 6. 写入最早快照的提示文件
writeEarliestHint(endExclusiveId);
LOG.info("Finished expire snapshots, range is [{}, {})",
beginInclusiveId, endExclusiveId);
return (int) (endExclusiveId - beginInclusiveId);
}
}
5.7 快照过期逻辑图
graph TB
A[开始 expire] --> B[获取最新/最早快照ID]
B --> C{快照存在?}
C -->|否| D[返回 0]
C -->|是| E[计算过期范围]
E --> E1[min = max latestId - retainMax + 1, earliest]
E1 --> E2[maxExclusive = latestId - retainMin + 1]
E2 --> E3[考虑消费者保护]
E3 --> E4[限制 maxDeletes]
E4 --> F[遍历检查时间条件]
F --> G{older_than 条件满足?}
G -->|是| H[调用 expireUntil]
G -->|否| I[继续检查下一个]
H --> H1[找到第一个要过期的快照]
H1 --> H2[获取被 Tag 保护的快照列表]
H2 --> H3[删除数据文件]
H3 --> H4[删除 Manifest 文件]
H4 --> H5[删除快照文件]
H5 --> H6[写入 earliest hint]
H6 --> J[返回删除数量]
style E1 fill:#e1f5ff
style E2 fill:#e1f5ff
style E3 fill:#ffe1e1
style E4 fill:#ffe1e1
style H3 fill:#fff4e1
style H4 fill:#fff4e1
style H5 fill:#fff4e1
5.8 参数约束和保护机制
| 参数 | 作用 | 约束条件 |
|---|---|---|
retain_max | 最多保留的快照数 | 必须 >= retain_min |
retain_min | 至少保留的快照数 | 保护阈值,确保不会删除过多 |
older_than | 时间阈值 | 只删除早于此时间的快照 |
max_deletes | 单次删除限制 | 避免一次性删除过多快照 |
| Consumer 保护 | 自动 | 正在被消费者读取的快照不会被删除 |
| Tag 保护 | 自动 | 被打标签的快照不会被删除 |
6. 如何实现自定义 Action
假设我们需要实现一个 vacuum_table Action,用于清理表的所有过期数据(包括快照、分区、孤立文件)。
6.1 步骤 1:创建 Action 类
创建文件:VacuumTableAction.java
package com.example.paimon.action;
import org.apache.paimon.catalog.Identifier;
import org.apache.paimon.flink.action.ActionBase;
import org.apache.paimon.flink.action.LocalAction;
import org.apache.paimon.table.Table;
import java.util.Map;
/**
* Vacuum table action - 清理表的所有过期数据
* 包括:过期快照、过期分区、孤立文件
*/
public class VacuumTableAction extends ActionBase implements LocalAction {
private final String database;
private final String table;
private final boolean expireSnapshots;
private final boolean expirePartitions;
private final boolean removeOrphanFiles;
private final Integer retainDays;
public VacuumTableAction(
String database,
String table,
Map<String, String> catalogConfig,
boolean expireSnapshots,
boolean expirePartitions,
boolean removeOrphanFiles,
Integer retainDays) {
super(catalogConfig);
this.database = database;
this.table = table;
this.expireSnapshots = expireSnapshots;
this.expirePartitions = expirePartitions;
this.removeOrphanFiles = removeOrphanFiles;
this.retainDays = retainDays;
}
@Override
public void executeLocally() throws Exception {
// 1. 获取表对象
Identifier identifier = Identifier.create(database, table);
Table tableObj = catalog.getTable(identifier);
System.out.println("Starting vacuum for table: " + database + "." + table);
// 2. 过期快照
if (expireSnapshots) {
System.out.println("Expiring snapshots...");
int expiredCount = tableObj.newExpireSnapshots()
.config(org.apache.paimon.options.ExpireConfig.builder()
.snapshotTimeRetain(java.time.Duration.ofDays(retainDays))
.build())
.expire();
System.out.println("Expired " + expiredCount + " snapshots");
}
// 3. 过期分区(仅对分区表有效)
if (expirePartitions && !tableObj.partitionKeys().isEmpty()) {
System.out.println("Expiring partitions...");
// 调用分区过期逻辑
// tableObj.newExpirePartitions()...
}
// 4. 删除孤立文件
if (removeOrphanFiles) {
System.out.println("Removing orphan files...");
// 调用孤立文件清理逻辑
// tableObj.newRemoveOrphanFiles()...
}
System.out.println("Vacuum completed successfully");
}
}
6.2 步骤 2:创建 ActionFactory 类
创建文件:VacuumTableActionFactory.java
package com.example.paimon.action;
import org.apache.paimon.flink.action.Action;
import org.apache.paimon.flink.action.ActionFactory;
import org.apache.paimon.flink.action.MultipleParameterToolAdapter;
import java.util.Optional;
/**
* Factory to create {@link VacuumTableAction}.
*/
public class VacuumTableActionFactory implements ActionFactory {
public static final String IDENTIFIER = "vacuum_table";
// 参数键定义
private static final String EXPIRE_SNAPSHOTS = "expire_snapshots";
private static final String EXPIRE_PARTITIONS = "expire_partitions";
private static final String REMOVE_ORPHAN_FILES = "remove_orphan_files";
private static final String RETAIN_DAYS = "retain_days";
@Override
public String identifier() {
return IDENTIFIER;
}
@Override
public Optional<Action> create(MultipleParameterToolAdapter params) {
// 解析参数(使用默认值)
boolean expireSnapshots = params.has(EXPIRE_SNAPSHOTS)
? Boolean.parseBoolean(params.get(EXPIRE_SNAPSHOTS))
: true; // 默认开启
boolean expirePartitions = params.has(EXPIRE_PARTITIONS)
? Boolean.parseBoolean(params.get(EXPIRE_PARTITIONS))
: true; // 默认开启
boolean removeOrphanFiles = params.has(REMOVE_ORPHAN_FILES)
? Boolean.parseBoolean(params.get(REMOVE_ORPHAN_FILES))
: true; // 默认开启
Integer retainDays = params.has(RETAIN_DAYS)
? Integer.parseInt(params.get(RETAIN_DAYS))
: 7; // 默认保留 7 天
// 创建 Action 实例
VacuumTableAction action = new VacuumTableAction(
params.getRequired(DATABASE),
params.getRequired(TABLE),
catalogConfigMap(params),
expireSnapshots,
expirePartitions,
removeOrphanFiles,
retainDays);
return Optional.of(action);
}
@Override
public void printHelp() {
System.out.println("Action \"vacuum_table\" cleans up all expired data for a table.");
System.out.println();
System.out.println("Syntax:");
System.out.println(" vacuum_table \\");
System.out.println(" --warehouse <warehouse_path> \\");
System.out.println(" --database <database> \\");
System.out.println(" --table <table> \\");
System.out.println(" [--expire_snapshots <true|false>] \\");
System.out.println(" [--expire_partitions <true|false>] \\");
System.out.println(" [--remove_orphan_files <true|false>] \\");
System.out.println(" [--retain_days <days>]");
System.out.println();
System.out.println("Options:");
System.out.println(" --expire_snapshots : Whether to expire old snapshots (default: true)");
System.out.println(" --expire_partitions : Whether to expire old partitions (default: true)");
System.out.println(" --remove_orphan_files : Whether to remove orphan files (default: true)");
System.out.println(" --retain_days : Days to retain data (default: 7)");
System.out.println();
System.out.println("Examples:");
System.out.println(" # Vacuum with all operations");
System.out.println(" vacuum_table --warehouse hdfs:///warehouse --database mydb --table mytable");
System.out.println();
System.out.println(" # Vacuum only snapshots, retain 30 days");
System.out.println(" vacuum_table --warehouse hdfs:///warehouse --database mydb --table mytable \\");
System.out.println(" --expire_snapshots true --expire_partitions false --remove_orphan_files false \\");
System.out.println(" --retain_days 30");
}
}
6.3 步骤 3:注册到 SPI
创建文件:src/main/resources/META-INF/services/org.apache.paimon.factories.Factory
# 自定义 Action Factory
com.example.paimon.action.VacuumTableActionFactory
如果是在 Paimon 源码中添加,需要在现有的 SPI 文件中追加:
paimon-flink/paimon-flink-common/src/main/resources/META-INF/services/org.apache.paimon.factories.Factory
### action factories
org.apache.paimon.flink.action.CopyFilesActionFactory
...
org.apache.paimon.flink.action.ExpireSnapshotsActionFactory
com.example.paimon.action.VacuumTableActionFactory # 添加这一行
...
6.4 步骤 4:构建和打包
6.4.1 Maven pom.xml 配置
<project>
<modelVersion>4.0.0</modelVersion>
<groupId>com.example</groupId>
<artifactId>paimon-custom-action</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencies>
<!-- Paimon Flink Common (provided,运行时由 Flink 提供) -->
<dependency>
<groupId>org.apache.paimon</groupId>
<artifactId>paimon-flink-common</artifactId>
<version>1.4-SNAPSHOT</version>
<scope>provided</scope>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.2.4</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<transformers>
<!-- 合并 SPI 配置文件 -->
<transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
6.4.2 构建命令
mvn clean package
生成文件:target/paimon-custom-action-1.0-SNAPSHOT.jar
6.5 步骤 5:使用自定义 Action
6.5.1 方式 1:独立 JAR
如果自定义 Action 打包为独立 JAR,需要将其放到 Flink lib 目录:
# 1. 复制 JAR 到 Flink lib
cp target/paimon-custom-action-1.0-SNAPSHOT.jar $FLINK_HOME/lib/
# 2. 执行 Action
<FLINK_HOME>/bin/flink run \
/path/to/paimon-flink-action.jar \
vacuum_table \
--warehouse hdfs:///path/to/warehouse \
--database my_database \
--table my_table \
--retain_days 30
6.5.2 方式 2:集成到 paimon-flink-action.jar
如果在 Paimon 源码中添加,重新构建 paimon-flink-action.jar:
# 1. 在 Paimon 源码目录
cd paimon
# 2. 构建项目
mvn clean package -DskipTests
# 3. 找到生成的 JAR
ls paimon-flink/paimon-flink-action/target/paimon-flink-action-*.jar
# 4. 执行 Action
<FLINK_HOME>/bin/flink run \
paimon-flink/paimon-flink-action/target/paimon-flink-action-1.4-SNAPSHOT.jar \
vacuum_table \
--warehouse hdfs:///warehouse \
--database mydb \
--table mytable
6.5.3 查看帮助信息
<FLINK_HOME>/bin/flink run \
/path/to/paimon-flink-action.jar \
vacuum_table \
--help
输出:
Action "vacuum_table" cleans up all expired data for a table.
Syntax:
vacuum_table \
--warehouse <warehouse_path> \
--database <database> \
--table <table> \
[--expire_snapshots <true|false>] \
[--expire_partitions <true|false>] \
[--remove_orphan_files <true|false>] \
[--retain_days <days>]
Options:
--expire_snapshots : Whether to expire old snapshots (default: true)
--expire_partitions : Whether to expire old partitions (default: true)
--remove_orphan_files : Whether to remove orphan files (default: true)
--retain_days : Days to retain data (default: 7)
Examples:
# Vacuum with all operations
vacuum_table --warehouse hdfs:///warehouse --database mydb --table mytable
# Vacuum only snapshots, retain 30 days
vacuum_table --warehouse hdfs:///warehouse --database mydb --table mytable \
--expire_snapshots true --expire_partitions false --remove_orphan_files false \
--retain_days 30
6.6 测试自定义 Action
6.6.1 单元测试
创建文件:VacuumTableActionTest.java
package com.example.paimon.action;
import org.apache.paimon.catalog.Catalog;
import org.apache.paimon.catalog.CatalogFactory;
import org.apache.paimon.catalog.Identifier;
import org.apache.paimon.flink.action.ActionFactory;
import org.junit.jupiter.api.Test;
import org.junit.jupiter.api.io.TempDir;
import java.nio.file.Path;
import java.util.HashMap;
import java.util.Map;
import java.util.Optional;
import static org.junit.jupiter.api.Assertions.*;
public class VacuumTableActionTest {
@TempDir
Path tempDir;
@Test
public void testVacuumTableActionFactory() {
VacuumTableActionFactory factory = new VacuumTableActionFactory();
assertEquals("vacuum_table", factory.identifier());
}
@Test
public void testCreateAction() {
String[] args = new String[]{
"vacuum_table",
"--warehouse", tempDir.toString(),
"--database", "test_db",
"--table", "test_table",
"--retain_days", "30"
};
Optional<org.apache.paimon.flink.action.Action> action =
ActionFactory.createAction(args);
assertTrue(action.isPresent());
assertInstanceOf(VacuumTableAction.class, action.get());
}
@Test
public void testExecuteVacuumAction() throws Exception {
// 1. 创建测试 Catalog
Map<String, String> catalogConfig = new HashMap<>();
catalogConfig.put("warehouse", tempDir.toString());
// 2. 创建 Action
VacuumTableAction action = new VacuumTableAction(
"test_db",
"test_table",
catalogConfig,
true, // expireSnapshots
false, // expirePartitions
false, // removeOrphanFiles
7); // retainDays
// 3. 执行(需要先创建表)
// action.executeLocally();
}
}
7. 最佳实践和注意事项
7.1 参数设计
7.1.1 必需参数 vs 可选参数
- 必需参数:使用
params.getRequired(key),缺失时抛出异常 - 可选参数:使用
params.has(key)检查,提供默认值
// 必需参数
String database = params.getRequired(DATABASE);
// 可选参数,带默认值
Integer retainMax = params.has(RETAIN_MAX)
? Integer.parseInt(params.get(RETAIN_MAX))
: 10;
7.1.2 参数验证
在 Factory 的 create() 方法中进行参数验证:
@Override
public Optional<Action> create(MultipleParameterToolAdapter params) {
Integer retainMax = params.has(RETAIN_MAX)
? Integer.parseInt(params.get(RETAIN_MAX)) : null;
Integer retainMin = params.has(RETAIN_MIN)
? Integer.parseInt(params.get(RETAIN_MIN)) : null;
// 参数验证
if (retainMax != null && retainMin != null && retainMax < retainMin) {
throw new IllegalArgumentException(
"retain_max (" + retainMax + ") must be >= retain_min (" + retainMin + ")");
}
// 创建 Action
return Optional.of(new MyAction(...));
}
7.2 Catalog 配置
7.2.1 使用 catalogConfigMap() 获取配置
Map<String, String> catalogConfig = catalogConfigMap(params);
这个方法会:
- 解析所有
--catalog_conf key=value参数 - 自动添加
--warehouse参数到配置中 - 返回完整的 Catalog 配置 Map
7.2.2 额外的 Catalog 配置
用户可以通过 --catalog_conf 传递额外配置:
vacuum_table \
--warehouse hdfs:///warehouse \
--database mydb \
--table mytable \
--catalog_conf metastore=hive \
--catalog_conf uri=thrift://localhost:9083
7.3 LocalAction vs 普通 Action
7.3.1 选择 LocalAction
适用场景:
- 轻量级操作,不需要分布式计算
- 单表操作,数据量不大
- 主要是元数据操作(如创建标签、回滚)
示例:
public class MyAction extends ActionBase implements LocalAction {
@Override
public void executeLocally() throws Exception {
// 直接在客户端执行
}
}
7.3.2 选择普通 Action
适用场景:
- 需要分布式处理大量数据
- 需要构建 Flink 作业图(Source、Transform、Sink)
- 涉及数据读写和计算
示例:
public class CompactAction extends ActionBase {
@Override
public void build() throws Exception {
// 构建 Flink 作业图
DataStream<RowData> source = ...;
source.transform(...).sinkTo(...);
}
}
7.3.3 强制 Flink 作业模式
即使是 LocalAction,也可以强制在 Flink 作业中执行:
vacuum_table \
--warehouse hdfs:///warehouse \
--database mydb \
--table mytable \
--force_start_flink_job true
7.4 错误处理
7.4.1 在 Factory 中处理错误
@Override
public Optional<Action> create(MultipleParameterToolAdapter params) {
try {
// 参数解析和验证
String database = params.getRequired(DATABASE);
Integer retainDays = Integer.parseInt(params.get(RETAIN_DAYS));
return Optional.of(new MyAction(...));
} catch (NumberFormatException e) {
System.err.println("Invalid number format for retain_days: " + e.getMessage());
return Optional.empty();
} catch (Exception e) {
System.err.println("Failed to create action: " + e.getMessage());
return Optional.empty();
}
}
7.4.2 在 Action 中处理错误
@Override
public void executeLocally() throws Exception {
try {
// 执行操作
Table table = catalog.getTable(Identifier.create(database, table));
// ...
} catch (Catalog.TableNotExistException e) {
System.err.println("Table not found: " + database + "." + table);
throw e;
} catch (Exception e) {
System.err.println("Execution failed: " + e.getMessage());
throw e;
}
}
7.5 帮助信息
提供详细的帮助信息,包括:
- Action 的功能描述
- 完整的语法示例
- 每个参数的说明
- 实际使用示例
@Override
public void printHelp() {
System.out.println("Action \"my_action\" does something useful.");
System.out.println();
System.out.println("Syntax:");
System.out.println(" my_action \\");
System.out.println(" --warehouse <warehouse_path> \\");
System.out.println(" --database <database> \\");
System.out.println(" --table <table> \\");
System.out.println(" [--param1 <value>] \\");
System.out.println(" [--param2 <value>]");
System.out.println();
System.out.println("Parameters:");
System.out.println(" --warehouse : (Required) Path to the data warehouse");
System.out.println(" --database : (Required) Database name");
System.out.println(" --table : (Required) Table name");
System.out.println(" --param1 : (Optional) Description of param1");
System.out.println(" --param2 : (Optional) Description of param2");
System.out.println();
System.out.println("Examples:");
System.out.println(" # Basic usage");
System.out.println(" my_action --warehouse /path/to/warehouse --database db --table tbl");
System.out.println();
System.out.println(" # With optional parameters");
System.out.println(" my_action --warehouse /path/to/warehouse --database db --table tbl \\");
System.out.println(" --param1 value1 --param2 value2");
}
7.6 序列化
7.6.1 Action 必须可序列化
如果 LocalAction 使用强制 Flink 作业模式,Action 对象会被序列化发送到 TaskManager:
public class MyAction extends ActionBase implements LocalAction, Serializable {
// 所有字段必须可序列化
private final String database; // OK
private final Integer retainDays; // OK
// 不可序列化的字段必须标记为 transient
private transient Catalog catalog; // OK,由 ActionBase 管理
}
7.6.2 使用 transient 字段
对于不可序列化的字段(如 Catalog、FileIO),标记为 transient 并在运行时重新初始化:
public class MyAction extends ActionBase implements LocalAction {
private transient MyHelper helper;
@Override
public void executeLocally() throws Exception {
// 在执行时初始化
if (helper == null) {
helper = new MyHelper(catalog);
}
helper.doSomething();
}
}
7.7 日志记录
使用 SLF4J 记录关键操作:
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class MyAction extends ActionBase implements LocalAction {
private static final Logger LOG = LoggerFactory.getLogger(MyAction.class);
@Override
public void executeLocally() throws Exception {
LOG.info("Starting my action for table: {}.{}", database, table);
try {
// 执行操作
int result = doSomething();
LOG.info("Action completed successfully, result: {}", result);
} catch (Exception e) {
LOG.error("Action failed", e);
throw e;
}
}
}
7.8 性能优化
7.8.1 避免重复加载
缓存重复使用的对象:
private transient Table tableCache;
private Table getTable() throws Exception {
if (tableCache == null) {
tableCache = catalog.getTable(Identifier.create(database, table));
}
return tableCache;
}
7.8.2 批量操作
对于需要处理多个表的 Action,使用批量 API:
// 不好:逐个处理
for (String tableName : tables) {
Table table = catalog.getTable(Identifier.create(database, tableName));
processTable(table);
}
// 更好:批量加载
List<Identifier> identifiers = tables.stream()
.map(t -> Identifier.create(database, t))
.collect(Collectors.toList());
List<Table> tables = catalog.getTables(identifiers);
tables.forEach(this::processTable);
7.9 兼容性
7.9.1 向后兼容
添加新参数时,提供默认值以保持向后兼容:
// 新增参数
Integer newParam = params.has(NEW_PARAM)
? Integer.parseInt(params.get(NEW_PARAM))
: DEFAULT_VALUE; // 默认值确保向后兼容
7.9.2 废弃参数
如果需要废弃某个参数,先标记为 deprecated:
@Deprecated
private static final String OLD_PARAM = "old_param";
private static final String NEW_PARAM = "new_param";
@Override
public Optional<Action> create(MultipleParameterToolAdapter params) {
String value;
if (params.has(NEW_PARAM)) {
value = params.get(NEW_PARAM);
} else if (params.has(OLD_PARAM)) {
System.err.println("Warning: --old_param is deprecated, use --new_param instead");
value = params.get(OLD_PARAM);
} else {
value = DEFAULT_VALUE;
}
// ...
}
8. 总结
8.1 Paimon Action Jar 的核心设计
Paimon Action Jar 通过以下机制实现了灵活、可扩展的表维护操作框架:
8.1.1 模块隔离
- 独立的入口模块:
paimon-flink-action只包含入口类,避免类加载冲突 - 实现模块分离:所有实现都在
paimon-flink-common中
8.1.2 SPI 扩展机制
- 基于 Java SPI 的插件化架构
- 通过
META-INF/services文件注册 Factory FactoryUtil.discoverFactory()动态加载实现
8.1.3 分层设计
清晰的职责分层:
- Factory 层:参数解析、验证、Action 创建
- Action 层:执行调度、模式选择(本地 vs Flink 作业)
- Procedure 层:业务逻辑封装(复用 Flink SQL CALL)
- Core 层:核心实现(如 ExpireSnapshotsImpl)
FlinkActions.main()
↓
ActionFactory.createAction()
↓ (SPI 加载)
ExpireSnapshotsActionFactory.create()
↓
ExpireSnapshotsAction.run()
↓ (LocalAction)
ExpireSnapshotsAction.executeLocally()
↓
ExpireSnapshotsProcedure.call()
↓
ExpireSnapshotsImpl.expire()
8.1.4 灵活的执行模式
- LocalAction:轻量操作本地执行,快速高效
- 普通 Action:构建 Flink 作业,分布式处理
- 强制模式:LocalAction 也可强制使用 Flink 作业
8.1.5 统一的接口规范
所有 Action 遵循统一接口:
Action.run()- 执行入口Action.build()- 构建作业图(可选)LocalAction.executeLocally()- 本地执行(可选)
8.2 实现自定义 Action 的关键点
-
继承正确的基类:
- 简单操作:
extends ActionBase implements LocalAction - 复杂作业:
extends ActionBase
- 简单操作:
-
实现 Factory:
- 定义唯一的 identifier
- 解析和验证参数
- 提供详细的帮助信息
-
注册到 SPI:
- 在
META-INF/services/org.apache.paimon.factories.Factory中注册
- 在
-
处理好序列化:
- Action 类必须实现 Serializable
- 不可序列化的字段标记为 transient
-
错误处理和日志:
- 提供清晰的错误信息
- 记录关键操作日志
8.3 ExpireSnapshotsAction 的实现要点
8.3.1 多层保护机制
retain_min:确保最少保留数量retain_max:限制最多保留数量older_than:时间条件过滤max_deletes:单次删除限制- Consumer 保护:自动检测消费者
- Tag 保护:保护被标记的快照
8.3.2 分阶段删除
- 删除数据文件(合并树文件)
- 删除 Manifest 文件
- 删除快照文件本身
- 更新 earliest hint
8.3.3 性能优化
- 提前退出:满足时间条件时立即停止
- 批量操作:避免逐个文件删除
- 异步模式:支持异步过期(避免反压)
8.4 适用场景
| Action 类型 | 适用场景 | 示例 |
|---|---|---|
| LocalAction | 轻量维护操作 | expire_snapshots, rollback_to, create_tag |
| 普通 Action | 分布式计算 | compact, merge_into, clone |
| 混合模式 | 可选执行方式 | 使用 --force_start_flink_job 切换 |
8.5 扩展建议
基于 Paimon Action 框架,可以扩展实现:
- 数据质量检查 Action:检查表数据的完整性和一致性
- 数据备份 Action:备份表的快照到外部存储
- 数据迁移 Action:在不同 Catalog 之间迁移表
- 统计信息收集 Action:收集表的统计信息用于查询优化
- 数据采样 Action:从大表中采样数据用于分析
8.6 参考资源
-
源码位置:
- 入口:
paimon-flink/paimon-flink-action/src/main/java/org/apache/paimon/flink/action/FlinkActions.java - Action 实现:
paimon-flink/paimon-flink-common/src/main/java/org/apache/paimon/flink/action/ - SPI 配置:
paimon-flink/paimon-flink-common/src/main/resources/META-INF/services/
- 入口:
-
官方文档:
-
相关 Procedure:
- Action 和 Procedure 共享相同的业务逻辑
- Procedure 用于 Flink SQL
CALL语句 - Action 用于命令行
flink run执行
附录
A. 完整的命令行示例
A.1 过期快照
# 基本用法:保留最近 10 个快照
<FLINK_HOME>/bin/flink run \
/path/to/paimon-flink-action.jar \
expire_snapshots \
--warehouse hdfs:///warehouse \
--database my_database \
--table my_table \
--retain_max 10
# 高级用法:组合多个条件
<FLINK_HOME>/bin/flink run \
/path/to/paimon-flink-action.jar \
expire_snapshots \
--warehouse hdfs:///warehouse \
--database my_database \
--table my_table \
--retain_max 100 \
--retain_min 10 \
--older_than '2024-01-01 00:00:00' \
--max_deletes 50
A.2 表压缩
# 压缩整个表
<FLINK_HOME>/bin/flink run \
/path/to/paimon-flink-action.jar \
compact \
--warehouse hdfs:///warehouse \
--database my_database \
--table my_table
# 压缩指定分区
<FLINK_HOME>/bin/flink run \
/path/to/paimon-flink-action.jar \
compact \
--warehouse hdfs:///warehouse \
--database my_database \
--table my_table \
--partition dt=2024-01-01 \
--partition dt=2024-01-02
A.3 删除孤立文件
<FLINK_HOME>/bin/flink run \
/path/to/paimon-flink-action.jar \
remove_orphan_files \
--warehouse hdfs:///warehouse \
--database my_database \
--table my_table \
--older_than '2024-01-01 00:00:00'
A.4 查看所有可用 Action
<FLINK_HOME>/bin/flink run \
/path/to/paimon-flink-action.jar
输出:
Usage: <action> [OPTIONS]
Available actions:
compact
compact_database
copy_files
create_branch
create_tag
create_tag_from_timestamp
create_tag_from_watermark
delete_branch
delete_tag
drop_partition
expire_changelogs
expire_partitions
expire_snapshots
expire_tags
fast_forward
mark_partition_done
merge_into
migrate_database
migrate_table
remove_orphan_files
repair
replace_tag
reset_consumer
rewrite_file_index
rollback_to
rollback_to_timestamp
...
For detailed options of each action, run <action> --help
B. 常见问题
B.1 ClassNotFoundException
问题:执行 Action 时报 ClassNotFoundException
原因:自定义 Action 的 JAR 没有放到 Flink lib 目录
解决方案:
cp my-custom-action.jar $FLINK_HOME/lib/
B.2 SPI 未生效
问题:自定义 Action 未被识别
原因:SPI 配置文件路径或格式错误
检查:
- 文件路径:
src/main/resources/META-INF/services/org.apache.paimon.factories.Factory - 文件内容:完整的类名,每行一个
- Maven 配置:使用 ServicesResourceTransformer 合并 SPI 文件
B.3 参数解析错误
问题:参数传递后无效
原因:参数名称错误或格式不正确
检查:
- 参数名使用下划线(
retain_max),不是驼峰(retainMax) - 参数值格式正确(数字、时间戳等)
- 使用
--help查看正确的参数名称