Paimon官方现在推荐使用Catalog的方式来使用和创建Paimon,所以先学习一下Flink Catalog的相关知识点
一、Catalog
1.1 默认Catalog
从StreamTableEnvironment.create入口一路跟下去,会在StreamTableEnvironmentImpl.create里,初始化一个
final CatalogManager catalogManager =
CatalogManager.newBuilder()
.classLoader(classLoader)
.config(tableConfig)
.defaultCatalog(
settings.getBuiltInCatalogName(),
new GenericInMemoryCatalog(
settings.getBuiltInCatalogName(), -- default_catalog
settings.getBuiltInDatabaseName())) -- default_database
.executionConfig(executionEnvironment.getConfig())
.build();
所以当我们写flink sql的时候,比如
create table source_table(
)with(
)
其实是默认使用的default_catalog.default_database.source_table
1.2 Create Catalog
我们继续看CatalogManager类
public void createCatalog(String catalogName, CatalogDescriptor catalogDescriptor)
throws CatalogException {
checkArgument(
!StringUtils.isNullOrWhitespaceOnly(catalogName),
"Catalog name cannot be null or empty.");
checkNotNull(catalogDescriptor, "Catalog descriptor cannot be null");
if (catalogStoreHolder.catalogStore().contains(catalogName)) {
throw new CatalogException(
format("Catalog %s already exists in catalog store.", catalogName));
}
if (catalogs.containsKey(catalogName)) {
throw new CatalogException(
format("Catalog %s already exists in initialized catalogs.", catalogName));
}
//开始创建
Catalog catalog = initCatalog(catalogName, catalogDescriptor);
//用于初始化阶段所需的任何准备
catalog.open();
//初始化好了后放入缓存
catalogs.put(catalogName, catalog);
catalogStoreHolder.catalogStore().storeCatalog(catalogName, catalogDescriptor);
}
private Catalog initCatalog(String catalogName, CatalogDescriptor catalogDescriptor) {
return FactoryUtil.createCatalog(
catalogName,
catalogDescriptor.getConfiguration().toMap(),
catalogStoreHolder.config(),
catalogStoreHolder.classLoader());
}
继续跟进FactoryUtil.createCatalog,可以看到是Flink常用的通过SPI找到对应的CatalogFactory,然后执行factory.createCatalog
public static Catalog createCatalog(
String catalogName,
Map<String, String> options,
ReadableConfig configuration,
ClassLoader classLoader) {
// Use the legacy mechanism first for compatibility
try {
final CatalogFactory legacyFactory =
TableFactoryService.find(CatalogFactory.class, options, classLoader);
return legacyFactory.createCatalog(catalogName, options);
} catch (NoMatchingTableFactoryException e) {
// No matching legacy factory found, try using the new stack
final DefaultCatalogContext discoveryContext =
new DefaultCatalogContext(catalogName, options, configuration, classLoader);
try {
// 通过SPI找到对应的CatalogFactory,然后执行factory.createCatalog
final CatalogFactory factory = getCatalogFactory(discoveryContext);
// The type option is only used for discovery, we don't actually want to forward it
// to the catalog factory itself.
final Map<String, String> factoryOptions =
options.entrySet().stream()
.filter(
entry ->
!CommonCatalogOptions.CATALOG_TYPE
.key()
.equals(entry.getKey()))
.collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue));
final DefaultCatalogContext context =
new DefaultCatalogContext(
catalogName, factoryOptions, configuration, classLoader);
return factory.createCatalog(context);
} catch (Throwable t) {
throw new ValidationException(
String.format(
"Unable to create catalog '%s'.%n%nCatalog options are:%n%s",
catalogName,
options.entrySet().stream()
.map(
optionEntry ->
stringifyOption(
optionEntry.getKey(),
optionEntry.getValue()))
.sorted()
.collect(Collectors.joining("\n"))),
t);
}
}
}
1.3 Create Table
在前面创建初始化Catalog完成之后,会把Catalog放入缓存中catalogs.put(catalogName, catalog);
当sql执行create table的时候,会根据table的前缀(类似paimon_catalog.default.tableA),然后从CatalogManager.getCatalog中拿取对应的catalog类,调用catalog.createTable()方法,生成Table。
1.4 Get Table
执行sql比如(insert into paimon_catalog.default.tableA xxx),会调用CatalogManager.getTable,同前面一样,会根据table的前缀(类似paimon_catalog.default.tableA),拿到对应的catalog,然后从CatalogManager.getCatalog中拿取对应的catalog类,调用catalog.getTable()方法。
二、Paimon Catalog
前面我们已经了解了Flink Catalog的基础知识和创建流程。我们开始进入Paimon Catalog的创建学习,上一章节最后了解到通过SPI机制找到对应的flink.CatalogFactory,我们来看一下Paimon的flink.CatalogFactory有哪些:
请注意,这里的flink.CatalogFactory是指org.apache.flink.table.factories.CatalogFactory类,是Flink的原生类。
后面还有属于Paimon的org.apache.paimon.catalog.CatalogFactory类。
全文会通过flink.CatalogFactory和paimon.CatalogFactory来区分。
2.1 FlinkCatalogFactory
2.1.1 create Catalog
我们先看FlinkCatalogFactory类,通过SPI找到对应CatalogFactory后,执行factory.createCatalog方法:
public class FlinkCatalogFactory implements org.apache.flink.table.factories.CatalogFactory {
public static final String IDENTIFIER = "paimon";
public FlinkCatalog createCatalog(Context context) {
return createCatalog(
context.getName(),
CatalogContext.create(
Options.fromMap(context.getOptions()), new FlinkFileIOLoader()),
context.getClassLoader());
}
public static FlinkCatalog createCatalog(
String catalogName, CatalogContext context, ClassLoader classLoader) {
return new FlinkCatalog(
//留意这里的这个CatalogFactory,这里对应paimon.CatalogFactory
CatalogFactory.createCatalog(context, classLoader),
catalogName,
context.options().get(DEFAULT_DATABASE),
classLoader,
context.options());
}
}
我们继续跟进FlinkCatalog类,可以看到其主要的方法,都是调用catalog.XXX,这里的catalog就是上面传入的CatalogFactory.createCatalog(context, classLoader),可以得知FlinkCatalog其实只是一个包装类,真正进行 Catalog 操作的还是外面通过paimon.CatalogFactory创建出来的org.apache.paimon.catalog.Catalog。
public class FlinkCatalog extends AbstractCatalog {
public FlinkCatalog(
Catalog catalog,
String name,
String defaultDatabase,
ClassLoader classLoader,
Options options) {
this.catalog = catalog;
...
}
public List<String> listTables(String databaseName)
throws DatabaseNotExistException, CatalogException {
try {
return catalog.listTables(databaseName);
} catch (Catalog.DatabaseNotExistException e) {
throw new DatabaseNotExistException(getName(), e.database());
}
}
...
}
我们回到外层的CatalogFactory.createCatalog(context, classLoader),这里的CatalogFactory是paimon.CatalogFactory类,继续跟进
static Catalog createCatalog(CatalogContext context, ClassLoader classLoader) {
Options options = context.options();
String metastore = options.get(METASTORE);
//和前面一样,通过SPI机制发现对应CatalogFactory
CatalogFactory catalogFactory =
FactoryUtil.discoverFactory(classLoader, CatalogFactory.class, metastore);
try {
return catalogFactory.create(context);
} catch (UnsupportedOperationException ignore) {
}
}
和前面的步骤一样,我们来看一下paimon.CatalogFactory有哪些子类:
我们挑最常用的HiveCatalogFactory来看看它的实现,先看一下它的创建示例,然后再跟踪源码,并不复杂:
// 创建 org.apache.paimon.hive.HiveCatalog
CREATE CATALOG my_hive WITH (
'type' = 'paimon',
'metastore' = 'hive'
);
/** Factory to create {@link HiveCatalog}. */
public class HiveCatalogFactory implements CatalogFactory {
public static final String IDENTIFIER = "hive";
...
@Override
public Catalog create(CatalogContext context) {
//调用HiveCatalog.createHiveCatalog
return HiveCatalog.createHiveCatalog(context);
}
}
...
public static Catalog createHiveCatalog(CatalogContext context) {
HiveConf hiveConf = createHiveConf(context);
Options options = context.options();
String warehouseStr = options.get(CatalogOptions.WAREHOUSE);
if (warehouseStr == null) {
warehouseStr =
hiveConf.get(METASTOREWAREHOUSE.varname, METASTOREWAREHOUSE.defaultStrVal);
}
Path warehouse = new Path(warehouseStr);
Path uri =
warehouse.toUri().getScheme() == null
? new Path(FileSystem.getDefaultUri(hiveConf))
: warehouse;
FileIO fileIO;
try {
fileIO = FileIO.get(uri, context);
fileIO.checkOrMkdirs(warehouse);
} catch (IOException e) {
throw new UncheckedIOException(e);
}
//在这里创建HiveCatalog
Catalog catalog =
new HiveCatalog(
fileIO,
hiveConf,
options.get(HiveCatalogFactory.METASTORE_CLIENT_CLASS),
options,
warehouse.toUri().toString());
PrivilegeManager privilegeManager =
new FileBasedPrivilegeManager(
warehouse.toString(),
fileIO,
context.options().get(PrivilegedCatalog.USER),
context.options().get(PrivilegedCatalog.PASSWORD));
if (privilegeManager.privilegeEnabled()) {
catalog = new PrivilegedCatalog(catalog, privilegeManager);
}
return catalog;
}
...
public HiveCatalog(
FileIO fileIO,
HiveConf hiveConf,
String clientClassName,
Options options,
String warehouse) {
super(fileIO, options);
this.hiveConf = hiveConf;
this.clientClassName = clientClassName;
this.warehouse = warehouse;
boolean needLocationInProperties =
hiveConf.getBoolean(
LOCATION_IN_PROPERTIES.key(), LOCATION_IN_PROPERTIES.defaultValue());
if (needLocationInProperties) {
locationHelper = new TBPropertiesLocationHelper();
} else {
// set the warehouse location to the hiveConf
hiveConf.set(HiveConf.ConfVars.METASTOREWAREHOUSE.varname, warehouse);
locationHelper = new StorageLocationHelper();
}
//在创建HiveCatalog时,会生成一个client端,用于连接hive集群
this.client = createClient(hiveConf, clientClassName);
}
...
public List<String> listDatabases() {
try {
//通过client调用访问hive集群
return client.getAllDatabases();
} catch (TException e) {
throw new RuntimeException("Failed to list all databases", e);
}
}
到此我们整理一下,我们上面讲了两个很像的类,flink.CatalogFactory类和paimon.CatalogFactory类,以及对应创建出来的FlinkCatalog(继承org.apache.flink.table.catalog.Catalog)和HiveCatalog(继承org.apache.paimon.catalog.Catalog)。
FlinkCatalog是外层的包装类,其方法实际调用的是里面的paimon.HiveCatalog类。
后面还有属于Flink的org.apache.flink.table.catalog.hive.HiveCatalog类。
后文会通过flink.HiveCatalog和paimon.HiveCatalog来区分。
2.1.2 create table
从1.3可知,创建表会CatalogManager.getCatalog中拿取对应的catalog类,调用catalog.createTable()方法,生成Table。对应的就是调用FlinkCatalog.createTable()
@Override
public void createTable(ObjectPath tablePath, CatalogBaseTable table, boolean ignoreIfExists)
throws TableAlreadyExistException, DatabaseNotExistException, CatalogException {
Identifier identifier = toIdentifier(tablePath);
Map<String, String> options = new HashMap<>(table.getOptions());
//校验、构造Paimon参数。这里如果create的table不是paimon类型的表,会直接报错
Schema paimonSchema = buildPaimonSchema(identifier, (CatalogTable) table, options);
try {
catalog.createTable(identifier, paimonSchema, ignoreIfExists);
} catch (Catalog.TableAlreadyExistException e) {
...
}
}
//paimon.AbstractCatalog
public void createTable(Identifier identifier, Schema schema, boolean ignoreIfExists)
throws TableAlreadyExistException, DatabaseNotExistException {
checkNotBranch(identifier, "createTable");
checkNotSystemTable(identifier, "createTable");
validateIdentifierNameCaseInsensitive(identifier);
validateFieldNameCaseInsensitive(schema.rowType().getFieldNames());
validateAutoCreateClose(schema.options());
if (!databaseExists(identifier.getDatabaseName())) {
throw new DatabaseNotExistException(identifier.getDatabaseName());
}
if (tableExists(identifier)) {
if (ignoreIfExists) {
return;
}
throw new TableAlreadyExistException(identifier);
}
copyTableDefaultOptions(schema.options());
//这里会调用paimon.HiveCatalog的方法,基于hive metastore创建paimon表
createTableImpl(identifier, schema);
}
2.1.3 get table
同理,调用FlinkCatalog.getTable()
private CatalogTable getTable(ObjectPath tablePath, @Nullable Long timestamp)
throws TableNotExistException {
Table table;
try {
//
table = catalog.getTable(toIdentifier(tablePath));
} catch (Catalog.TableNotExistException e) {
throw new TableNotExistException(getName(), tablePath);
}
...
if (table instanceof FileStoreTable) {
return toCatalogTable(table);
} else {
return new SystemCatalogTable(table);
}
}
//paimon.AbstractCatalog
public Table getTable(Identifier identifier) throws TableNotExistException {
if (isSystemDatabase(identifier.getDatabaseName())) {
...
} else if (isSpecifiedSystemTable(identifier)) {
..
} else {
try {
return getDataTable(identifier);
} catch (TableNotExistException e) {
return getFormatTable(identifier);
}
}
}
private FileStoreTable getDataTable(Identifier identifier) throws TableNotExistException {
Preconditions.checkArgument(identifier.getSystemTableName() == null);
TableSchema tableSchema = getDataTableSchema(identifier);
//构造table
return FileStoreTableFactory.create(
fileIO,
getTableLocation(identifier),
tableSchema,
new CatalogEnvironment(
identifier,
Lock.factory(
lockFactory().orElse(null), lockContext().orElse(null), identifier),
metastoreClientFactory(identifier).orElse(null),
lineageMetaFactory));
}
...
//FileStoreTableFactory.create -> FileStoreTableFactory.createWithoutFallbackBranch
//一路流转最后根据是否有主键会创建 AppendOnlyFileStoreTable 或者 PrimaryKeyFileStoreTable
private static FileStoreTable createWithoutFallbackBranch(
FileIO fileIO,
Path tablePath,
TableSchema tableSchema,
Options dynamicOptions,
CatalogEnvironment catalogEnvironment) {
FileStoreTable table =
tableSchema.primaryKeys().isEmpty()
? new AppendOnlyFileStoreTable(
fileIO, tablePath, tableSchema, catalogEnvironment)
: new PrimaryKeyFileStoreTable(
fileIO, tablePath, tableSchema, catalogEnvironment);
return table.copy(dynamicOptions.toMap());
}
FileStoreTable的细节我们下一章再聊
2.2 FlinkGenericCatalogFactory
为什么会有FlinkGenericCatalogFactory类呢?我们可以看一下这个ISSUE:
github.com/apache/paim…
github.com/apache/paim…
总结来说就是,上面的FlinkCatalog类在createTable的时候,会因为创建的表不是Paimon类型的表而报错抛出。那既然都是基于hive metastore创建构造的flink表,日常使用也确实会有Hive表和Paimon表在同一个任务里共用的情况,而且当前Paimon的FlinkCatalog不满足同时查询Hive&Paimon格式的表,为什么不创建一个类,既能够支持Hive表也能够支持Paimon表呢?
所以FlinkGenericCatalogFactory就应运而生了。
2.2.1 create Catalog
我们从FlinkGenericCatalogFactory.createCatalog()的创建开始
public FlinkGenericCatalog createCatalog(Context context) {
//注意这里生成的HiveCatalogFactory和HiveCatalog,是flink.HiveCatalog,和上一章讲的paimon.HiveCatalog不是一个类
CatalogFactory hiveFactory = createHiveCatalogFactory(context.getClassLoader());
Context filteredContext = filterContextOptions(context, hiveFactory);
Catalog catalog = hiveFactory.createCatalog(filteredContext);
//将创建好的flink.HiveCatalog传入
return createCatalog(
context.getClassLoader(), context.getOptions(), context.getName(), catalog);
}
public static FlinkGenericCatalog createCatalog(
ClassLoader cl, Map<String, String> optionMap, String name, Catalog flinkCatalog) {
Options options = Options.fromMap(optionMap);
//这里写死了metastore=hive,所以下面创建的会是paimon.HiveCatalog
options.set(CatalogOptions.METASTORE, "hive");
//FlinkCatalog里面的catalog是paimon.HiveCatalog
FlinkCatalog paimon =
new FlinkCatalog(
org.apache.paimon.catalog.CatalogFactory.createCatalog(
CatalogContext.create(options, new FlinkFileIOLoader()), cl),
name,
options.get(DEFAULT_DATABASE),
cl,
options);
//这里传入的就是一个FlinkCatalog,一个flink.HiveCatalog
return new FlinkGenericCatalog(paimon, flinkCatalog);
}
从 FlinkGenericCatalog 的实现来看,很多操作都会同时操作两个 catalog, 其中 HiveCatalog 是对 hive HMS 进行请求操作,FlinkCatalog 是对 paimon 进行操作,比如:
public class FlinkGenericCatalog extends AbstractCatalog {
...
public void createTable(ObjectPath tablePath, CatalogBaseTable table, boolean ignoreIfExists)
throws TableAlreadyExistException, DatabaseNotExistException, CatalogException {
String connector = table.getOptions().get(CONNECTOR.key());
if (connector == null) {
throw new RuntimeException(
"FlinkGenericCatalog can not create table without 'connector' key.");
}
//如果connector=paimon,就走paimon.createTable
if (FlinkCatalogFactory.IDENTIFIER.equals(connector)) {
paimon.createTable(tablePath, table, ignoreIfExists);
} else {
flink.createTable(tablePath, table, ignoreIfExists);
}
}
public CatalogBaseTable getTable(ObjectPath tablePath)
throws TableNotExistException, CatalogException {
//先尝试paimon.getTable,不行则flink.getTable
try {
return paimon.getTable(tablePath);
} catch (TableNotExistException e) {
return flink.getTable(tablePath);
}
}
...
}
三、总结
Paimon Catalog相关的知识我们就讲完了,总结一下:
- flink.CatalogFactory有两个子类,两个类都是外层包装类:
- FlinkCatalogFactory:只支持Paimon的表
- FlinkGenericCatalogFactory:兼容Paimon、Hive 的表
- paimon.CatalogFactory有三个子类,都是实际执行的catalog类:
- HiveCatalogFactory
- FileSystemCatalogFactory
- JdbcCatalogFactory
- FlinkCatalog在get table的时候会创建FileStoreTable,有两个子类
- 没有主键会创建 AppendOnlyFileStoreTable
- 有主键会创建 PrimaryKeyFileStoreTable