一次flink-cdc任务提交时No suitable driver found的排查过程

2,316 阅读3分钟

1. 通过bin/flink run命令提交任务时报错

任务具体为使用flink-mysql-cdc拉取mysql中的数据到数据湖Paimon中,使用的flink版本为1.16.1,Paimon版本为flink-1.16-0.5-SNAPSHOT,出现问题的表面原因在于引入了Hive依赖

./bin/flink run -d -c org.apache.paimon.flink.action.FlinkActions ./lib/paimon-flink-1.16-0.5-SNAPSHOT.jar    mysql-sync-database   --warehouse hdfs://xbstar00:9000/warehouse/paimon/test   --database xex_mini3   --including-tables 'bxxxx|xxxm'   --table-prefix ods_   --mysql-conf hostname=192.168.1.xxx   --mysql-conf username=root   --mysql-conf password=xxxx   --mysql-conf database-name=xex_xxx   --table-conf bucket=1   --table-conf changelog-producer=input   --table-conf sink.parallelism=1

提交任务之后产生了报错:

 The program finished with the following exception:

org.apache.flink.client.program.ProgramInvocationException: The main method caused an error: No suitable driver found for jdbc:mysql://192.168.1.214:3306/
	at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:372)
	at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:222)
	at org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:98)
	at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:843)
	at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:240)
	at org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1087)
	at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1165)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
	at org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
	at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1165)
Caused by: java.sql.SQLException: No suitable driver found for jdbc:mysql://192.168.1.214:3306/
	at java.sql.DriverManager.getConnection(DriverManager.java:689)
	at java.sql.DriverManager.getConnection(DriverManager.java:247)
	at org.apache.paimon.flink.action.cdc.mysql.MySqlActionUtils.getConnection(MySqlActionUtils.java:72)
	at org.apache.paimon.flink.action.cdc.mysql.MySqlSyncDatabaseAction.getMySqlSchemaList(MySqlSyncDatabaseAction.java:255)
	at org.apache.paimon.flink.action.cdc.mysql.MySqlSyncDatabaseAction.build(MySqlSyncDatabaseAction.java:165)
	at org.apache.paimon.flink.action.cdc.mysql.MySqlSyncDatabaseAction.run(MySqlSyncDatabaseAction.java:440)
	at org.apache.paimon.flink.action.FlinkActions.main(FlinkActions.java:47)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:355)
	... 11 more

2. 开始排查

  1. 首先确认了lib下是确实存在mysql的依赖,com.mysql.cj.jdbc.Driver类在flink-sql-connector-mysql-cdc-2.3.0.jar包中
[root@xbstar00 flink-1.16.1]# ls lib
flink-cep-1.16.1.jar                            flink-table-api-java-uber-1.16.1.jar
flink-connector-files-1.16.1.jar                flink-table-planner-loader-1.16.1.jar
flink-csv-1.16.1.jar                            flink-table-runtime-1.16.1.jar
flink-dist-1.16.1.jar                           hive-exec-2.3.9.jar
flink-json-1.16.1.jar                           log4j-1.2-api-2.17.1.jar
flink-scala_2.12-1.16.1.jar                     log4j-api-2.17.1.jar
flink-shaded-hadoop-2-uber-2.7.5-8.0.jar        log4j-core-2.17.1.jar
flink-shaded-zookeeper-3.5.9.jar                log4j-slf4j-impl-2.17.1.jar
flink-sql-connector-hive-2.3.9_2.12-1.16.2.jar  paimon-flink-1.16-0.5-SNAPSHOT.jar
flink-sql-connector-mysql-cdc-2.3.0.jar
  1. 由于原本整库同步未集成HiveCatalog而仅使用hdfs的时候,任务是可以正常启动的,因此问题定位到flink-sql-connector-hive-2.3.9_2.12-1.16.2.jar包上
${FLINK16_HOME}/bin/flink run-application \  
-c org.apache.paimon.flink.action.FlinkActions \  
./lib/paimon-flink-1.16-0.5-SNAPSHOT.jar \  
mysql-sync-database \  
--warehouse hdfs://xbstar00:9000/warehouse/paimon/test \  
--database xex_mini \  
--including-tables 'bxxx|xxom' \  
--table-prefix ods_ \  
--mysql-conf hostname=192.168.1.xxx \  
--mysql-conf username=root \  
--mysql-conf password=**** \  
--mysql-conf database-name=xex_*** \  
--table-conf bucket=1 \  
--table-conf changelog-producer=input \  
--table-conf sink.parallelism=1
  1. 但是!根据经验来看hive和mysql的包不存在冲突,并且在解压之后对比也没发现冲突的类

image.png

image.png 4. 并且此时在flink的sql-client对于这些依赖都是可以正常加载并使用的

  1. 进入flink的sql-cli
[root@xbstar00 flink-1.16.1]# bin/sql-client.sh
Command history file path: /root/.flink-sql-history

Flink SQL> SET 'execution.runtime-mode' = 'batch';
[INFO] Session property has been set.
  1. 使用Paimon Catalog
Flink SQL> CREATE
> CATALOG paimon_catalog
> WITH ( 'type' = 'paimon',
>     'warehouse' = 'hdfs://192.168.1.180:9000/warehouse/paimon/test/');
[INFO] Execute statement succeed.
  1. 使用Hive Catalog
Flink SQL> CREATE CATALOG myhive WITH (
>   'type' = 'hive',
>   'hive-conf-dir' = '/home/hadoop/app/apache-hive-2.3.9-bin/conf'
> );
[INFO] Execute statement succeed.
  1. 使用mysql-cdc
Flink SQL> use catalog myhive;
[INFO] Execute statement succeed.
Flink SQL> create table if not exists ods_test
> (
>     id   int,
>     name string,
>     primary key (id) not enforced
> ) with (
>     'hostname' = '192.168.1.xxx',
>     'port' = '3306',
>     'username' = 'root',
>     'password' = '****',
>     'database-name' = 'xex_xxx',
>     'table-name' = 'xxx',
>     'server-time-zone' = 'Asia/Shanghai',
>     'connector' = 'mysql-cdc'
> );
[INFO] Execute statement succeed.
  1. 使用Paimon Catalog(hive metastore)
Flink SQL> CREATE
> CATALOG paimon_hive_catalog
> WITH ( 'type' = 'paimon',
>     'metastore' = 'hive',
>     'uri' = 'thrift://192.168.1.180:9083',
>     'warehouse' = 'hdfs://192.168.1.180:9000/warehouse/paimon/test/');
[INFO] Execute statement succeed.
Flink SQL> use `default`;
[INFO] Execute statement succeed.
Flink SQL> create table if not exists ods_test2
> (
>     id   int,
>     name string,
>     primary key (id) not enforced
> ) with (
>     'hostname' = '192.168.1.xxx',
>     'port' = '3306',
>     'username' = 'root',
>     'password' = '****',
>     'database-name' = 'xex_xxx',
>     'table-name' = 'xxx',
>     'server-time-zone' = 'Asia/Shanghai',
>     'connector' = 'mysql-cdc'
> );
[ERROR] Could not execute SQL statement. Reason:
org.apache.flink.table.catalog.exceptions.CatalogException: Paimon Catalog only supports paimon tables , and you don't need to specify  'connector'= 'paimon' when using Paimon Catalog
 You can create TEMPORARY table instead if you want to create the table of other connector.
  1. Paimon Catalog和Hive Catalog下,表都能正常查询出数据,此时感觉依赖并没有什么问题,应该是类加载的问题

3. lib下一旦加入了flink-sql-connector-hive-2.3.9_2.12-1.16.2.jar,原本不使用hive metastore的cdc任务也会出现同样的错误

经过各种尝试,发现在lib下只要mysql-cdc的jar在Hive前面就能够正常运行,而hive的jar在前面就不行

[root@xbstar00 flink-1.16.1]# ls lib
a08-flink-sql-connector-hive-2.3.9_2.12-1.16.2.jar  flink-shaded-zookeeper-3.5.9.jar
a09-flink-sql-connector-mysql-cdc-2.3.0.jar         flink-table-api-java-uber-1.16.1.jar
flink-cep-1.16.1.jar                                flink-table-planner-loader-1.16.1.jar
flink-connector-files-1.16.1.jar                    flink-table-runtime-1.16.1.jar
flink-csv-1.16.1.jar                                log4j-1.2-api-2.17.1.jar
flink-dist-1.16.1.jar                               log4j-api-2.17.1.jar
flink-json-1.16.1.jar                               log4j-core-2.17.1.jar
flink-scala_2.12-1.16.1.jar                         log4j-slf4j-impl-2.17.1.jar
flink-shaded-hadoop-2-uber-2.7.5-8.0.jar            paimon-flink-1.16-0.5-SNAPSHOT.jar
[root@xbstar00 flink-1.16.1]#

4. 此时想起来在flink-cdc的2.1版本下出现过类似的问题,<# 升级2.1版本后,本地能启动,flink集群无法启动== No suitable driver #628>

  1. 先改了试试,在org.apache.paimon.flink.action.FlinkActions中加入如下代码

注意Paimon项目里面有两个这个类,分别在:paimon-flink/paimon-flink-actionpaimon-flink/paimon-flink-common/src/main/java/org/apache/paimon/flink/action
虽然paimon-flink-common下的已经加了 @Deprecated 注解,但是打包之后 paimon-flink-1.16-0.5-SNAPSHOT.jar 中的仍然是common下的这个类,因此改common下的

try {  
Class.forName("com.mysql.cj.jdbc.Driver");  
} catch (Exception e) {  
e.printStackTrace();  
}
  1. 打包并替换flink/lib下的paimon包进行测试

先mvn install paimon-flink-common,再 mvn package paimon-flink-1.16 即可

  1. 直接放到flink/lib下并执行任务(要重启flink),报错依旧
java.sql.SQLException: No suitable driver found for jdbc:mysql://192.168.1.214:3306/
  1. 不放到lib下进行尝试,结果依旧
  2. CLass.forName原理分析,参见:【JDBC篇】Class.forName原理剖析
     在项目中我们常通过反射技术Class.forName(“com.mysql.jdbc.Driver”),把Driver类加载进内存。来取代上面繁琐的创建对象过程,为什么可以这样呢?通过源码我们可以看到Driver里有个内置的静态代码块,在进入内存时会Driver类会初始化,静态代码块里的代码也会被执行。
 static {
        try {
            java.sql.DriverManager.registerDriver(new Driver()); // 注册驱动
        } catch (SQLException E) {
            throw new RuntimeException("Can't register driver!");
        }
    }

为什么Class.forName可以注册驱动?

这里说的注册驱动,指的是将java.sql.Driver实现类(对于连接mysql数据库来说,驱动就是mysql .com.mysql.cj.jdbc.Driver)注册到DriverManager.registeredDrivers 存储驱动信息的集合中。
使用Class.forName("com.mysql.cj.jdbc.Driver")来注册驱动。仅看代码只是加载了这个类,并没有显示的注册驱动,那为什么还可以注册上去呢?打开com.mysql.cj.jdbc.Driver时,我们可以看到,静态代码块中会执行注册驱动的方法,而加载这个类时,静态代码块会被执行。所以Class.forName();可以注册驱动;
————————————————
版权声明:本文为CSDN博主「南斋孤鹤」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。 原文链接:blog.csdn.net/m0_64231944… ————————————————

  1. 为什么明明注册了Driver,却还是报错not found,并且在报错的方法MySqlActionUtils.getConnection中,也确实是使用了DriverManager来连接的
static Connection getConnection(Configuration mySqlConfig) throws Exception {  
    DriverManager.setLogWriter(new PrintWriter(System.out));  
    return DriverManager.getConnection(  
        String.format(  
            "jdbc:mysql://%s:%d/",  
            mySqlConfig.get(MySqlSourceOptions.HOSTNAME),  
            mySqlConfig.get(MySqlSourceOptions.PORT)),  
            mySqlConfig.get(MySqlSourceOptions.USERNAME),  
            mySqlConfig.get(MySqlSourceOptions.PASSWORD));  
}
  1. 进入DriverManager内部可以发现,在连接的时候使用了ClassLoader,那么问题可能出在Flink的ClassLoader上,Flink的ClassLoader机制详见:再谈双亲委派模型与Flink的类加载策略
// Worker method called by the public getConnection() methods.  
private static Connection getConnection(  
String url, java.util.Properties info, Class<?> caller) throws SQLException {  
/*  
* When callerCl is null, we should check the application's  
* (which is invoking this class indirectly)  
* classloader, so that the JDBC driver class outside rt.jar  
* can be loaded from here.  
*/  
ClassLoader callerCL = caller != null ? caller.getClassLoader() : null;  
if (callerCL == null || callerCL == ClassLoader.getPlatformClassLoader()) {  
callerCL = Thread.currentThread().getContextClassLoader();  
}  
  
if (url == null) {  
throw new SQLException("The url cannot be null", "08001");  
}  
  
println("DriverManager.getConnection(\"" + url + "\")");  
  
ensureDriversInitialized();  
  
// Walk through the loaded registeredDrivers attempting to make a connection.  
// Remember the first exception that gets raised so we can reraise it.  
SQLException reason = null;  
  
for (DriverInfo aDriver : registeredDrivers) {  
// If the caller does not have permission to load the driver then  
// skip it.  
if (isDriverAllowed(aDriver.driver, callerCL)) {  
try {  
println(" trying " + aDriver.driver.getClass().getName());  
Connection con = aDriver.driver.connect(url, info);  
if (con != null) {  
// Success!  
println("getConnection returning " + aDriver.driver.getClass().getName());  
return (con);  
}  
} catch (SQLException ex) {  
if (reason == null) {  
reason = ex;  
}  
}  
  
} else {  
println(" skipping: " + aDriver.getClass().getName());  
}  
  
}  
  
// if we got here nobody could connect.  
if (reason != null) {  
println("getConnection failed: " + reason);  
throw reason;  
}  
  
println("getConnection: no suitable driver found for "+ url);  
throw new SQLException("No suitable driver found for "+ url, "08001");  
}

5. 此时开始排查CLassLoader

  1. 在FlinkActions的main方法中加入如下代码:
System.out.println(com.mysql.cj.jdbc.Driver.class.getClassLoader());  
System.out.println(FlinkActions.class.getClassLoader());
  1. 打包并进行测试,同时尝试修改flink-conf.yaml中的classloader.resolve-order属性,该属性默认值为child-first
resolve-orderDriver的ClassLoaderFlinkActions的ClassLoader
child-firstsun.misc.Launcher$AppClassLoader@61bbe9baorg.apache.flink.util.ChildFirstClassLoader@48d61b48
parent-firstsun.misc.Launcher$AppClassLoader@61bbe9baorg.apache.flink.util.FlinkUserCodeClassLoaders$ParentFirstClassLoader@48d61b48
  1. 两种情况下,DriverManager都拿不到Driver,因为com.mysql.cj.jdbc.Driver的ClassLoader始终为AppClassLoader,那么想办法将ClassLoader统一或许可以解决问题
  2. 经过一番检索发现可以更改加载的顺序: Flink类加载机制与--classpath参数动态加载外部类分析,因此在child-first模式下或许可以将类加载器进行统一
[root@xbstar00 flink-1.16.1]# bin/flink run -h
Action "run" compiles and runs a program.

  Syntax: run [OPTIONS] <jar-file> <arguments>
  "run" action options:
     -c,--class <classname>                     Class with the program entry
                                                point ("main()" method). Only
                                                needed if the JAR file does not
                                                specify the class in its
                                                manifest.
     -C,--classpath <url>                       Adds a URL to each user code
                                                classloader  on all nodes in the
                                                cluster. The paths must specify
                                                a protocol (e.g. file://) and be
                                                accessible on all nodes (e.g. by
                                                means of a NFS share). You can
                                                use this option multiple times
                                                for specifying more than one
                                                URL. The protocol must be
                                                supported by the {@link
                                                java.net.URLClassLoader}.
     -d,--detached                              If present, runs the job in
                                                detached mode

在提交命令上加入classpath参数

./bin/flink run -d -c org.apache.paimon.flink.action.FlinkActions -C file:///home/hadoop/app/test/flink-1.16.1/lib/a09-flink-sql-connector-mysql-cdc-2.3.0.jar  ./paimon-flink-1.16-0.5-SNAPSHOT.jar   mysql-sync-database   --warehouse ... ...
...
...
org.apache.flink.util.ChildFirstClassLoader@48d61b48
org.apache.flink.util.ChildFirstClassLoader@48d61b48
DriverManager.getConnection("jdbc:mysql://192.168.1.214:3306/")
    trying com.mysql.cj.jdbc.Driver
getConnection returning com.mysql.cj.jdbc.Driver

可以看到此时两个类的加载器均为org.apache.flink.util.ChildFirstClassLoader@48d61b48,任务也终于能够成功提交了

6. 那么为什么在不改配置的情况下,仅调整jar的顺序也能解决?并且明明ClassLoader也不是同一个

image.png

image.png 并且没有DriverManager的skipping日志,说明出错的时候根本没有进入for循环,打开日志详见:DriverManager初始化的日志怎么打印?

for (DriverInfo aDriver : registeredDrivers) {...}

并且,在org.apache.paimon.flink.action.cdc.mysql.MySqlActionUtils#getConnection中加载mysql驱动也可以解决:

static Connection getConnection(Configuration mySqlConfig) throws Exception {  
DriverManager.setLogWriter(new PrintWriter(System.out));  
try {  
Class.forName("com.mysql.cj.jdbc.Driver");  
} catch (Exception e) {  
e.printStackTrace();  
}  
return DriverManager.getConnection(  
String.format(  
"jdbc:mysql://%s:%d/",  
mySqlConfig.get(MySqlSourceOptions.HOSTNAME),  
mySqlConfig.get(MySqlSourceOptions.PORT)),  
mySqlConfig.get(MySqlSourceOptions.USERNAME),  
mySqlConfig.get(MySqlSourceOptions.PASSWORD));  
}

因此报错的根本原因在于com.mysql.cj.jdbc.Driver类没有被加载,而为什么换了个顺序就能加载了?代码中也并没有显式的import,未完待续。。。
有懂的佬可以指教一下~

更新

7. 破案了

JDBC DriverManager驱动加载与SPI ServiceLoader

  1. 查看flink-sql-connector-hive-2.3.9_2.12-1.16.2.jar,可以发现确实定义了一个Driver

image.png 2. 然后有意思的来了,该类并不在其中,而是在flink-table-planner_2.12-1.16.2.jar中,但是服务器环境中的flink/lib下并没有这个包,取而代之的是flink-table-planner-loader-1.16.1.jar

image.png 3. 并且这个loader中并没有这个Driver,因此是sql-hive包本身的问题,只要将含有org.apache.calcite.jdbc.Driver类的jar放入即可

image.png 4. 所以主要问题就是本地环境用的flink-table-planner_2.12,而服务器环境用的flink-table-planner-loader-1.16.1.jar

image.png 5. 至于为什么planer和planer-loader平常都是可以正常运行的,可以参照# Flink1.15 发布最新版本说明中的【重新组织table模块和介绍flink-table-planner-loader】 6. 后续发现,无论是将planner-loader换成planner还是直接放入含有org.apache.calcite.jdbc.Driver的jar都可以解决该问题

总结

由一个类依赖的缺失引发的血案!
收获: 深入理解了flink的类加载机制,因为确实是类没有加载,所以上述方法中无论是Class.forName或是使用child-first进行加载,都能解决依赖问题。然而本质还是因为DriverManager的加载机制存在问题。