本文适用于flink1.10-1.12版本。1.13及以后的版本好像有一些改动,hadoop配置文件目录好像有一定改动,需要自己确认一下
flink on k8s 的APP 模式下
flink on k8s 的APP 模式下,读取hadoop需要先添加依赖jar包,我是添加flink-shaded-hadoop-2-uber-2.8.3-10.0.jar 这个jar包到flink的lib目录下,如果你使用的是其他版本的hadoop,请添加对应的依赖包。
然后需要把hdfs的配置文件打镜像的时候打入镜像里面,其实主要就是core-site.xml和hdfs-site.xml两个配置文件,hdfs在镜像容器里面的默认目录为/etc/hadoop/conf目录,只需要把配置文件放进这个文件夹,flink就能自己读取。
COPY --chown=flink:flink $hadoop_conf/ /etc/hadoop/conf
然后读取hive还需要在SQL里面配置hive-site.xml的目录,同理,也需要把hive-site.xml打进镜像里面,这个目录可以是flink能读取到的任意目录,然后配置对应目录就能获取。
之所以使用catalog.dbName.tableName,是比较方便,flink官网的例子是use catalog catalogName,这种方式再去读取tableName,比较麻烦,使用完之后还得在切换为flink的默认catalogName,不然其他数据源没法使用,所以直接使用catalogName这种方式是最方便的。
SQL的例子如下:
CREATE CATALOG myhive WITH (``'type' = ``'hive'``, ``'default-database' = ``'test'``, ``'hive-conf-dir' = ``'容器里面配置文件地址'```); create table titanic_source ( passengeridINT,survivedINT,pclassINT,nameSTRING,sexSTRING,ageDOUBLE,sibspINT,parchINT,ticketSTRING,fareDOUBLE,cabinSTRING,embarked STRING ) WITH( ```'connector' =``'jdbc'``, ``'url' =``'jdbc:mysql://ip:port/db_name'``, ``'username' =``'root'``, ``'password' =``'xxx'``, ``'table-name' =``'tablename'``); insert into myhive.test.titanic_sink select * from titanic_source; |
|---|
flink on k8s 的SESSION 模式下
flink on k8s 的SESSION 模式下,访问hadoop只需要机器有hadoop的客户端,且客户端可用就行,要访问hive的话,需要hive-site.xml配置文件。
因为SESSION模式下,是在本地构建JobGraph,然后再集群执行,构建JobGraph的时候已经把配置信息读取完毕,只需要集群上游hadoop的jar包就行,配置文件可以省略,当然,为了APP模式也好使,有配置文件也不影响。
SQL同上,区别在于,SESSION模式下读取的是web机器上的配置文件,相当于'hive-conf-dir' = '本地配置文件目录',这里指定的是hive的本地配置文件的目录,而且本地hadoop环境必须可用
CREATE CATALOG myhive WITH (``'type' = ``'hive'``, ``'default-database' = ``'test'``, 'hive-conf-dir' = '本地hive-site.xml文件地址'); create table titanic_source ( `passengerid` INT, `survived` INT, `pclass` INT, `name` STRING, `sex` STRING, `age` DOUBLE, `sibsp` INT, `parch` INT, `ticket` STRING, `fare` DOUBLE, `cabin` STRING, `embarked` STRING ) WITH( ```'connector'` `='jdbc', 'url' ='jdbc:mysql://ip:port/db_name', 'username'` `='root', 'password' ='xxx', 'table-name'` `='tablename'``); insert into myhive.test.titanic_sink select * from titanic_source;` |
|---|
配置文件内容
hive-site.xml的配置文件内容
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://ip:port/dbName</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.cj.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>password</value>
</property>
<property>
<name>hive.cli.print.current.db</name>
<value>true</value>
</property>
<property>
<name>hive.cli.print.header</name>
<value>true</value>
</property>
<property>
<name>hive.server2.thrift.bind.host</name>
<value>thrift地址,ip或者域名</value>
</property>
<property>
<name>hive.server2.thrift.port</name>
<value>port</value> //(HiveServer2远程连接的端口,默认为10000)
<description>Port number of HiveServer2 Thrift interface.
Can be overridden by setting $HIVE_SERVER2_THRIFT_PORT</description>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>hdfs目录</value>
<description>location of default database for the warehouse</description>
</property>
<property>
<name>datanucleus.autoStartMechanism</name>
<value>SchemaTable</value>
</property>
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
</property>
<property>
<name>hive.metastore.uris</name>
<value>thrift://ip:port</value>
</property>
</configuration>
hadoop的配置
core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://ip:port</value>
</property>
<!--修改用于hadoop存储数据的默认位置-->
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop/hadoop-2.10.1/tmp</value>
</property>
<property>
<name>hadoop.http.filter.initializers</name>
<value>org.apache.hadoop.security.AuthenticationFilterInitializer</value>
</property>
<property>
<name>hadoop.http.authentication.type</name>
<value>simple</value>
</property>
<property>
<name>hadoop.http.authentication.token.validity</name>
<value>3600</value>
</property>
<property>
<name>hadoop.http.authentication.signature.secret.file</name>
<value>/home/package/hadoop-2.10.1/etc/hadoop/secret</value>
</property>
<property>
<name>hadoop.http.authentication.cookie.domain</name>
<value></value>
</property>
<property>
<name>hadoop.http.authentication.simple.anonymous.allowed</name>
<value>false</value>
</property>
</configuration>
hdfs-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/home/hadoop/hadoop-2.10.1/dfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/home/hadoop/hadoop-2.10.1/dfs/data</value>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
<description>HDFS 的数据块的副本存储个数, 默认是3</description>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
<description>need not permissions</description>
</property>
<property>
<name>dfs.namenode.http-address</name>
<value>ip:port</value>
</property>
</configuration>