
微信公众号:深广大数据Club关注可了解更多大数据的相关资讯。问题或建议,请公众号留言;如果你觉得深广大数据Club对你有帮助,帮忙转发文章到微信朋友圈
本文我们主要介绍Apache Flink集成的交互式Scala Shell脚本。我们既可以在本地安装模式下或者集群模式下运行该脚本。之后就可以在这之上执行你所编写的代码程序。
Scala REPL
start-scala-shell.sh脚本存放于Flink安装目录的bin底下。通过如下命令在单机模式下启动shell脚本:
bin/start-scala-shell.sh local
详细的使用方法可具体查看Scala REPL:https://github.com/Jonathan-Wei/Flink-Docs-CN/blob/master/06%20%E9%83%A8%E7%BD%B2-%E6%93%8D%E4%BD%9C/07%20Scala%20REPL.md
start-scala-shell.sh
这里主要看脚本最后调用的代码信息,从如下代码可以看到,脚本最终调用的是FlinkShell.scala
if ${EXTERNAL_LIB_FOUND}then java -Dscala.color -cp "$FLINK_CLASSPATH" $log_setting org.apache.flink.api.scala.FlinkShell $@ --addclasspath "$EXT_CLASSPATH"else java -Dscala.color -cp "$FLINK_CLASSPATH" $log_setting org.apache.flink.api.scala.FlinkShell $@fi
FlinkShell.scala
从main()方法入手,来查看具体的代码逻辑。
cmd("local") action { (_, c) => c.copy(executionMode = ExecutionMode.LOCAL)}text "Starts Flink scala shell with a local Flink cluster" children( ...)cmd("remote") action { (_, c) => c.copy(executionMode = ExecutionMode.REMOTE)} text "Starts Flink scala shell connecting to a remote cluster" children( ...)cmd("yarn") action { (_, c) => c.copy(executionMode = ExecutionMode.YARN, yarnConfig = None)} text "Starts Flink scala shell connecting to a yarn cluster" children( ...)// parse argumentsparser.parse(args, Config()) match { case Some(config) => startShell(config) case _ => println("Could not parse program arguments")}
从代码上可以看出,启动方式包含三种:local、remote、yarn在指定启动方式之后,会将executionMode指定对应的模式。最后通过startShell启动脚本。
startShell方法
val (repl, cluster) = try { val (host, port, cluster) = fetchConnectionInfo(configuration, config) val conf = cluster match { case Some(Left(_)) => configuration case Some(Right(yarnCluster)) => yarnCluster.getFlinkConfiguration case None => configuration } println(s"\nConnecting to Flink cluster (host: $host, port: $port).\n") val repl = bufferedReader match { case Some(reader) => val out = new StringWriter() new FlinkILoop(host, port, conf, config.externalJars, reader, new JPrintWriter(out)) case None => new FlinkILoop(host, port, conf, config.externalJars) } (repl, cluster)} catch { case e: IllegalArgumentException => println(s"Error: ${e.getMessage}") sys.exit()}
以上代码做了两件事,第一件事是获取链接信息fetchConnectionInfo(host, port, cluster)并读取配置,另外一件事是基于配置new一个FlinkIloop对象repl。之后通过repl.process(settings)启动处理
try { repl.process(settings)} finally { repl.closeInterpreter() cluster match { case Some(Left(miniCluster)) => miniCluster.close() case Some(Right(yarnCluster)) => yarnCluster.shutDownCluster() yarnCluster.shutdown() case _ => }}
我们再来看下关键的fetchConnectionInfo方法
def fetchConnectionInfo( configuration: Configuration, config: Config ): (String, Int, Option[Either[MiniCluster , ClusterClient[_]]]) = { config.executionMode match { case ExecutionMode.LOCAL => // Local mode val config = configuration config.setInteger(JobManagerOptions.PORT, 0) val miniClusterConfig = new MiniClusterConfiguration.Builder() .setConfiguration(config) .build() val cluster = new MiniCluster(miniClusterConfig) cluster.start() val port = cluster.getRestAddress.getPort println(s"\nStarting local Flink cluster (host: localhost, port: $port).\n") ("localhost", port, Some(Left(cluster))) case ExecutionMode.REMOTE => // Remote mode if (config.host.isEmpty || config.port.isEmpty) { throw new IllegalArgumentException("<host> or <port> is not specified!") } (config.host.get, config.port.get, None) case ExecutionMode.YARN => // YARN mode config.yarnConfig match { case Some(yarnConfig) => // if there is information for new cluster deployNewYarnCluster( configuration, config.configDir.getOrElse(CliFrontend.getConfigurationDirectoryFromEnv), yarnConfig) case None => // there is no information for new cluster. Then we use yarn properties. fetchDeployedYarnClusterInfo( configuration, config.configDir.getOrElse(CliFrontend.getConfigurationDirectoryFromEnv) ) } case ExecutionMode.UNDEFINED => // Wrong input throw new IllegalArgumentException("please specify execution mode:\n" + "[local | remote <host> <port> | yarn]") } }
fetchConnectionInfo方法中对ExecutionMode的类型进行匹配。
-
LOCAL模式
val cluster = new MiniCluster(miniClusterConfig)cluster.start()
-
REMOTE模式
if (config.host.isEmpty || config.port.isEmpty) { throw new IllegalArgumentException("<host> or <port> is not specified!") } (config.host.get, config.port.get, None)
远程模式仅获取host和port提供调用。
-
YARN模式
config.yarnConfig match { case Some(yarnConfig) => // if there is information for new cluster deployNewYarnCluster( configuration, config.configDir.getOrElse(CliFrontend.getConfigurationDirectoryFromEnv), yarnConfig) case None => // there is no information for new cluster. Then we use yarn properties. fetchDeployedYarnClusterInfo( configuration, config.configDir.getOrElse(CliFrontend.getConfigurationDirectoryFromEnv) )}
如果yarnConfig不为空调用deployNewYarnCluster,否则则使用yarn properties文件的信息,调用fetchDeployedYarnClusterInfo
deployNewYarnCluster
val frontend = new CliFrontend(configuration, CliFrontend.loadCustomCommandLines(configuration, configurationDirectory))val commandOptions = CliFrontendParser.getRunCommandOptionsval commandLineOptions = CliFrontendParser.mergeOptions(commandOptions, frontend.getCustomCommandLineOptions());val commandLine = CliFrontendParser.parse(commandLineOptions, args.toArray, true)val customCLI = frontend.getActiveCustomCommandLine(commandLine)val clusterDescriptor = customCLI.createClusterDescriptor(commandLine)val clusterSpecification = customCLI.getClusterSpecification(commandLine)val cluster = clusterDescriptor.deploySessionCluster(clusterSpecification)val inetSocketAddress = AkkaUtils.getInetSocketAddressFromAkkaURL( cluster.getClusterConnectionInfo.getAddress)val address = inetSocketAddress.getAddress.getHostAddressval port = inetSocketAddress.getPort(address, port, Some(Right(cluster)))
fetchDeployedYarnClusterInfo
val commandLine = CliFrontendParser.parse( CliFrontendParser.getRunCommandOptions, args.toArray, true)val frontend = new CliFrontend( configuration, CliFrontend.loadCustomCommandLines(configuration, configurationDirectory))val customCLI = frontend.getActiveCustomCommandLine(commandLine)val clusterDescriptor = customCLI .createClusterDescriptor(commandLine) .asInstanceOf[ClusterDescriptor[Any]]val clusterId = customCLI.getClusterId(commandLine)val cluster = clusterDescriptor.retrieve(clusterId)if (cluster == null) { throw new RuntimeException("Yarn Cluster could not be retrieved.")}val jobManager = AkkaUtils.getInetSocketAddressFromAkkaURL( cluster.getClusterConnectionInfo.getAddress)(jobManager.getHostString, jobManager.getPort, None)
两个方法最终的都是通过AkkaUtils.getInetSocketAddressFromAkkaURL获取host以及port。不同的是前面的部署,deployNewYarnCluster通过clusterDescriptor.deploySessionCluster部署集群获取Cluster,而fetchDeployedYarnClusterInfo则是先获取clusterID,调用clusterDescriptor.retrieve()传入clusterId获取Cluster
FlinkILoop
回过头来看下之前提到的repl,其实就是FlinkILoop的实例。在FlinkILoop包含了Local模式的环境信息以及Remote模式的环境信息
LOCAL模式
// local environmentval (scalaBenv: ExecutionEnvironment, scalaSenv: StreamExecutionEnvironment) = { val scalaBenv = new ExecutionEnvironment(remoteBenv) val scalaSenv = new StreamExecutionEnvironment(remoteSenv) (scalaBenv,scalaSenv)}
批量env:ExecutionEnvironment流式env:StreamExecutionEnvironment
local模式在《Flink源码解析 | 从Example出发:读懂本地任务执行流程》讲过,这里就不再赘述
REMOTE模式
// remote environmentprivate val (remoteBenv: ScalaShellRemoteEnvironment,remoteSenv: ScalaShellRemoteStreamEnvironment) = { // allow creation of environments ScalaShellRemoteEnvironment.resetContextEnvironments() // create our environment that submits against the cluster (local or remote) val remoteBenv = new ScalaShellRemoteEnvironment( host, port, this, clientConfig, this.getExternalJars(): _*) val remoteSenv = new ScalaShellRemoteStreamEnvironment( host, port, this, clientConfig, getExternalJars(): _*) // prevent further instantiation of environments ScalaShellRemoteEnvironment.disableAllContextAndOtherEnvironments() (remoteBenv,remoteSenv)}
批量env:ScalaShellRemoteEnvironment流式env:ScalaShellRemoteStreamEnvironment
我们在交互式shell脚本运行后,在其命令行中编写代码逻辑,编写完成之后通过env.execute()执行。
代码入口
这里我们拿官网的例子来看。
Scala-Flink> val text = benv.fromElements( "To be, or not to be,--that is the question:--", "Whether 'tis nobler in the mind to suffer", "The slings and arrows of outrageous fortune", "Or to take arms against a sea of troubles,")Scala-Flink> val counts = text .flatMap { _.toLowerCase.split("\\W+") } .map { (_, 1) }.groupBy(0).sum(1)Scala-Flink> counts.print()Scala-Flink> benv.execute("MyProgram")
我们运行脚本进入交互式界面后,其实脚本就已经内置了benv以及senv环境变量,之后编写代码,调用execute方法。
我们来看下ScalaShellRemoteEnvironment以及ScalaShellRemoteStreamEnvironment的内部实现。
ScalaShellRemoteEnvironment继承RemoteEnvironment
@Overridepublic JobExecutionResult execute(String jobName) throws Exception { PlanExecutor executor = getExecutor(); Plan p = createProgramPlan(jobName); // Session management is disabled, revert this commit to enable //p.setJobId(jobID); //p.setSessionTimeout(sessionTimeout); JobExecutionResult result = executor.executePlan(p); this.lastJobExecutionResult = result; return result;}
-
通过getExecutor方法获取PlanExecutor执行器对象
-
创建ProgramPlan
-
通过executor.executePlan方法执行计划并返回result
ScalaShellRemoteStreamEnvironment继承RemoteStreamEnvironment
@Overridepublic JobExecutionResult execute(String jobName) throws ProgramInvocationException { StreamGraph streamGraph = getStreamGraph(); streamGraph.setJobName(jobName); transformations.clear(); return executeRemotely(streamGraph, jarFiles);}
先通过RemoteStreamEnvironment.execute()方法获取StreamGraph,再调用其子类ScalaShellRemoteStreamEnvironment的executeRemotely方法
ScalaShellRemoteStreamEnvironment.executeRemotely
protected JobExecutionResult executeRemotely(StreamGraph streamGraph, List<URL> jarFiles) throws ProgramInvocationException { URL jarUrl; try { jarUrl = flinkILoop.writeFilesToDisk().getAbsoluteFile().toURI().toURL(); } catch (MalformedURLException e) { throw new ProgramInvocationException("Could not write the user code classes to disk.", streamGraph.getJobGraph().getJobID(), e); } List<URL> allJarFiles = new ArrayList<>(jarFiles.size() + 1); allJarFiles.addAll(jarFiles); allJarFiles.add(jarUrl); return super.executeRemotely(streamGraph, allJarFiles);}
获取url对象以及所需添加的jar包的url对象后,调用父类RemoteStreamEnvironment的executeRemotely方法。
RemoteStreamEnvironment.executeRemotely
final ClusterClient<?> client;try { client = new RestClusterClient<>(configuration, "RemoteStreamEnvironment");}catch (Exception e) { throw new ProgramInvocationException("Cannot establish connection to JobManager: " + e.getMessage(), streamGraph.getJobGraph().getJobID(), e);}client.setPrintStatusDuringExecution(getConfig().isSysoutLoggingEnabled());try { return client.run(streamGraph, jarFiles, globalClasspaths, usercodeClassLoader).getJobExecutionResult();}catch (ProgramInvocationException e) { throw e;}
主要做了两件事
-
获取ClusterClient,此处的ClusterClient是RestClusterClient
-
调用RestClusterClient.run()执行并获取JobExecutionResult
后续的流程与先前文章的流程类似,只是最后submitJob是调用的RestClusterClient的submitJob。具体内容我这里就不再深入。自己试着理解下。其他大体都和之前的流程类似。
相关文章
Flink源码解析 | 从Example出发:读懂本地任务执行流程
Flink源码解析 | 从Example出发:读懂集群任务执行流程
Flink源码解析 | 从Example出发:读懂Flink On Yarn任务执行流程
关注公众号
