大数据开发Spark实战(第三十篇)

264 阅读10分钟

携手创作,共同成长!这是我参与「掘金日新计划 · 12 月更文挑战」的第7天,点击查看活动详情

一、WordCount程序

  1. 读取文件中的所有内容,计算每个单词出现的次数
1.1、创建Scala项目
  1. 创建常规的idea maven java项目

  2. 创建好之后打开项目结构

  3. 添加依赖

    添加依赖

  4. scala全局依赖添加完

    Sacla全局依赖

  5. 创建scala目录,并且右键,使得文件目录为Sources Root

  6. 添加spark依赖(注意:本地依赖的scala版本的jar要与spark的的要一一对应)

    eg:我这边依赖的是2.12版本的,那么jar依赖的也要是spark-core2.12版本

    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.12</artifactId>
      <version>3.1.2</version>
    </dependency>
    
1.2、编写scala wordcount代码
import org.apache.spark.{SparkConf, SparkContext}
​
/**
 * @author xys
 * @version WordCountScala.java, 2022年11月13日
 */
object WordCountScala {
  def main(args: Array[String]): Unit = {
    //1.配置SparkContext
    val conf = new SparkConf()
    //设置任务名称;设置任务在本地执行
    conf.setAppName("WordCountScala").setMaster("local")
    val context = new SparkContext(conf)
​
    //2. 加载数据 数据内容:
    //hello you
    //hello me
    val linesRDD = context.textFile("/Users/strivelearn/Desktop/spark.txt")
​
    //3.对数据进行切割,把一行数据切分为一个一个的单词
    val wordsRDD = linesRDD.flatMap(_.split(" "))
​
    //4.迭代wordsRDD,将每个word转换为(word,1)这种形式
    val pairRDD = wordsRDD.map(word => (word, 1))
​
    //5.根据key(其实就是word)进行分组聚合统计
    val wordCountRDD = pairRDD.reduceByKey((x, y) => x + y)
​
    //6.将结果打印到控制台
    wordCountRDD.foreach(wordCount => println(wordCount._1 + "--" + wordCount._2))
​
    //7.通知sparkContext
    context.stop()
  }
}

执行结果

WordCountScala
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
22/11/13 23:47:32 INFO SparkContext: Running Spark version 3.1.2
22/11/13 23:47:33 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/11/13 23:47:33 INFO ResourceUtils: ==============================================================
22/11/13 23:47:33 INFO ResourceUtils: No custom resources configured for spark.driver.
22/11/13 23:47:33 INFO ResourceUtils: ==============================================================
22/11/13 23:47:33 INFO SparkContext: Submitted application: WordCountScala
22/11/13 23:47:33 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
22/11/13 23:47:33 INFO ResourceProfile: Limiting resource is cpu
22/11/13 23:47:33 INFO ResourceProfileManager: Added ResourceProfile id: 0
22/11/13 23:47:33 INFO SecurityManager: Changing view acls to: strivelearn
22/11/13 23:47:33 INFO SecurityManager: Changing modify acls to: strivelearn
22/11/13 23:47:33 INFO SecurityManager: Changing view acls groups to: 
22/11/13 23:47:33 INFO SecurityManager: Changing modify acls groups to: 
22/11/13 23:47:33 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(strivelearn); groups with view permissions: Set(); users  with modify permissions: Set(strivelearn); groups with modify permissions: Set()
22/11/13 23:47:33 INFO Utils: Successfully started service 'sparkDriver' on port 53653.
22/11/13 23:47:33 INFO SparkEnv: Registering MapOutputTracker
22/11/13 23:47:33 INFO SparkEnv: Registering BlockManagerMaster
22/11/13 23:47:33 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
22/11/13 23:47:33 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
22/11/13 23:47:33 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
22/11/13 23:47:33 INFO DiskBlockManager: Created local directory at /private/var/folders/nn/61tcd7zs6qbb72zlwpkv2qkw0000gn/T/blockmgr-e09d0fc0-f621-4efb-864d-42ea9eed15e3
22/11/13 23:47:33 INFO MemoryStore: MemoryStore started with capacity 4.1 GiB
22/11/13 23:47:33 INFO SparkEnv: Registering OutputCommitCoordinator
22/11/13 23:47:33 INFO Utils: Successfully started service 'SparkUI' on port 4040.
22/11/13 23:47:34 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://192.168.0.109:4040
22/11/13 23:47:34 INFO Executor: Starting executor ID driver on host 192.168.0.109
22/11/13 23:47:34 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 53654.
22/11/13 23:47:34 INFO NettyBlockTransferService: Server created on 192.168.0.109:53654
22/11/13 23:47:34 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
22/11/13 23:47:34 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 192.168.0.109, 53654, None)
22/11/13 23:47:34 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.0.109:53654 with 4.1 GiB RAM, BlockManagerId(driver, 192.168.0.109, 53654, None)
22/11/13 23:47:34 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 192.168.0.109, 53654, None)
22/11/13 23:47:34 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 192.168.0.109, 53654, None)
22/11/13 23:47:34 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 293.9 KiB, free 4.1 GiB)
22/11/13 23:47:34 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 27.0 KiB, free 4.1 GiB)
22/11/13 23:47:34 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.0.109:53654 (size: 27.0 KiB, free: 4.1 GiB)
22/11/13 23:47:34 INFO SparkContext: Created broadcast 0 from textFile at WordCountScala.scala:18
22/11/13 23:47:35 INFO FileInputFormat: Total input files to process : 1
22/11/13 23:47:35 INFO SparkContext: Starting job: foreach at WordCountScala.scala:30
22/11/13 23:47:35 INFO DAGScheduler: Registering RDD 3 (map at WordCountScala.scala:24) as input to shuffle 0
22/11/13 23:47:35 INFO DAGScheduler: Got job 0 (foreach at WordCountScala.scala:30) with 1 output partitions
22/11/13 23:47:35 INFO DAGScheduler: Final stage: ResultStage 1 (foreach at WordCountScala.scala:30)
22/11/13 23:47:35 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
22/11/13 23:47:35 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0)
22/11/13 23:47:35 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[3] at map at WordCountScala.scala:24), which has no missing parents
22/11/13 23:47:35 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 6.8 KiB, free 4.1 GiB)
22/11/13 23:47:35 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 4.0 KiB, free 4.1 GiB)
22/11/13 23:47:35 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.0.109:53654 (size: 4.0 KiB, free: 4.1 GiB)
22/11/13 23:47:35 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1388
22/11/13 23:47:35 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[3] at map at WordCountScala.scala:24) (first 15 tasks are for partitions Vector(0))
22/11/13 23:47:35 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks resource profile 0
22/11/13 23:47:35 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0) (192.168.0.109, executor driver, partition 0, PROCESS_LOCAL, 4499 bytes) taskResourceAssignments Map()
22/11/13 23:47:35 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
22/11/13 23:47:35 INFO HadoopRDD: Input split: file:/Users/strivelearn/Desktop/spark.txt:0+19
22/11/13 23:47:35 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1248 bytes result sent to driver
22/11/13 23:47:35 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 403 ms on 192.168.0.109 (executor driver) (1/1)
22/11/13 23:47:35 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
22/11/13 23:47:35 INFO DAGScheduler: ShuffleMapStage 0 (map at WordCountScala.scala:24) finished in 0.518 s
22/11/13 23:47:35 INFO DAGScheduler: looking for newly runnable stages
22/11/13 23:47:35 INFO DAGScheduler: running: Set()
22/11/13 23:47:35 INFO DAGScheduler: waiting: Set(ResultStage 1)
22/11/13 23:47:35 INFO DAGScheduler: failed: Set()
22/11/13 23:47:35 INFO DAGScheduler: Submitting ResultStage 1 (ShuffledRDD[4] at reduceByKey at WordCountScala.scala:27), which has no missing parents
22/11/13 23:47:35 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 4.4 KiB, free 4.1 GiB)
22/11/13 23:47:35 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 2.6 KiB, free 4.1 GiB)
22/11/13 23:47:35 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.0.109:53654 (size: 2.6 KiB, free: 4.1 GiB)
22/11/13 23:47:35 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1388
22/11/13 23:47:35 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (ShuffledRDD[4] at reduceByKey at WordCountScala.scala:27) (first 15 tasks are for partitions Vector(0))
22/11/13 23:47:35 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks resource profile 0
22/11/13 23:47:35 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1) (192.168.0.109, executor driver, partition 0, NODE_LOCAL, 4271 bytes) taskResourceAssignments Map()
22/11/13 23:47:35 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
22/11/13 23:47:35 INFO ShuffleBlockFetcherIterator: Getting 1 (66.0 B) non-empty blocks including 1 (66.0 B) local and 0 (0.0 B) host-local and 0 (0.0 B) remote blocks
22/11/13 23:47:35 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 6 ms
22/11/13 23:47:35 INFO BlockManagerInfo: Removed broadcast_1_piece0 on 192.168.0.109:53654 in memory (size: 4.0 KiB, free: 4.1 GiB)
22/11/13 23:47:35 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1310 bytes result sent to driver
22/11/13 23:47:35 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 137 ms on 192.168.0.109 (executor driver) (1/1)
22/11/13 23:47:35 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
22/11/13 23:47:35 INFO DAGScheduler: ResultStage 1 (foreach at WordCountScala.scala:30) finished in 0.154 s
22/11/13 23:47:35 INFO DAGScheduler: Job 0 is finished. Cancelling potential speculative or zombie tasks for this job
22/11/13 23:47:35 INFO TaskSchedulerImpl: Killing all running tasks in stage 1: Stage finished
22/11/13 23:47:35 INFO DAGScheduler: Job 0 finished: foreach at WordCountScala.scala:30, took 0.892652 s
hello--2
your--1
me--1
22/11/13 23:47:35 INFO SparkUI: Stopped Spark web UI at http://192.168.0.109:4040
22/11/13 23:47:35 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
22/11/13 23:47:36 INFO MemoryStore: MemoryStore cleared
22/11/13 23:47:36 INFO BlockManager: BlockManager stopped
22/11/13 23:47:36 INFO BlockManagerMaster: BlockManagerMaster stopped
22/11/13 23:47:36 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
22/11/13 23:47:36 INFO SparkContext: Successfully stopped SparkContext
22/11/13 23:47:36 INFO ShutdownHookManager: Shutdown hook called
22/11/13 23:47:36 INFO ShutdownHookManager: Deleting directory /private/var/folders/nn/61tcd7zs6qbb72zlwpkv2qkw0000gn/T/spark-ca60e48f-d332-4193-a336-23f5868a4aa1

进程已结束,退出代码0

1.3、使用java代码实现
package com.strivelearn.java;
​
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.api.java.function.VoidFunction;
import scala.Tuple2;
​
import java.util.Arrays;
import java.util.Iterator;
​
/**
 * @author xys
 * @version WordCountJava.java, 2022年11月13日
 */
public class WordCountJava {
    public static void main(String[] args) {
        //1.创建sparkContext
        SparkConf sparkConf = new SparkConf();
        sparkConf.setAppName("WordCountJava");
        sparkConf.setMaster("local");
        JavaSparkContext javaSparkContext = new JavaSparkContext(sparkConf);
​
        //2.加载数据
        JavaRDD<String> linesRDD = javaSparkContext.textFile("/Users/strivelearn/Desktop/spark.txt");
        //3.对数据进行切割,把一行数据切分为一个一个的单词
        //注意:FlatMapFunction的泛型,第一个参数表示输入数据类型,第二个表示输出数据类型
        JavaRDD<String> wordsRDD = linesRDD.flatMap(new FlatMapFunction<String, String>() {
            @Override
            public Iterator<String> call(String line) throws Exception {
                return Arrays.asList(line.split(" "))
                             .iterator();
            }
        });
        //4.迭代wordsRDD,将每个word转换为(word,1)这种形式
        //注意:PairFunction的泛型,第一个参数为输入数据类型
        //第二个是输出tuple中的第一个参数类型,第三个是输出tuple中的第二个参数类型
        //如果后面需要使用到...Bykey,前面都是需要使用mapToPair去处理
        JavaPairRDD<String, Integer> pairRDD = wordsRDD.mapToPair(new PairFunction<String, String, Integer>() {
            @Override
            public Tuple2<String, Integer> call(String word) throws Exception {
                return new Tuple2<>(word, 1);
            }
        });
        //5.根据key进行分组聚合统计
        JavaPairRDD<String, Integer> wordCountRDD = pairRDD.reduceByKey(new Function2<Integer, Integer, Integer>() {
            @Override
            public Integer call(Integer i1, Integer i2) throws Exception {
                return i1 + i2;
            }
        });
        //6.将结果打印
        wordCountRDD.foreach(new VoidFunction<Tuple2<String, Integer>>() {
            @Override
            public void call(Tuple2<String, Integer> tup) throws Exception {
                System.out.println(tup._1 + "--" + tup._2);
            }
        });
        //7.停止
        javaSparkContext.stop();
    }
}
​

二、Spark任务提交方式

  1. 直接在idea中执行,方便在本地环境调试代码

  2. 使用spark-submit提交到集群执行【实际工作中使用】

    ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster /root/software/spark-3.3.1-bin-hadoop3/examples/jars/spark-examples_2.12-3.3.1.jar 2

  3. 使用spark-shell,方便在集群环境调式代码

    image-20221119142426141

    使用spark-shell yarn模式执行命令

    spark-shell --master yarn --deploy-mode client

三、Spark开启historyServer模式

  1. 进入spark的conf目录

    cd /root/software/spark-3.3.1-bin-hadoop3/conf

  2. 修改spark-defaults.conf.template

    mv spark-defaults.conf.template spark-defaults.conf

  3. 修改spark-defaults.conf

    创建hdfs目录:hdfs dfs -mkdir -p /tmp/logs/root/logs

    spark.eventLog.enabled=true

    spark.eventLog.compress=true

    spark.eventLog.dir=hdfs://bigdata01:9000/tmp/logs/root/logs

    spark.hostory.fs.logDirectory=hdfs://bigdata01:9000/tmp/logs/root/logs

    spark.yarn.historyServer.address=http://bigdata01:18080

  4. 修改spark-env.sh

    增强一条配置

    export SPARK_HISTORY_OPTS="-Dspark.history.ui.port=18080 -Dspark.history.fs.logDirectory=hdfs://bigdata01:9000/tmp/logs/root/logs"

  5. 启动spark目录下的sbin

    sbin/start-history-server.sh

  6. 使用jps查看

    image-20221119144257755

这样提交任务,任务完成就可以看到history

http://192.168.234.100:8088/cluster/apps

image-20221119144542236

注意这个要开启hdfs的history web服务,在hadoop的etc/hadoop中的mapred-site.xml配置

<!-- 历史服务器 web 端地址 --> 
<property> 
    <name>mapreduce.jobhistory.webapp.address</name> 
    <value>bigdata01:19888</value> 
</property>

开启hdfs history服务:

mapred --daemon start historyserver

image-20221119151712688

master:9870 #HDFS master:8088 #yarn框架 master:19888 #任务平台

最终可以看到了历史执行的日志:

http://bigdata01:19888/jobhistory/logs//192.168.234.100:34905/container_1668841833138_0001_01_000002/container_1668841833138_0001_01_000002/root/stdout?start=-4096

image-20221119151808789