携手创作,共同成长!这是我参与「掘金日新计划 · 12 月更文挑战」的第7天,点击查看活动详情
一、WordCount程序
- 读取文件中的所有内容,计算每个单词出现的次数
1.1、创建Scala项目
-
创建常规的idea maven java项目
-
创建好之后打开
项目结构 -
添加依赖
-
scala全局依赖添加完
-
创建scala目录,并且右键,使得文件目录为
Sources Root -
添加spark依赖(注意:本地依赖的scala版本的jar要与spark的的要一一对应)
eg:我这边依赖的是2.12版本的,那么jar依赖的也要是spark-core2.12版本
<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.12</artifactId> <version>3.1.2</version> </dependency>
1.2、编写scala wordcount代码
import org.apache.spark.{SparkConf, SparkContext}
/**
* @author xys
* @version WordCountScala.java, 2022年11月13日
*/
object WordCountScala {
def main(args: Array[String]): Unit = {
//1.配置SparkContext
val conf = new SparkConf()
//设置任务名称;设置任务在本地执行
conf.setAppName("WordCountScala").setMaster("local")
val context = new SparkContext(conf)
//2. 加载数据 数据内容:
//hello you
//hello me
val linesRDD = context.textFile("/Users/strivelearn/Desktop/spark.txt")
//3.对数据进行切割,把一行数据切分为一个一个的单词
val wordsRDD = linesRDD.flatMap(_.split(" "))
//4.迭代wordsRDD,将每个word转换为(word,1)这种形式
val pairRDD = wordsRDD.map(word => (word, 1))
//5.根据key(其实就是word)进行分组聚合统计
val wordCountRDD = pairRDD.reduceByKey((x, y) => x + y)
//6.将结果打印到控制台
wordCountRDD.foreach(wordCount => println(wordCount._1 + "--" + wordCount._2))
//7.通知sparkContext
context.stop()
}
}
执行结果
WordCountScala
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
22/11/13 23:47:32 INFO SparkContext: Running Spark version 3.1.2
22/11/13 23:47:33 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/11/13 23:47:33 INFO ResourceUtils: ==============================================================
22/11/13 23:47:33 INFO ResourceUtils: No custom resources configured for spark.driver.
22/11/13 23:47:33 INFO ResourceUtils: ==============================================================
22/11/13 23:47:33 INFO SparkContext: Submitted application: WordCountScala
22/11/13 23:47:33 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
22/11/13 23:47:33 INFO ResourceProfile: Limiting resource is cpu
22/11/13 23:47:33 INFO ResourceProfileManager: Added ResourceProfile id: 0
22/11/13 23:47:33 INFO SecurityManager: Changing view acls to: strivelearn
22/11/13 23:47:33 INFO SecurityManager: Changing modify acls to: strivelearn
22/11/13 23:47:33 INFO SecurityManager: Changing view acls groups to:
22/11/13 23:47:33 INFO SecurityManager: Changing modify acls groups to:
22/11/13 23:47:33 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(strivelearn); groups with view permissions: Set(); users with modify permissions: Set(strivelearn); groups with modify permissions: Set()
22/11/13 23:47:33 INFO Utils: Successfully started service 'sparkDriver' on port 53653.
22/11/13 23:47:33 INFO SparkEnv: Registering MapOutputTracker
22/11/13 23:47:33 INFO SparkEnv: Registering BlockManagerMaster
22/11/13 23:47:33 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
22/11/13 23:47:33 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
22/11/13 23:47:33 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
22/11/13 23:47:33 INFO DiskBlockManager: Created local directory at /private/var/folders/nn/61tcd7zs6qbb72zlwpkv2qkw0000gn/T/blockmgr-e09d0fc0-f621-4efb-864d-42ea9eed15e3
22/11/13 23:47:33 INFO MemoryStore: MemoryStore started with capacity 4.1 GiB
22/11/13 23:47:33 INFO SparkEnv: Registering OutputCommitCoordinator
22/11/13 23:47:33 INFO Utils: Successfully started service 'SparkUI' on port 4040.
22/11/13 23:47:34 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://192.168.0.109:4040
22/11/13 23:47:34 INFO Executor: Starting executor ID driver on host 192.168.0.109
22/11/13 23:47:34 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 53654.
22/11/13 23:47:34 INFO NettyBlockTransferService: Server created on 192.168.0.109:53654
22/11/13 23:47:34 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
22/11/13 23:47:34 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 192.168.0.109, 53654, None)
22/11/13 23:47:34 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.0.109:53654 with 4.1 GiB RAM, BlockManagerId(driver, 192.168.0.109, 53654, None)
22/11/13 23:47:34 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 192.168.0.109, 53654, None)
22/11/13 23:47:34 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 192.168.0.109, 53654, None)
22/11/13 23:47:34 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 293.9 KiB, free 4.1 GiB)
22/11/13 23:47:34 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 27.0 KiB, free 4.1 GiB)
22/11/13 23:47:34 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.0.109:53654 (size: 27.0 KiB, free: 4.1 GiB)
22/11/13 23:47:34 INFO SparkContext: Created broadcast 0 from textFile at WordCountScala.scala:18
22/11/13 23:47:35 INFO FileInputFormat: Total input files to process : 1
22/11/13 23:47:35 INFO SparkContext: Starting job: foreach at WordCountScala.scala:30
22/11/13 23:47:35 INFO DAGScheduler: Registering RDD 3 (map at WordCountScala.scala:24) as input to shuffle 0
22/11/13 23:47:35 INFO DAGScheduler: Got job 0 (foreach at WordCountScala.scala:30) with 1 output partitions
22/11/13 23:47:35 INFO DAGScheduler: Final stage: ResultStage 1 (foreach at WordCountScala.scala:30)
22/11/13 23:47:35 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
22/11/13 23:47:35 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0)
22/11/13 23:47:35 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[3] at map at WordCountScala.scala:24), which has no missing parents
22/11/13 23:47:35 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 6.8 KiB, free 4.1 GiB)
22/11/13 23:47:35 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 4.0 KiB, free 4.1 GiB)
22/11/13 23:47:35 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.0.109:53654 (size: 4.0 KiB, free: 4.1 GiB)
22/11/13 23:47:35 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1388
22/11/13 23:47:35 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[3] at map at WordCountScala.scala:24) (first 15 tasks are for partitions Vector(0))
22/11/13 23:47:35 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks resource profile 0
22/11/13 23:47:35 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0) (192.168.0.109, executor driver, partition 0, PROCESS_LOCAL, 4499 bytes) taskResourceAssignments Map()
22/11/13 23:47:35 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
22/11/13 23:47:35 INFO HadoopRDD: Input split: file:/Users/strivelearn/Desktop/spark.txt:0+19
22/11/13 23:47:35 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1248 bytes result sent to driver
22/11/13 23:47:35 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 403 ms on 192.168.0.109 (executor driver) (1/1)
22/11/13 23:47:35 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
22/11/13 23:47:35 INFO DAGScheduler: ShuffleMapStage 0 (map at WordCountScala.scala:24) finished in 0.518 s
22/11/13 23:47:35 INFO DAGScheduler: looking for newly runnable stages
22/11/13 23:47:35 INFO DAGScheduler: running: Set()
22/11/13 23:47:35 INFO DAGScheduler: waiting: Set(ResultStage 1)
22/11/13 23:47:35 INFO DAGScheduler: failed: Set()
22/11/13 23:47:35 INFO DAGScheduler: Submitting ResultStage 1 (ShuffledRDD[4] at reduceByKey at WordCountScala.scala:27), which has no missing parents
22/11/13 23:47:35 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 4.4 KiB, free 4.1 GiB)
22/11/13 23:47:35 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 2.6 KiB, free 4.1 GiB)
22/11/13 23:47:35 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.0.109:53654 (size: 2.6 KiB, free: 4.1 GiB)
22/11/13 23:47:35 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1388
22/11/13 23:47:35 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (ShuffledRDD[4] at reduceByKey at WordCountScala.scala:27) (first 15 tasks are for partitions Vector(0))
22/11/13 23:47:35 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks resource profile 0
22/11/13 23:47:35 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1) (192.168.0.109, executor driver, partition 0, NODE_LOCAL, 4271 bytes) taskResourceAssignments Map()
22/11/13 23:47:35 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
22/11/13 23:47:35 INFO ShuffleBlockFetcherIterator: Getting 1 (66.0 B) non-empty blocks including 1 (66.0 B) local and 0 (0.0 B) host-local and 0 (0.0 B) remote blocks
22/11/13 23:47:35 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 6 ms
22/11/13 23:47:35 INFO BlockManagerInfo: Removed broadcast_1_piece0 on 192.168.0.109:53654 in memory (size: 4.0 KiB, free: 4.1 GiB)
22/11/13 23:47:35 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1310 bytes result sent to driver
22/11/13 23:47:35 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 137 ms on 192.168.0.109 (executor driver) (1/1)
22/11/13 23:47:35 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
22/11/13 23:47:35 INFO DAGScheduler: ResultStage 1 (foreach at WordCountScala.scala:30) finished in 0.154 s
22/11/13 23:47:35 INFO DAGScheduler: Job 0 is finished. Cancelling potential speculative or zombie tasks for this job
22/11/13 23:47:35 INFO TaskSchedulerImpl: Killing all running tasks in stage 1: Stage finished
22/11/13 23:47:35 INFO DAGScheduler: Job 0 finished: foreach at WordCountScala.scala:30, took 0.892652 s
hello--2
your--1
me--1
22/11/13 23:47:35 INFO SparkUI: Stopped Spark web UI at http://192.168.0.109:4040
22/11/13 23:47:35 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
22/11/13 23:47:36 INFO MemoryStore: MemoryStore cleared
22/11/13 23:47:36 INFO BlockManager: BlockManager stopped
22/11/13 23:47:36 INFO BlockManagerMaster: BlockManagerMaster stopped
22/11/13 23:47:36 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
22/11/13 23:47:36 INFO SparkContext: Successfully stopped SparkContext
22/11/13 23:47:36 INFO ShutdownHookManager: Shutdown hook called
22/11/13 23:47:36 INFO ShutdownHookManager: Deleting directory /private/var/folders/nn/61tcd7zs6qbb72zlwpkv2qkw0000gn/T/spark-ca60e48f-d332-4193-a336-23f5868a4aa1
进程已结束,退出代码0
1.3、使用java代码实现
package com.strivelearn.java;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.api.java.function.VoidFunction;
import scala.Tuple2;
import java.util.Arrays;
import java.util.Iterator;
/**
* @author xys
* @version WordCountJava.java, 2022年11月13日
*/
public class WordCountJava {
public static void main(String[] args) {
//1.创建sparkContext
SparkConf sparkConf = new SparkConf();
sparkConf.setAppName("WordCountJava");
sparkConf.setMaster("local");
JavaSparkContext javaSparkContext = new JavaSparkContext(sparkConf);
//2.加载数据
JavaRDD<String> linesRDD = javaSparkContext.textFile("/Users/strivelearn/Desktop/spark.txt");
//3.对数据进行切割,把一行数据切分为一个一个的单词
//注意:FlatMapFunction的泛型,第一个参数表示输入数据类型,第二个表示输出数据类型
JavaRDD<String> wordsRDD = linesRDD.flatMap(new FlatMapFunction<String, String>() {
@Override
public Iterator<String> call(String line) throws Exception {
return Arrays.asList(line.split(" "))
.iterator();
}
});
//4.迭代wordsRDD,将每个word转换为(word,1)这种形式
//注意:PairFunction的泛型,第一个参数为输入数据类型
//第二个是输出tuple中的第一个参数类型,第三个是输出tuple中的第二个参数类型
//如果后面需要使用到...Bykey,前面都是需要使用mapToPair去处理
JavaPairRDD<String, Integer> pairRDD = wordsRDD.mapToPair(new PairFunction<String, String, Integer>() {
@Override
public Tuple2<String, Integer> call(String word) throws Exception {
return new Tuple2<>(word, 1);
}
});
//5.根据key进行分组聚合统计
JavaPairRDD<String, Integer> wordCountRDD = pairRDD.reduceByKey(new Function2<Integer, Integer, Integer>() {
@Override
public Integer call(Integer i1, Integer i2) throws Exception {
return i1 + i2;
}
});
//6.将结果打印
wordCountRDD.foreach(new VoidFunction<Tuple2<String, Integer>>() {
@Override
public void call(Tuple2<String, Integer> tup) throws Exception {
System.out.println(tup._1 + "--" + tup._2);
}
});
//7.停止
javaSparkContext.stop();
}
}
二、Spark任务提交方式
-
直接在idea中执行,方便在本地环境调试代码
-
使用spark-submit提交到集群执行【实际工作中使用】
./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster /root/software/spark-3.3.1-bin-hadoop3/examples/jars/spark-examples_2.12-3.3.1.jar 2
-
使用spark-shell,方便在集群环境调式代码
使用spark-shell yarn模式执行命令
spark-shell --master yarn --deploy-mode client
三、Spark开启historyServer模式
-
进入spark的conf目录
cd /root/software/spark-3.3.1-bin-hadoop3/conf
-
修改spark-defaults.conf.template
mv spark-defaults.conf.template spark-defaults.conf
-
修改spark-defaults.conf
创建hdfs目录:hdfs dfs -mkdir -p /tmp/logs/root/logs
spark.eventLog.enabled=true
spark.eventLog.compress=true
spark.eventLog.dir=hdfs://bigdata01:9000/tmp/logs/root/logs
spark.hostory.fs.logDirectory=hdfs://bigdata01:9000/tmp/logs/root/logs
spark.yarn.historyServer.address=http://bigdata01:18080
-
修改spark-env.sh
增强一条配置
export SPARK_HISTORY_OPTS="-Dspark.history.ui.port=18080 -Dspark.history.fs.logDirectory=hdfs://bigdata01:9000/tmp/logs/root/logs"
-
启动spark目录下的sbin
sbin/start-history-server.sh
-
使用jps查看
这样提交任务,任务完成就可以看到history
http://192.168.234.100:8088/cluster/apps
注意这个要开启hdfs的history web服务,在hadoop的etc/hadoop中的mapred-site.xml配置
<!-- 历史服务器 web 端地址 -->
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>bigdata01:19888</value>
</property>
开启hdfs history服务:
mapred --daemon start historyserver
master:9870 #HDFS master:8088 #yarn框架 master:19888 #任务平台
最终可以看到了历史执行的日志: