开启掘金成长之旅！这是我参与「掘金日新计划 · 12 月更文挑战」的第21天，点击查看活动详情

Spark Streaming（一）

离线：
    MapReduce
    Hive
    Spark Core
    Spark SQL
实时：
    Storm
    Spark Streaming （微批）
    Flink （实时）

其他采集、调度 ...

离线：

直梯

实时：

斜梯

一、Spark Streaming概念

1.1 实时任务简介

Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards. In fact, you can apply Spark’s machine learning and graph processing algorithms on data streams.

Spark Streaming 是核心 Spark API 的扩展，它支持实时数据流的可扩展、高吞吐量、容错流处理。Kafka，Kinesis，或TCP套接字许多来源摄入，并且可以使用与像高级别功能表达复杂的算法来处理，比如 map，reduce，join和window。最后，可以将处理后的数据推送到文件系统、数据库和实时仪表板。事实上，您可以在数据流上应用 Spark 的机器学习和图形处理算法。

1.2 Spark Streaming程序入口

val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))

1.3 DStream介绍

Discretized Stream or DStream is the basic abstraction provided by Spark Streaming. It represents a continuous stream of data, either the input data stream received from source, or the processed data stream generated by transforming the input stream. Internally, a DStream is represented by a continuous series of RDDs, which is Spark’s abstraction of an immutable, distributed dataset (see Spark Programming Guide for more details). Each RDD in a DStream contains data from a certain interval, as shown in the following figure.

离散数据流或者DStream是SS提供的基本抽象。其表现数据的连续流，这个输入数据流可以来自于源，也可以来自于转换输入流产生的已处理数据流。内部而言，一个DStream以一系列连续的RDDs所展现，这些RDD是Spark对于不变的，分布式数据集的抽象。一个DStream中的每个RDD都包含来自一定间隔的数据，如下图：

Any operation applied on a DStream translates to operations on the underlying RDDs. For example, in the earlier example of converting a stream of lines to words, the flatMap operation is applied on each RDD in the lines DStream to generate the RDDs of the words DStream. This is shown in the following figure.

在DStream上使用的任何操作都会转换为针对底层RDD的操作。例如：之前那个将行的流转变为词流的例子中，flatMap操作应用于行DStream的每个RDD上 从而产生words DStream的RDD。如下图：

二、Spark Streaming 快速入门

2.1 pom文件配置

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-streaming_2.12</artifactId>
    <version>3.1.2</version>
    <scope>provided</scope>
</dependency>

遇到的错误：

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/streaming/StreamingContext
	at com.aa.sparkscala.streaming.StreamingWordCountScala$.main(StreamingWordCountScala.scala:17)
	at com.aa.sparkscala.streaming.StreamingWordCountScala.main(StreamingWordCountScala.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.streaming.StreamingContext
	at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	... 2 more

解决方案：

问题出现在：
<scope>provided</scope>这里，这表示编译和测试时有效，并且该jar包在运行时由服务器提供。
所以出错了，解决办法就直接在pom.xml文件中去掉这一行即可。

下面列举出附加依赖的范围，供参考：

compile: 默认值，适用于所有阶段（表明该jar包在编译、运行以及测试中路径都可见），并且会随着项目直接发布。
provided: 编译和测试时有效，并且该jar包在运行时由服务器提供。
runtime： 运行时使用，对测试和运行有效。
test: 只在测试时使用，在编译和运行的时候不起作用，发布项目的时候没有作用。
system: 不依赖maven仓库解析，需要提供依赖的显式的置顶jar包路径，这个对项目的移植来说是不方便的。

所以正确的pom配置文件如下：

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming_2.12</artifactId>
            <version>3.1.2</version>
            <!--<scope>provided</scope>-->
        </dependency>

2.2 Scala版本代码

package com.aa.sparkscala.streaming

import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
 * @Author AA
 * @Date 2021/12/8 15:59
 * @Project bigdatapre
 * @Package com.aa.sparkscala.streaming
 */
object StreamingWordCountScala {
  def main(args: Array[String]): Unit = {
    //1、初始化程序的入口
    val conf: SparkConf = new SparkConf().setMaster("local[2]").setAppName("StreamingWordCountScala")
    val ssc: StreamingContext = new StreamingContext(conf, Seconds(3))

    //2、获取数据流，其实也就是数据源
    val lines: ReceiverInputDStream[String] = ssc.socketTextStream("192.168.22.138", 9999)

    //3、数据处理
    val words: DStream[String] = lines.flatMap(_.split(" "))
    val wordAndOne: DStream[(String, Int)] = words.map((_, 1))
    val wordRes: DStream[(String, Int)] = wordAndOne.reduceByKey(_ + _)

    //4、数据输出
    wordRes.print()

    //5、启动任务
    ssc.start()
    ssc.awaitTermination()  //等待任务结束
    ssc.stop()
  }
}

2.3 Java版本代码

package com.aa.sparkjava.streaming;

import org.apache.spark.SparkConf;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaReceiverInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import scala.Tuple2;

import java.util.Arrays;

/**
 * @Author AA
 * @Date 2021/12/8 16:30
 * @Project bigdatapre
 * @Package com.aa.sparkjava.streaming
 */
public class StreamingWordCountJava {
    public static void main(String[] args) throws Exception {
        //1、初始化程序入口
        SparkConf conf = new SparkConf().setMaster("local[2]").setAppName("StreamingWordCountJava");
        JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(5));

        //2、获取数据源
        JavaReceiverInputDStream<String> lines = jssc.socketTextStream("192.168.22.138", 9999);

        //3、进行数据的处理
        JavaDStream<String> words = lines.flatMap(line -> Arrays.asList(line.split(" ")).iterator());
        JavaPairDStream<String, Integer> pairs = words.mapToPair(word -> new Tuple2<>(word, 1));
        JavaPairDStream<String, Integer> wordCounts = pairs.reduceByKey((i, j) -> i + j);

        //4、数据输出
        wordCounts.print();

        //5、启动程序
        jssc.start();
        jssc.awaitTermination();
        jssc.stop();
    }
}

三、Spark Streaming 常见的输入数据源

3.1 Socket数据源

package com.aa.sparkscala.streaming

import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
 * @Author AA
 * @Date 2021/12/8 15:59
 * @Project bigdatapre
 * @Package com.aa.sparkscala.streaming
 */
object StreamingWordCountScala {
  def main(args: Array[String]): Unit = {
    //1、初始化程序的入口
    val conf: SparkConf = new SparkConf().setMaster("local[2]").setAppName("StreamingWordCountScala")
    val ssc: StreamingContext = new StreamingContext(conf, Seconds(3))

    //2、获取数据流，其实也就是数据源
    val lines: ReceiverInputDStream[String] = ssc.socketTextStream("192.168.22.138", 9999)

    //3、数据处理
    val words: DStream[String] = lines.flatMap(_.split(" "))
    val wordAndOne: DStream[(String, Int)] = words.map((_, 1))
    val wordRes: DStream[(String, Int)] = wordAndOne.reduceByKey(_ + _)

    //4、数据输出
    wordRes.print()

    //5、启动任务
    ssc.start()
    ssc.awaitTermination()
    ssc.stop()
  }
}

3.2 HDFS数据源

package com.aa.sparkscala.streaming

import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
 * @Author AA
 * @Date 2021/12/8 15:59
 * @Project bigdatapre
 * @Package com.aa.sparkscala.streaming
 * 实时的情况下面的HDFS的数据源，这个用的少，知道就行了。
 * 监听的是每间隔Seconds(3)内，计算当前目录里面这3秒新产生的文件
 */
object WordCountFromHDFS {
  def main(args: Array[String]): Unit = {
    //1、初始化程序的入口
    val conf: SparkConf = new SparkConf().setMaster("local[2]").setAppName("StreamingWordCountScala")
    val ssc: StreamingContext = new StreamingContext(conf, Seconds(3))

    //2、获取数据流，其实也就是数据源， 这个是目录
    /**
     * 这个测试的就是Seconds(3) 新增的目录下面的数据的统计。是一个时间窗口内的，不是所有的。
     */
    val lines: DStream[String] = ssc.textFileStream("hdfs://hadoop10/worddir")

    //3、数据处理
    val words: DStream[String] = lines.flatMap(_.split(" "))
    val wordAndOne: DStream[(String, Int)] = words.map((_, 1))
    val wordRes: DStream[(String, Int)] = wordAndOne.reduceByKey(_ + _)

    //4、数据输出
    wordRes.print()

    //5、启动任务
    ssc.start()
    ssc.awaitTermination()
    ssc.stop()
  }
}

声明：
文章中代码及相关语句为自己根据相应理解编写，文章中出现的相关图片为自己实践中的截图和相关技术对应的图片，若有相关异议，请联系删除。感谢。转载请注明出处，感谢。

落叶飘雪