02-Spark快速上手

299 阅读2分钟

一、创建Maven项目

  1. 增加 Scala 插件

    打开 IDEA ,在插件中找到 Scala ,并下载。

    image.png

  2. 任意创建一个 maven 程序,在pom.xml文件中添加如下依赖(以Spark3.0为例):

    <dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.12</artifactId>
            <version>3.0.0</version>
        </dependency>
    </dependencies>
    <build>
        <plugins>
            <!-- 该插件用于将Scala代码编译成class文件 -->
            <plugin>
                <groupId>net.alchim31.maven</groupId>
                <artifactId>scala-maven-plugin</artifactId>
                <version>3.2.2</version>a
                <executions>
                    <execution>
                        <!-- 声明绑定到maven的compile阶段 -->
                        <goals>
                            <goal>testCompile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-assembly-plugin</artifactId>
                <version>3.0.0</version>
                <configuration>
                <descriptorRefs>
                    <descriptorRef>jar-with-dependencies</descriptorRef>
                </descriptorRefs>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
    
  3. 在idea项目中创建input文件下,创建数据文件word.txt

    hello world
    hello spark
    
  4. 复制如下代码运行(统计单词个数)

    object WordCount {
        def main(args: Array[String]): Unit = {
            // 创建Spark运行配置对象
            val sparkConf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("WordCount")
    
            // 创建Spark上下文环境对象(连接对象)
            val sc : SparkContext = new SparkContext(sparkConf)
    
            // 读取文件数据
            val fileRDD: RDD[String] = sc.textFile("input/word.txt")
    
            // 将文件中的数据进行分词
            val wordRDD: RDD[String] = fileRDD.flatMap( _.split(" ") )
    
            // 转换数据结构 word => (word, 1)
            val word2OneRDD: RDD[(String, Int)] = wordRDD.map((_,1))
    
            // 将转换结构后的数据按照相同的单词进行分组聚合
            val word2CountRDD: RDD[(String, Int)] = word2OneRDD.reduceByKey(_ + _)
    
            // 将数据聚合结果采集到内存中
            val word2Count: Array[(String, Int)] = word2CountRDD.collect()
    
            // 打印结果
            word2Count.foreach(println)
    
            //关闭Spark连接
            sc.stop()
        }
    }
    
  5. 打印结果 image.png

  6. 可能的异常和优化 在程序运行时,可能会报如下错误:

    java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
    

    说明 Windows 环境未配置 Hadoop,缺少 winutils.exe 文件,需要手动配置。在网上下载一个即可,并设置环境变量。 image.png 此外,由于 INFO 信息过多,我们可以减少对应的日志信息,方便观察结果。 在 resources 目录中创建 log4j.properties 文件,并添加日志配置信息:

    log4j.rootCategory=ERROR, console
    log4j.appender.console=org.apache.log4j.ConsoleAppender
    log4j.appender.console.target=System.err
    log4j.appender.console.layout=org.apache.log4j.PatternLayout
    log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
    
    # Set the default spark-shell log level to ERROR. When running the spark-shell, the
    # log level for this class is used to overwrite the root logger's log level, so that
    # the user can have different defaults for the shell and regular Spark apps.
    log4j.logger.org.apache.spark.repl.Main=ERROR
    
    # Settings to quiet third party logs that are too verbose
    log4j.logger.org.spark_project.jetty=ERROR
    log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=ERROR
    log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=ERROR
    log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=ERROR
    log4j.logger.org.apache.parquet=ERROR
    log4j.logger.parquet=ERROR
    
    # SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
    log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
    log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
    

整个目录结构如下: image.png