一、创建Maven项目
-
增加 Scala 插件
打开 IDEA ,在插件中找到 Scala ,并下载。
-
任意创建一个 maven 程序,在pom.xml文件中添加如下依赖(以Spark3.0为例):
<dependencies> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.12</artifactId> <version>3.0.0</version> </dependency> </dependencies> <build> <plugins> <!-- 该插件用于将Scala代码编译成class文件 --> <plugin> <groupId>net.alchim31.maven</groupId> <artifactId>scala-maven-plugin</artifactId> <version>3.2.2</version>a <executions> <execution> <!-- 声明绑定到maven的compile阶段 --> <goals> <goal>testCompile</goal> </goals> </execution> </executions> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-assembly-plugin</artifactId> <version>3.0.0</version> <configuration> <descriptorRefs> <descriptorRef>jar-with-dependencies</descriptorRef> </descriptorRefs> </configuration> <executions> <execution> <id>make-assembly</id> <phase>package</phase> <goals> <goal>single</goal> </goals> </execution> </executions> </plugin> </plugins> </build>
-
在idea项目中创建input文件下,创建数据文件word.txt
hello world hello spark
-
复制如下代码运行(统计单词个数)
object WordCount { def main(args: Array[String]): Unit = { // 创建Spark运行配置对象 val sparkConf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("WordCount") // 创建Spark上下文环境对象(连接对象) val sc : SparkContext = new SparkContext(sparkConf) // 读取文件数据 val fileRDD: RDD[String] = sc.textFile("input/word.txt") // 将文件中的数据进行分词 val wordRDD: RDD[String] = fileRDD.flatMap( _.split(" ") ) // 转换数据结构 word => (word, 1) val word2OneRDD: RDD[(String, Int)] = wordRDD.map((_,1)) // 将转换结构后的数据按照相同的单词进行分组聚合 val word2CountRDD: RDD[(String, Int)] = word2OneRDD.reduceByKey(_ + _) // 将数据聚合结果采集到内存中 val word2Count: Array[(String, Int)] = word2CountRDD.collect() // 打印结果 word2Count.foreach(println) //关闭Spark连接 sc.stop() } }
-
打印结果
-
可能的异常和优化 在程序运行时,可能会报如下错误:
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
说明 Windows 环境未配置 Hadoop,缺少 winutils.exe 文件,需要手动配置。在网上下载一个即可,并设置环境变量。
此外,由于 INFO 信息过多,我们可以减少对应的日志信息,方便观察结果。 在 resources 目录中创建 log4j.properties 文件,并添加日志配置信息:
log4j.rootCategory=ERROR, console log4j.appender.console=org.apache.log4j.ConsoleAppender log4j.appender.console.target=System.err log4j.appender.console.layout=org.apache.log4j.PatternLayout log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n # Set the default spark-shell log level to ERROR. When running the spark-shell, the # log level for this class is used to overwrite the root logger's log level, so that # the user can have different defaults for the shell and regular Spark apps. log4j.logger.org.apache.spark.repl.Main=ERROR # Settings to quiet third party logs that are too verbose log4j.logger.org.spark_project.jetty=ERROR log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=ERROR log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=ERROR log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=ERROR log4j.logger.org.apache.parquet=ERROR log4j.logger.parquet=ERROR # SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
整个目录结构如下: