idea 新建一个spark项目

149 阅读1分钟

1. New Project - 创建maven工程

  • Add Archetype

    • jdk与scala版本匹配: docs.scala-lang.org/overviews/j…

      试了两天错发现截至2023.4.14 jdk20 和jdk19都不适配,1.8就是最好用的!!!

    • 我的点开是没有maven-archetype-scala 的

      • 方法1 :add 方法2: 随便新建webapp也可以,自己后续创建scala文件夹 image-20230514130713097.png img

2.添加scala

image-20230514135031982.png

  • 如果只在local跑,版本都可以
  • 如果要跟别的连接,版本要注意统一

3. 新建scala文件

1. 右键项目add frameworks
   添加了这个才能新建scala项目 
  • image-20230514122659701.png 效果:
  • image-20230514122736264.png

4. 设置

Scala文件夹要选择Source类型,不然在运行Scala时候会错误: 找不到或无法加载主类。
  • image-20230513145650020.png

  • image-20230513145920991.png

5. 配置settings.xml,更改阿里镜像

  • image-20230513115509609.png

<settings xmlns="maven.apache.org/SETTINGS/1.…"

      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

      xsi:schemaLocation="http://maven.apache.org/SETTINGS/1.0.0 http://maven.apache.org/xsd/settings-1.0.0.xsd">



<mirrors>

    <!-- mirror

    | Specifies a repository mirror site to use instead of a given repository. The repository that

    | this mirror serves has an ID that matches the mirrorOf element of this mirror. IDs are used

    | for inheritance and direct lookup purposes, and must be unique across the set of mirrors.

    |

    <mirror>

    <id>mirrorId</id>

    <mirrorOf>repositoryId</mirrorOf>

    <name>Human Readable Name for this Mirror.</name>

    <url>http://my.repository.com/repo/path</url>

    </mirror>

    -->

    <mirror>

        <id>alimaven</id>

        <mirrorOf>central</mirrorOf>

        <name>aliyun maven</name>

        <url>http://maven.aliyun.com/nexus/content/repositories/central/</url>

    </mirror>

    <mirror>

        <id>alimaven</id>

        <name>aliyun maven</name>

        <url>http://maven.aliyun.com/nexus/content/groups/public/</url>

        <mirrorOf>central</mirrorOf>

    </mirror>

    <mirror>

        <id>central</id>

        <name>Maven Repository Switchboard</name>

        <url>http://repo1.maven.org/maven2/</url>

        <mirrorOf>central</mirrorOf>

    </mirror>

    <mirror>

        <id>repo2</id>

        <mirrorOf>central</mirrorOf>

        <name>Human Readable Name for this Mirror.</name>

        <url>http://repo2.maven.org/maven2/</url>

    </mirror>

    <mirror>

        <id>ibiblio</id>

        <mirrorOf>central</mirrorOf>

        <name>Human Readable Name for this Mirror.</name>

        <url>http://mirrors.ibiblio.org/pub/mirrors/maven2/</url>

    </mirror>

    <mirror>

        <id>jboss-public-repository-group</id>

        <mirrorOf>central</mirrorOf>

        <name>JBoss Public Repository Group</name>

        <url>http://repository.jboss.org/nexus/content/groups/public</url>

    </mirror>

    <mirror>

        <id>google-maven-central</id>

        <name>Google Maven Central</name>

        <url>https://maven-central.storage.googleapis.com

        </url>

        <mirrorOf>central</mirrorOf>

    </mirror>

    <!-- 中央仓库在中国的镜像 -->

    <mirror>

        <id>maven.net.cn</id>

        <name>oneof the central mirrors in china</name>

        <url>http://maven.net.cn/content/groups/public/</url>

        <mirrorOf>central</mirrorOf>

    </mirror>

</mirrors>
```

6.配置pom.xml文件

  • 注意版本号都要一一对应,ctrl+s后maven会自动下载

    1. properties设置

      <properties>
        <scala.binary.version>2.12</scala.binary.version>
        <scala.version>2.12.15</scala.version>
        <spark.version>3.3.2</spark.version>
        <hadoop.version>2.6.0</hadoop.version>
      </properties>
      
    2. dependencies配置

            <dependencies>
        <dependency>
          <groupId>junit</groupId>
          <artifactId>junit</artifactId>
          <version>3.8.1</version>
          <scope>test</scope>
        </dependency>
      
        <dependency>
          <groupId>org.apache.spark</groupId>
          <artifactId>spark-core_${scala.binary.version}</artifactId>
          <version>${spark.version}</version>
        </dependency>
      
        <dependency>
          <groupId>org.apache.spark</groupId>
          <artifactId>spark-sql_${scala.binary.version}</artifactId>
          <version>${spark.version}</version>
        </dependency>
      
        <dependency>
          <groupId>org.apache.spark</groupId>
          <artifactId>spark-streaming_${scala.binary.version}</artifactId>
          <version>${spark.version}</version>
        </dependency>
      
        <dependency>
          <groupId>org.apache.spark</groupId>
          <artifactId>spark-hive_${scala.binary.version}</artifactId>
          <version>${spark.version}</version>
        </dependency>
      
        <dependency>
          <groupId>org.apache.spark</groupId>
          <artifactId>spark-streaming-kafka-0-10_${scala.binary.version}</artifactId>
          <version>${spark.version}</version>
        </dependency>
      
        <dependency>
          <groupId>org.apache.spark</groupId>
          <artifactId>spark-sql-kafka-0-10_${scala.binary.version}</artifactId>
          <version>${spark.version}</version>
        </dependency>
      
        <dependency>
          <groupId>org.apache.hadoop</groupId>
          <artifactId>hadoop-client</artifactId>
          <version>${hadoop.version}</version>
        </dependency>
      </dependencies>
      

7. 开发Spark Application程序并进行本地测试

  1. WordCount.scala

    import org.apache.spark.SparkContext
    import org.apache.spark.SparkContext._
    import org.apache.spark.SparkConf
    object WordCount {
      def main(args: Array[String]) {
        val inputFile = "/Users/bml/Documents/spark/Data01.txt"
        val conf = new SparkConf().setAppName("WordCount").setMaster("local")
        val sc = new SparkContext(conf)
        val textFile = sc.textFile(inputFile)
        val wordCount = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
        wordCount.foreach(println)
      }
    }
    

8. 配置 Spark 通过 JDBC 连接数据库 MySQL

  1. 使用maven库自动下载

    mvnrepository.com/artifact/my…

    搜索mysql 找到对应版本的maven语句

  2. 放到pom.xml

    <dependency>
      <groupId>mysql</groupId>
      <artifactId>mysql-connector-java</artifactId>
      <version>8.0.32</version>
    </dependency>
    
  3. 会自动下载

  4. scala文件

    import org.apache.spark.SparkConf
    import org.apache.spark.SparkContext
    import org.apache.spark.sql.SparkSession
    
    object RDDtoDF {
    
      def main(args: Array[String]): Unit = {
    
        val conf = new SparkConf()
        conf.setMaster("local")
          .setAppName("RDDtoDF")
        val sc = new SparkContext(conf)
        val spark = SparkSession.builder.getOrCreate
        val filePath = args(0)
        val rdd = spark.sparkContext.textFile(filePath)
        // 将 RDD 转换为 DataFrame
        import spark.implicits._
        val df = rdd.map(_.split(",")).map(row => (row(0).toInt, row(1), row(2).toInt))
          .toDF("id", "name", "age")
        println("111")
        println(df)
        df.show()
        // 打印 DataFrame 的所有数据
        df.collect().foreach(row =>
          println(s"id:${row.getAs[Int]("id")},name:${row.getAs[String]("name")},age:${row.getAs[Int]("age")}")
        )
    
        // 停止 SparkSession 对象
        spark.stop()
      }
    
    }