大数据框架学习-spark (3)

145 阅读1分钟

持续创作,加速成长!这是我参与「掘金日新计划 · 6 月更文挑战」的第3天,点击查看活动详情

spark—sql

要求:1、编写自定义sql命令;2、实现sql优化效果;3、自定义规则

修改SqlBase.g4

statement
| SHOW VERSION                                                     #showVersion

ansiNonReserved
| VERSION

nonReserved
| VERSION

//--SPARK-KEYWORD-LIST-START
VERSION: 'VERSION' | 'V';

修改SparkSqlParser.scala

override def visitShowVersion(ctx: ShowVersionContext): LogicalPlan = withOrigin(ctx) {
  ShowVersionCommand()
}

ShowVersionCommand类

package org.apache.spark.sql.execution.command

import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeReference}
import org.apache.spark.sql.types.StringType


case class ShowVersionCommand() extends LeafRunnableCommand {

  override val output: Seq[Attribute] =
    Seq(AttributeReference("version", StringType)())

  override def run(sparkSession: SparkSession): Seq[Row] = {
    val sparkVersion = sparkSession.version
    val javaVersion = System.getProperty("java.version")
    val scalaVersion = scala.util.Properties.releaseVersion
    val output = "Spark Version: %s, Java Version: %s, Scala Version: %s"
      .format(sparkVersion, javaVersion, scalaVersion.getOrElse(""))
    Seq(Row(output))
  }
}

编译

build/mvn clean package -DskipTests -Phive -Phive-thriftserver

22/06/23 17:20:15 WARN ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore xuejiameng@192.168.54.71

Spark master: local[*], Application Id: local-1655976010880

spark-sql> show version;

第二题: 首先简单写个测试类测试一下 java 如何使用 spark-sql 去连接 mysql

public class SparkSQLTest3 {
    public static void main(String[] args){
        //java版本
        SparkConf conf = new SparkConf();
        conf.setMaster("local");    //本地单线程运行
        conf.setAppName("testJob");
        JavaSparkContext sc = new JavaSparkContext(conf);

        SparkSession spark = SparkSession
                .builder()
                .appName("SparkSQLTest3")
                .config("spark.some.config.option", "some-value")
                .getOrCreate();

        Dataset<Row> jdbcDF = spark.read()
                .format("jdbc")
                .option("url", "jdbc:mysql://127.0.0.1:3306/test")
                .option("dbtable", "(SELECT * FROM city) city")
                .option("user", "root")
                .option("password", "root")
                .load();

        jdbcDF.printSchema();
        jdbcDF.show();

        spark.stop();
    }
}

屏幕快照 2022-06-26 下午11.18.08.png

加个配置就可以打印规则

spark.conf().set("spark.sql.planChangeLog.level","WARN");

CombineFilters:合并两个相邻的Filters,谓词下推的一种形式

CollapseProject: 合并两个确定且独立的列

BooleanSimplification:布尔表达式简化

简单写一个例子:

jdbcDF.registerTempTable("city");

spark.sql("select NAME,CODE from (select NAME,CODE from city where 1 = 1 and CODE>=101010100) tmp where tmp.CODE<=20000000 ").explain(true);

屏幕快照 2022-06-27 上午12.06.12.png

屏幕快照 2022-06-27 上午12.06.25.png

屏幕快照 2022-06-27 上午12.06.37.png 这样三个规则都满足了 然后加一个表写个联查:

jdbcDF.registerTempTable("city2");

spark.sql("SELECT DISTINCT a.NAME FROM (" +
        "SELECT*FROM city WHERE CODE> 101010100 AND 1=1) a WHERE a.id=2 except " +
        "SELECT NAME FROM city2 WHERE CODE> 101280101 ").explain(true);

屏幕快照 2022-06-27 上午12.22.47.png

屏幕快照 2022-06-27 上午12.22.57.png

屏幕快照 2022-06-27 上午12.23.07.png

屏幕快照 2022-06-27 上午12.23.16.png

剩下的规则也都满足了