携手创作,共同成长!这是我参与「掘金日新计划 · 8 月更文挑战」的第11天,点击查看活动详情
1.关于TPC
TPC(Transactionprocessing Performance Council,事务处理性能委员会)是由数十家会员公司创建的非盈利组织,总部设在美国。 TPC组织提供了一系列面对数据库性能进行测试的认证。在业内,TPC-X已经成为了一个衡量数据库性能的标准。面向OLAP的主要测试是 TCP-DS。
TPC的官方网站在这里tpc.org,在主页上飘着一行小字:The TPC is a non-profit corporation focused on developing data-centric benchmark standards and disseminating objective, verifyable data to the industry. TPC起到两个作用:1. 创造不错的benchmark,2. 创造一个不错的review流程。
spark的tpcds脚本在如下地址可以找到:
https://github.com/apache/spark/tree/master/sql/core/src/test/resources/tpcds
2.环境准备
jdk安装:
[root@mv161p120 software]# rpm -ivh ./jdk-8u231-linux-x64.rpm
Preparing... ################################# [100%]
Updating / installing...
1:jdk1.8-2000:1.8.0_231-fcs ################################# [100%]
Unpacking JAR files...
tools.jar...
plugin.jar...
javaws.jar...
deploy.jar...
rt.jar...
jsse.jar...
charsets.jar...
localedata.jar...
[root@mv161p120 software]# java -version
java version "1.8.0_231"
Java(TM) SE Runtime Environment (build 1.8.0_231-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.231-b11, mixed mode)
建立用户:
[root@mv161p120 software]# useradd spark
[root@mv161p120 software]# sudo su - spark
下载spark 并解压:
[spark@mv161p120 ~]$ wget https://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop2.7.tgz
--2022-08-11 23:34:33-- https://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop2.7.tgz
Resolving archive.apache.org (archive.apache.org)... 138.201.131.134, 2a01:4f8:172:2ec5::2
Connecting to archive.apache.org (archive.apache.org)|138.201.131.134|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 224374704 (214M) [application/x-gzip]
Saving to: ‘spark-3.1.1-bin-hadoop2.7.tgz’
100%[===========================================================================================================================================================================>] 224,374,704 1.91MB/s in 98s
[spark@mv161p120 ~]$ ls
spark-3.1.1-bin-hadoop2.7.tgz
[spark@mv161p120 ~]$ tar -zxvf spark-3.1.1-bin-hadoop2.7.tgz
下载jar:
[spark@mv161p120 ~]$ wget https://repo1.maven.org/maven2/org/apache/spark/spark-catalyst_2.12/3.1.1/spark-catalyst_2.12-3.1.1-tests.jar
[spark@mv161p120 ~]$ wget https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.12/3.1.1/spark-core_2.12-3.1.1-tests.jar
[spark@mv161p120 ~]$ wget https://repo1.maven.org/maven2/org/apache/spark/spark-sql_2.12/3.1.1/spark-sql_2.12-3.1.1-tests.jar
3.git clone代码
[spark@mv161p120 ~]$ git clone https://github.com/maropu/spark-tpcds-datagen.git
Cloning into 'spark-tpcds-datagen'...
remote: Enumerating objects: 4686, done.
remote: Counting objects: 100% (505/505), done.
remote: Compressing objects: 100% (143/143), done.
remote: Total 4686 (delta 333), reused 500 (delta 332), pack-reused 4181
Receiving objects: 100% (4686/4686), 36.14 MiB | 2.21 MiB/s, done.
Resolving deltas: 100% (2760/2760), done.
[spark@mv161p120 ~]$
安装环境变量: vim .bash_profile
spark@mv161p120 ~]$ vim .bash_profile
增加如下内容:
SPARK_HOME=/home/spark/spark-3.1.1-bin-hadoop2.7
PATH=$PATH:$SPARK_HOME/bin
export PATH SPARK_HOME
执行source让环境变量生效:
spark@mv161p120 ~]$ source .bash_profile
3.生成数据
[spark@mv161p120 ~]$ mkdir -p tpcds-data-1g
[spark@mv161p120 ~]$ spark-tpcds-datagen/bin/dsdgen --output-location tpcds-data-1g
/home/spark/spark-tpcds-datagen/bin/../target/spark-tpcds-datagen_2.12-0.1.0-SNAPSHOT-with-dependencies.jar not found, so use pre-compiled /home/spark/spark-tpcds-datagen/bin/../assembly/spark-tpcds-datagen_2.12-0.1.0-SNAPSHOT-with-dependencies.jar
Using `spark-submit` from path: /home/spark/spark-3.1.1-bin-hadoop2.7
22/08/12 00:03:43 WARN Utils: Your hostname, localhost resolves to a loopback address: 127.0.0.1; using 192.168.161.120 instead (on interface eth0)
22/08/12 00:03:43 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
22/08/12 00:03:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.prb
... 部分日志省略 ...
22/08/12 00:04:08 INFO ParquetWriteSupport: Initialized Parquet WriteSupport with Catalyst schema:
{
"type" : "struct",
"fields" : [ {
"name" : "cs_sold_date_sk",
"type" : "integer",
"nullable" : true,
"metadata" : { }
}, {
"name" : "cs_sold_time_sk",
"type" : "integer",
"nullable" : true,
"metadata" : { }
}, {
"name" : "cs_ship_date_sk",
"type" : "integer",
"nullable" : true,
"metadata" : { }
}, {
"name" : "cs_bill_customer_sk",
"type" : "integer",
"nullable" : true,
"metadata" : { }
}, {
"name" : "cs_bill_cdemo_sk",
"type" : "integer",
"nullable" : true,
"metadata" : { }
}, {
"name" : "cs_bill_hdemo_sk",
"type" : "integer",
"nullable" : true,
"metadata" : { }
}, {
"name" : "cs_bill_addr_sk",
"type" : "integer",
"nullable" : true,
"metadata" : { }
}, {
"name" : "cs_ship_customer_sk",
"type" : "integer",
"nullable" : true,
"metadata" : { }
}, {
"name" : "cs_ship_cdemo_sk",
"type" : "integer",
"nullable" : true,
"metadata" : { }
}, {
"name" : "cs_ship_hdemo_sk",
"type" : "integer",
"nullable" : true,
"metadata" : { }
}, {
"name" : "cs_ship_addr_sk",
"type" : "integer",
"nullable" : true,
"metadata" : { }
}, {
"name" : "cs_call_center_sk",
"type" : "integer",
"nullable" : true,
"metadata" : { }
}, {
"name" : "cs_catalog_page_sk",
"type" : "integer",
"nullable" : true,
"metadata" : { }
}, {
"name" : "cs_ship_mode_sk",
"type" : "integer",
"nullable" : true,
"metadata" : { }
}, {
"name" : "cs_warehouse_sk",
"type" : "integer",
"nullable" : true,
"metadata" : { }
}, {
"name" : "cs_item_sk",
"type" : "integer",
"nullable" : true,
"metadata" : { }
}, {
"name" : "cs_promo_sk",
"type" : "integer",
"nullable" : true,
"metadata" : { }
}, {
"name" : "cs_order_number",
"type" : "integer",
"nullable" : true,
"metadata" : { }
}, {
"name" : "cs_quantity",
"type" : "integer",
"nullable" : true,
"metadata" : { }
}, {
"name" : "cs_wholesale_cost",
"type" : "decimal(7,2)",
"nullable" : true,
"metadata" : { }
}, {
"name" : "cs_list_price",
"type" : "decimal(7,2)",
"nullable" : true,
"metadata" : { }
}, {
"name" : "cs_sales_price",
"type" : "decimal(7,2)",
"nullable" : true,
"metadata" : { }
}, {
"name" : "cs_ext_discount_amt",
"type" : "decimal(7,2)",
"nullable" : true,
"metadata" : { }
}, {
"name" : "cs_ext_sales_price",
"type" : "decimal(7,2)",
"nullable" : true,
"metadata" : { }
}, {
"name" : "cs_ext_wholesale_cost",
"type" : "decimal(7,2)",
"nullable" : true,
"metadata" : { }
}, {
"name" : "cs_ext_list_price",
"type" : "decimal(7,2)",
"nullable" : true,
"metadata" : { }
}, {
"name" : "cs_ext_tax",
"type" : "decimal(7,2)",
"nullable" : true,
"metadata" : { }
}, {
"name" : "cs_coupon_amt",
"type" : "decimal(7,2)",
"nullable" : true,
"metadata" : { }
}, {
"name" : "cs_ext_ship_cost",
"type" : "decimal(7,2)",
"nullable" : true,
"metadata" : { }
}, {
"name" : "cs_net_paid",
"type" : "decimal(7,2)",
"nullable" : true,
"metadata" : { }
}, {
"name" : "cs_net_paid_inc_tax",
"type" : "decimal(7,2)",
"nullable" : true,
"metadata" : { }
}, {
"name" : "cs_net_paid_inc_ship",
"type" : "decimal(7,2)",
"nullable" : true,
"metadata" : { }
}, {
"name" : "cs_net_paid_inc_ship_tax",
"type" : "decimal(7,2)",
"nullable" : true,
"metadata" : { }
}, {
"name" : "cs_net_profit",
"type" : "decimal(7,2)",
"nullable" : true,
"metadata" : { }
} ]
}
and corresponding Parquet message type:
message spark_schema {
optional int32 cs_sold_date_sk;
optional int32 cs_sold_time_sk;
optional int32 cs_ship_date_sk;
optional int32 cs_bill_customer_sk;
optional int32 cs_bill_cdemo_sk;
optional int32 cs_bill_hdemo_sk;
optional int32 cs_bill_addr_sk;
optional int32 cs_ship_customer_sk;
optional int32 cs_ship_cdemo_sk;
optional int32 cs_ship_hdemo_sk;
optional int32 cs_ship_addr_sk;
optional int32 cs_call_center_sk;
optional int32 cs_catalog_page_sk;
optional int32 cs_ship_mode_sk;
optional int32 cs_warehouse_sk;
optional int32 cs_item_sk;
optional int32 cs_promo_sk;
optional int32 cs_order_number;
optional int32 cs_quantity;
optional int32 cs_wholesale_cost (DECIMAL(7,2));
optional int32 cs_list_price (DECIMAL(7,2));
optional int32 cs_sales_price (DECIMAL(7,2));
optional int32 cs_ext_discount_amt (DECIMAL(7,2));
optional int32 cs_ext_sales_price (DECIMAL(7,2));
optional int32 cs_ext_wholesale_cost (DECIMAL(7,2));
optional int32 cs_ext_list_price (DECIMAL(7,2));
optional int32 cs_ext_tax (DECIMAL(7,2));
optional int32 cs_coupon_amt (DECIMAL(7,2));
optional int32 cs_ext_ship_cost (DECIMAL(7,2));
optional int32 cs_net_paid (DECIMAL(7,2));
optional int32 cs_net_paid_inc_tax (DECIMAL(7,2));
optional int32 cs_net_paid_inc_ship (DECIMAL(7,2));
optional int32 cs_net_paid_inc_ship_tax (DECIMAL(7,2));
optional int32 cs_net_profit (DECIMAL(7,2));
}
... ...
5.执行sql
[spark@mv161p120 ~]$ spark-submit --class org.apache.spark.sql.execution.benchmark.TPCDSQueryBenchmark --jars spark-core_2.12-3.1.1-tests.jar,spark-catalyst_2.12-3.1.1-tests.jar spark-sql_2.12-3.1.1-tests.jar --data-location tpcds-data-1g --query-filter "q1"
22/08/12 11:21:42 WARN Utils: Your hostname, localhost resolves to a loopback address: 127.0.0.1; using 192.168.161.120 instead (on interface eth0)
22/08/12 11:21:42 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
22/08/12 11:21:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
22/08/12 11:21:44 INFO SparkContext: Running Spark version 3.1.122/08/12 11:22:37 INFO OverwriteByExpressionExec: Data source write support org.apache.spark.sql.execution.datasources.noop.NoopBatchWrite$@689afbb0 committed.
Stopped after 2 iterations, 6170 ms
Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Linux 3.10.0-862.el7.x86_64
QEMU Virtual CPU version (cpu64-rhel6)
TPCDS Snappy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
q1 2575 3085 721 0.2 5591.6 1.0X
22/08/12 11:22:37 INFO SparkContext: Invoking stop() from shutdown hook
22/08/12 11:22:37 INFO SparkUI: Stopped Spark web UI at http://192.168.161.120:4040
22/08/12 11:22:37 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
22/08/12 11:22:37 INFO MemoryStore: MemoryStore cleared
22/08/12 11:22:37 INFO BlockManager: BlockManager stopped
22/08/12 11:22:37 INFO BlockManagerMaster: BlockManagerMaster stopped
22/08/12 11:22:37 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
22/08/12 11:22:37 INFO SparkContext: Successfully stopped SparkContext
22/08/12 11:22:37 INFO ShutdownHookManager: Shutdown hook called
22/08/12 11:22:37 INFO ShutdownHookManager: Deleting directory /tmp/spark-c19b5c42-a30f-4b54-8a57-2afb6828a425
22/08/12 11:22:37 INFO ShutdownHookManager: Deleting directory /tmp/spark-940573ac-0b6c-42e8-a5fb-6c9d36b65758
6.相关优化器说明
ColumnPruning
列裁剪在 Spark SQL 中由 ColumnPruning 实现。因为我们查询的表可能有很多个字段,但是每次查询我们很大可能不需要扫描出所有的字段,这个时候利用列裁剪可以把那些查询不需要的字段过滤掉,使得扫描的数据量减少。
这个优化一方面大幅度减少了网络、内存数据量消耗,另一方面对于列存格式(Parquet)来说大大提高了扫描效率。
PushDownPredicates
PushDownPredicates是谓词下推规则,这个过程主要将过滤条件尽可能地下推到底层,最好是贴近数据源。
谓词下推能将过滤条件下推到JOIN之前进行,这样在扫描数据的时候就对数据进行了过滤,参与JOIN的数据量将会得到显著地减少,JOIN耗时必然也会降低。
从执行计划来看,就是将Filter下推到Join之前先执行。