一、概述
Trino on Kubernetes(Trino在Kubernetes上的部署)是将Trino查询引擎与Kubernetes容器编排平台相结合,以实现在Kubernetes集群上部署、管理和运行Trino的解决方案。
Trino(之前称为Presto SQL)是一个高性能的分布式SQL查询引擎,旨在处理大规模数据集和复杂查询。Kubernetes是一个流行的开源容器编排平台,用于自动化容器的部署、扩展和管理。
将Trino部署在Kubernetes上可以带来一些优势:
-
弹性扩展:Kubernetes提供了自动化的容器扩展功能,可以根据工作负载的需求自动增加或减少Trino的实例数。这样,可以根据查询负载的变化进行弹性伸缩,提高性能和资源利用率。
-
高可用性:Kubernetes具有容错和故障恢复的能力。通过在Kubernetes集群中部署多个Trino实例,可以实现高可用性架构,当其中一个实例失败时,其他实例可以接管工作,保证系统的可用性。
-
资源管理:Kubernetes提供了资源调度和管理的功能,可以控制Trino实例使用的计算资源、存储资源和网络资源。通过适当配置资源限制和请求,可以有效地管理Trino查询的资源消耗,防止资源冲突和争用。
-
简化部署和管理:Kubernetes提供了声明性的配置和自动化的部署机制,可以简化Trino的部署和管理过程。通过使用Kubernetes的标准工具和API,可以轻松地进行Trino实例的创建、配置和监控。
-
生态系统整合:Kubernetes具有丰富的生态系统和集成能力,可以与其他工具和平台进行无缝集成。例如,可以与存储系统(如Hadoop HDFS、Amazon S3)和其他数据处理工具(如Apache Spark)集成,实现数据的无缝访问和处理。
需要注意的是,将Trino部署在Kubernetes上需要适当的配置和调优,以确保性能和可靠性。此外,对于大规模和复杂的查询场景,可能需要考虑数据分片、数据划分和数据本地性等方面的优化。
总之,Trino on Kubernetes提供了一种灵活、可扩展和高效的方式来部署和管理Trino查询引擎,使其能够更好地适应大数据环境中的查询需求。
这里只是讲解部署过程,想了解更多的trino的内容,可参考我以下几篇文章:
- 大数据Hadoop之——基于内存型SQL查询引擎Presto(Presto-Trino环境部署)
- 【大数据】Presto(Trino)SQL 语法进阶
- 【大数据】Presto(Trino)REST API 与执行计划介绍
- 【大数据】Presto(Trino)配置参数以及 SQL语法
如果想单机容器部署,可以参考我这篇文章:【大数据】通过 docker-compose 快速部署 Presto(Trino)保姆级教程
二、k8s 部署部署
k8s 环境部署这里不重复讲解了,重点是 Hadoop on k8s,不知道怎么部署k8s环境的可以参考我以下几篇文章:
三、开始编排部署 Trino
1)构建镜像 Dockerfile
FROM registry.cn-hangzhou.aliyuncs.com/bigdata_cloudnative/centos:7.7.1908
RUN rm -f /etc/localtime && ln -sv /usr/share/zoneinfo/Asia/Shanghai /etc/localtime && echo "Asia/Shanghai" > /etc/timezone
RUN export LANG=zh_CN.UTF-8
# 创建用户和用户组,跟yaml编排里的user: 10000:10000
RUN groupadd --system --gid=10000 hadoop && useradd --system --home-dir /home/hadoop --uid=10000 --gid=hadoop hadoop -m
# 安装sudo
RUN yum -y install sudo ; chmod 640 /etc/sudoers
# 给hadoop添加sudo权限
RUN echo "hadoop ALL=(ALL) NOPASSWD: ALL" >> /etc/sudoers
RUN yum -y install install net-tools telnet wget nc
RUN mkdir /opt/apache/
# 添加配置 JDK
ADD zulu20.30.11-ca-jdk20.0.1-linux_x64.tar.gz /opt/apache/
ENV JAVA_HOME /opt/apache/zulu20.30.11-ca-jdk20.0.1-linux_x64
ENV PATH $JAVA_HOME/bin:$PATH
# 添加配置 trino server
ENV TRINO_VERSION 416
ADD trino-server-${TRINO_VERSION}.tar.gz /opt/apache/
ENV TRINO_HOME /opt/apache/trino
RUN ln -s /opt/apache/trino-server-${TRINO_VERSION} $TRINO_HOME
# 创建配置目录和数据源catalog目录
RUN mkdir -p ${TRINO_HOME}/etc/catalog
# 添加配置 trino cli
COPY trino-cli-416-executable.jar $TRINO_HOME/bin/trino-cli
# copy bootstrap.sh
COPY bootstrap.sh /opt/apache/
RUN chmod +x /opt/apache/bootstrap.sh ${TRINO_HOME}/bin/trino-cli
RUN chown -R hadoop:hadoop /opt/apache
WORKDIR $TRINO_HOME
bootstrap.sh
脚本内容
#!/usr/bin/env sh
wait_for() {
if [ -n "$1" -a -z -n "$2" ];then
echo Waiting for $1 to listen on $2...
while ! nc -z $1 $2; do echo waiting...; sleep 1s; done
fi
}
start_trino() {
wait_for $1 $2
${TRINO_HOME}/bin/launcher run --verbose
}
case $1 in
trino-coordinator)
start_trino coordinator $@
;;
trino-worker)
start_trino worker $@
;;
*)
echo "请输入正确的服务启动命令~"
;;
esac
构建镜像:
docker build -t registry.cn-hangzhou.aliyuncs.com/bigdata_cloudnative/trino-k8s:416 . --no-cache
### 参数解释
# -t:指定镜像名称
# . :当前目录Dockerfile
# -f:指定Dockerfile路径
# --no-cache:不缓存
2)values.yaml 文件配置
# Default values for trino.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.
image:
repository: registry.cn-hangzhou.aliyuncs.com/bigdata_cloudnative/trino-k8s
pullPolicy: IfNotPresent
# Overrides the image tag whose default is the chart version.
tag: 416
imagePullSecrets:
- name: registry-credentials
server:
workers: 1
node:
environment: production
dataDir: /opt/apache/trino/data
pluginDir: /opt/apache/trino/plugin
log:
trino:
level: INFO
config:
path: /opt/apache/trino/etc
http:
port: 8080
https:
enabled: false
port: 8443
keystore:
path: ""
# Trino supports multiple authentication types: PASSWORD, CERTIFICATE, OAUTH2, JWT, KERBEROS
# For more info: https://trino.io/docs/current/security/authentication-types.html
authenticationType: ""
query:
maxMemory: "1GB"
maxMemoryPerNode: "512MB"
memory:
heapHeadroomPerNode: "512MB"
exchangeManager:
name: "filesystem"
baseDir: "/tmp/trino-local-file-system-exchange-manager"
workerExtraConfig: ""
coordinatorExtraConfig: ""
autoscaling:
enabled: false
maxReplicas: 5
targetCPUUtilizationPercentage: 50
accessControl: {}
# type: configmap
# refreshPeriod: 60s
# # Rules file is mounted to /etc/trino/access-control
# configFile: "rules.json"
# rules:
# rules.json: |-
# {
# "catalogs": [
# {
# "user": "admin",
# "catalog": "(mysql|system)",
# "allow": "all"
# },
# {
# "group": "finance|human_resources",
# "catalog": "postgres",
# "allow": true
# },
# {
# "catalog": "hive",
# "allow": "all"
# },
# {
# "user": "alice",
# "catalog": "postgresql",
# "allow": "read-only"
# },
# {
# "catalog": "system",
# "allow": "none"
# }
# ],
# "schemas": [
# {
# "user": "admin",
# "schema": ".*",
# "owner": true
# },
# {
# "user": "guest",
# "owner": false
# },
# {
# "catalog": "default",
# "schema": "default",
# "owner": true
# }
# ]
# }
additionalNodeProperties: {}
additionalConfigProperties: {}
additionalLogProperties: {}
additionalExchangeManagerProperties: {}
eventListenerProperties: {}
#additionalCatalogs: {}
additionalCatalogs:
mysql: |-
connector.name=mysql
connection-url=jdbc:mysql://mysql-primary.mysql:3306
connection-user=root
connection-password=WyfORdvwVm
hive: |-
connector.name=hive
hive.metastore.uri=thrift://hadoop-hadoop-hive-metastore.hadoop:9083
hive.allow-drop-table=true
hive.allow-rename-table=true
#hive.config.resources=/tmp/core-site.xml,/tmp/hdfs-site.xml
# Array of EnvVar (https://v1-18.docs.kubernetes.io/docs/reference/generated/kubernetes-api/v1.18/#envvar-v1-core)
env: []
initContainers: {}
# coordinator:
# - name: init-coordinator
# image: busybox:1.28
# imagePullPolicy: IfNotPresent
# command: ['sh', '-c', "until nslookup myservice.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; do echo waiting for myservice; sleep 2; done"]
# worker:
# - name: init-worker
# image: busybox:1.28
# command: ['sh', '-c', 'echo The worker is running! && sleep 3600']
securityContext:
runAsUser: 10000
runAsGroup: 10000
service:
#type: ClusterIP
type: NodePort
port: 8080
nodePort: 31880
nodeSelector: {}
tolerations: []
affinity: {}
auth: {}
# Set username and password
# https://trino.io/docs/current/security/password-file.html#file-format
# passwordAuth: "username:encrypted-password-with-htpasswd"
serviceAccount:
# Specifies whether a service account should be created
create: false
# The name of the service account to use.
# If not set and create is true, a name is generated using the fullname template
name: ""
# Annotations to add to the service account
annotations: {}
secretMounts: []
coordinator:
jvm:
maxHeapSize: "2G"
gcMethod:
type: "UseG1GC"
g1:
heapRegionSize: "32M"
additionalJVMConfig: {}
resources: {}
# We usually recommend not to specify default resources and to leave this as a conscious
# choice for the user. This also increases chances charts run on environments with little
# resources, such as Minikube. If you do want to specify resources, uncomment the following
# lines, adjust them as necessary, and remove the curly braces after 'resources:'.
# limits:
# cpu: 100m
# memory: 128Mi
# requests:
# cpu: 100m
# memory: 128Mi
worker:
jvm:
maxHeapSize: "2G"
gcMethod:
type: "UseG1GC"
g1:
heapRegionSize: "32M"
additionalJVMConfig: {}
resources: {}
# We usually recommend not to specify default resources and to leave this as a conscious
# choice for the user. This also increases chances charts run on environments with little
# resources, such as Minikube. If you do want to specify resources, uncomment the following
# lines, adjust them as necessary, and remove the curly braces after 'resources:'.
# limits:
# cpu: 100m
# memory: 128Mi
# requests:
# cpu: 100m
# memory: 128Mi
3)trino catalog configmap yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: {{ template "trino.catalog" . }}
labels:
app: {{ template "trino.name" . }}
chart: {{ template "trino.chart" . }}
release: {{ .Release.Name }}
heritage: {{ .Release.Service }}
role: catalogs
data:
tpch.properties: |
connector.name=tpch
tpch.splits-per-node=4
tpcds.properties: |
connector.name=tpcds
tpcds.splits-per-node=4
{{- range $catalogName, $catalogProperties := .Values.additionalCatalogs }}
{{ $catalogName }}.properties: |
{{- $catalogProperties | nindent 4 }}
{{- end }}
这里只是列举出核心部署配置,最下面会提供git下载地址,有任何疑问欢迎留言或私信~
4)开始安装
cd trino-on-kubernetes
# 安装
helm install trino ./ -n trino --create-namespace
# 更新
# helm upgrade trino ./ -n trino
# 卸载
# helm uninstall trino -n trino
# 查看
kubectl get pods,svc -n trino
5)测试验证
coordinator_name=`kubectl get pods -n trino|grep coordinator|awk '{print $1}'`
# 登录
kubectl exec -it $coordinator_name -n trino -- /opt/apache/trino/bin/trino-cli --server http://trino-coordinator:8080 --catalog=hive --schema=default --user=hadoop
# 查看数据源
show catalogs;
select * from system.runtime.nodes;
四、配置 k8s hive 数据源
hive on k8s 可以参考我这篇文章:Hadoop on k8s 快速部署进阶精简篇
在 trino-on-kubernetes/values.yaml
文件中添加数据源
重新更新配置并重启 trino节点
helm upgrade trino ./ -n trino
# 重启,因为修改configmap是不会动态刷新的,得重启才生效
kubectl delete pod -n trino `kubectl get pods -n trino|awk 'NR!=1{print $1}'`
coordinator_name=`kubectl get pods -n hadoop|grep coordinator|awk '{print $1}'`
# 登录
kubectl exec -it $coordinator_name -n trino -- ${TRINO_HOME}/bin/trino-cli --server http://trino-coordinator:8080 --catalog=hive --schema=default --user=hadoop
# 查看数据源
show catalogs;
# 查看mysql库
show schemas from hive;
# 查看表
show tables from hive.default;
create schema hive.test;
# 创建表
CREATE TABLE hive.test.movies (
movie_id bigint,
title varchar,
rating real, -- real类似与float类型
genres varchar,
release_year int
)
WITH (
format = 'ORC',
partitioned_by = ARRAY['release_year'] -- 注意这里的分区字段必须是上面顺序的最后一个
);
#加载数据到Hive表
INSERT INTO hive.test.movies
VALUES
(1, 'Toy Story', 8.3, 'Animation|Adventure|Comedy', 1995),
(2, 'Jumanji', 6.9, 'Action|Adventure|Family', 1995),
(3, 'Grumpier Old Men', 6.5, 'Comedy|Romance', 1995);
# 查询数据
select * from hive.test.movies;
五、快速部署核心操作步骤(如果只关注部署,可直接跳转这里)
如果只是想快速部署,上面的内容就可以直接忽略了,直接执行下面步骤即可:
1)安装 git
# 1、安装 git
yum -y install git
2)下载trino安装包
git clone git@github.com:HBigdata/trino-on-kubernetes.git
cd trino-on-kubernetes
3)配置数据源
cat -n values.yaml
3)配置资源限制 requests 和 limits
4)修复 trino 配置
JVM 内存配置
5)开始部署
# git clone git@github.com:HBigdata/trino-on-kubernetes.git
# cd trino-on-kubernetes
# 安装
helm install trino ./ -n trino --create-namespace
# 更新
helm upgrade trino ./ -n trino
# 卸载
helm uninstall trino -n trino
6)测试验证
coordinator_name=`kubectl get pods -n trino|grep coordinator|awk '{print $1}'`
# 登录
kubectl exec -it $coordinator_name -n trino -- ${TRINO_HOME}/bin/trino-cli --server http://trino-coordinator:8080 --catalog=hive --schema=default --user=hadoop
# 查看数据源
show catalogs;
# 查看mysql库
show schemas from hive;
# 查看表
show tables from hive.default;
create schema hive.test;
# 创建表
CREATE TABLE hive.test.movies (
movie_id bigint,
title varchar,
rating real, -- real类似与float类型
genres varchar,
release_year int
)
WITH (
format = 'ORC',
partitioned_by = ARRAY['release_year'] -- 注意这里的分区字段必须是上面顺序的最后一个
);
#加载数据到Hive表
INSERT INTO hive.test.movies
VALUES
(1, 'Toy Story', 8.3, 'Animation|Adventure|Comedy', 1995),
(2, 'Jumanji', 6.9, 'Action|Adventure|Family', 1995),
(3, 'Grumpier Old Men', 6.5, 'Comedy|Romance', 1995);
# 查询数据
select * from hive.test.movies;
到这里完成 trino on k8s
部署和可用性演示就完成了,有任何疑问请关注我公众号:大数据与云原生技术分享
,加群交流或私信沟通,如本篇文章对您有所帮助,麻烦帮忙一键三连(点赞、转发、收藏)~