Hive数据倾斜、内外部表Hive表关联时产生数据倾斜问题 Hive内部表和外部表的区别 Hive的函数：UDF、UDA

Hive表关联时产生数据倾斜问题

1. 倾斜原因：map输出数据key Hash的分配到reduce中，由于key分布不均匀、业务数据本身的特性、
	建表时考虑不周、等原因造成的reduce上的数据量差距过大。
2. 解决方案：
	1. 参数调节
		hive.map.aggr = true
		hive.groupby.skewindata = true
	2. SQL语句调节 
		①选用join key分布最均匀的表作为驱动表。做好列裁剪和filter操作，以达到两表做join的时候，
		  数据量相对变小的效果。
		②大小表join:
			使用map join让小的维度表(1000条以下的记录条数)先进内存。在map端完成reduce。
		③大表join大表：
			把空值的key变成一个字符串加上随机数，把倾斜的数据分到不同reduce上，由于null值关联不上，
			处理后并不影响最终结果。
		④count distinct大量相同特殊值
			count distinct时，将值为空的情况单独处理，如果是计算count distinct，可以不用处理，
			直接过滤，在最后结果中终加1。如果还有其他计算，需要进行group by,可以先将值为空的记录
			单独处理，再和其他计算结果进行union。

Hive内部表和外部表的区别

1. 创建表时：创建内部表时，会将数据移动到数据仓库指向的路径；
	create table test (name string , age string) location '/input/table_data';    
	若创建外部表，仅记录数据所在的路径，不对数据的位置做任何该表。
	create external table etest (name string , age string); 
	会在/user/hive/warehouse/新建一个etest
	load data inpath '/input/edata' into table etest;  
2. 	删除表时：在删除表的时候，内部表的元数据和数据会被一起被删除，
	而外部表只删除元数据，不删除数据。这样外部表相对来说更加安全些。
	
分区表：
	一个分区：
		create table score(s_id string,c_id string,s_score int) 
		partitioned by (month string) row format delimited fields terminated by '\t';
	创建一个表带有多个分区：
		create table score2(s_id string,c_id string,s_score int)
		partitioned by (year string,month string,day string)
		row format delimited fields terminated by '\t';
	加载数据到分区表当中去：
		load data local inpath '/bigdata/logs/score.csv' into table score partition(month='201806');
	查看分区：
		show partitions score;
	增加一个分区：
		alter table score add partition(month='201805')
	同时添加多个分区：
		alter table score add partition(month='201804') partition(month='201803');
	删除分区：
		alter table score drop partition(month='201806');
分桶表：
	创建分桶表：
		set hive.enforce.bucketing=true;
		set mapreduce.job.reduces=4;# 分为4个桶
		
	##创建分桶表:
		create table user_buckets_demo(id int,name string)
		clustered by(id) into 4 buckets
		row format delimited fields terminated by '\t';
	##创建普通表:
		create table user_demo(id int, name string)
		row format delimited fields terminated by '\t';
	加载数据到普通表user_buckets_demo中:
		load data local inpath '/bigdata/logs/user_bucket.txt'
		overwrite into table user_buckets_demo;

分区表和分桶表区别：
	1. 分区表时一个目录，分桶表是文件
	2. 分区表使用partition by指定，以指定字段为伪列，需要指定字段类型
	   分桶表由clustered by 字句指定，指定字段为真实字段，需要指定桶的个数
	3. 分区表的分区数据可以增长，分桶表一旦指定，不能再增长
	4. 作用
		分区表避免全表扫描
		分桶保存分桶查询结果的分桶结构
		分桶表数据进行抽样和join时提升MR程序效率

Hive的函数：UDF、UDAF、UDTF的区别

UDF:单行进入，单行输入
UDAF:多行进入，单行输出
UDTF:单行输入，多行输出

增加、修改表字段

增加hive表字段
语法：`alter table 表名 add columns('特征字段' '特征字段对应的类型')
```
alter table idl.i all day uid feature add columns(
uid_data_labt_cust_type string comment '客群'
,uid_is_qult_cmp string comment '客户认定'
,uid_qult_cmp_type string comment '单位类别'
)

ALTER TABLE idl.i_all_day_uid_feature change casescases array<map<string,string>>after uid_qult_cmp_type;

修改表字段类型
语法：`alter table 表名 change column 原字段名称 新字段名称 新字段类型`  
比如：`alter table idl.i_all_day_uid_feature change column last_phone_un_reg_time last_phone_un_reg_time string;`