Hive分桶表&分区表

科技2022-07-10 384

Hive分桶表&分区表

分区表

Hive存在的问题:hive中没有索引机制,每次查询的时候,hive会暴力扫描整张表.MySQL具有索引机制.

因为没有分区,所以hive搞了一个分区表的机制.在建表的时候,加上分区字段,然后会在表的目录下建立一个个的分区.如果按照天为分区,那么在表目录里就会有每天的目录.分区表的核心就是分目录.

分区表的建表

create table dept_partition( deptno int, dname string, loc string ) partitioned by (day string) row format delimited fields terminated by '\t';

上面的语法中分区字段不能和分区表中的内容一样

分区表的加载:加载数据的时候要把分区字段名给指定

load data local inpath '/opt/module/hive/datas/dept_20200401.log' into table dept_partition partition(day='20200401');

分区表的查询

查询一个分区的方式:

select * from dept_partition where day = '20200401'

查询多个分区的方式一:

select * from dept_partition where day = '20200401' or day = '20200402';

方式二:

select * from dept_partition where day = '20200401' union select * from dept_partition where day = '20200402' union select * from dept_partition where day = '20200403'

union相当于纵向拼接,join是横向拼接

查看分区表下有多少个分区

show partitions dept_partition;

删除分区表的分区

alter table dept_partition drop partition(day=' ');

增加分区表的分区

alter table dept_partition add partition(day='20200404');

增加多个(不能加逗号)

alter table dept_partition add partition(day='20200405') partition(day='20200406');

删除多个分区(必须加逗号)

alter table dept_partition drop partition(day = '20200404'),partition (day = '20200405')

分区表的二级分区

二级分区的目的是为了解决每天的数据量也很大的情况

create table dept_partition2(deptno int,dname string,loc stirng) partitioned by (day string,hour string) row format delimited fields terminated by '\t';

查询

select * from dept_partition2 where day= '20200401' and hour = '12'

给二级分区正常加载数据

load data local inpath '/opt/module/hive/datas/dept_20200401.log' into table dept_partition2 partition(day='20200401',hour='12');

给二级分区增加分区

alter table dept_partition2 add partition(day='20200403',hour='01') partition(day='20200403',hour = '02');

给二级分区删除分区

alter table dept_partition2 drop partition(day='20200403',hour='01'),partition(day='20200403',hour = '02');

分区表和元数据对应三种方式

1 先上传再修复表

msck repair table dept_partition2;

2 先上传数据，然后手动添加分区

3 直接load load数据的时候直接指定分区字段的值，这个时候不仅会上传数据，还会创建对应的分区

动态分区调整

必须要进行的配置

（1）开启动态分区功能（默认true，开启）

hive.exec.dynamic.partition=true

（2）设置为非严格模式（动态分区的模式，默认strict，表示必须指定至少一个分区为静态分区，nonstrict模式表示允许所有的分区字段都可以使用动态分区。）

hive.exec.dynamic.partition.mode=nonstrict

下面的设置不是必须

（3）在所有执行MR的节点上，最大一共可以创建多少个动态分区。默认1000

hive.exec.max.dynamic.partitions=1000

（4）在每个执行MR的节点上，最大可以创建多少个动态分区。该参数需要根据实际的数据来设定。比如：源数据中包含了一年的数据，即day字段有365个值，那么该参数就需要设置成大于365，如果使用默认值100，则会报错。

hive.exec.max.dynamic.partitions.pernode=100

（5）整个MR Job中，最大可以创建多少个HDFS文件。默认100000

hive.exec.max.created.files=100000

（6）当有空分区生成时，是否抛出异常。一般不需要设置。默认false

hive.error.on.empty.partition=false

分桶表是分文件（分桶字段一定要从表字段中选择一个）

创建分桶表

create table stu_buck(id int, name string) clustered by(id) into 4 buckets row format delimited fields terminated by '\t';

怎么查看一张表是分区表还是分桶表？（详细查看表信息）

desc formatted stu_buck;

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-5XrFMKvq-1601696317519)(E:\BigData\学习笔记\Typora\hive\src\分桶字段.png)]

往分桶表中插入数据(执行分桶表插入数据的时候,有可能会报错.(hive新版本跑的是mr,因此要改用hdfs路径导入数据)

load data local inpath ‘/opt/module/hive/datas/tmp/student.txt' into table stu_buck;

下面是正确的排序的方式:

load data inpath ‘hdfs路径' into table stu_buck;

e新版本跑的是mr,因此要改用hdfs路径导入数据)

load data local inpath ‘/opt/module/hive/datas/tmp/student.txt' into table stu_buck;

下面是正确的排序的方式:

load data inpath ‘hdfs路径' into table stu_buck;

Processed: 0.014, SQL: 8

Hive分桶表&amp;分区表

Hive分桶表&分区表

分区表

分区表的二级分区

给二级分区正常加载数据

给二级分区增加分区

给二级分区删除分区

分区表和元数据对应三种方式

1 先上传 再修复表

2 先上传数据，然后手动添加分区

3 直接load load数据的时候直接指定分区字段的值，这个时候不仅会上传数据，还会创建对应的分区

动态分区调整

分桶表是分文件（分桶字段一定要从表字段中选择一个）

Hive分桶表&分区表

1 先上传再修复表