Hive 调优

最新推荐文章于 2025-06-04 13:27:52 发布

原创最新推荐文章于 2025-06-04 13:27:52 发布 · 1.5k 阅读

7 ·

CC 4.0 BY-SA版权

本文为 cpucode.blog.csdn.net 原创作品，欢迎转载，请保留出处，谢谢！

文章标签：

#hive #big data #hadoop #大数据 #数据库

Hive 专栏收录该内容

10 篇文章

订阅专栏

本文详细解析Hive执行计划查看技巧，涵盖Fetch抓取设置、本地模式、小表大表Join优化、MapJoin参数调整、GroupBy优化、Count(Distinct)处理、笛卡尔积规避、并行执行和资源分配策略。深入讲解了如何通过设置参数、调整MapReduce任务以提升查询性能。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

执行计划（Explain）

基本语法

EXPLAIN [EXTENDED | DEPENDENCY | AUTHORIZATION] query

查看执行计划

没有生成MR任务 :

explain select * from emp;

Explain
STAGE DEPENDENCIES:
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        TableScan
          alias: emp
          Statistics: Num rows: 1 Data size: 7020 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: empno (type: int), ename (type: string), job (type: string), mgr (type: int), hiredate (type: string), sal (type: double), comm (type: double), deptno (type: int)
            outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
            Statistics: Num rows: 1 Data size: 7020 Basic stats: COMPLETE Column stats: NONE
            ListSink

有生成MR任务 :

explain select deptno, avg(sal) avg_sal from emp group by deptno;

Explain
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: emp
            Statistics: Num rows: 1 Data size: 7020 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: sal (type: double), deptno (type: int)
              outputColumnNames: sal, deptno
              Statistics: Num rows: 1 Data size: 7020 Basic stats: COMPLETE Column stats: NONE
              Group By Operator
                aggregations: sum(sal), count(sal)
                keys: deptno (type: int)
                mode: hash
                outputColumnNames: _col0, _col1, _col2
                Statistics: Num rows: 1 Data size: 7020 Basic stats: COMPLETE Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: int)
                  sort order: +
                  Map-reduce partition columns: _col0 (type: int)
                  Statistics: Num rows: 1 Data size: 7020 Basic stats: COMPLETE Column stats: NONE
                  value expressions: _col1 (type: double), _col2 (type: bigint)
      Execution mode: vectorized
      Reduce Operator Tree:
        Group By Operator
          aggregations: sum(VALUE._col0), count(VALUE._col1)
          keys: KEY._col0 (type: int)
          mode: mergepartial
          outputColumnNames: _col0, _col1, _col2
          Statistics: Num rows: 1 Data size: 7020 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: _col0 (type: int), (_col1 / _col2) (type: double)
            outputColumnNames: _col0, _col1
            Statistics: Num rows: 1 Data size: 7020 Basic stats: COMPLETE Column stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 1 Data size: 7020 Basic stats: COMPLETE Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

查看详细执行计划

explain extended select * from emp;

explain extended select deptno, avg(sal) avg_sal from emp group by deptno;

Fetch抓取

Fetch 抓取 : Hive 对某些情况的查询可以不使用 MapReduce

如：

只需读取 employee 对应的存储目录下的文件，就可输出查询结果

SELECT * FROM employees

在 hive-default.xml.template 文件中 hive.fetch.task.conversion 默认是 more，老版本 hive 默认: minimal，

hive.fetch.task.conversion : more ，在全局查找、字段查找、limit 查找等都不走 mapreduce

<property>
    <name>hive.fetch.task.conversion</name>
    <value>more</value>
    <description>
      Expects one of [none, minimal, more].
      Some select queries can be converted to single FETCH task minimizing latency.
      Currently the query should be single sourced not having any subquery and should not have any aggregations or distincts (which incurs RS), lateral views and joins.
      0. none : disable hive.fetch.task.conversion
      1. minimal : SELECT STAR, FILTER on partition columns, LIMIT only
      2. more  : SELECT, FILTER, LIMIT only (support TABLESAMPLE and virtual columns)
    </description>
</property>

设置成 none

执行查询语句，都会执行 mapreduce

set hive.fetch.task.conversion=none;

select * from emp;

select ename from emp;

select ename from emp limit 3;

设置成 more

执行查询语句，如下查询方式都不会执行 mapreduce

set hive.fetch.task.conversion=more;

select * from emp;

select ename from emp;

select ename from emp limit 3;

本地模式

当 Hive 的输入数据量非常小时，可以通过本地模式在单台机器上处理所有的任务。来避免查询触发执行任务消耗的时间比实际 job 的执行时间多的情况

用户可以通过设置 hive.exec.mode.local.auto 值 : true，让 Hive 在适当的时候自动启动该优化

开启本地 mr

set hive.exec.mode.local.auto=true;

设置 local mr 的最大输入数据量，当输入数据量< 该值时，采用 local mr 的方式，默认 : 134217728 ( 128M )

set hive.exec.mode.local.auto.inputbytes.max=50000000;

设置 local mr 的最大输入文件个数，当输入文件个数 < 该值时，采用 local mr 的方式，默认 : 4

set hive.exec.mode.local.auto.input.files.max=10;

开启本地模式，并执行查询语句

set hive.exec.mode.local.auto=true;

select * from emp cluster by deptno;

关闭本地模式，并执行查询语句

set hive.exec.mode.local.auto=false;

select * from emp cluster by deptno;

HQL 语法优化

小表大表 Join ( MapJoin )

当 key 相对分散，且数据量小的表放在 join 的左边，可以有效减少内存溢出错误发生的几率；还可以使用 map join 让小的维度表（1000条以下的记录条数）先进内存。在 map 端完成 join

新 Hive 已对小表 JOIN 大表和大表 JOIN 小表进行了优化。小表放在左边和右边已无区别

设置 MapJoin 参数

设置自动选择 Mapjoin , 默认 : true

set hive.auto.convert.join = true;

大表小表的阈值设置 , 默认 : 25M

set hive.mapjoin.smalltable.filesize = 25000000;

MapJoin 工作机制

TaskA : 一个 Local Task ( 在客户端本地执行的 Task ) , 负责扫描小表 b 的数据，将其转换成一个 HashTable 的数据结构，并写入本地的文件中，之后将该文件加载到 DistributeCache 中
TaskB : 没有 Reduce 的 MR , 启动 MapTasks 扫描大表 a , 在 Map 阶段，根据 a 的每一条记录去和 DistributeCache 中 b 表对应的 HashTable 关联，并直接输出结果
由于 MapJoin 没有 Reduce , 所以由 Map 直接输出结果文件，有多少个 Map Task , 就有多少个结果文件

创建表

创建大表

create table bigtable(
id bigint, 
t bigint, 
uid string, 
keyword string, 
url_rank int, 
click_num int, 
click_url string
)
row format delimited fields terminated by '\t';

创建小表

create table smalltable(
id bigint,
t bigint,
uid string,
keyword string,
url_rank int,
click_num int,
click_url string
) 
row format delimited fields terminated by '\t';

创建 join 后表的语句

create table jointable(
id bigint,
t bigint,
uid string,
keyword string,
url_rank int,
click_num int,
click_url string
)
row format delimited fields terminated by '\t';

导入数据

load data local inpath '/opt/module/hive/datas/bigtable' into table bigtable;

load data local inpath '/opt/module/hive/datas/smalltable' into table smalltable;

Join

小表 JOIN 大表

insert overwrite table jointable
select
	b.id, b.t, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url
from
	smalltable s
    join
		bigtable b
	on
		b.id = s.id;

执行大表 JOIN 小表语句

insert overwrite table jointable
select 
	b.id, b.t, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url
from
	bigtable b
	join smalltable s
    on
        s.id = b.id;

大表 Join 大表

空 KEY 过滤

有时 join 超时是因为某些 key ( 空Key ) 的数据太多，而相同 key 的数据都会发送到相同的 reducer 上，从而导致内存不够

配置历史服务器

<property>
    <name>mapreduce.jobhistory.address</name>
    <value>cpu101:10020</value>
</property>
<property>
    <name>mapreduce.jobhistory.webapp.address</name>
    <value>cpu101:19888</value>
</property>

启动历史服务器

mr-jobhistory-daemon.sh start historyserver

http://cpu101:19888/jobhistory

创建原始数据表、空 id 表、合并后数据表

create table nullidtable(
id bigint, 
t bigint,
uid string, 
keyword string, 
url_rank int, 
click_num int, 
click_url string
)
row format delimited fields terminated by '\t';

加载原始数据和空id数据到对应表中

load data local inpath '/opt/module/hive-3.1.2/datas/nullid' into table nullidtable;

测试不过滤空 id

insert overwrite table jointable 
select * 
from nullidtable n 
	left join bigtable o 
	on
		n.id = o.id;

测试过滤空id

insert overwrite table jointable
select * 
from (
    select * 
    from 
    	nullidtable 
    where 
    	id is not null 
) n  
	left join bigtable o 
	on n.id = o.id;

空key转换

有时空key 的数据很多，但不是异常数据，必须要包含在 join 的结果中，就可以将空key 给个随机值，使得数据随机均匀分配到不同的 reducer 上

不随机分布空null值

设置 5 个 reduce

set mapreduce.job.reduces = 5;

JOIN两张表

insert overwrite table jointable
select 
	n.*
from
	nullidtable n
	left join bigtable b 
	on n.id = b.id;

出现数据倾斜时，某些 reducer 的资源消耗远大于其他 reducer

随机分布空 null 值

设置 5 个 reduce

set mapreduce.job.reduces = 5;

JOIN两张表

insert overwrite table jointable
select
	n.*
from
    nullidtable n 
    full join bigtable o 
    on 
		nvl(n.id, rand()) = o.id;

无数据倾斜，负载均衡 reducer 的资源消耗

SMB ( Sort Merge Bucket join )

创建第二张大表

create table bigtable2(
id bigint,
t bigint,
uid string,
keyword string,
url_rank int,
click_num int,
click_url string
)
row format delimited fields terminated by '\t';
load data local inpath '/opt/module/data/bigtable' into table bigtable2;

测试大表直接 JOIN

insert overwrite table jointable
select
	b.id, b.t, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url
from bigtable s
    join bigtable2 b
    on b.id = s.id;

创建分桶表1 , 桶的个数不要超过可用 CPU 的核数

create table bigtable_buck1(
id bigint,
t bigint,
uid string,
keyword string,
url_rank int,
click_num int,
click_url string
)
clustered by(id)
sorted by(id)
into 6 buckets
row format delimited fields terminated by '\t';

insert into bigtable_buck1 select * from bigtable;

创建分通表2 , 桶的个数不要超过可用 CPU 的核数

create table bigtable_buck2(
id bigint,
t bigint,
uid string,
keyword string,
url_rank int,
click_num int,
click_url string
)
clustered by(id)
sorted by(id)
into 6 buckets
row format delimited fields terminated by '\t';

insert into bigtable_buck2 select * from bigtable;

设置参数

set hive.optimize.bucketmapjoin = true;

set hive.optimize.bucketmapjoin.sortedmerge = true;

set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;

测试

insert overwrite table jointable
select 
	b.id, b.t, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url
from bigtable_buck1 s
    join bigtable_buck2 b
    on b.id = s.id;

Group By

默认情况下，Map 阶段同一 Key 数据分发给一个 reduce，当一个 key 数据过大时就倾斜了

不是所有的聚合操作都是在 Reduce 端完成，很多聚合操作都先在 Map 端进行部分聚合，最后在 Reduce 端得出最终结果

###　开启Map端聚合参数

在 Map 端进行聚合，默认 : True

set hive.map.aggr = true

在 Map 端进行聚合操作的条目数目

set hive.groupby.mapaggr.checkinterval = 100000

数据倾斜时进行负载均衡 , 默认 : false

set hive.groupby.skewindata = true

该参数 : true，生成的查询计划有两个MR Job

第一个 MR Job ，Map 的输出结果会随机分布到 Reduce 中，每个 Reduce 先做部分聚合操作，并输出结果，现象 : 相同的 Group By Key 可能被分发到不同的 Reduce 中，来达到负载均衡

第二个 MR Job，再根据预处理的数据结果按照 Group By Key 分布到 Reduce 中（保证相同的 Group By Key 被分布到同一个Reduce中），再完成最后的聚合操作

select deptno 
from emp 
group by deptno;

优化 :

set hive.groupby.skewindata = true;

select deptno 
from emp 
group by deptno;

Count(Distinct) 去重统计

COUNT DISTINCT 只用了一个 Reduce Task 来完成，当该 Reduce 处理的数据量太大时，会导致整个 Job 错误

一般 COUNT DISTINCT 会使用 GROUP BY 再 COUNT 的方式替换 , 注意 : group by 出现的数据倾斜问题

创建一张大表

create table bigtable(
id bigint, 
time bigint, 
uid string, 
keyword string, 
url_rank int, 
click_num int, 
click_url string
) 
row format delimited fields terminated by '\t';

加载数据

load data local inpath '/opt/module/datas/bigtable' into table bigtable;

设置 5 个 reduce 个数

set mapreduce.job.reduces = 5;

执行去重 id 查询

select count(distinct id) from bigtable;

采用 GROUP by 去重 id

select count(id) 
from (select id from bigtable group by id) a;

笛卡尔积

尽量避免笛卡尔积，因为 Hive 只能使用 1 个 reducer 来完成笛卡尔积

出现该情况条件 :

join 时不加 on 条件
无效的 on 条件

行列过滤

列处理：在 SELECT 时，只拿需要的列，如果有分区，尽量使用分区过滤，少用 SELECT *

行处理：在分区剪裁中，当使用外关联时，如果将副表的过滤条件写在 Where 后面，就会先全表关联，之后再过滤

测试先关联两张表，再用 where 条件过滤

select 
	o.id 
from bigtable b
	join bigtable  o.id = b.id
where 
	o.id <= 10;

通过子查询后，再关联表

select 
	b.id 
from bigtable b
join (
    select id 
    from bigtable 
    where id <= 10 
) o 
	on b.id = o.id;

合理设置 Map 及 Reduce 数

通常情况下，作业会通过 input 的目录产生 n 个 map 任务

决定因素：

input 的文件总个数
input 的文件大小
集群设置的文件块大小

map 数并不是越多越好

当一个任务有很多小文件（远 < 块大小 128m ），则每个小文件就会当成一个块，用一个 map 任务来完成，而一个 map 任务启动和初始化的时间远 > 逻辑处理的时间，会造成资源浪费

减少 map 数

保证每个 map 处理接近 128m 的文件块，还要考虑复杂度

如 : 有一个 127m 的文件，正常会用一个 map 去完成，但该文件只有一个或两个小字段，却有几千万的记录，如果 map 处理的逻辑比较复杂，用一个 map 任务去做，就比较耗时

增加 map 数

复杂文件增加 Map 数

当 input 的文都很大，任务逻辑复杂，map 执行慢时，可以考虑增加 Map 数，让每个 map 处理的数据量减少，从而提高任务的执行效率

增加 map ：

computeSliteSize(Math.max( minSize, Math.min(  maxSize, blocksize  ) )) = blocksize = 128M

调整 maxSize : 当 maxSize < blocksize : 增加 map 的个数

执行查询

select count(*) from emp;

设置最大切片值为 100 个字节

set mapreduce.input.fileinputformat.split.maxsize = 100;

select count(*) from emp;

小文件进行合并

在 map 执行前合并小文件，减少 map 数：CombineHiveInputFormat 具有对小文件进行合并的功能（系统默认的格式）

HiveInputFormat 没有小文件合并功能

set hive.input.format = org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

在 Map-Reduce 的任务结束时合并小文件的设置：

在 map-only 任务结束时合并小文件，默认 : true

SET hive.merge.mapfiles = true;

在 map-reduce 任务结束时合并小文件，默认 : false

SET hive.merge.mapredfiles = true;

合并文件的大小，默认 : 256M

SET hive.merge.size.per.task = 268435456;

当输出文件的平均大小 < 该值时，启动一个独立的 map-reduce 任务进行文件 merge

SET hive.merge.smallfiles.avgsize = 16777216;

合理设置 Reduce 数

每个 Reduce 处理的数据量，默认 : 256MB

hive.exec.reducers.bytes.per.reducer=256000000

每个任务最大的 reduce 数，默认 : 1009

hive.exec.reducers.max=1009

计算 reducer 数的公式

N = min(参数2，总输入数据量 / 参数1)

在 hadoop 的 mapred-default.xml 文件中修改

设置每个 job 的 Reduce 个数

set mapreduce.job.reduces = 15;

reduce 个数并不是越多越好

过多的启动和初始化 reduce 也会消耗时间和资源
有多少个 reduce ，就有多少个输出文件，如果生成很多个小文件，当这些小文件作为下一个任务的输入，就会出现小文件过多的问题
在设置 reduce 个数时，考虑：处理大数据量利用合适的 reduce 数；单个 reduce 任务处理数据量大小要合适

并行执行

Hive 会将一个查询转化成 n 个阶段

这些阶段有 :

MapReduce阶段
抽样阶段
合并阶段
limit阶段
其他阶段

默认情况下，Hive 一次只会执行一个阶段

某个特定的 job 可能包含众多的阶段，而这些阶段可能都不完全互相依赖 ( 并行执行 )，可以使得整个 job 的执行时间缩短

当 job 并行阶段增多，集群利用率就会增加

打开任务并行执行

set hive.exec.parallel = true;

同一个 sql 允许最大并行度，默认 : 8

set hive.exec.parallel.thread.number = 16;

严格模式

防止一些危险操作

分区表不使用分区过滤

对于分区表，where 后没有分区字段过滤条件来限制范围，就不允许执行 ( 用户不允许扫描所有分区 )

set hive.strict.checks.no.partition.filter = true;

限制原因 : 一般分区表的数据集都非常大，且数据增加迅速。如果没有分区限制的查询 , 就会消耗巨大资源

使用 order by 没有 limit 过滤

order by 查询，要求必须使用 limit

set hive.strict.checks.orderby.no.limit = true;

限制原因 : order by 为了执行排序中会把所有的结果数据分发到同一个 Reducer 中进行处理，强制要求用户增加这个 LIMIT 可以防止 Reducer 数据量过大

笛卡尔积

限制笛卡尔积的查询

set hive.strict.checks.cartesian.product = true;

Mysql : 执行 JOIN 查询时，不使用 ON 而使用 where ，优化器会将 WHERE 转化成 ON 。

Hive : 没有这种优化，所以表足够大，该查询会出问题

JVM 重用

小文件过多时 , 使用

Hive On Spark

栗子 : 单台服务器 128 G 内存，32 线程

Executor 参数

CPU核心数

每个 Executor 可利用的 CPU 核心数

spark.executor.cores

该值不宜过大，因为 Hive 的底层以 HDFS 存储，而 HDFS 有时对高并发写入处理不太好，容易造成 race condition

根据经验实践，设定在 3 ~ 6 之间比较合理

假设 :

服务器单节点有 32 个 CPU 核心可供使用 , 考虑到系统基础服务和HDFS等组件的余量，一般会将 YARN NodeManager 的

yarn.nodemanager.resource.cpu-vcores 参数设为28，也就是YARN能够利用其中的28核，

此时将spark.executor.cores 设为 4 最合适，最多可以正好分配给7个Executor而不造成浪费。

假设 :

yarn.nodemanager.resource.cpu-vcores 为26，那么将spark.executor.cores设为5最合适，只会剩余1个核。

由于一个Executor需要一个YARN Container来运行，所以还需保证 spark.executor.cores 的值不能大于单个Container能申请到的最大核心数，即 yarn.scheduler.maximum-allocation-vcores 的值

堆内外内存量

每个 Executor 可利用的堆内内存量

spark.executor.memory

每个 Executor 可利用的堆外内存量

spark.yarn.executor.memoryOverhead

堆内内存越大，Executor就能缓存更多的数据，如 : map join之类的操作时就会更快，但同时也会使得GC变得更麻烦。

spark.yarn.executor.memoryOverhead 的默认值是 executorMemory * 0.10，最小值为384M(每个Executor)

Hive官方提供了一个计算Executor总内存量的经验公式，如下：

yarn.nodemanager.resource.memory-mb * (spark.executor.cores / yarn.nodemanager.resource.cpu-vcores)

其实就是按核心数的比例分配。

在计算出来的总内存量中，80%~85% 划分给堆内内存，剩余的划分给堆外内存。

假设集群中单节点有128G物理内存，yarn.nodemanager.resource.memory-mb（即单个NodeManager能够利用的主机内存量）设为100G，那么每个Executor大概就是100*(4/28)=约14G。

再按 8 : 2 比例划分的话，

最终 spark.executor.memory 设为约11.2G，spark.yarn.executor.memoryOverhead设为约2.8G。

通过这些配置，每个主机一次可以运行多达7个executor。

每个 executor 最多可以运行4个 task (每个核一个)。因此，每个task平均有3.5 GB(14 / 4)内存。在executor中运行的所有task共享相同的堆空间

set spark.executor.memory=11.2g;

set spark.yarn.executor.memoryOverhead=2.8g;

同理，这两个内存参数相加的总量也不能超过单个Container最多能申请到的内存量，即yarn.scheduler.maximum-allocation-mb配置的值。

Executor实例

执行查询时一共启动多少个 Executor 实例 , 取决于每个节点的资源分配情况以及集群的节点数

 spark.executor.instances

假设 : 10 台 32G / 128G 的节点，并按照上述配置（即每个节点承载 7 个 Executor ），那么理论上讲我们可以将spark.executor.instances 设为70，以使集群资源最大化利用

一般都会适当设小一些（推荐是理论值的一半左右，比如40），因为Driver也要占用资源，并且一个 YARN 集群往往还要承载

动态分配

spark.dynamicAllocation.enabled

固定分配 Executor 数量的方式可能不太灵活，尤其是在 Hive 集群面向很多用户提供分析服务的情况下

所以更推荐将 spark.dynamicAllocation.enabled : true，以启用 Executor 动态分配

参数配置样例参考

set hive.execution.engine=spark;
set spark.executor.memory=11.2g;
set spark.yarn.executor.memoryOverhead=2.8g;
set spark.executor.cores=4;
set spark.executor.instances=40;
set spark.dynamicAllocation.enabled=true;
set spark.serializer=org.apache.spark.serializer.KryoSerializer;