hive 开窗函数over()

七年·

已于 2024-04-24 15:02:51 修改

阅读量1.6k

点赞数

CC 4.0 BY-SA版权

分类专栏： hive 文章标签：大数据 sql hive

于 2020-07-10 12:39:49 首次发布

本文链接：https://blog.csdn.net/qq_28603127/article/details/107234295

hive 专栏收录该内容

3 篇文章

订阅专栏

本文深入解析SQL开窗函数的应用，包括OVER()的基础用法，PARTITION BY与ORDER BY的结合，以及ROW_NUMBER(), RANK(), DENSE_RANK(), LAG(), LEAD(), FIRST_VALUE(), LAST_VALUE(), NTILE()等高级分析函数的使用技巧。通过实例，读者可以了解到如何利用这些函数进行复杂的数据分析和处理。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

开窗函数over()是将聚合函数以列的形式展开,替代需要join聚合结果的处理
数据:

CREATE TABLE `test`(
  `id` int, 
  `class` string, 
  `name` string, 
  `score` int);
  
insert into test values
(1,'a','zhang',90),
(2,'a','li',80),
(3,'a','zhao',95),
(4,'a','qian',85),
(5,'a','sun',60),
(6,'a','zhou',70),
(7,'b','wu',75),
(8,'b','zheng',50),
(9,'b','wang',65),
(10,'b','luo',100);

1 over()

如果要查询每个人的成绩与总成绩一个视图展示,通过join:

select a.name,a.score,b.total_score from test a join 
(select sum(score)as total_score from test )b;

在这里插入图片描述
此时使用开窗函数就会更简单:

select name,score,sum(score)over() as total_score from test;

在这里插入图片描述
sum(score)over() total_score作用是将sum(score)作为值,重新写入total_score列中的每一行.
不光sum()函数,还可以是其他的聚合函数min,max,count,avg等等

2 over(partition by 列名 order by 列名)

之前是所有学生不分班级,如果要展示所有学生成绩,但是班级总成绩要分开:
通过join:

select a.class,a.name,a.score,b.total_score from test a join 
(select class,sum(score)as total_score from test group by class )b 
on a.class = b.class;

在这里插入图片描述
通过开窗函数:

select class,name,score,sum(score)over(partition by class) as total_score from test;

在这里插入图片描述
over(partition by class) 作用是按照class去分区聚合,a班级sum(score)=480 b班级sum(score)=290,然后total_score中按照班级对应显示sum(score);
over中的order by 与普通sql中的order by 都有按照某个字段排序的功能,但是over中的order by 可以指定范围,会影响前面聚合函数的值.

select class,name,score,sum(score)over(partition by class order by score) as total_score from test;

在这里插入图片描述
添加了over中的 order by之后发现totsal_score也变化了.
window子句

基于行:
(以分区中的每个组作为一个所有行单位,初始行是一个分区的第一行,结束行是一个分区的最后一行)
 起始行到当前行:
 rows between unbounded preceding and current row
 当前行的前一行到当前行
 rows between 1 preceding and current row
 当前行的前1行到后1行
 rows between 1 preceding and 1 following
 当前行到后1行
 rows between current row and 1 following
 当前行的后1行到最后一行
 rows between 1 following and unbounded following
 起始行到最后一行
 rows between unbounded preceding and unbounded following

基于值:
以分区中的每个组作为一个所有行单位,按照值来取.
基于值的需要根据order by中是asc还是desc."current row"总是代表当前值x
asc是从小到大排列,"unbounded preceding"就代表负无穷,"1 preceding"代表x-1,"1 following"代表x+1,"unbounded following"代表正无穷
desc是从大到小排列,"unbounded preceding"就代表正无穷,"1 preceding"代表x+1,"1 following"代表x-1,"unbounded following"代表负无穷
理解:"unbounded preceding"是开始,升序则代表负无穷,降序则代表正无穷."1 preceding"是当前行往前面,升序往前是小的值,所以代表x-1,降序往前是大值,则代表x+1......

 asc则<=x&>=负无穷   desc则>=x&<=正无穷
 range between unbounded preceding and current row
 asc则>=x-1&<=x   desc则>=x&<=x+1
 range between 1 preceding and current row
 asc则>=x-1&<=x+1   desc则<=x+1&>=x-1
 range between 1 preceding and 1 following
 asc则<=x+1&>=x   desc则<=x&>=x-1
 range between current row and 1 following
 asc则>=x+1     des则<=x-1
 range between 1 following and unbounded following
 asc和desc都是取所有值
 range between unbounded preceding and unbounded following

当over中没有order by时,默认作用范围是基于行的分区起始行到分区最后一行( rows between unbounded preceding and unbounded following)
当over中有order by时,作用范围为基于值的分区起始值(desc是正无穷 asc是负无穷)到当前值之间(range between unbounded preceding and current row)

上图中的total_score展示的结果也符合rows between unbounded preceding and unbounded following,这点需要澄清,实际是range between unbounded preceding and current row.

在这里插入图片描述

如果要根据score排序并且要显示整个分区的sum的话:
第一种:

select class,name,score,sum(score)over(partition by class) as total_score
from test order by class,score desc;

第二种:

select class,name,score,sum(score)over(
partition by class order by score desc rows between unbounded preceding and unbounded following
) as total_score from test ;

一些分析函数

ntile() --(不支持window子句)

ntile(n)over(partition by x1,order by x2) as num
先按照x1分区,x2排序,然后把分区分成n份,当前数据在第几份就在num列中标几.

select class,name,score,ntile(2)over(partition by class order by score ) as num from test;

在这里插入图片描述

row_number()/rank()/dense_rank() --不支持window子句

row_number()over(patition by x1 order by x2) as num
排名函数,按照X1分组,按照x2排序,在num中标注分区中的名次

select class,name,score,row_number()over(partition by class order by score desc) as num from test;

在这里插入图片描述
注意:我向上面b班中添加了一条score=100的记录,b班现在有2个100,分数相同排名不同.
rank()over和dense_rank()over作用跟row_number相同都是排序,但是rank()/dense_rank()名次相同将会并列名次,rank()跟dense_rank()区别在于是否跳名次.

select class,name,score,
row_number()over(partition by class order by score desc) as num,
rank()over(partition by class order by score desc) as num1,
dense_rank()over(partition by class order by score desc) as num2
from test;

在这里插入图片描述

lag()/lead() --不支持window子句

lag(col,n,default) 根据分组,取当前行往前的第n行的col列的值,如果为null则default
lead(col,n,default) 根据分组,取当前行往后的第n行的col列的值,如果为null则default

select class,name,score,
lag(name,1,'wu')over(partition by class order by score desc) as num,
lead(name,1,'wu')over(partition by class order by score desc) as num1 from test;

在这里插入图片描述

first_value()/last_value() —支持window子句

first_value(col)根据分组,取当前排序的col的第一行到当前行的第一个值
last_value(col)根据分组,取当前排序的col从第一行到当前行的最后一个值

select class,name,score,
first_value(name)over(partition by class order by score desc) as num,
last_value(name)over(partition by class order by score desc rows between unbounded preceding  
and unbounded following) as num1 from test;

在这里插入图片描述