Hive Get the Max/Min Value Rows

Most of time, we need to find the max or min value of particular columns as well as other columns. For example, we have following employee table.

1
2
3
4
5
6
7
8
9
10
11
> SELECT name,sex_age.sex AS sex,sex_age.age AS age FROM employee;
+----------+---------+------+

| name | sex | age |
+----------+---------+------+

| Michael | Male | 30 |
| Will | Male | 35 |
| Shelley | Female | 27 |
| Lucy | Female | 57 |
| Steven | Male | 30 |
+----------+---------+------+

5 rows selected (75.887 seconds)

We want to know Who is oldest of males or females? There are three solutions available.

Solution 1

The most frequent way of doing it is to to firstly find the MAX of age in each SEX group and do SELF JOIN by matching SEX and the MAX age as follows. This will create two stages of jobs and NOT efficient.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
> SELECT employee.sex_age.sex, employee.sex_age.age, name 
> FROM
> employee JOIN
> (
> SELECT
> max(sex_age.age) as max_age, sex_age.sex as sex
> FROM employee
> GROUP BY sex_age.sex

> ) maxage
> ON employee.sex_age.age = maxage.max_age
> AND employee.sex_age.sex = maxage.sex;
+--------------+------+-------+

| sex_age.sex | age | name |
+--------------+------+-------+

| Female | 57 | Lucy |
| Male | 35 | Will |
+--------------+------+-------+

2 rows selected (94.043 seconds)

Solution 2

Once Hive 0.11.0 introduced analytics functions, we can use ROW_NUMBER to solve the problem as well, but only trigger one MapReduce job.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
> SELECT sex, age, name
> FROM
> (
> SELECT sex_age.sex AS sex,
> ROW_NUMBER() OVER (PARTITION BY sex_age.sex ORDER BY sex_age.age DESC) AS row_num,
> sex_age.age as age,name

> FROM employee
> ) t WHERE row_num = 1;
+---------+------+-------+

| sex | age | name |
+---------+------+-------+

| Female | 57 | Lucy |
| Male | 35 | Will |
+---------+------+-------+

2 rows selected (61.655 seconds)

Solution 3

Actually, there is a better way of doing it as follows through MAX/MIN STRUCT function added by Hive-1128 since Hive 0.6.0, although it is not documented anywhere in the Hive Wiki.

1
2
3
4
5
6
7
8
9
10
11
12
> SELECT sex_age.sex, 
> max(struct(sex_age.age, name)).col1 as age,

> max(struct(sex_age.age, name)).col2 as name
> FROM employee
> GROUP BY sex_age.sex;

+--------------+------+-------+
| sex_age.sex | age | name |
+--------------+------+-------+

| Female | 57 | Lucy |
| Male | 35 | Will |
+--------------+------+-------+

2 rows selected (47.659 seconds)

The above job only trigger one MapReduce job. We still need to use the Group By clause. However, we can use MAX/MIN STRUCT function to show all other columns in the same line of MAX/MIN value. By default, Group By clause does not allow columns shown in the SELECT list if it is not Group By column.

Summary

The solution 3 is better in terms of performance, query complexity, and version supports.