Most of time, we need to find the max or min value of particular columns as well as other columns. For example, we have following employee table.
> SELECT name,sex_age.sex AS sex,sex_age.age AS age FROM employee;
We want to know Who is oldest of males or females? There are three solutions available.
The most frequent way of doing it is to to firstly find the MAX of age in each SEX group and do SELF JOIN by matching SEX and the MAX age as follows. This will create two stages of jobs and NOT efficient.
> SELECT employee.sex_age.sex, employee.sex_age.age, name
Once Hive 0.11.0 introduced analytics functions, we can use ROW_NUMBER to solve the problem as well, but only trigger one MapReduce job.
> SELECT sex, age, name
Actually, there is a better way of doing it as follows through MAX/MIN STRUCT function added by Hive-1128 since Hive 0.6.0, although it is not documented anywhere in the Hive Wiki.
> SELECT sex_age.sex,
The above job only trigger one MapReduce job. We still need to use the Group By clause. However, we can use MAX/MIN STRUCT function to show all other columns in the same line of MAX/MIN value. By default, Group By clause does not allow columns shown in the SELECT list if it is not Group By column.
The solution 3 is better in terms of performance, query complexity, and version supports.