Hive Sorting and Ordering

There are following key words used in Hive to sort data with following difference:

  • ORDER BY (ASC|DESC) : This is similar to the traditional SQL operator. Sorted order is maintained across all of the output from every reducer. It performs the global sort using only one reducer, so it takes long time to return result. The Usage with LIMIT is strongly recommended for order by. When hive.mapred.mode = strict and you do not specify “limit”, there are error out.

  • SORT BY (ASC|DESC) : This dictates which columns to sort by when ordering reducer input records. That means it complete sort before data sending to reducer. “Sort by” does not perform global sort and only make sure locally sorted in each reducer unless you set mapred.reduce.tasks=1. In this case, it is equal to result of “order by”

  • DISTRIBUTE BY : Rows with matching column values will partition to the same reducer. When used alone, it does not guarantee sorted input to the reducer.

  • CLUSTER BY: This is a shorthand operator to perform = SORT BY + DISTRIBUTE BY operations on a group of columns. And, it is sorted locally in reducer.