Hive vs. Pig


Both projects are top Apache projects to process data in Hadoop. Here, I try to compare the difference. Below is picture I found (I cannot find the original link, but there is mirror here

In addition, I am here to share some more difference. To be honest, I prefer Hive more. Mapred, Hive, and pig are like Java, ruby, and python.

Pig

  • It is more easy to install pig than hive. Of course, it is mostly on shell interaction.
  • Pig is more easy to do ETL jobs. For example, you do not need to write separate “Create table” scripts to hold data. Piggy banks and facebook lib bring more functions.
  • The “illustrate” function is very good by offering sample data for each steps and each scenario. However, Hive does not support this.
  • Pig’s order by support global sort by multiple reducer. Hive does not. Hive uses only one reducer.
  • Pig’s skew join can handle skewed data easily by lunching 2 mapred (one for sampling, the other one is for reduce side join. to enable, add (using ‘skewed’) in the end of join statement. Hive does not support this now and working on it.
  • Pig supports Avro. Hive does not, but will be soon here
  • pig -x local has better performance for unit test since it only runs a localjobrunner (local mapreduce)
  • Pig’s user group: Yahoo! 90% of MapReduce is done by pig. Twitter (80%), Linkedin, Salesforce, Nokia, AOL

Hive

  • It is SQL like. The learning curve for SQL developer is almost none.
  • It is supported by Hue which does make it more popular.
  • Hive has provided more powerful statistics function
  • Hive can integrate with HBase. You can query HBase data by it. Pig cannot, but can use HBaseStorage() to load data from HBase
  • Hive’s user group: Facebook, CNET, Digg, Last.fm