Build Big Data Warehouse With Apache Hive

Above presentation is the fourth topic I have covered for series of talks about the Ten Tools for Ten Big Data Areas. Apache Hive is one of the earliest SQL on Hadoop approach. Although its legacy design is based on MapReduce, Hive involves fast and tries to be shining continually by providing sub-seconds query on top of Hadoop in future roadmap.

There are following areas where Hive could consider involving in the future.

  1. Dynamic Schema - Hive’s meta store provides a convenient view of schemaless data in the traditional schema view of data in RDBMS. This is a good approach for the usage transaction from the legacy database. But, there are increasing requirements for dynamically creating the schema which means we do not have to define the schema to access the data, especially for semi-structure and structure data. This is very useful when doing ad-hoc analysis. And, there are already other tools working in this way, such as pig, spark dataframe, apache drill, etc.

  2. Smart Engine - There are two subareas where we define engine as smart, various and transparent. For various, Hive has already supported working with different computing engines, such as MapReduce, Tez, and Spark. For transparent, we expect to see the switch between the different engine in a dynamic or even transparent way. There are always pros and cons for using different engines. If we can dynamic specify which engine to use by adding specific SQL keywords/hints, it will be an awesome feature for Apache Hive. As a result, we can use a single framework (hive) for the different use cases. Even smarter, the framework can pick up the best engine to run queries on the fly.

  3. Multi-source Support. This is where we expect Hive can be a unified SQL over various of data sources so that we can easily to do data blending among different type of data sources.

  4. Others - Standard SQL-2011 support, store procedure like UDF, Live Long and Process (LLAP), leverage Hadoop caching features, metastore performance improvement, and advanced transaction supports.