Apache HAWQ is Landing on HDP

News

Last week, HDP had announced to expend their strategic relationship with Pivotal. This will bring together Hortonworks’ expertise and support for data management and processing with Pivotal’s top analytics engine for Apache® HadoopTM. The full announcement is here.

This is to say as I understand, Pivotal basically gives up their Pivotal HD version and mainly resell Hortonworks Data Platform (HDP) for instead. As the return, Hortonworks will support and include Pivotal’s leading product (open sourced last year) Apache HAWQ to the future version of HDP around Q2 of 2016. This is obviously a great deal for Hortonworks by converting a competitor to a friend of open source big data service provider. In addition, HDP finally gets a chance to include an MPP type of SQL over Hadoop solution in its distribution in order to well compete with Cloudera Impala. Nowadays, Apache Hive together with Apache Tez is still the leading tool for batch-oriented jobs. Considering the maturity as well as stability, this is not likely to change in short time. On the other hand, data analytics need more near-real time, more flexibility, and better performance access patterns rather than extrem high reliability (That is to say less frequent failures in the ad-hoc use cases are tolerable). As a result, Apache Impala + Apache Kudu, which are mainly contributed and distributed from Cloudera, starts catching more and more people’s attention. However, you are not likely to see them shipped in the HDP distribution because of the underlying competitions. However, the embrace of Apache HAWQ will change the game for Hortonworks.

History - MPP vs. Batch

The design of MPP solutions is the shared-nothing architecture. Each MMP executor has separate CPU, memory, and disk resources. These resources are dedicated to the executors at run time. There may be a managed data exchange through the network, usually called synchronization which is required for shuffling the data across the cluster to perform joins and aggregations. The concept of resource isolation behind MPP solutions works perfectly fine most of the time except for two scenarios as follows.

  • Performance bottlenecks: When data synchronization is required which is quite common for big data analytics, the low performed nodes in the MPP cluster will be the bottleneck of the job running. All of the other jobs may just wait for this slow job to complete so it brings down the whole cluster performance, even worse failed the query because of OOM issue. This is the same problem where MapReduce like algorithms try to solve through speculative execution.

  • Concurrency bottlenecks: There is known concurrency issue that MPP’s concurrency is tightly depending on the job instead of the scale of processing units/resources. For example, clusters of 50 nodes and 500 nodes would support the same level of concurrency. Here is an article mentioning about the same issue in Apache Impala from Yahoo team. A modern MPP system always has such limitation of around 50 machines in a typical cluster without much more scalabilities. And, this is an area where MapReduce like algorithms can well leverage its independent task processing model and also scales well with more computing resource added.

Future

For the coming release of HDP with the closer strategic relationship with Pivotal, Apache HAWQ will introduce a completely new design with a combined advantages from both of MPP and Batch. Let’s look forward how it looks like soon :).