Happy New Year 2016


It is the end of 2015 and HAPPY NEW YEAR - 2016. It is time to wrap up my writing calendar with some summary on Sparkera, myself, and Big Data ecosystem.

In past 2015, I have published 21 articles in this blog, which finally got its name as SPARKERA - an era when a little spark can lead big fires. In this year, I also successfully migrated this blog from Jekyll to Hexo and renovate it with new UI and domain name. I also have my first book on Apache Hive published and start my career as an independent consultant for more challenge and exciting moment. In addition, I got more chances to share and teach my experience with many other people through various of courses, meetings, and talks. The year of 2015 is not easy but productive.

In 2015, there are also greats ideas and projects we cannot neglect in the big data ecosystem as follows.

  • Apache Kylin - is really an innovative idea by providing SQL interface and multi-dimensional analysis (OLAP) on Hadoop and HBase. This tool has more use cases for enterprise users and it is a better solution when the enterprise wants to build a data warehouse on top of Hadoop ecosystem.
  • Apache Kafaka - a high-throughput distributed messaging system build for great scalability, high availability, ease of usage, as well as big data friendly.
  • Apache Kudu - this is a smart solution by providing a super fast columnar storage having access pattern between HDFS and HBase. Kudu can well integrate with the Hadoop Ecosystem. Kudu is suitable for Data warehouse environment and interesting to see how it performs on columnar storage formats like Parquet, and ORC format.
  • Apache NiFi - this is an easy to use, powerful, and reliable dataflow system. It has combined functionality between Apache Flume and ETL tools. HDP has integrated this tool in its latest distribution called Hortonworks Data Flow.
  • Apache HDFS Cache and Truncate - These two new features are expected for a long time to foresee Hadoop’s future strength as the core component in the big data ecosystem.
  • Apache YARN - new features supporting management label, long running jobs, and dock support
  • Spark SQL and DataFrame - Spark SQL brings Spark more closed to enterprise use case by leveraging the dataframe having schema and optimization on regular RDD. According to the benchmark, dataframe have great performance boost than regular RDD.
  • Hive Over Spark - makes Hive users/application to leverage powerful in-memory computing engine of Apache Spark.
  • GemFire and HAWQ - Very few enterprise would like to open source their leading products, but Pivotal did it and did for all its big data products. This is a great time to see the gap between enterprise ready products and open source softwares. This will also bring challenge and competitions for solutions, such as Impala, Hive, Spark SQL, HBase.
  • Apache Zeppelin - There is very few open source visualization tools avaliable, especially which supporting many framework or system. Its native supports Apache Spark and interactive code-to-result experience attracting lots of community users.

Well, again at the end, look forward a mazing and beautiful year of 2016 for me and you!