Light Big Data With Apache Spark

Above presentation is the third topic I have covered for series of talks about the Ten Tools for Ten Big Data Areas. Apache Spark is one of the greatest big data open source projects in nowadays.

There are so many articles, reports about how great the Apache Spark it is. I also heard lots of people keep saying that Spark will replace Hadoop and do everything in big data. As for me, I agree on some of them by saying Spark does a great job on some areas. However, I do disagree some saying about spark and I’ll talk about this today as most people do not realise there Spark may not work very well in some areas.

First, Spark is not going to replace Hadoop. Hadoop is a big data platform while Spark is an application. For another saying in terms of IPhone ecosystem, it is a relationship between IOS and Apps. Hadoop, especially Yarn, gradually become more and more important in the role of a platform by providing a multi-purpose universal platform for run various of big data applications. On the other hand, Spark is a powerful big data application which is able to do lots of things in big data. But it cannot “rules all” the big data ecosystem. We still have lots of use cases which require other big data applications.

Second, Spark is not good for everything. Below are some areas that I think we have other better options than using Spark.

  • Spark has to rely on HDFS or other file systems to store data. It is a mainly computing engine. Spark is based on RDD, which is the immutable dataset. As a result, Spark does not fit for the use case where you need to modify the data.

  • Spark uses lots of micro-batch execution model to simulate data streaming. As a result, it has a limitation when the stream interval less than 0.5 seconds. For instead, you may need other truly real-time streaming framework, such as Apache Storm or Flink.

  • Spark runs on the JVM and leverages Java’s garbage collector. As JVM is designed for its general purpose, it lacks flexibility, good user experience, as well as efficient memory usage. Spark team has realised about this and comes up the Project Tungsten, which starts to build Spark’s own memory management system in the recent release.

  • For data ETL (extract, transformation, and load), you may not always need Spark’s speed, but focus more on the reliability as well as failure recovery. In this case, MapReduce’s processing style can be just fine as stable batch-mode processing.

  • Spark aims to replace MapReduce, but it does not provide any way to back compatible with old MapReduce jobs. However, there are legacy MapReduce jobs which could not be retired immediately from production.

  • MLLib in Spark still needs improvement by supporting more algorithm as well as accuracy.

  • GrapX in Spark is still new (comparing Apache Giraph) and some functions are only available in Scala API.

Although there are areas to improve for Apache Spark, there is no doubt Apache Spark is a great big data application stack. I do believe it has good future. As there are big competition and fast evolvement in the big data ecosystem, let’s look forward to seeing if this little ‘spark’ can start a prairie fire.