Develop Spark WordCount


It is quite often to setup Apache Spark development environment through IDE. Since I do not cover much setup IDE details in my Spark course, I am here to give detail steps for developing the well known Spark word count example using scala API in Eclipse.

Environment

Download Software Needed

  1. Download the proper scala version and install it
  2. Download the Eclipse scala IDE from above link

Create Scala Project

  1. Open Scala Eclipse IDE. From the top menu, choose File-> New -> Project -> Maven project with below information as example.

    1
    2
    Choose `Creat a simple project (skip archtype selection)`
    Choose `Use default Workspace location`

  2. Click Next button to go POM setting page and fill below information. Then, click Finish.

    1
    2
    Group Id = ca.sparkera.spark 
    Artifact Id = WordCount
  3. Open the pom.xml file in eclipse working area and add replace using code here. Save the file, then Eclipse will automatically download the proper jar files and build the work space.

  4. Add Scala Nature to this project by right clicking on project -> configure - > Add Scala Nature.

  5. Update Scala compiler version for Spark by right clicking on project- > Properties -> Scala Compiler -> Use Project Settings ->Scala Installation Latest 2.10 bundle (dynamic).

  6. Refeactor source folder src/main/java to src/main/scala by right click -> Refactor -> Rename. Create a package under this name it as ca.sparkera.spark.

  7. Create a Scala object under package created above package and name it as WordCount.scala by right clicking on the package -> New -> Scala Object and add WordCount as object name.

  8. Paste below code as content for WordCount.scala

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    package ca.sparkera.spark
    import org.apache.spark.SparkConf
    import org.apache.spark.SparkContext
    import org.apache.spark.SparkContext._

    object WordCount {
    def main(args: Array[String]) {

    //check proper parameters - optional
    if (args.length < 1) {
    System.err.println("Usage: <file>")
    System.exit(1)
    }

    //Configuration for a Spark application.
    val conf = new SparkConf()
    conf.setAppName("SparkWordCount").setMaster("local")

    //Create Spark Context
    val sc = new SparkContext(conf)

    //Create MappedRDD by reading from HDFS file from path command line parameter
    val rdd = sc.textFile(args(0))

    //WordCount
    rdd.flatMap(_.split(" ")).
    map((_, 1)).
    reduceByKey(_ + _).
    map(x => (x._2, x._1)).
    sortByKey(false).
    map(x => (x._2, x._1)).
    saveAsTextFile("SparkWordCountResult")

    //stop context
    sc.stop
    }
    }
  9. Run the application by right clicking on WordCount.scala - > Run as -> Run Configurations -> Arguments and add input file path, such as /Users/will/Downloads/testdata/

  10. Check the output files containing word count result from proper place where you can find from console output.

    1
    2
    3
    4
    5
    6
    7
    s -l /Users/will/workspace/WordCount/SparkWordCountResult
    total 1248
    -rw-r--r-- 1 will staff 0 6 Mar 16:30 _SUCCESS
    -rw-r--r-- 1 will staff 42669 6 Mar 16:30 part-00000
    -rw-r--r-- 1 will staff 15600 6 Mar 16:30 part-00001
    -rw-r--r-- 1 will staff 129997 6 Mar 16:30 part-00002
    -rw-r--r-- 1 will staff 443519 6 Mar 16:30 part-00003

Note

  • Without using maven, we can alternatively add the downloaded spark assmeling jar file (at ../spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar) to the scala project build path as external jar libaray.
  • In real project, we usually export the above scala code as jar file, copy it to the spark cluster, and submit it using spark-submit.
  • Share a website to search maven configurations