Spark and ScalOps

My friend, Nathan, asked me to pay attention about Spark and ScalOps. So I read about them, and here follows my reading note.

To my understanding, the idea behind Spark and ScalOps is to provide a powerful distributed computing API in functional programming paradigm, so we can program a distributed job in a few lines of code. This is what real programmers want. As the pure functional programming paradigm forces us to shape our programs as a sequence of computations without side-effects, the programs become naturally easy to parallelize using operations like map and reduce. This has been mentioned many times by Lisp advocators like Paul Graham and Peter Norvig, and has been impressively demonstrated by the massive use of map operation in Python and Ruby. However, Spark and ScalOps did the final hit — parallelize the programs.

I am a little worried about Scala, though, Scala is cute. It supports functional programming paradigm, in addition to object-orientation. It also has an extension, which implements the CSP concurrent programming model. These features have been inherited by the Go programming language, which supports functional programming via first-class function type, and CSP via goroutines and channels. However, Go programs are compiled into native code; whereas Scala programs are compiled into byte code and run on JVM. This means that Scala programs would suffer from the terrible Java garbage collection technology, which often eats up memories of Hadoop clusters and thus prevents new jobs from being started.

Let us talk a few words about Scala and Go. In a recent performance comparison (, Scala beats Go. However, Scala was created in 2001 and has been optimized since then, whereas Go was in its first year when the comparison was done. In general, Scala is as powerful as Python, and researchers would love Scala as Python. However, Go was designed an alternative to C/C++ for engineers. This leaves Go a large space of performance improvements.

This makes me believe that Scala and Spark were designed and developed by researchers and for researchers; whereas Hadoop and Google MapReduce were by engineers and for engineers. Although Scala and Spark show we human being how concurrent programming would be like in the next decades and I would like to experience with them, I won’t try to put them into a production environment.