Statistics With Spark

Josh - 07 Mar 2014

Lately I've been writing a lot of Spark Jobs that perform some statistical analysis on datasets. One of the things I didn't realize right away - is that RDD's have built in support for basic statistic functions like mean, variance, sample variance, standard deviation.

These operations are avaible on RDD's of Double via DoubleRDDFunctions. You can access these functions like so:

  import org.apache.spark.SparkContext._ // implicit conversions in here

  val myRDD = newRDD().map { _.toDouble }
  myRDD.sampleVariance // divides by n-1
  myRDD.sampleStdDev  // divides by n-1 

Getting It All At Once

If you're interested in calling multiple stats functions at the same time, it's a better idea to get them all in a single pass. Spark provides the stats method in DoubleRDDFunctions for that; it also provides the total count of the RDD as well.


Means and standard deviation are a decent starting point when you're looking at a new dataset; but you have to be careful because measures of central tendency - they hide the distribution from you. If you're looking at something like response latency - there could very well be dragons lurking there.

Fortunately Spark also includes a nifty little histogram method which you can use:

val myRDD = newRDD().map { _.toDouble }
myRDD.histogram(10) // 10 evenly spaced buckets, between myRDD.min ->  myRDD.max 
myRDD.histogram(new Array(0.0, 10.0, 20.0, 30.0)) // manually specify the buckets

Beyond The Box

Spark provides a very basic, but useful starting point. If you want access to more advanced statistical methods like classification or regression - check out MLLib. However at the time of writing its still a very young project and you might have to implement things on your own.

In our case, we ended up implementing some basic z & chi-squared tests for some of our bidding algorithms like:

  • Comparing binomial proportions of two samples
  • Comparing means of two samples
  • Comparing two distributions drawn from different samples

These are still very much in there infancy, but it's possible we might open source them at a later date.

For now, when implementing things yourself it's nice to keep operations as general as possible, so I'd recommend trying to keep all your hand-rolled stats method to operations on RDD's of primatives (or matrices).

comments powered by Disqus