Addition of two RDD[mllib.linalg.Vector]'s

krishna picture krishna · Jan 30, 2015 · Viewed 11.3k times · Source

I need addition of two matrices that are stored in two files.

The content of latest1.txt and latest2.txt has the next str:

1 2 3
4 5 6
7 8 9

I am reading those files as follows:

scala> val rows = sc.textFile(“latest1.txt”).map { line => val values = line.split(‘ ‘).map(_.toDouble)
    Vectors.sparse(values.length,values.zipWithIndex.map(e => (e._2, e._1)).filter(_._2 != 0.0))
}

scala> val r1 = rows
r1: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MappedRDD[2] at map at :14

scala> val rows = sc.textFile(“latest2.txt”).map { line => val values = line.split(‘ ‘).map(_.toDouble)
    Vectors.sparse(values.length,values.zipWithIndex.map(e => (e._2, e._1)).filter(_._2 != 0.0))
}

scala> val r2 = rows
r2: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MappedRDD[2] at map at :14

I want to add r1, r2. So, Is there any way to add this two RDD[mllib.linalg.Vector]s in Apache-Spark.

Answer

StephenBoesch picture StephenBoesch · Jan 31, 2015

This is actually a good question. I work with mllib regularly and did not realize these basic linear algebra operations are not easily accessible.

The point is that the underlying breeze vectors have all of the linear algebra manipulations you would expect - including of course basic element wise addition that you specifically mentioned.

However the breeze implementation is hidden from the outside world via:

[private mllib]

So then, from the outside world/public API perspective, how do we access those primitives?

Some of them are already exposed: e.g. sum of squares:

/**
 * Returns the squared distance between two Vectors.
 * @param v1 first Vector.
 * @param v2 second Vector.
 * @return squared distance between two Vectors.
 */
def sqdist(v1: Vector, v2: Vector): Double = { 
  ...
}

However the selection of such available methods is limited - and in fact does not include the basic operations including element wise addition, subtraction, multiplication, etc.

So here is the best I could see:

  • Convert the vectors to breeze:
  • Perform the vector operations in breeze
  • Convert back from breeze to mllib Vector

Here is some sample code:

val v1 = Vectors.dense(1.0, 2.0, 3.0)
val v2 = Vectors.dense(4.0, 5.0, 6.0)
val bv1 = new DenseVector(v1.toArray)
val bv2 = new DenseVector(v2.toArray)

val vectout = Vectors.dense((bv1 + bv2).toArray)
vectout: org.apache.spark.mllib.linalg.Vector = [5.0,7.0,9.0]