SparkR vs sparklyr

koVex picture koVex · Sep 14, 2016 · Viewed 19.9k times · Source

Does someone have an overview with respect to advantages/disadvantages of SparkR vs sparklyr? Google does not yield any satisfactory results and both seem fairly similar. Trying both out, SparkR appears a lot more cumbersome, whereas sparklyr is pretty straight forward (both to install but also to use, especially with the dplyr inputs). Can sparklyr only be used to run dplyr functions in parallel or also "normal" R-Code?

Best

Answer

Alex Vorobiev picture Alex Vorobiev · Oct 11, 2016

The biggest advantage of SparkR is the ability to run on Spark arbitrary user-defined functions written in R:

https://spark.apache.org/docs/2.0.1/sparkr.html#applying-user-defined-function

Since sparklyr translates R to SQL, you can only use very small set of functions in mutate statements:

http://spark.rstudio.com/dplyr.html#sql_translation

That deficiency is somewhat alleviated by Extensions (http://spark.rstudio.com/extensions.html#wrapper_functions).

Other than that, sparklyr is a winner (in my opinion). Aside from the obvious advantage of using familiar dplyr functions, sparklyr has much more comprehensive API for MLlib (http://spark.rstudio.com/mllib.html) and the Extensions mentioned above.