Does someone have an overview with respect to advantages/disadvantages of SparkR vs sparklyr? Google does not yield any satisfactory results and both seem fairly similar. Trying both out, SparkR appears a lot more cumbersome, whereas sparklyr is pretty straight forward (both to install but also to use, especially with the dplyr inputs). Can sparklyr only be used to run dplyr functions in parallel or also "normal" R-Code?
Best
The biggest advantage of SparkR is the ability to run on Spark arbitrary user-defined functions written in R:
https://spark.apache.org/docs/2.0.1/sparkr.html#applying-user-defined-function
Since sparklyr translates R to SQL, you can only use very small set of functions in mutate
statements:
http://spark.rstudio.com/dplyr.html#sql_translation
That deficiency is somewhat alleviated by Extensions (http://spark.rstudio.com/extensions.html#wrapper_functions).
Other than that, sparklyr is a winner (in my opinion). Aside from the obvious advantage of using familiar dplyr
functions, sparklyr has much more comprehensive API for MLlib (http://spark.rstudio.com/mllib.html) and the Extensions mentioned above.