Tutorial by Examples | RIP Tutorial

Setup Spark context

Setup Spark context in R To start working with Sparks distributed dataframes, you must connect your R program with an existing Spark Cluster. library(SparkR) sc <- sparkR.init() # connection to Spark context sqlContext <- sparkRSQL.init(sc) # connection to SQL context Here are infos how...

R Language • Spark API (SparkR)

Cache data

What: Caching can optimize computation in Spark. Caching stores data in memory and is a special case of persistence. Here is explained what happens when you cache an RDD in Spark. Why: Basically, caching saves an interim partial result - usually after transformations - of your original data. So, ...

R Language • Spark API (SparkR)

Create RDDs (Resilient Distributed Datasets)

From dataframe: mtrdd <- createDataFrame(sqlContext, mtcars) From csv: For csv's, you need to add the csv package to the environment before initiating the Spark context: Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.10:1.4.0" "sparkr-shel...

R Language • Spark API (SparkR)