Caching can optimize computation in Spark. Caching stores data in memory and is a special case of persistence. Here is explained what happens when you cache an RDD in Spark.
Basically, caching saves an interim partial result - usually after transformations - of your original data. So, when you use the cached RDD, the already transformed data from memory is accessed without recomputing the earlier transformations.
Here is an example how to quickly access large data (here 3 GB big csv) from in-memory storage when accessing it more then once:
library(SparkR) # next line is needed for direct csv import: Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.10:1.4.0" "sparkr-shell"') sc <- sparkR.init() sqlContext <- sparkRSQL.init(sc) # loading 3 GB big csv file: train <- read.df(sqlContext, "/train.csv", source = "com.databricks.spark.csv", inferSchema = "true") cache(train) system.time(head(train)) # output: time elapsed: 125 s. This action invokes the caching at this point. system.time(head(train)) # output: time elapsed: 0.2 s (!!)