R Language Parallel processing Parallel processing with parallel package

Help us to keep this website almost Ad Free! It takes only 10 seconds of your time:
> Step 1: Go view our video on YouTube: EF Core Bulk Extensions
> Step 2: And Like the video. BONUS: You can also share it!

Example

The base package parallel allows parallel computation through forking, sockets, and random-number generation.

Detect the number of cores present on the localhost:

parallel::detectCores(all.tests = FALSE, logical = TRUE)

Create a cluster of the cores on the localhost:

parallelCluster <- parallel::makeCluster(parallel::detectCores())

First, a function appropriate for parallelization must be created. Consider the mtcars dataset. A regression on mpg could be improved by creating a separate regression model for each level of cyl.

data <- mtcars
yfactor <- 'cyl'
zlevels <- sort(unique(data[[yfactor]]))
datay <- data[,1]
dataz <- data[,2]
datax <- data[,3:11]


fitmodel <- function(zlevel, datax, datay, dataz) {
  glm.fit(x = datax[dataz == zlevel,], y = datay[dataz == zlevel])
}

Create a function that can loop through all the possible iterations of zlevels. This is still in serial, but is an important step as it determines the exact process that will be parallelized.

fitmodel <- function(zlevel, datax, datay, dataz) {
  glm.fit(x = datax[dataz == zlevel,], y = datay[dataz == zlevel])
}


for (zlevel in zlevels) {
  print("*****")
  print(zlevel)
  print(fitmodel(zlevel, datax, datay, dataz))
}

Curry this function:

worker <- function(zlevel) {
    fitmodel(zlevel,datax, datay, dataz)
  }

Parallel computing using parallel cannot access the global environment. Luckily, each function creates a local environment parallel can access. Creation of a wrapper function allows for parallelization. The function to be applied also needs to be placed within the environment.

wrapper <- function(datax, datay, dataz) {
  # force evaluation of all paramters not supplied by parallelization apply
  force(datax)
  force(datay)
  force(dataz)
  # these variables are now in an enviroment accessible by parallel function
  
  # function to be applied also in the environment
  fitmodel <- function(zlevel, datax, datay, dataz) {
    glm.fit(x = datax[dataz == zlevel,], y = datay[dataz == zlevel])
  }
  
  # calling in this environment iterating over single parameter zlevel
  worker <- function(zlevel) {
    fitmodel(zlevel,datax, datay, dataz)
  }
  return(worker) 
}

Now create a cluster and run the wrapper function.

parallelcluster <- parallel::makeCluster(parallel::detectCores())
models <- parallel::parLapply(parallelcluster,zlevels,
                              wrapper(datax, datay, dataz))

Always stop the cluster when finished.

parallel::stopCluster(parallelcluster)

The parallel package includes the entire apply() family, prefixed with par.



Got any R Language Question?