A major problem with parallelization is the used of RNG as seeds. Random numbers by the number are iterated by the number of operations from either the start of the session or the most recent set.seed()
. Since parallel processes arise from the same function, it can use the same seed, possibly causing identical results! Calls will run in serial on the different cores, provide no advantage.
A set of seeds must be generated and sent to each parallel process. This is automatically done in some packages (parallel
, snow
, etc.), but must be explicitly addressed in others.
s <- seed
for (i in 1:numofcores) {
s <- nextRNGStream(s)
# send s to worker i as .Random.seed
}
Seeds can be also be set for reproducibility.
clusterSetRNGStream(cl = parallelcluster, iseed)