Julia Language Parallel Processing When to use @parallel vs. pmap


Example

The Julia documentation advises that

pmap() is designed for the case where each function call does a large amount of work. In contrast, @parallel for can handle situations where each iteration is tiny, perhaps merely summing two numbers.

There are several reasons for this. First, pmap incurs greater start up costs initiating jobs on workers. Thus, if the jobs are very small, these startup costs may become inefficient. Conversely, however, pmap does a "smarter" job of allocating jobs amongst workers. In particular, it builds a queue of jobs and sends a new job to each worker whenever that worker becomes available. @parallel by contrast, divvies up all work to be done amongst the workers when it is called. As such, if some workers take longer on their jobs than others, you can end up with a situation where most of your workers have finished and are idle while a few remain active for an inordinate amount of time, finishing their jobs. Such a situation, however, is less likely to occur with very small and simple jobs.

The following illustrates this: suppose we have two workers, one of which is slow and the other of which is twice as fast. Ideally, we would want to give the fast worker twice as much work as the slow worker. (or, we could have fast and slow jobs, but the principal is the exact same). pmap will accomplish this, but @parallel won't.

For each test, we initialize the following:

addprocs(2)

@everywhere begin
    function parallel_func(idx)
        workernum = myid() - 1 
        sleep(workernum)
        println("job $idx")
    end
end

Now, for the @parallel test, we run the following:

@parallel for idx = 1:12
    parallel_func(idx)
end

And get back print output:

julia>     From worker 2:    job 1
    From worker 3:    job 7
    From worker 2:    job 2
    From worker 2:    job 3
    From worker 3:    job 8
    From worker 2:    job 4
    From worker 2:    job 5
    From worker 3:    job 9
    From worker 2:    job 6
    From worker 3:    job 10
    From worker 3:    job 11
    From worker 3:    job 12

It's almost sweet. The workers have "shared" the work evenly. Note that each worker has completed 6 jobs, even though worker 2 is twice as fast as worker 3. It may be touching, but it is inefficient.

For for the pmap test, I run the following:

pmap(parallel_func, 1:12)

and get the output:

From worker 2:    job 1
From worker 3:    job 2
From worker 2:    job 3
From worker 2:    job 5
From worker 3:    job 4
From worker 2:    job 6
From worker 2:    job 8
From worker 3:    job 7
From worker 2:    job 9
From worker 2:    job 11
From worker 3:    job 10
From worker 2:    job 12

Now, note that worker 2 has performed 8 jobs and worker 3 has performed 4. This is exactly in proportion to their speed, and what we want for optimal efficiency. pmap is a hard task master - from each according to their ability.