cudaParallel reduction (e.g. how to sum an array)


Remarks

Parallel reduction algorithm typically refers to an algorithm which combines an array of elements, producing a single result. Typical problems that fall into this category are:

  • summing up all elements in an array
  • finding a maximum in an array

In general, the parallel reduction can be applied for any binary associative operator, i.e. (A*B)*C = A*(B*C). With such operator *, the parallel reduction algorithm repetedely groups the array arguments in pairs. Each pair is computed in parallel with others, halving the overall array size in one step. The process is repeated until only a single element exists.

If the operator is commutative (i.e. A*B = B*A) in addition to being associative, the algorithm can pair in a different pattern. From theoretical standpoint it makes no difference, but in practice it gives a better memory access pattern:

Not all associative operators are commutative - take matrix multiplication for example.