apache-spark Repartition an RDD


Sometimes we want to repartition an RDD, for example because it comes from a file that wasn't created by us, and the number of partitions defined from the creator is not the one we want.

The two most known functions to achieve this are:



coalesce(numPartitions, shuffle=False)

As a rule of thumb, use the first when you want to repartition your RDD in a greater number of partitions and the second to reduce your RDD, in a smaller number of partitions. Spark - repartition() vs coalesce().

For example:

data = sc.textFile(file)
data = data.coalesce(100) // requested number of #partitions

will decrease the number of partitions of the RDD called 'data' to 100, given that this RDD has more than 100 partitions when it got read by textFile().

And in a similar way, if you want to have more than the current number of partitions for your RDD, you could do (given that your RDD is distributed in 200 partitions for example):

data = sc.textFile(file)
data = data.repartition(300) // requested number of #partitions