How does an RDD gets partitioned?
By default a partition is created for each HDFS partition, which by default is 64MB. Read more here.
How to balance my data across partitions?
First, take a look at the three ways one can repartition his data:
Pass a second parameter, the desired minimum number of partitions for your RDD, into textFile(), but be careful:
In [14]: lines = sc.textFile("data")
In [15]: lines.getNumPartitions() Out[15]: 1000
In [16]: lines = sc.textFile("data", 500)
In [17]: lines.getNumPartitions() Out[17]: 1434
In [18]: lines = sc.textFile("data", 5000)
In [19]: lines.getNumPartitions() Out[19]: 5926
As you can see, [16]
doesn't do what one would expect, since the number of partitions the RDD has, is already greater than the minimum number of partitions we request.
Use repartition(), like this:
In [22]: lines = lines.repartition(10)
In [23]: lines.getNumPartitions() Out[23]: 10
Warning: This will invoke a shuffle and should be used when you want to increase the number of partitions your RDD has.
From the docs:
The shuffle is Spark’s mechanism for re-distributing data so that it’s grouped differently across partitions. This typically involves copying data across executors and machines, making the shuffle a complex and costly operation.
Use coalesce(), like this:
In [25]: lines = lines.coalesce(2)
In [26]: lines.getNumPartitions() Out[26]: 2
Here, Spark knows that you will shrink the RDD and gets advantage of it. Read more about repartition() vs coalesce().
But will all this guarantee that your data will be perfectly balanced across your partitions? Not really, as I experienced in How to balance my data across the partitions?