The number of partitions is critical for an application's performance and/or successful termination.
A Resilient Distributed Dataset (RDD) is Spark's main abstraction. An RDD is split into partitions, that means that a partition is a part of the dataset, a slice of it, or in other words, a chunk of it.
The greater the number of partitions is, the smaller the size of each partition is.
However, notice that a large number of partitions puts a lot of pressure on Hadoop Distributed File System (HDFS), which has to keep a significant amount of metadata.
The number of partitions is related to the memory usage, and a memoryOverhead issue can be related to this number (personal experience).
A common pitfall for new users is to transform their RDD into an RDD with only one partition, which usually looks like that:
data = sc.textFile(file) data = data.coalesce(1)
That's usually a very bad idea, since you are telling Spark to put all the data is just one partition! Remember that:
A stage in Spark will operate on one partition at a time (and load the data in that partition into memory).
As a result, you tell Spark to handle all the data at once, which usually results in memory related errors (Out of Memory for example), or even a null pointer exception.
So, unless you know what you are doing, avoid repartitioning your RDD in just one partition!