Tutorial by Examples | RIP Tutorial

Partitions Intro

How does an RDD gets partitioned? By default a partition is created for each HDFS partition, which by default is 64MB. Read more here. How to balance my data across partitions? First, take a look at the three ways one can repartition his data: Pass a second parameter, the desired min...

apache-spark • Partitions

Partitions of an RDD

As mentioned in "Remarks", a partition is a part/slice/chunk of an RDD. Below is a minimal example on how to request a minimum number of partitions for your RDD: In [1]: mylistRDD = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 2) In [2]: mylistRDD.getNumPartitions() Out[2]: 2 No...

apache-spark • Partitions

Repartition an RDD

Sometimes we want to repartition an RDD, for example because it comes from a file that wasn't created by us, and the number of partitions defined from the creator is not the one we want. The two most known functions to achieve this are: repartition(numPartitions) and: coalesce(numPartitions, s...

apache-spark • Partitions

Rule of Thumb about number of partitions

As rule of thumb, one would want his RDD to have as many partitions as the product of the number of executors by the number of used cores by 3 (or maybe 4). Of course, that's a heuristic and it really depends on your application, dataset and cluster configuration. Example: In [1]: data = sc.textF...

apache-spark • Partitions

Show RDD contents

To show contents of an RDD, it have to be printed: myRDD.foreach(println) To limit number of rows printed: myRDD.take(num_of_rows).foreach(println)

apache-spark • Partitions