Tutorial by Examples

How does an RDD gets partitioned? By default a partition is created for each HDFS partition, which by default is 64MB. Read more here. How to balance my data across partitions? First, take a look at the three ways one can repartition his data: Pass a second parameter, the desired min...
As mentioned in "Remarks", a partition is a part/slice/chunk of an RDD. Below is a minimal example on how to request a minimum number of partitions for your RDD: In [1]: mylistRDD = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 2) In [2]: mylistRDD.getNumPartitions() Out[2]: 2 No...
Sometimes we want to repartition an RDD, for example because it comes from a file that wasn't created by us, and the number of partitions defined from the creator is not the one we want. The two most known functions to achieve this are: repartition(numPartitions) and: coalesce(numPartitions, s...
As rule of thumb, one would want his RDD to have as many partitions as the product of the number of executors by the number of used cores by 3 (or maybe 4). Of course, that's a heuristic and it really depends on your application, dataset and cluster configuration. Example: In [1]: data = sc.textF...
To show contents of an RDD, it have to be printed: myRDD.foreach(println) To limit number of rows printed: myRDD.take(num_of_rows).foreach(println)

Page 1 of 1