apache-spark Tutorial => Rule of Thumb about number of partitions

Example

As rule of thumb, one would want his RDD to have as many partitions as the product of the number of executors by the number of used cores by 3 (or maybe 4). Of course, that's a heuristic and it really depends on your application, dataset and cluster configuration.

Example:

In [1]: data  = sc.textFile(file)

In [2]: total_cores = int(sc._conf.get('spark.executor.instances')) * int(sc._conf.get('spark.executor.cores'))

In [3]: data = data.coalesce(total_cores * 3)

PDF - Download apache-spark for free

Previous Next

apache-spark

Fastest Entity Framework Extensions

Example

Got any apache-spark Question?

apache-spark

apache-spark Partitions Rule of Thumb about number of partitions

Fastest Entity Framework Extensions

Example

Got any apache-spark Question?