apache-spark Tutorial => Serialize and Send python RDD to scala code

Download apache-spark (PDF)

Fastest Entity Framework Extensions

Example

This part of development you should serialize the python RDD to the JVM. This process uses the main development of Spark to call the jar function.

from pyspark.serializers import PickleSerializer, AutoBatchedSerializer


rdd = sc.parallelize(range(10000))
reserialized_rdd = rdd._reserialize(AutoBatchedSerializer(PickleSerializer()))
rdd_java = rdd.ctx._jvm.SerDe.pythonToJava(rdd._jrdd, True)

_jvm = sc._jvm #This will call the py4j gateway to the JVM.
_jvm.myclass.apps.etc.doSomethingByPythonRDD(rdd_java)

PDF - Download apache-spark for free

Previous Next

Get monthly updates about new articles, cheatsheets, and tricks.