Please try to provide a minimal example input data in a format that can be directly used by the answers without tedious and time consuming parsing for example input file or local collection with all code required to create distributed data structures.
When applicable always include type information:
StrucType
or output from Dataset.printSchema
.Output from Dataset.show
or print
can look good but doesn't tell us anything about underlying types.
If particular problem occurs only at scale use random data generators (Spark provides some useful utilities in org.apache.spark.mllib.random.RandomRDDs
and org.apache.spark.graphx.util.GraphGenerators
Please use type annotations when possible. While your compiler can easily keep track of the types it is not so easy for mere mortals. For example:
val lines: RDD[String] = rdd.map(someFunction)
or
def f(x: String): Int = ???
are better than:
val lines = rdd.map(someFunction)
and
def f(x: String) = ???
respectively.