apache-spark Tutorial => Example data and code

Example

Example Data

Please try to provide a minimal example input data in a format that can be directly used by the answers without tedious and time consuming parsing for example input file or local collection with all code required to create distributed data structures.

When applicable always include type information:

In RDD based API use type annotations when necessary.
In DataFrame based API provide schema information as a StrucType or output from Dataset.printSchema.

Output from Dataset.show or print can look good but doesn't tell us anything about underlying types.

If particular problem occurs only at scale use random data generators (Spark provides some useful utilities in org.apache.spark.mllib.random.RandomRDDs and org.apache.spark.graphx.util.GraphGenerators

Code

Please use type annotations when possible. While your compiler can easily keep track of the types it is not so easy for mere mortals. For example:

val lines: RDD[String] = rdd.map(someFunction)

def f(x: String): Int = ???

are better than:

val lines = rdd.map(someFunction)

and

def f(x: String) = ???

respectively.

PDF - Download apache-spark for free

Previous Next

apache-spark

Fastest Entity Framework Extensions

Example

Example Data

Code

Got any apache-spark Question?

apache-spark

apache-spark How to ask Apache Spark related question? Example data and code

Fastest Entity Framework Extensions

Example

Example Data

Code

Got any apache-spark Question?