There are many ways of creating DataFrames. They can be created from local lists, distributed RDDs or reading from datasources.
By importing spark sql implicits, one can create a DataFrame from a local Seq, Array or RDD, as long as the contents are of a Product sub-type (tuples and case classes are well-known examples of Product sub-types). For example:
import sqlContext.implicits._
val df = Seq(
(1, "First Value", java.sql.Date.valueOf("2010-01-01")),
(2, "Second Value", java.sql.Date.valueOf("2010-02-01"))
).toDF("int_column", "string_column", "date_column")
Another option is using the createDataFrame
method present in SQLcontext. This option also allows the creation from local lists or RDDs of Product sub-types as with toDF
, but the names of the columns are not set in the same step. For example:
val df1 = sqlContext.createDataFrame(Seq(
(1, "First Value", java.sql.Date.valueOf("2010-01-01")),
(2, "Second Value", java.sql.Date.valueOf("2010-02-01"))
))
Additionally, this approach allows creation from RDDs of Row
instances, as long as a schema
parameter is passed along for the definition of the resulting DataFrame's schema. Example:
import org.apache.spark.sql.types._
val schema = StructType(List(
StructField("integer_column", IntegerType, nullable = false),
StructField("string_column", StringType, nullable = true),
StructField("date_column", DateType, nullable = true)
))
val rdd = sc.parallelize(Seq(
Row(1, "First Value", java.sql.Date.valueOf("2010-01-01")),
Row(2, "Second Value", java.sql.Date.valueOf("2010-02-01"))
))
val df = sqlContext.createDataFrame(rdd, schema)
Maybe the most common way to create DataFrame is from datasources. One can create it from a parquet file in hdfs, for example:
val df = sqlContext.read.parquet("hdfs:/path/to/file")