Spark data frame
In Apache Spark, a Dataset
is a distributed collection of data. The Dataset
is a new interface added since Spark 1.6. It provides the benefits of RDDs with the benefits of Spark SQL's execution engine. A Dataset
can be constructed from JVM objects and then manipulated using functional transformations (map
, flatMap
, filter
, and so on). The Dataset
API is available only for in Scala and Java. It is not available for Python or R.
A DataFrame
is a dataset with named columns. It is equivalent to a table in a relational database or a data frame in R/Python, with richer optimizations. DataFrame
is constructed from structured data files, tables in Hive, external databases, or existing RDDs. The DataFrame
API is available in Scala, Python, Java, and R.
A Spark DataFrame
needs the Spark session instantiated first:
import org.apache.spark.sql.SparkSession val spark = SparkSession.builder().appName("Spark SQL").config("spark.some.config.option", "").getOrCreate() import spark.implicits._
Next, we create a DataFrame
from a Json file using the spark.read.json
function:
scala> val df = spark.read.json("/home/ubuntu/work/ml-resources /spark-ml/Chapter_01/data/example_one.json")
Note that Spark Implicits
are being used to implicitly convert RDD to Data Frame types:
org.apache.spark.sql Class SparkSession.implicits$ Object org.apache.spark.sql.SQLImplicits Enclosing class: SparkSession
Implicit methods available in Scala for converting common Scala objects into DataFrames
.
Output will be similar to the following listing:
df: org.apache.spark.sql.DataFrame = [address: struct<city:
string, state: string>, name: string]
Now we want to see how this is actually loaded in the DataFrame
:
scala> df.show +-----------------+-------+ | address| name| +-----------------+-------+ | [Columbus,Ohio]| Yin| |[null,California]|Michael| +-----------------+-------+