DataFrame API and SQL API
The creation of a DataFrame can be in several ways:
- By executing SQL queries
- Loading external data such as Parquet, JSON, CSV, text, Hive, JDBC, and so on
- Converting RDDs to data frames
A DataFrame can be by loading a CSV file. We look at a CSV statesPopulation.csv
, which is being loaded as a DataFrame.
The CSV has the following format of US states populations from years 2010 to 2016.
State | Year | Population |
Alabama | 2010 | 4785492 |
Alaska | 2010 | 714031 |
Arizona | 2010 | 6408312 |
Arkansas | 2010 | 2921995 |
California | 2010 | 37332685 |
Since this CSV has a header, we can use it to quickly load into a DataFrame with an implicit schema detection.
scala> val statesDF = spark.read.option("header", "true").option("inferschema", "true").option("sep", ",").csv("statesPopulation.csv")
statesDF: org.apache.spark.sql.DataFrame = [State: string, Year: int ... 1 more field]
Once the DataFrame is loaded, it can be examined for the schema:
scala> statesDF.printSchema
root
|-- State: string (nullable = true)
|-- Year...