Using SparkSession and SQL
Spark exposes many SQL-like actions that can be taken upon a data frame. For example, we could load a data frame with product sales information in a CSV file:
from pyspark.sql import SparkSession spark = SparkSession(sc) df = spark.read.format("csv") \ .option("header", "true") \ .load("productsales.csv");df.show()
The example:
- Starts a SparkSession(needed for most data access)
- Uses the session to read a CSV formatted file, that contains a header record
- Displays initial rows

We have a few interesting columns in the sales data:
- Actual sales for the products by division
- Predicted sales for the products by division
If this were a bigger file, we could use SQL to determine the extent of the product list. Then the following is the Spark SQL to determine the product list:
df.groupBy("PRODUCT").count().show()The data frame groupBy function works very similar to the SQL Group By clause. Group By collects the items in the dataset according to the values in the column...
 
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
     
         
                 
                 
                 
                 
                 
                 
                 
                 
         
                 
                 
                