0% found this document useful (0 votes)
1K views

DataFrame Operations Using A Json File

This Python code uses Spark SQL to read employee data from a JSON file into a DataFrame. It then filters the DataFrame to only rows where the stream is "JAVA" and writes the filtered DataFrame to a new Parquet file. It first reads the JSON, displays the DataFrame, coalesces and writes it to a Parquet file. Then it reads the Parquet, filters for "JAVA" stream, displays and writes the filtered DataFrame to a new Parquet file.

Uploaded by

Arpita Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views

DataFrame Operations Using A Json File

This Python code uses Spark SQL to read employee data from a JSON file into a DataFrame. It then filters the DataFrame to only rows where the stream is "JAVA" and writes the filtered DataFrame to a new Parquet file. It first reads the JSON, displays the DataFrame, coalesces and writes it to a Parquet file. Then it reads the Parquet, filters for "JAVA" stream, displays and writes the filtered DataFrame to a new Parquet file.

Uploaded by

Arpita Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 1

#Put your code here

from pyspark.sql import SparkSession


spark = SparkSession \
.builder \
.appName("Data Frame EMPLOYEE") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
df = spark.read.json("emp.json")
df.show()
df.coalesce(1).write.parquet("Employees")
pf = spark.read.parquet("Employees")
dfNew = pf.filter(pf.stream=='JAVA')
dfNew.show()
dfNew.coalesce(1).write.parquet("JavaEmployees")

You might also like