0% found this document useful (0 votes)

14 views11 pages

Basic DataFrame Operation

The document provides an overview of basic DataFrame operations in Databricks using Spark SQL with Python. It covers creating SparkSessions, constructing DataFrames from various data structures (lists, dictionaries, RDDs), reading external files, and performing basic operations like displaying data, filtering, and selecting specific columns. Additionally, it explains the use of methods like show(), display(), and printSchema() for inspecting DataFrames.

Uploaded by

lathakaruna493

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views11 pages

Basic DataFrame Operation

Uploaded by

lathakaruna493

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

6/11/25, 5:54 PM Dataframe Basic Operations - Databricks

Dataframe Basic Operations (Python)

Import notebook

Creating a SparkSession
The SparkSession is the entry point to programming with Spark SQL.

It allows you to create DataFrames, register DataFrames as tables, execute SQL over tables, cache tables, and read
parquet files.

SparkSession.builder: The builder attribute is a class attribute of SparkSession that provides a way to configure
and create a SparkSession instance.

appName("Example App"): The appName method sets the name of the Spark application. This name will appear
in the Spark web UI and can help you identify your application among others running on the same cluster.

config("spark.some.config.option", "some-value"): The config method allows you to set various configuration
options for the Spark session. In this example, " spark.some.config.option " is a placeholder for an actual
configuration key, and "some-value" is the value for that configuration. You can set multiple configuration options
by chaining multiple config calls.

getOrCreate(): The getOrCreate method either retrieves an existing SparkSession if one already exists or creates a
new one if it does not. This ensures that you do not accidentally create multiple SparkSession instances in your
application.

Note:In Databricks, you do not need to create or override the SparkSession as it is automatically created for each
notebook or job executed against the cluster. Databricks manages the SparkSession and SparkContext for you,
ensuring optimal configuration and resource usage.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Spark DataFrames").config("spark.some.config.option", "some-
value").getOrCreate()

Creating DataFrame
1.From Python a List of Tuples

%python
# List of tuples
data = [("John", 25), ("Doe", 30), ("Jane", 22)]

# Creating DataFrame
df_list = spark.createDataFrame(data, ["Name", "Age"])

# Display the DataFrame

df_list.show()

file:///C:/Users/esidhannara/Downloads/Dataframe Basic Operations (1).html 1/11

6/11/25, 5:54 PM Dataframe Basic Operations - Databricks

;   df_list: pyspark.sql.dataframe.DataFrame = [Name: string, Age: long]

+----+---+
|Name|Age|
+----+---+
|John| 25|
| Doe| 30|
|Jane| 22|
+----+---+

2.From a List of Dictionaries

%python
# List of dictionaries
data = [{"Name": "Alice", "Id": 1}, {"Name": "Bob", "Id": 2}, {"Name": "Cathy", "Id": 3}]

# Creating DataFrame
df_dict = spark.createDataFrame(data)

# Display the DataFrame

df_dict.show()

  df_dict: pyspark.sql.dataframe.DataFrame = [Id: long, Name: string]

+---+-----+
| Id| Name|
+---+-----+
| 1|Alice|
| 2| Bob|
| 3|Cathy|
+---+-----+

3.From a List of Rows

%python
from pyspark.sql import Row

# List of Rows
data = [ Row(Name="Cathy", Id=1),
Row(Name="David", Id=2),
Row(Name="Eva", Id=3),
Row(Name="Frank", Id=4)]

# Creating DataFrame
df_row = spark.createDataFrame(data)

# Display the DataFrame

df_row.show()

  df_row: pyspark.sql.dataframe.DataFrame = [Name: string, Id: long]

+-----+---+
| Name| Id|
+-----+---+

file:///C:/Users/esidhannara/Downloads/Dataframe Basic Operations (1).html 2/11

6/11/25, 5:54 PM Dataframe Basic Operations - Databricks

|Cathy| 1|
|David| 2|
| Eva| 3|
|Frank| 4|
+-----+---+

4.Creating a DataFrame from an RDD

%python
# Import necessary modules
from pyspark.sql import Row

# Create an RDD
rdd = spark.sparkContext.parallelize([
Row(Name="Alice", Age=25),
Row(Name="Bob", Age=30),
Row(Name="Cathy", Age=22),
Row(Name="David", Age=35),
Row(Name="Eva", Age=28),
Row(Name="Frank", Age=40)
])

# Convert RDD to DataFrame

df_rdd = spark.createDataFrame(rdd)

# Display the DataFrame

df_rdd.show()

  df_rdd: pyspark.sql.dataframe.DataFrame = [Name: string, Age: long]

+-----+---+
| Name|Age|
+-----+---+
|Alice| 25|
| Bob| 30|
|Cathy| 22|
|David| 35|
| Eva| 28|
|Frank| 40|
+-----+---+

5.Reading external file

spark.read: This is the entry point for reading data in Spark. It returns a DataFrameReader object that is used to read
data from various sources.

.format("csv"): Specifies the format of the data source. In this case, it indicates that the data is in CSV (Comma-
Separated Values) format.

.option("header", "true"): This option tells Spark that the first row of the CSV file contains the column names. If this
option is set to false, Spark will treat the first row as data. "true" means that the CSV file has a header row.

file:///C:/Users/esidhannara/Downloads/Dataframe Basic Operations (1).html 3/11

6/11/25, 5:54 PM Dataframe Basic Operations - Databricks

.option("inferSchema", "true"): This option tells Spark to automatically infer the data types of each column in the
CSV file. If this option is set to false, all columns will be read as strings (default behavior). "true" means that Spark will
try to infer the schema (data types) of the columns based on the data.

.load("/FileStore/tables/retail_db/customers"):

This method specifies the path to the CSV file or directory containing CSV files that you want to read.

customer_df=spark.read.format("csv").option("header","true").option("inferSchema","true").load("dbfs:/FileStor
e/tables/customers_300mb.csv")

  customer_df: pyspark.sql.dataframe.DataFrame = [customer_id: integer, name: string ... 5 more fields]

6. Using StructType & StructField

%python
#employee data and schemas
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, FloatType, DateType
from datetime import date

# Create dummy data as a list of lists

emp_data = [
[1, 101, "John Doe", 30, "M", 60000.0, date(2020, 1, 15)],
[2, 102, "Jane Smith", 25, "F", 65000.0, date(2019, 3, 10)],
[3, 101, "Mike Johnson", 35, "M", 70000.0, date(2018, 5, 20)],
[4, 103, "Emily Davis", 28, "F", 72000.0, date(2021, 7, 30)],
[5, 102, "Robert Brown", 40, "M", 80000.0, date(2017, 9, 25)],
[6, 101, "Linda Wilson", 32, "F", 68000.0, date(2020, 11, 5)],
[7, 103, "David Lee", 29, "M", 75000.0, date(2019, 12, 15)]]

# Define the schema

emp_schema = StructType([
StructField("empid", StringType(), True),
StructField("deptid", IntegerType(), True),
StructField("name", StringType(), True),
StructField("age", IntegerType(), True),
StructField("gender", StringType(), True),
StructField("salary", FloatType(), True),
StructField("hiredate", DateType(), True)
])
# Create DataFrame
df = spark.createDataFrame(emp_data, emp_schema)
#df = spark1.createDataFrame(data = emp_data, schema = emp_schema)

# Display the DataFrame

df.show()

  df: pyspark.sql.dataframe.DataFrame = [empid: string, deptid: integer ... 5 more fields]

+-----+------+------------+---+------+-------+----------+
|empid|deptid| name|age|gender| salary| hiredate|
+-----+------+------------+---+------+-------+----------+
| 1| 101| John Doe| 30| M|60000.0|2020-01-15|
| 2| 102| Jane Smith| 25| F|65000.0|2019-03-10|
| 3| 101|Mike Johnson| 35| M|70000.0|2018-05-20|

file:///C:/Users/esidhannara/Downloads/Dataframe Basic Operations (1).html 4/11

6/11/25, 5:54 PM Dataframe Basic Operations - Databricks

| 4| 103| Emily Davis| 28| F|72000.0|2021-07-30|

| 5| 102|Robert Brown| 40| M|80000.0|2017-09-25|
| 6| 101|Linda Wilson| 32| F|68000.0|2020-11-05|
| 7| 103| David Lee| 29| M|75000.0|2019-12-15|
+-----+------+------------+---+------+-------+----------+

Basic DataFrame Operation

1. show() & display()
In Databricks, show() and display() are used to visualize DataFrames, but they have different functionalities:

show(): This is a method available on Spark DataFrames that prints the first n rows to the console. It is useful for
quick inspection of data but does not provide rich formatting or interactivity. You can specify the number of rows to
display, and it defaults to 20 rows if not specified.

display(): This is a Databricks-specific function that provides a rich, interactive view of the DataFrame. It is more
suitable for use within notebooks as it allows for better visualization, including sorting, filtering, and graphical
representation of data.

customer_df.show(5)

customer_df.display()

#display(customer_df)

Table

file:///C:/Users/esidhannara/Downloads/Dataframe Basic Operations (1).html 5/11

6/11/25, 5:54 PM Dataframe Basic Operations - Databricks

2. Columns & Prinschema()

In Spark, columns and printSchema() are used to inspect the structure of a DataFrame, but they serve different
purposes:

columns: This attribute returns a list of the column names in the DataFrame.

printSchema(): This method prints the schema of the DataFrame, including column names and data types, in a
tree format.

customer_df.columns

['customer_id',
'name',
'city',
'state',
'country',
'registration_date',
'is_active']

customer_df.printSchema()

3. Select specific columns

customer_df.select("name","city").show()

file:///C:/Users/esidhannara/Downloads/Dataframe Basic Operations (1).html 6/11

6/11/25, 5:54 PM Dataframe Basic Operations - Databricks

4. Filter rows

customer_df.filter(customer_df.city=="Hyderabad").show()

| 21| Customer_21|Hyderabad| Tamil Nadu| India| 2023-09-16| true|

customer_df.where(customer_df.city=="Hyderabad").show()

| 21| Customer_21|Hyderabad| Tamil Nadu| India| 2023 09 16| true|

file:///C:/Users/esidhannara/Downloads/Dataframe Basic Operations (1).html 7/11

6/11/25, 5:54 PM Dataframe Basic Operations - Databricks

| 134|Customer_134|Hyderabad|West Bengal| India| 2023-06-25| true| 

5. Create or replace new column

The withColumn method is used to create a new column or replace an existing column in a DataFrame.

df.withColumn("name","defination")


%python
from pyspark.sql.functions import col, concat, lit

# col: A function to reference a column in a DataFrame.

# concat: A function to concatenate multiple columns or strings.
# lit: A function to create a column with a literal value.

# Example: Adding a new column

df_with_new_column = customer_df.withColumn("full name", concat(col("name"), lit(" Singh")))

# Display the DataFrame

df_with_new_column.show()

  df_with_new_column: pyspark.sql.dataframe.DataFrame = [customer_id: integer, name: string ... 6 more fields]

withColumnRenamed
The withColumnRenamed method is used to rename a single column in a DataFrame.

file:///C:/Users/esidhannara/Downloads/Dataframe Basic Operations (1).html 8/11

6/11/25, 5:54 PM Dataframe Basic Operations - Databricks

%python
# Example: Renaming a column
df_renamed_column = df_with_new_column.withColumnRenamed("full name", "Full Name")

# Display the DataFrame

df_renamed_column.show()

  df_renamed_column: pyspark.sql.dataframe.DataFrame = [customer_id: integer, name: string ... 6 more fields]

6. Dropping a Column
The drop method is used to remove one or more columns from a DataFrame.

# Dropping a single column

df_dropped_column = df_renamed_column.drop("Full Name")

# Display the DataFrame

df_dropped_column.show()

  df_dropped_column: pyspark.sql.dataframe.DataFrame = [customer_id: integer, name: string ... 5 more fields]

file:///C:/Users/esidhannara/Downloads/Dataframe Basic Operations (1).html 9/11

6/11/25, 5:54 PM Dataframe Basic Operations - Databricks

| 17|Customer_17| Pune| Delhi| India| 2023 04 14| false|

Dropping Multiple Columns

%python
# Dropping multiple columns
df_dropped_columns = df_renamed_column.drop("name", "country")

# Display the DataFrame

df_dropped_columns.show()

7. Removing Duplicate Rows

%python
# Removing duplicate rows
df_distinct = df_renamed_column.distinct()

# Display the DataFrame

df_distinct.show()

  df_distinct: pyspark.sql.dataframe.DataFrame = [customer_id: integer, name: string ... 6 more fields]

file:///C:/Users/esidhannara/Downloads/Dataframe Basic Operations (1).html 10/11

6/11/25, 5:54 PM Dataframe Basic Operations - Databricks

| 3| Customer_3| Mumbai| Telangana| India| 2023-06-04| true| Customer_3 Singh|

Aggregation
Will cover in detail tomorrow

+---------+------+
| city| count|
+---------+------+
|Bangalore|661013|
| Chennai|660249|
| Mumbai|661241|
|Ahmedabad|660218|
| Kolkata|660174|
| Pune|660737|
| Delhi|661025|
|Hyderabad|662281|
+---------+------+

file:///C:/Users/esidhannara/Downloads/Dataframe Basic Operations (1).html 11/11

My First ETL Pipeline
No ratings yet
My First ETL Pipeline
10 pages
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
PYSPARK Interview Questions
100% (3)
PYSPARK Interview Questions
126 pages
Software Design Description School Fees Managemnt System
No ratings yet
Software Design Description School Fees Managemnt System
36 pages
PySpark Data Frame Questions PDF
100% (2)
PySpark Data Frame Questions PDF
57 pages
Etl Commands For Pyspark
No ratings yet
Etl Commands For Pyspark
8 pages
Pyspark Practice
No ratings yet
Pyspark Practice
42 pages
PySpark SQL Cheat Sheet Python
No ratings yet
PySpark SQL Cheat Sheet Python
1 page
Master PySpark 1-18
No ratings yet
Master PySpark 1-18
59 pages
Master Pyspark Zero To Big Data Hero: Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Day 8 Day 9 Day 10
No ratings yet
Master Pyspark Zero To Big Data Hero: Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Day 8 Day 9 Day 10
106 pages
Day11 Notes
No ratings yet
Day11 Notes
2 pages
Pyspark IQ FREE Guide
100% (1)
Pyspark IQ FREE Guide
57 pages
Day 11 Notes
No ratings yet
Day 11 Notes
3 pages
(Big Data Analytics With PySpark) (CheatSheet)
No ratings yet
(Big Data Analytics With PySpark) (CheatSheet)
7 pages
Pyspark Syntax Using Simple Examples
No ratings yet
Pyspark Syntax Using Simple Examples
28 pages
PySpark Notes
No ratings yet
PySpark Notes
64 pages
Pyspark Cheatsheet
No ratings yet
Pyspark Cheatsheet
21 pages
Pyspark Funcamentals
No ratings yet
Pyspark Funcamentals
10 pages
Fundamental Pyspark Operations 1708364268
No ratings yet
Fundamental Pyspark Operations 1708364268
10 pages
Spark Cheat Sheet 1717838924
No ratings yet
Spark Cheat Sheet 1717838924
10 pages
Journal
No ratings yet
Journal
47 pages
Pyspark Coding Interview Questions
No ratings yet
Pyspark Coding Interview Questions
19 pages
Comparison of SQL
No ratings yet
Comparison of SQL
11 pages
Pyspark SQL Basics Cheat Sheet: Python For Data Science
No ratings yet
Pyspark SQL Basics Cheat Sheet: Python For Data Science
1 page
Big Data Analytics in Apache Spark
No ratings yet
Big Data Analytics in Apache Spark
79 pages
1 - Introduction ToPySpark
No ratings yet
1 - Introduction ToPySpark
26 pages
Slide 10 PySpark - SQL
No ratings yet
Slide 10 PySpark - SQL
131 pages
Pyspark Practice - Databricks
No ratings yet
Pyspark Practice - Databricks
66 pages
SQL Cheat Sheet Python
100% (1)
SQL Cheat Sheet Python
1 page
Python Data Exploratory Commands
No ratings yet
Python Data Exploratory Commands
9 pages
Spark Test Que
No ratings yet
Spark Test Que
3 pages
Chapter 3
No ratings yet
Chapter 3
33 pages
Working With CSV File in Databricks
No ratings yet
Working With CSV File in Databricks
4 pages
Apache Spark
No ratings yet
Apache Spark
5 pages
Master Pyspark Zero To Hero 1738689679
No ratings yet
Master Pyspark Zero To Hero 1738689679
102 pages
Pandas Dataframe All Operations 1735471870
No ratings yet
Pandas Dataframe All Operations 1735471870
4 pages
07 Structured Data Processing (2)
No ratings yet
07 Structured Data Processing (2)
91 pages
EDA Python For Data Analsis
No ratings yet
EDA Python For Data Analsis
10 pages
Unit 4 (Data Frame and Apache Kafka)
No ratings yet
Unit 4 (Data Frame and Apache Kafka)
28 pages
Pyspark Distinct and Filter
No ratings yet
Pyspark Distinct and Filter
3 pages
T09 Sparksql
No ratings yet
T09 Sparksql
30 pages
Data and AI - Spark Python
No ratings yet
Data and AI - Spark Python
11 pages
Pyspark SQL and DataFrames
No ratings yet
Pyspark SQL and DataFrames
6 pages
PySpark Questions
No ratings yet
PySpark Questions
5 pages
PySpark SQL Cheat Sheet Python
100% (2)
PySpark SQL Cheat Sheet Python
1 page
PySpark SQL Cheat Sheet Python PDF
No ratings yet
PySpark SQL Cheat Sheet Python PDF
1 page
PySpark SQL Cheat Sheet Python PDF
No ratings yet
PySpark SQL Cheat Sheet Python PDF
1 page
Pyspark Basics
No ratings yet
Pyspark Basics
74 pages
Data Frame in Panda 01
No ratings yet
Data Frame in Panda 01
9 pages
Spark Walmart Data Analysis Project
0% (1)
Spark Walmart Data Analysis Project
17 pages
PySpark Interview Cheatsheet 1741068112
No ratings yet
PySpark Interview Cheatsheet 1741068112
19 pages
Pyspark Interview: Abhinav Singh
No ratings yet
Pyspark Interview: Abhinav Singh
275 pages
Top 100 Pyspark Functions For Data Engineers 1738131847
No ratings yet
Top 100 Pyspark Functions For Data Engineers 1738131847
30 pages
Mod5 Bda
No ratings yet
Mod5 Bda
9 pages
Spark Essentials
No ratings yet
Spark Essentials
15 pages
Pyspark
No ratings yet
Pyspark
10 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
C# Interview Questions, Answers, and Explanations: C Sharp Certification Review
From Everand
C# Interview Questions, Answers, and Explanations: C Sharp Certification Review
equitypress
4.5/5 (3)
How to a Developers Guide to 4k: Developer edition, #3
From Everand
How to a Developers Guide to 4k: Developer edition, #3
Xinc Cyberwizard
No ratings yet
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet
Microsoft Visual Basic Interview Questions: Microsoft VB Certification Review
From Everand
Microsoft Visual Basic Interview Questions: Microsoft VB Certification Review
Equity Press
No ratings yet
PHP Interview Questions, Answers, and Explanations: PHP Certification Review: PHP FAQ
From Everand
PHP Interview Questions, Answers, and Explanations: PHP Certification Review: PHP FAQ
equitypress
No ratings yet
Chiranjeevi Lakkakula Data Engineering CV
No ratings yet
Chiranjeevi Lakkakula Data Engineering CV
8 pages
SCD in Databricks
No ratings yet
SCD in Databricks
16 pages
SQL 1
No ratings yet
SQL 1
14 pages
Deletion of Product Master or Location Product
No ratings yet
Deletion of Product Master or Location Product
3 pages
Step 8 Spring Boot Level III Scenario Based
No ratings yet
Step 8 Spring Boot Level III Scenario Based
5 pages
Forti SIEM
No ratings yet
Forti SIEM
7 pages
UNIT-1 Dbms
No ratings yet
UNIT-1 Dbms
19 pages
Fundamentals of Data Engineering Index
No ratings yet
Fundamentals of Data Engineering Index
17 pages
Lab Sheet 05
No ratings yet
Lab Sheet 05
5 pages
Synopsis - Mern Stack
No ratings yet
Synopsis - Mern Stack
15 pages
Library Science and Information Science-1
No ratings yet
Library Science and Information Science-1
20 pages
MR Longs Exam Guide 2022 For IT
No ratings yet
MR Longs Exam Guide 2022 For IT
11 pages
CVJ 58288148 20240611091239
No ratings yet
CVJ 58288148 20240611091239
5 pages
Lesson 7 Working With Data From External Sources
100% (2)
Lesson 7 Working With Data From External Sources
21 pages
Exercise 2: Relational Model: Solution
No ratings yet
Exercise 2: Relational Model: Solution
13 pages
BI Unit 4-2
No ratings yet
BI Unit 4-2
25 pages
Data Warehouse IMP QUESTIONS CS3551
No ratings yet
Data Warehouse IMP QUESTIONS CS3551
2 pages
Computer Application Paper Pattern
No ratings yet
Computer Application Paper Pattern
2 pages
Data Recovery and Collection of Evidence
No ratings yet
Data Recovery and Collection of Evidence
14 pages
Image Retrieval Using Color and Shape
No ratings yet
Image Retrieval Using Color and Shape
12 pages
Computer Science Practical File Class 12
No ratings yet
Computer Science Practical File Class 12
21 pages
Sign Language Recognition System - A Survey
No ratings yet
Sign Language Recognition System - A Survey
5 pages
Curriculum Vitae Debashis Nag
100% (1)
Curriculum Vitae Debashis Nag
3 pages
Manuscript
No ratings yet
Manuscript
128 pages
Chapter 1
No ratings yet
Chapter 1
21 pages
Project and Its Background Project Context Barangay Katipunan Is One of The 142 Barangays of Quezon City. Its Current
No ratings yet
Project and Its Background Project Context Barangay Katipunan Is One of The 142 Barangays of Quezon City. Its Current
7 pages
Movicon 11 Programmer Guide PDF
No ratings yet
Movicon 11 Programmer Guide PDF
662 pages
Computer Project
No ratings yet
Computer Project
28 pages
Part A Plan: Design Er Diagram and Normalisation Database For Hospital Management System
No ratings yet
Part A Plan: Design Er Diagram and Normalisation Database For Hospital Management System
6 pages
Government College of Engineering and Technology: Aashutosh Gandotra Computer Engineering 202/18
No ratings yet
Government College of Engineering and Technology: Aashutosh Gandotra Computer Engineering 202/18
25 pages
GIS Based Network Analysis - Manual
No ratings yet
GIS Based Network Analysis - Manual
66 pages
SCRIPT - To Tune The 'SESSION - CACHED - CURSORS' and 'OPEN - CURSORS' Parameters (ID 208857.1)
No ratings yet
SCRIPT - To Tune The 'SESSION - CACHED - CURSORS' and 'OPEN - CURSORS' Parameters (ID 208857.1)
3 pages

Basic DataFrame Operation

Uploaded by

Basic DataFrame Operation

Uploaded by

6/11/25, 5:54 PM Dataframe Basic Operations - Databricks

Dataframe Basic Operations (Python)

from pyspark.sql import SparkSession

# Display the DataFrame

file:///C:/Users/esidhannara/Downloads/Dataframe Basic Operations (1).html 1/11

;   df_list: pyspark.sql.dataframe.DataFrame = [Name: string, Age: long]

2.From a List of Dictionaries

# Display the DataFrame

  df_dict: pyspark.sql.dataframe.DataFrame = [Id: long, Name: string]

3.From a List of Rows

# Display the DataFrame

  df_row: pyspark.sql.dataframe.DataFrame = [Name: string, Id: long]

file:///C:/Users/esidhannara/Downloads/Dataframe Basic Operations (1).html 2/11

4.Creating a DataFrame from an RDD

# Convert RDD to DataFrame

# Display the DataFrame

  df_rdd: pyspark.sql.dataframe.DataFrame = [Name: string, Age: long]

5.Reading external file

file:///C:/Users/esidhannara/Downloads/Dataframe Basic Operations (1).html 3/11

  customer_df: pyspark.sql.dataframe.DataFrame = [customer_id: integer, name: string ... 5 more fields]

6. Using StructType & StructField

# Create dummy data as a list of lists

# Define the schema

# Display the DataFrame

  df: pyspark.sql.dataframe.DataFrame = [empid: string, deptid: integer ... 5 more fields]

file:///C:/Users/esidhannara/Downloads/Dataframe Basic Operations (1).html 4/11

| 4| 103| Emily Davis| 28| F|72000.0|2021-07-30|

Basic DataFrame Operation

file:///C:/Users/esidhannara/Downloads/Dataframe Basic Operations (1).html 5/11

2. Columns & Prinschema()

3. Select specific columns

file:///C:/Users/esidhannara/Downloads/Dataframe Basic Operations (1).html 6/11

| 21| Customer_21|Hyderabad| Tamil Nadu| India| 2023-09-16| true|

| 21| Customer_21|Hyderabad| Tamil Nadu| India| 2023 09 16| true|

file:///C:/Users/esidhannara/Downloads/Dataframe Basic Operations (1).html 7/11

| 134|Customer_134|Hyderabad|West Bengal| India| 2023-06-25| true| 

5. Create or replace new column

# col: A function to reference a column in a DataFrame.

# Example: Adding a new column

# Display the DataFrame

  df_with_new_column: pyspark.sql.dataframe.DataFrame = [customer_id: integer, name: string ... 6 more fields]

file:///C:/Users/esidhannara/Downloads/Dataframe Basic Operations (1).html 8/11

# Display the DataFrame

  df_renamed_column: pyspark.sql.dataframe.DataFrame = [customer_id: integer, name: string ... 6 more fields]

# Dropping a single column

# Display the DataFrame

  df_dropped_column: pyspark.sql.dataframe.DataFrame = [customer_id: integer, name: string ... 5 more fields]

file:///C:/Users/esidhannara/Downloads/Dataframe Basic Operations (1).html 9/11

| 17|Customer_17| Pune| Delhi| India| 2023 04 14| false|

Dropping Multiple Columns

# Display the DataFrame

7. Removing Duplicate Rows

# Display the DataFrame

  df_distinct: pyspark.sql.dataframe.DataFrame = [customer_id: integer, name: string ... 6 more fields]

file:///C:/Users/esidhannara/Downloads/Dataframe Basic Operations (1).html 10/11

| 3| Customer_3| Mumbai| Telangana| India| 2023-06-04| true| Customer_3 Singh|

file:///C:/Users/esidhannara/Downloads/Dataframe Basic Operations (1).html 11/11

You might also like