0% found this document useful (0 votes)

3 views12 pages

How to Clean Data Using SQL

Uploaded by

Yaswanth Reddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views12 pages

How to Clean Data Using SQL

Uploaded by

Yaswanth Reddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

How to Clean Data Using SQL

Introduction

Data cleaning is a critical step in any data analysis or data science project. Without proper data
cleaning, your analysis may lead to inaccurate or misleading results. This enhanced guide covers
essential SQL data cleaning techniques with practical examples, step-by-step strategies, and real-
world input/output demonstrations.

1. Handling Missing Values

Problem: Missing values can lead to inaccurate analysis or cause errors during joins and
aggregations.

Solution: Use COALESCE() or IFNULL() to replace missing values with defaults.

Example with Data:

-- Input Data (users table)

/*
| user_id | email |
|---------|---------------------|
| 1 | john@example.com |
| 2 | NULL |
| 3 | sarah@example.com |
| 4 | NULL |
| 5 | mike@example.com |
*/

SELECT user_id, COALESCE(email, 'unknown') AS cleaned_email

FROM users;

Output:
/*
| user_id | cleaned_email |
|---------|---------------------|
| 1 | john@example.com |
| 2 | unknown |
| 3 | sarah@example.com |
| 4 | unknown |
| 5 | mike@example.com |

2. Removing Duplicates
Problem: Duplicates in data can distort results and lead to incorrect conclusions.

Solution: Use ROW_NUMBER() to eliminate duplicate rows.

Example with Data:

-- Input Data (orders table)

/*
| order_id | user_id | created_at | amount |
|----------|---------|---------------------|--------|
| 101 | 1 | 2023-01-01 10:00:00 | 100 |
| 102 | 1 | 2023-01-02 11:00:00 | 150 |
| 103 | 2 | 2023-01-01 09:00:00 | 200 |
| 104 | 3 | 2023-01-03 12:00:00 | 120 |
| 105 | 3 | 2023-01-03 13:00:00 | 130 |
*/

WITH RankedRows AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY
created_at DESC) AS row_num
FROM orders
)
SELECT order_id, user_id, created_at, amount
FROM RankedRows
WHERE row_num = 1;
Output:

| order_id | user_id | created_at | amount |

|----------|---------|---------------------|--------|
| 102 | 1 | 2023-01-02 11:00:00 | 150 |
| 103 | 2 | 2023-01-01 09:00:00 | 200 |
| 105 | 3 | 2023-01-03 13:00:00 | 130 |

3. Standardizing Data Formats

Problem: Inconsistent data formats can cause issues in comparisons or analysis.

Solution: Use LOWER(), UPPER(), and TRIM() to standardize text.

Example with Data:

-- Input Data (customers table)

/*
| customer_id | first_name |
|-------------|------------|
| 1 | JOHN |
| 2 | Mary |
| 3 | peter |
| 4 | " alice " |
| 5 | BOB |
*/

SELECT
customer_id,
TRIM(LOWER(first_name)) AS standardized_name
FROM customers;

Output:
/*

| customer_id | standardized_name |
|-------------|-------------------|
| 1 | john |
| 2 | mary |
| 3 | peter |
| 4 | alice |
| 5 | bob |

4. Handling Outliers
Problem: Outliers can distort analysis results.

Solution: Identify and either remove or cap outliers.

Example with Data:

-- Input Data (orders table)

/*
| order_id | amount |
|----------|--------|
| 101 | 100 |
| 102 | 150 |
| 103 | 200 |
| 104 | 1200 | -- Outlier
| 105 | 130 |
*/

-- Identifying outliers
SELECT order_id, amount
FROM orders
WHERE amount > (SELECT AVG(amount) + 3 * STDDEV(amount) FROM orders);

-- Capping outliers
UPDATE orders
SET amount = (SELECT AVG(amount) + 3 * STDDEV(amount) FROM orders)
WHERE amount > (SELECT AVG(amount) + 3 * STDDEV(amount) FROM orders);

SELECT * FROM orders;

Output (after capping):

| order_id | amount |
|----------|--------|
| 101 | 100 |
| 102 | 150 |
| 103 | 200 |
| 104 | 356 | -- Capped value
| 105 | 130 |

5. Date Format Standardization

Problem: Inconsistent date formats can cause issues in time-based analysis.

Solution: Use TO_DATE() or EXTRACT() functions.

Example with Data:

-- Input Data (orders table)

/*
| order_id | order_date |
|----------|--------------|
| 101 | 01-01-2023 |
| 102 | 2023/02/15 |
| 103 | March 3 2023 |
| 104 | 04-04-2023 |
| 105 | 2023-05-05 |
*/

-- Standardizing dates
SELECT
order_id,
TO_DATE(order_date, 'YYYY-MM-DD') AS standardized_date
FROM orders;

-- Extracting components
SELECT
order_id,
EXTRACT(YEAR FROM TO_DATE(order_date, 'YYYY-MM-DD')) AS year,
EXTRACT(MONTH FROM TO_DATE(order_date, 'YYYY-MM-DD')) AS month
FROM orders;

Output:

| order_id | standardized_date | year | month |

|----------|-------------------|------|-------|
| 101 | 2023-01-01 | 2023 | 1 |
| 102 | 2023-02-15 | 2023 | 2 |
| 103 | 2023-03-03 | 2023 | 3 |
| 104 | 2023-04-04 | 2023 | 4 |
| 105 | 2023-05-05 | 2023 | 5 |

6. Correcting Data Entry Errors

Problem: Manual data entry often leads to formatting errors.

Solution: Use REGEXP to detect and correct errors.

Example with Data:

-- Input Data (customers table)

/*
| customer_id | phone_number |
|-------------|---------------|
| 1 | 1234567890 |
| 2 | 234-567-8901 |
| 3 | 34567890 |
| 4 | (456)7890123 |
| 5 | 56789O1234 | -- Contains letter O
*/

-- Finding invalid phone numbers

SELECT customer_id, phone_number
FROM customers
WHERE phone_number NOT REGEXP '^[0-9]{10}$';

Output:

| customer_id | phone_number |
|-------------|---------------|
| 2 | 234-567-8901 |
| 3 | 34567890 |
| 4 | (456)7890123 |
| 5 | 56789O1234 |

7. Handling Null Values in Aggregations

Problem: Null values in aggregations can cause incorrect results.

Solution: Use COALESCE() to handle nulls.

Example with Data:

-- Input Data (orders table)

/*
| order_id | amount |
|----------|--------|
| 101 | 100 |
| 102 | NULL |
| 103 | 200 |
| 104 | NULL |
| 105 | 150 |
*/

SELECT SUM(COALESCE(amount, 0)) AS total_amount FROM orders;

Output:

| total_amount |
|--------------|
| 450 |

8. Removing Leading/Trailing Spaces

Problem: Extra spaces can cause comparison issues.

Solution: Use TRIM() to remove unnecessary whitespace.

Example with Data:

-- Input Data (employees table)

/*
| emp_id | first_name |
|--------|-------------|
| 1 | " John " |
| 2 | " Mary " |
| 3 | "Peter " |
| 4 | " Alice" |
| 5 | "Bob " |
*/

SELECT emp_id, TRIM(first_name) AS trimmed_name FROM employees;

Output:

| emp_id | trimmed_name |
|--------|--------------|
| 1 | John |
| 2 | Mary |
| 3 | Peter |
| 4 | Alice |
| 5 | Bob |

9. Splitting Combined Columns into Multiple Columns

Problem: Data often comes combined in a single column (e.g., full names, addresses) and needs
to be split for analysis.

Solution: Use SUBSTRING(), SPLIT_PART(), or similar functions to separate values.

Example with Data:

-- Input Data (customers table)

/*
| customer_id | full_name |
|-------------|-----------------|
| 1 | John Smith |
| 2 | Mary Johnson |
| 3 | Peter Parker |
| 4 | Alice Williams |
| 5 | Bob Brown |
*/

SELECT
customer_id,
SUBSTRING(full_name, 1, POSITION(' ' IN full_name) - 1) AS
first_name,
SUBSTRING(full_name, POSITION(' ' IN full_name) + 1) AS last_name
FROM customers;
Output:

| customer_id | first_name | last_name |

|-------------|------------|-----------|
| 1 | John | Smith |
| 2 | Mary | Johnson |
| 3 | Peter | Parker |
| 4 | Alice | Williams |
| 5 | Bob | Brown |

10. Handling Inconsistent Categorical Values

Problem: Categorical data (e.g., product categories) may have inconsistent labels (e.g.,
"Electronics" vs. "ELECTRONICS").

Solution: Standardize categories using CASE statements or UPDATE queries.

Example with Data:

-- Input Data (products table)

/*
| product_id | category |
|------------|-----------------|
| 1 | Electronics |
| 2 | ELECTRONICS |
| 3 | books |
| 4 | Books |
| 5 | stationery |
*/

SELECT
product_id,
CASE
WHEN LOWER(category) LIKE '%electronic%' THEN 'Electronics'
WHEN LOWER(category) LIKE '%book%' THEN 'Books'
WHEN LOWER(category) LIKE '%stationery%' THEN 'Stationery'
ELSE category
END AS standardized_category
FROM products;

Output:

| product_id | standardized_category |
|------------|-----------------------|
| 1 | Electronics |
| 2 | Electronics |
| 3 | Books |
| 4 | Books |
| 5 | Stationery |

Conclusion

This guide provides practical, real-world examples of data cleaning techniques in SQL. Each
concept is demonstrated with sample input data and the corresponding output after cleaning,
making the techniques more tangible and easier to understand. By following these methods, you
can ensure your data is clean, consistent, and ready for analysis.

Duda Problemsolutions
100% (6)
Duda Problemsolutions
446 pages
Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
Sample Epms (Map Lna - PPCR With Epms) - Updated100917
No ratings yet
Sample Epms (Map Lna - PPCR With Epms) - Updated100917
5 pages
Using SQL and Python For Data Analytics
No ratings yet
Using SQL and Python For Data Analytics
113 pages
SQL Cheat Sheet
No ratings yet
SQL Cheat Sheet
15 pages
Gardner - Property & Theft' Notes
No ratings yet
Gardner - Property & Theft' Notes
4 pages
Listening Forecast Tháng 2 Quan Trong 4
100% (2)
Listening Forecast Tháng 2 Quan Trong 4
77 pages
A Nice OSCP Cheat Sheet
50% (2)
A Nice OSCP Cheat Sheet
12 pages
What Is Good SOW
No ratings yet
What Is Good SOW
8 pages
What Is Data Cleanning?
No ratings yet
What Is Data Cleanning?
14 pages
Data Cleaning in SQL
No ratings yet
Data Cleaning in SQL
14 pages
Data Cleaning Made Easy: Essential SQL Techniques in Mysql
No ratings yet
Data Cleaning Made Easy: Essential SQL Techniques in Mysql
11 pages
Mastering Data Cleaning Techniques With SQL - Explained Examples - by ? Pandata - Level Up Coding
No ratings yet
Mastering Data Cleaning Techniques With SQL - Explained Examples - by ? Pandata - Level Up Coding
31 pages
Master in SQL: Data Cleaning
No ratings yet
Master in SQL: Data Cleaning
14 pages
ORACLE COURSE SYLLABUS SQL, PLSQL - Qtree Technologies
No ratings yet
ORACLE COURSE SYLLABUS SQL, PLSQL - Qtree Technologies
14 pages
Data Cleaning in SQL
No ratings yet
Data Cleaning in SQL
21 pages
Abhinav DBMS File
No ratings yet
Abhinav DBMS File
15 pages
SQL Deep Guide
No ratings yet
SQL Deep Guide
236 pages
SQL Questions
No ratings yet
SQL Questions
14 pages
code (1)
No ratings yet
code (1)
4 pages
RDBMS Lab Record-IV Sem-1
No ratings yet
RDBMS Lab Record-IV Sem-1
39 pages
RDBMS Lab Record-IV Sem
No ratings yet
RDBMS Lab Record-IV Sem
39 pages
Aaaaaa
No ratings yet
Aaaaaa
15 pages
Singh Advanced Data Cleaning Techniques For E-Commerce Projects
No ratings yet
Singh Advanced Data Cleaning Techniques For E-Commerce Projects
14 pages
Chapter 1 Q & A
No ratings yet
Chapter 1 Q & A
4 pages
SQL Notes
No ratings yet
SQL Notes
9 pages
SQL Essentials: Mark Mcilroy
No ratings yet
SQL Essentials: Mark Mcilroy
36 pages
Data Cleaning Steps
No ratings yet
Data Cleaning Steps
3 pages
Tableau Notes
No ratings yet
Tableau Notes
16 pages
SQL Essentials PDF
No ratings yet
SQL Essentials PDF
36 pages
Asg2 Sqljoin
No ratings yet
Asg2 Sqljoin
25 pages
SQL Summary
No ratings yet
SQL Summary
10 pages
Cleaning Function in SQL
No ratings yet
Cleaning Function in SQL
4 pages
Less Common SQL Sintaxes For SCM
No ratings yet
Less Common SQL Sintaxes For SCM
3 pages
Techniques Used To Transform Data, Part 2
No ratings yet
Techniques Used To Transform Data, Part 2
7 pages
SQL Data Clean Process
No ratings yet
SQL Data Clean Process
6 pages
KPMG Data Analyst Interview Questions
No ratings yet
KPMG Data Analyst Interview Questions
30 pages
SQL Answers
No ratings yet
SQL Answers
7 pages
My SQL
No ratings yet
My SQL
7 pages
SQL Experiment Ans
No ratings yet
SQL Experiment Ans
16 pages
Complete Data Cleaning Guide On in SQL
No ratings yet
Complete Data Cleaning Guide On in SQL
93 pages
SQL Roadmap
No ratings yet
SQL Roadmap
1 page
Frequently Used
No ratings yet
Frequently Used
14 pages
SQL Manuscript
No ratings yet
SQL Manuscript
154 pages
DBMS Pactical File SS
No ratings yet
DBMS Pactical File SS
21 pages
Ch-1 IP Notes
No ratings yet
Ch-1 IP Notes
7 pages
DBMS Practicals
No ratings yet
DBMS Practicals
32 pages
Database Testing Using SQL
No ratings yet
Database Testing Using SQL
6 pages
SQL Session 02 - Manual
No ratings yet
SQL Session 02 - Manual
8 pages
SQL Keywords and Functions
No ratings yet
SQL Keywords and Functions
9 pages
IP XII Quick Notes - Querying in MYSQL
No ratings yet
IP XII Quick Notes - Querying in MYSQL
11 pages
SQL Fundamentals
No ratings yet
SQL Fundamentals
27 pages
Tech Mahindra Data Analyst Interview Questions
No ratings yet
Tech Mahindra Data Analyst Interview Questions
11 pages
DBMSpractical5 (A BC)
No ratings yet
DBMSpractical5 (A BC)
7 pages
Order of Execution in SQL
No ratings yet
Order of Execution in SQL
12 pages
SQL Notes
No ratings yet
SQL Notes
5 pages
e93bf98e-3ca4-47e5-9897-8a0c159176c6
No ratings yet
e93bf98e-3ca4-47e5-9897-8a0c159176c6
19 pages
Data Analysis With SQL: Postgresql Cheat Sheet
No ratings yet
Data Analysis With SQL: Postgresql Cheat Sheet
4 pages
Interview - 7 - IMP
No ratings yet
Interview - 7 - IMP
26 pages
Modulo 1: Querying and Filtering Data
No ratings yet
Modulo 1: Querying and Filtering Data
35 pages
SQL for Data Analysis Cheat Sheet-By Srija Biswas
No ratings yet
SQL for Data Analysis Cheat Sheet-By Srija Biswas
22 pages
Aggregate and Mod3 SQL
No ratings yet
Aggregate and Mod3 SQL
8 pages
World of SQL
No ratings yet
World of SQL
30 pages
Hacker Rank
No ratings yet
Hacker Rank
20 pages
Week 2SQL
No ratings yet
Week 2SQL
7 pages
SQL Solved Questions (Imp.)
No ratings yet
SQL Solved Questions (Imp.)
21 pages
Snowflake Data Engineering Concepts
No ratings yet
Snowflake Data Engineering Concepts
93 pages
Statement of Purpose
No ratings yet
Statement of Purpose
1 page
Resume Updated
No ratings yet
Resume Updated
1 page
Practice Test On Advanced Vocabular1
No ratings yet
Practice Test On Advanced Vocabular1
4 pages
English Phrases For Meetings
No ratings yet
English Phrases For Meetings
2 pages
Standard 8
No ratings yet
Standard 8
1 page
Essence of Poetry
No ratings yet
Essence of Poetry
6 pages
Thermodynamics For Engineers 1st Edition Kroos Solutions Manual 1
100% (50)
Thermodynamics For Engineers 1st Edition Kroos Solutions Manual 1
36 pages
Business Ethics PPT Final
100% (1)
Business Ethics PPT Final
35 pages
Construction and Standardization of Psychology Aptitude Test For Incoming College Psychology Students
No ratings yet
Construction and Standardization of Psychology Aptitude Test For Incoming College Psychology Students
7 pages
Quitoy Feature
No ratings yet
Quitoy Feature
2 pages
Electrical Installation Theory and Practice by e L Donnelly
100% (1)
Electrical Installation Theory and Practice by e L Donnelly
1 page
Magnus Resch
No ratings yet
Magnus Resch
9 pages
Nguyen and Peschard
No ratings yet
Nguyen and Peschard
29 pages
Next in Rank Rule
100% (1)
Next in Rank Rule
6 pages
Banh Mi
No ratings yet
Banh Mi
20 pages
Onde Vivem Os Monstos
No ratings yet
Onde Vivem Os Monstos
1 page
Buckling of Spherical Shells
No ratings yet
Buckling of Spherical Shells
10 pages
Approaches To Acting.1
No ratings yet
Approaches To Acting.1
8 pages
DL14 Dragons of Triumph
100% (1)
DL14 Dragons of Triumph
102 pages
Introduction To (Demand) Forecasting
No ratings yet
Introduction To (Demand) Forecasting
35 pages
Kuk 253
No ratings yet
Kuk 253
28 pages
15 Content Summary
No ratings yet
15 Content Summary
4 pages
Is Is Has Is Has: Esl / Efl Resources
No ratings yet
Is Is Has Is Has: Esl / Efl Resources
1 page
Set Design Checklist1
50% (2)
Set Design Checklist1
4 pages
Topics
100% (1)
Topics
21 pages
Van Buiten
No ratings yet
Van Buiten
8 pages
PRP For Hair Loss Pre Post Instructions 10.18
No ratings yet
PRP For Hair Loss Pre Post Instructions 10.18
2 pages

How to Clean Data Using SQL

Uploaded by

How to Clean Data Using SQL

Uploaded by

How to Clean Data Using SQL