0% found this document useful (0 votes)
3 views12 pages

How to Clean Data Using SQL

Uploaded by

Yaswanth Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views12 pages

How to Clean Data Using SQL

Uploaded by

Yaswanth Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

How to Clean Data Using SQL

How to Clean Data Using SQL

Introduction

Data cleaning is a critical step in any data analysis or data science project. Without proper data
cleaning, your analysis may lead to inaccurate or misleading results. This enhanced guide covers
essential SQL data cleaning techniques with practical examples, step-by-step strategies, and real-
world input/output demonstrations.

1. Handling Missing Values

Problem: Missing values can lead to inaccurate analysis or cause errors during joins and
aggregations.

Solution: Use COALESCE() or IFNULL() to replace missing values with defaults.

Example with Data:

-- Input Data (users table)


/*
| user_id | email |
|---------|---------------------|
| 1 | john@example.com |
| 2 | NULL |
| 3 | sarah@example.com |
| 4 | NULL |
| 5 | mike@example.com |
*/

SELECT user_id, COALESCE(email, 'unknown') AS cleaned_email


FROM users;

Output:
/*
| user_id | cleaned_email |
|---------|---------------------|
| 1 | john@example.com |
| 2 | unknown |
| 3 | sarah@example.com |
| 4 | unknown |
| 5 | mike@example.com |

*/

2. Removing Duplicates
Problem: Duplicates in data can distort results and lead to incorrect conclusions.

Solution: Use ROW_NUMBER() to eliminate duplicate rows.

Example with Data:

-- Input Data (orders table)


/*
| order_id | user_id | created_at | amount |
|----------|---------|---------------------|--------|
| 101 | 1 | 2023-01-01 10:00:00 | 100 |
| 102 | 1 | 2023-01-02 11:00:00 | 150 |
| 103 | 2 | 2023-01-01 09:00:00 | 200 |
| 104 | 3 | 2023-01-03 12:00:00 | 120 |
| 105 | 3 | 2023-01-03 13:00:00 | 130 |
*/

WITH RankedRows AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY
created_at DESC) AS row_num
FROM orders
)
SELECT order_id, user_id, created_at, amount
FROM RankedRows
WHERE row_num = 1;
Output:

/*

| order_id | user_id | created_at | amount |


|----------|---------|---------------------|--------|
| 102 | 1 | 2023-01-02 11:00:00 | 150 |
| 103 | 2 | 2023-01-01 09:00:00 | 200 |
| 105 | 3 | 2023-01-03 13:00:00 | 130 |

*/

3. Standardizing Data Formats


Problem: Inconsistent data formats can cause issues in comparisons or analysis.

Solution: Use LOWER(), UPPER(), and TRIM() to standardize text.

Example with Data:

-- Input Data (customers table)


/*
| customer_id | first_name |
|-------------|------------|
| 1 | JOHN |
| 2 | Mary |
| 3 | peter |
| 4 | " alice " |
| 5 | BOB |
*/

SELECT
customer_id,
TRIM(LOWER(first_name)) AS standardized_name
FROM customers;

Output:
/*

| customer_id | standardized_name |
|-------------|-------------------|
| 1 | john |
| 2 | mary |
| 3 | peter |
| 4 | alice |
| 5 | bob |

*/

4. Handling Outliers
Problem: Outliers can distort analysis results.

Solution: Identify and either remove or cap outliers.

Example with Data:

-- Input Data (orders table)


/*
| order_id | amount |
|----------|--------|
| 101 | 100 |
| 102 | 150 |
| 103 | 200 |
| 104 | 1200 | -- Outlier
| 105 | 130 |
*/

-- Identifying outliers
SELECT order_id, amount
FROM orders
WHERE amount > (SELECT AVG(amount) + 3 * STDDEV(amount) FROM orders);

-- Capping outliers
UPDATE orders
SET amount = (SELECT AVG(amount) + 3 * STDDEV(amount) FROM orders)
WHERE amount > (SELECT AVG(amount) + 3 * STDDEV(amount) FROM orders);

SELECT * FROM orders;

Output (after capping):

/*

| order_id | amount |
|----------|--------|
| 101 | 100 |
| 102 | 150 |
| 103 | 200 |
| 104 | 356 | -- Capped value
| 105 | 130 |

*/

5. Date Format Standardization


Problem: Inconsistent date formats can cause issues in time-based analysis.

Solution: Use TO_DATE() or EXTRACT() functions.

Example with Data:

-- Input Data (orders table)


/*
| order_id | order_date |
|----------|--------------|
| 101 | 01-01-2023 |
| 102 | 2023/02/15 |
| 103 | March 3 2023 |
| 104 | 04-04-2023 |
| 105 | 2023-05-05 |
*/

-- Standardizing dates
SELECT
order_id,
TO_DATE(order_date, 'YYYY-MM-DD') AS standardized_date
FROM orders;

-- Extracting components
SELECT
order_id,
EXTRACT(YEAR FROM TO_DATE(order_date, 'YYYY-MM-DD')) AS year,
EXTRACT(MONTH FROM TO_DATE(order_date, 'YYYY-MM-DD')) AS month
FROM orders;

Output:

/*

| order_id | standardized_date | year | month |


|----------|-------------------|------|-------|
| 101 | 2023-01-01 | 2023 | 1 |
| 102 | 2023-02-15 | 2023 | 2 |
| 103 | 2023-03-03 | 2023 | 3 |
| 104 | 2023-04-04 | 2023 | 4 |
| 105 | 2023-05-05 | 2023 | 5 |

*/

6. Correcting Data Entry Errors


Problem: Manual data entry often leads to formatting errors.

Solution: Use REGEXP to detect and correct errors.

Example with Data:

-- Input Data (customers table)


/*
| customer_id | phone_number |
|-------------|---------------|
| 1 | 1234567890 |
| 2 | 234-567-8901 |
| 3 | 34567890 |
| 4 | (456)7890123 |
| 5 | 56789O1234 | -- Contains letter O
*/

-- Finding invalid phone numbers


SELECT customer_id, phone_number
FROM customers
WHERE phone_number NOT REGEXP '^[0-9]{10}$';

Output:

/*

| customer_id | phone_number |
|-------------|---------------|
| 2 | 234-567-8901 |
| 3 | 34567890 |
| 4 | (456)7890123 |
| 5 | 56789O1234 |

*/

7. Handling Null Values in Aggregations


Problem: Null values in aggregations can cause incorrect results.

Solution: Use COALESCE() to handle nulls.

Example with Data:

-- Input Data (orders table)


/*
| order_id | amount |
|----------|--------|
| 101 | 100 |
| 102 | NULL |
| 103 | 200 |
| 104 | NULL |
| 105 | 150 |
*/

SELECT SUM(COALESCE(amount, 0)) AS total_amount FROM orders;

Output:

/*

| total_amount |
|--------------|
| 450 |

*/

8. Removing Leading/Trailing Spaces


Problem: Extra spaces can cause comparison issues.

Solution: Use TRIM() to remove unnecessary whitespace.

Example with Data:

-- Input Data (employees table)


/*
| emp_id | first_name |
|--------|-------------|
| 1 | " John " |
| 2 | " Mary " |
| 3 | "Peter " |
| 4 | " Alice" |
| 5 | "Bob " |
*/

SELECT emp_id, TRIM(first_name) AS trimmed_name FROM employees;


Output:

/*

| emp_id | trimmed_name |
|--------|--------------|
| 1 | John |
| 2 | Mary |
| 3 | Peter |
| 4 | Alice |
| 5 | Bob |

*/

9. Splitting Combined Columns into Multiple Columns

Problem: Data often comes combined in a single column (e.g., full names, addresses) and needs
to be split for analysis.

Solution: Use SUBSTRING(), SPLIT_PART(), or similar functions to separate values.

Example with Data:

-- Input Data (customers table)


/*
| customer_id | full_name |
|-------------|-----------------|
| 1 | John Smith |
| 2 | Mary Johnson |
| 3 | Peter Parker |
| 4 | Alice Williams |
| 5 | Bob Brown |
*/

SELECT
customer_id,
SUBSTRING(full_name, 1, POSITION(' ' IN full_name) - 1) AS
first_name,
SUBSTRING(full_name, POSITION(' ' IN full_name) + 1) AS last_name
FROM customers;
Output:

/*

| customer_id | first_name | last_name |


|-------------|------------|-----------|
| 1 | John | Smith |
| 2 | Mary | Johnson |
| 3 | Peter | Parker |
| 4 | Alice | Williams |
| 5 | Bob | Brown |

*/

10. Handling Inconsistent Categorical Values

Problem: Categorical data (e.g., product categories) may have inconsistent labels (e.g.,
"Electronics" vs. "ELECTRONICS").

Solution: Standardize categories using CASE statements or UPDATE queries.

Example with Data:

-- Input Data (products table)


/*
| product_id | category |
|------------|-----------------|
| 1 | Electronics |
| 2 | ELECTRONICS |
| 3 | books |
| 4 | Books |
| 5 | stationery |
*/

SELECT
product_id,
CASE
WHEN LOWER(category) LIKE '%electronic%' THEN 'Electronics'
WHEN LOWER(category) LIKE '%book%' THEN 'Books'
WHEN LOWER(category) LIKE '%stationery%' THEN 'Stationery'
ELSE category
END AS standardized_category
FROM products;

Output:

/*

| product_id | standardized_category |
|------------|-----------------------|
| 1 | Electronics |
| 2 | Electronics |
| 3 | Books |
| 4 | Books |
| 5 | Stationery |

*/

Conclusion

This guide provides practical, real-world examples of data cleaning techniques in SQL. Each
concept is demonstrated with sample input data and the corresponding output after cleaning,
making the techniques more tangible and easier to understand. By following these methods, you
can ensure your data is clean, consistent, and ready for analysis.

You might also like