0% found this document useful (0 votes)
1 views

EDA_SQL_Document

EDA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

EDA_SQL_Document

EDA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Exploratory Data Analysis (EDA) Using SQL

1. Understanding the Dataset

- Data Overview:

Use SQL queries like SELECT TOP (5) * FROM table_name; or SELECT COLUMN_NAME,

DATA_TYPE FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = 'table_name';

to quickly understand the structure of your dataset. Identify whether the data types are appropriate

(categorical, numerical, dates).

2. Data Cleaning

- Check for Missing Values:

Identify missing values with a query like:

SELECT COUNT(*) AS missing_count FROM table_name WHERE column_name IS NULL;

- Check for Duplicates:

Find duplicate records using:

SELECT column_name, COUNT(*) FROM table_name GROUP BY column_name HAVING

COUNT(*) > 1;

- Outlier Detection:

For numerical data, outliers can be detected using standard deviation:

SELECT AVG(column_name), STDEV(column_name) FROM table_name;

Filter out rows where values fall beyond a certain threshold:

SELECT * FROM table_name WHERE column_name > (AVG(column_name) + 3 *

STDEV(column_name));
- Data Type Corrections:

Use ALTER TABLE statements to ensure columns have the correct data type, e.g.:

ALTER TABLE table_name ALTER COLUMN column_name INT;

3. Descriptive Statistics

- Summary Statistics:

Get summary statistics (mean, min, max, etc.) with:

SELECT MIN(column_name), MAX(column_name), AVG(column_name), COUNT(*) FROM

table_name;

- For categorical data, get frequency distribution:

SELECT column_name, COUNT(*) FROM table_name GROUP BY column_name;

4. Data Relationships

- Correlation Analysis:

Use aggregate SQL functions to mimic correlation.

- Scatter Plots:

SQL cannot directly create plots, but you can retrieve the necessary data for visualization.

5. Data Visualization:

SQL doesn't produce charts directly. Export results for visualization using external tools.

6. Handling Categorical Variables

- Encoding:
Use CASE statements to manually assign numerical values to categories.

- Frequency Analysis:

Analyze category frequency distribution with GROUP BY.

7. Feature Engineering:

Use SQL to create new features, bins, or calculated columns.

8. Outlier Treatment:

Identify and manage outliers using threshold-based queries.

9. Dimensionality Reduction:

SQL supports column filtering via SELECT.

10. Summarizing Findings:

Use GROUP BY and aggregations to reveal trends.

You might also like