What Is Exploratory Data Analysis (EDA)

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 13

What is Exploratory Data Analysis (EDA)?

The process of exploring and summarizing the main characteristics of the data
to uncover patterns, relationships, and trends.
It helps in formulating questions and making data-driven decisions.

1
Importance of EDA:

● Provides an initial understanding of the dataset.


● Helps in identifying data quality issues, such as missing values, outliers,
and inconsistencies.
● Guides the selection of appropriate statistical techniques and models.
● Helps in feature engineering and variable selection.
● Enables the discovery of meaningful insights and actionable conclusions.

2
Steps in EDA
1. Data Cleaning
2. Descriptive Statistics
3. Data Visualization
4. Data Distribution
5. Correlation Analysis
6. Outlier Detection or Anomaly Detection
7. Data Transformation

3
Data Cleaning
Data cleaning is the process of fixing or removing incorrect, corrupted,
incorrectly formatted, duplicate, or incomplete data within a dataset. When
combining multiple data sources, there are many opportunities for data to be
duplicated or mislabeled.

Importance: Having clean data will ultimately increase overall productivity and
allow for the highest quality information in your decision-making

4
Data Cleaning Techniques
Handling Missing Data: We use methods like filling or removing to deal with
missing values.

Outlier Detection: Find and address unusual data that can affect analysis or
models.

Data Standardization: Make data consistent for easier analysis and comparison.

Data Validation: Check data against rules to ensure accuracy and reliability.

5
Descriptive Statistics
Descriptive statistics is the branch of statistics that focuses on summarizing and
describing the main features/Attributes/Variables of a dataset.

Importance: They can simplify and organize large amounts of data into a few
numbers or graphs which makes it easier to grasp the main features and patterns
of your data, as well as identify any outliers or errors.

6
Data Visualization
Data visualization is the graphical representation of data and information. It
involves creating visual elements such as charts, graphs, and maps to help
people understand complex data patterns.

7
Importance of data Visualization

1. Enhances Data Understanding

2. Identification of Patterns and Trends

3. Quick Problem Identification

4. Narrative Building

8
Types of Data and Visualization
Different data types require different visualization techniques to convey
insights accurately.

9
Data Distribution
Data distribution refers to how data is spread out or clustered around certain
values or ranges. It is a way to organize trends and patterns observed in
dataset so that they are easier to understand.
Correlation Analysis
Correlation Analysis is also known as bivariate. It is primarily concerned with
finding out whether relationship exists between variables and then determining
the magnitude and action of that relationship.
Outlier/Anomaly Detection
Outlier detection is the process of detecting unusual data points that are far
away from average values. Generally it is a way to identify .rare items from a
dataset.
Data Transformation
Data Transformation is the process of converting data from one format to
another. Such as a database file, XML document or Excel spreadsheet, into
another. Transformations typically involve converting a raw data source into a
cleansed, validated and ready-to-use format.

You might also like