0% found this document useful (0 votes)
29 views

Explaratory Data Analysis - Python

The document discusses different types of data including structured, unstructured, natural language, machine-generated, graph-based, audio/video/images, and streaming data. It also outlines the typical data science process which involves defining goals, data retrieval, data cleaning/integration/transformation, exploratory data analysis, building models, and presenting findings.

Uploaded by

octalblue5
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Explaratory Data Analysis - Python

The document discusses different types of data including structured, unstructured, natural language, machine-generated, graph-based, audio/video/images, and streaming data. It also outlines the typical data science process which involves defining goals, data retrieval, data cleaning/integration/transformation, exploratory data analysis, building models, and presenting findings.

Uploaded by

octalblue5
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Data

In computing, data is information that has been translated into a form that is efficient
for movement or processing.

Data Science

Data science is an evolutionary extension of statistics capable of dealing with the


massive amounts of data produced today. It adds methods from computer science to
the repertoire of statistics.

Benefits and uses of data science


• Data science and big data are used almost everywhere in both commercial and
noncommercial Settings Commercial companies in almost every industry use
data science and big data to gain insights into their customers, processes, staff,
completion, and products.
• Many companies use data science to offer customers a better user experience,
as well as to cross-sell,up-sell, and personalize their offerings.
• Governmental organizations are also aware of data’s value. Many governmental
organizations not only rely on internal data scientists to discover valuable
information, but also share their data with the public.
• Nongovernmental organizations (NGOs) use it to raise money and defend their
causes.
• Universities use data science in their research but also to enhance the study
experience of their students. The rise of massive open online courses (MOOC)
produces a lot of data, which allows universities to study how this type of
learning can complement traditional classes.

Facets of data
In data science and big data you’ll come across many different types of data, and each
of them tends to require different tools and techniques. The main categories of data
are these:

• Structured
• Unstructured
• Natural language
• Machine-generated
• Graph-based
• Audio, video, and images
• Streaming

Let’s explore all these interesting data types


Structured data

• Structured data is data that depends on a data model and resides in a fixed field
within a record. As such, it’s often easy to store structured data in tables within
databases or Excel files
• SQL, or Structured Query Language, is the preferred way to manage and query
data that resides in• databases.

Unstructured data
Unstructured data is data that isn’t easy to fit into a data model because the
content is context-specific or varying. One example of unstructured data is your
regular email.

Natural language
• Natural language is a special type of unstructured data; it’s challenging
to process because it requires knowledge of specific data science
techniques and linguistics.
• The natural language processing community has had success in entity
recognition, topic recognition, summarization, text completion, and
sentiment analysis, but models trained in one domain don’t generalize
well to other domains.
• Even state-of-the-art techniques aren’t able to decipher the meaning of
every piece of text

Machine-generated data
• Machine-generated data is information that’s automatically created
by a computer, process, application, or other machine without human
intervention.
• Machine-generated data is becoming a major data resource and will
continue to do so.
• The analysis of machine data relies on highly scalable tools, due to its
high volume and speed. Examples of machine data are web server
logs , call detail records, network event logs, and telemetry.

Machine generated data

Graph-based or network data

• “Graph data” can be a confusing term because any data can be shown in a graph
• Graph or network data is, in short, data that focuses on the relationship or
adjacency of objects
• The graph structures use nodes, edges, and properties to represent and store
graphical data.
• Graph-based data is a natural way to represent social networks, and its structure
allows you to calculate specific metrics such as the influence of a person and
the shortest path between two people.

Audio, image, and video


• Audio, image, and video are data types that pose specific challenges to
a data scientist
• Tasks that are trivial for humans, such as recognizing objects in pictures,
turn out to be challenging for computers
• MLBAM (Major League Baseball Advanced Media) announced in 2014
that they’ll increase video capture to approximately 7 TB per game for
the purpose of live, in-game analytics.
• Recently a company called DeepMind succeeded at creating an
algorithm that’s capable of learning how to play video games.
• This algorithm takes the video screen as input and learns to interpret
everything via a complex process of deep learning

Streaming data

• The data flows into the system when an event happens instead of being loaded
into a data store in a batch.
• Examples are the “What’s trending” on Twitter, live sporting or music events,
and the stock market.

Data Science Process


The typical data science process consists of six steps through which you’ll iterate,
as shown in figure

Steps for Data Science Processes:


Step 1: Defining research goals and creating a project charter
• Spend time understanding the goals and context of your research.Continue asking
questions and devising examples until you grasp the exact business expectations,
identify how your project fits in the bigger picture, appreciate how your research is
going to change the business, and understand how they’ll use your results.
Create a project charter
A project charter requires teamwork, and your input covers at least the following:
1. A clear research goal
2. The project mission and context
3. How you’re going to perform your analysis
4. What resources you expect to use
5. Proof that it’s an achievable project, or proof of concepts
6. Deliverables and a measure of success
7. A timeline
Step 2: Retrieving Data
Start with data stored within the company
• Finding data even within your own company can sometimes be a challenge.
• This data can be stored in official data repositories such as databases, data marts,
data warehouses, and data lakes maintained by a team of IT professionals.
• Getting access to the data may take time and involve company policies.
Step 3: Cleansing, integrating, and transforming data-
Cleaning:
• Data cleansing is a subprocess of the data science process that focuses on
removing errors in your data so your data becomes a true and consistent
representation of the processes it originates from.
• The first type is the interpretation error, such as incorrect use of terminologies,
like saying that a person’s age is greater than 300 years.
• The second type of error points to inconsistencies between data sources or
against your company’s standardized values. An example of this class of errors
is putting “Female” in one table and “F” in another when they represent the
same thing: that the person is female.
Integrating:
• Combining Data from different Data Sources.
• Your data comes from several different places, and in this sub step we focus on
integrating these different sources.
• You can perform two operations to combine information from different data
sets. The first operation is joining and the second operation is appending or
stacking.
Joining Tables:
• Joining tables allows you to combine the information of one observation found
in one table with the information that you find in another table.
Appending Tables:
• Appending or stacking tables is effectively adding observations from one table
to another table.
Transforming Data
• Certain models require their data to be in a certain shape.
Reducing the Number of Variables
• Sometimes you have too many variables and need to reduce the number
because they don’t add new information to the model.
• Having too many variables in your model makes the model difficult to handle,
and certain techniques don’t perform well when you overload them with too
many input variables.
• Dummy variables can only take two values: true (1) or false (0). They’re used to
indicate the absence of a categorical effect that may explain the observation.
Step 4: Exploratory Data Analysis
• During exploratory data analysis you take a deep dive into the data.
• Information becomes much easier to grasp when shown in a picture, therefore you
mainly use graphical techniques to gain an understanding of your data and the
interactions between variables.
• Bar Plot, Line Plot, Scatter Plot, Multiple Plots, Pareto Diagram, Link and Brush
Diagram, Histogram, Box and Whisker Plot.
Step 5: Build the Models
• Build the models are the next step, with the goal of making better predictions,
classifying objects, or gaining an understanding of the system that are required for
modeling.
Step 6: Presenting findings and building applications on top of them –
• The last stage of the data science process is where your soft skills will be most
useful, and yes, they’re extremely important.
• Presenting your results to the stakeholders and industrializing your analysis process
for repetitive reuse and integration with other tools.

Usage of Data Science Process


The Data Science Process is a systematic approach to solving data-related problems
and consists of the following steps:
1. Problem Definition: Clearly defining the problem and identifying the goal of the
analysis.
2. Data Collection: Gathering and acquiring data from various sources, including data
cleaning and preparation.
3. Data Exploration: Exploring the data to gain insights and identify trends, patterns,
and relationships.
4. Data Modeling: Building mathematical models and algorithms to solve problems
and make predictions.
5. Evaluation: Evaluating the model’s performance and accuracy using appropriate
metrics.
6. Deployment: Deploying the model in a production environment to make
predictions or automate decision-making processes.
7. Monitoring and Maintenance: Monitoring the model’s performance over time and
making updates as needed to improve accuracy.
Issues of Data Science Process
1. Data Quality and Availability: Data quality can affect the accuracy of the models
developed and therefore, it is important to ensure that the data is accurate,
complete, and consistent. Data availability can also be an issue, as the data required
for analysis may not be readily available or accessible.
2. Bias in Data and Algorithms: Bias can exist in data due to sampling techniques,
measurement errors, or imbalanced datasets, which can affect the accuracy of
models. Algorithms can also perpetuate existing societal biases, leading to unfair
or discriminatory outcomes.
3. Model Overfitting and Underfitting: Overfitting occurs when a model is too
complex and fits the training data too well, but fails to generalize to new data. On
the other hand, underfitting occurs when a model is too simple and is not able to
capture the underlying relationships in the data.
4. Model Interpretability: Complex models can be difficult to interpret and
understand, making it challenging to explain the model’s decisions and decisions.
This can be an issue when it comes to making business decisions or gaining
stakeholder buy-in.
5. Privacy and Ethical Considerations: Data science often involves the collection and
analysis of sensitive personal information, leading to privacy and ethical concerns.
It is important to consider privacy implications and ensure that data is used in a
responsible and ethical manner.
6. Technical Challenges: Technical challenges can arise during the data science
process such as data storage and processing, algorithm selection, and
computational scalability.

Exploratory Data Analysis or (EDA)


Exploratory Data Analysis or (EDA) is a process of describing the data using statistical
and visualization techniques to bring important aspects of that data into focus for
further analysis. This involves in understanding the data sets by summarizing their
main characteristics often plotting them visually. This step is very important especially
when we arrive at modeling the data in order to apply Machine learning.

EDA consists of Histograms, Box plot, Scatter plot and many more. It often takes much
time to explore the data. Through the process of EDA, we can ask to define the problem
statement or definition on our data set which is very important. EDA involves a
comprehensive range of activities, including data integration, analysis, cleaning,
transformation, and dimension reduction.

This approach commonly utilizes data visualization techniques to gain insights and
identify relevant patterns, anomalies, and hypotheses, ultimately facilitating the
manipulation of data sources in order to obtain desired answers. EDA plays a crucial
role in assisting data scientists in making informed decisions, testing hypotheses, and
validating assumptions.
Significance of EDA
Exploratory Data Analysis (EDA) holds immense significance in the realm of data
science and analytics for several reasons:

 Understanding the Data: EDA helps in understanding the structure,


relationships, and patterns present in the data. It gives insights into the data's
characteristics, such as its distribution, central tendency, and variability.
 Data Cleaning: Through EDA, data inconsistencies, missing values, outliers, and
other anomalies can be identified and addressed. This step is crucial for
ensuring the quality and reliability of the data used for analysis.
 Feature Selection: EDA aids in identifying the most relevant features or
variables for analysis. By examining their distributions and relationships with the
target variable, unnecessary or redundant features can be eliminated, leading
to more efficient and effective models.
 Detecting Patterns and Relationships: EDA techniques such as scatter plots,
correlation analysis, and clustering can reveal underlying patterns, trends, and
relationships within the data. This helps in formulating hypotheses and guiding
further analysis.
 Model Assumptions: EDA helps in validating assumptions required by different
modeling techniques. For example, normality assumptions for linear regression
or independence assumptions for time series analysis can be checked through
EDA.
 Communication and Visualization: EDA often involves creating visualizations
such as histograms, box plots, and heatmaps to represent the data graphically.
These visualizations not only aid in understanding the data but also in
communicating findings and insights to stakeholders effectively.
 Decision Making: EDA provides a foundation for informed decision-making. By
gaining a deeper understanding of the data, stakeholders can make better
decisions regarding business strategies, resource allocation, and problem-
solving.

In summary, EDA is a critical initial step in the data analysis process, enabling data
scientists and analysts to gain insights, clean and prepare the data, identify patterns
and relationships, and ultimately make informed decisions based on data-driven
evidence.
EDA techniques
1. Univariate Analysis: In EDA Analysis, univariate analysis examines individual
variables to understand their distributions and summary statistics.
2. Bivariate Analysis: This aspect of EDA explores the relationship between two
variables, uncovering patterns through techniques like scatter plots and
correlation analysis.
3. Multivariant analysis: Multivariate analysis extends bivariate evaluation to
encompass greater than variables. It ambitions to apprehend the complex
interactions and dependencies among more than one variable in a records set.
Techniques inclusive of heatmaps, parallel coordinates, aspect analysis, and
primary component analysis (PCA) are used for multivariate analysis.
4. Visualization Techniques: EDA relies heavily on visualization methods to
depict data distributions, trends, and associations using various charts and
graphs.
5. Outlier Detection: EDA involves identifying outliers within the data, anomalies
that deviate significantly from the rest, employing tools such as box plots and
z-score analysis.
6. Statistical Tests: EDA often includes performing statistical tests to validate
hypotheses or discern significant differences between groups, adding depth to
the analysis process.

Exploratory Data Analysis Tools


 Python with Libraries:
 Pandas: Pandas is a powerful data manipulation and analysis library in
Python. It provides data structures and functions for efficiently handling
and analysing structured data.
 NumPy: NumPy is a fundamental package for scientific computing in
Python. It provides support for large, multi-dimensional arrays and
matrices, along with a collection of mathematical functions to operate
on these arrays.
 Matplotlib and Seaborn: These libraries are used for creating static,
animated, and interactive visualizations in Python. Matplotlib provides a
MATLAB-like interface, while Seaborn offers a higher-level interface for
creating attractive and informative statistical graphics.
 Plotly and Bokeh: These libraries are used for creating interactive
visualizations and dashboards in Python. They provide capabilities for
building web-based plots with interactivity and responsiveness.
 R Programming:
 R and RStudio: R is a programming language and environment
specifically designed for statistical computing and graphics. RStudio is
an integrated development environment (IDE) for R, providing a user-
friendly interface for data analysis and visualization.
 ggplot2: ggplot2 is a popular data visualization package in R, known for
its declarative syntax and powerful capabilities for creating customized
and publication-quality graphics.
 Microsoft Excel:

Excel is a widely used spreadsheet software that offers basic data analysis and
visualization capabilities. It provides features such as pivot tables, charts, and
functions for summarizing, filtering, and analysing data.

 RapidMiner:

RapidMiner is a data science platform that provides an integrated environment


for data preparation, machine learning, and predictive analytics. It offers visual
workflows and a range of built-in tools for EDA, modeling, and deployment

 MATLAB

MATLAB is a prominent commercial software, especially among engineers, due


to its exceptional capabilities in mathematical computations. It can be applied
in EDA, although it necessitates a fundamental understanding of MATLAB
programming. Its solid mathematical foundation makes it a viable option for
data analysis tasks

Python and R stand out as the most prevalent tools in data science for conducting
EDA. Python, an interpreted, object-oriented programming language, is a powerful
tool for EDA, aiding in identifying missing values for machine learning. R, an open-
source statistical computing language, is widely adopted by statisticians for data
science in order to facilitate statistical observations and analysis.

Visual Aids for EDA


Visual aids play a crucial role in Exploratory Data Analysis (EDA) by helping to
understand data distributions, patterns, and relationships. Here are some common
visual aids used in EDA:

 Histograms: Useful for showing the distribution of a single variable. It provides


insights into the shape, central tendency, variability, and presence of outliers.
 Box plots: Helpful for visualizing the distribution of a numerical variable across
different categories. They display the median, quartiles, and potential outliers.
 Scatter plots: Effective for exploring relationships between two continuous
variables. They can reveal patterns, correlations, clusters, or outliers.
 Bar charts: Suitable for displaying the distribution of a categorical variable.
They show the frequency or proportion of each category.
 Pie charts: Similar to bar charts, pie charts are useful for displaying the
composition of categorical variables. Each category is represented as a slice of
the pie, with its size proportional to its frequency or proportion.
 Heatmaps: Ideal for visualizing the correlation matrix between variables. They
use colour gradients to represent the strength and direction of correlations.
 Line plots: Useful for visualizing trends over time or across ordered categories.
They are commonly used in time series analysis or to track changes in a variable
over different conditions.
 Violin plots: Combines elements of box plots and kernel density plots to show
the distribution of a variable, including its probability density.

Steps involved in Exploratory data Analysis


EDA plays a vital role in comprehending and deriving valuable information from
datasets. It comprises various essential stages to proficiently examine and delve into
the data. Here are the main exploratory data analysis steps:

1. Setting Up Your Environment


Set up Python and import the necessary packages.

2. Installing Python:
• Python Installation: The most recent version of Python can be downloaded
from the official website, python.org, if it is not already installed on your
computer.

• Python Environment Management: To manage your Python packages, think


about utilizing virtual environments.

Bringing in Required Python Packages

An extensive library ecosystem is available for data analysis in Python. Three essential
libraries for EDA will be used:

• Pandas: pandas is a robust data manipulation library that offers functions and
data structures for handling structured data. Installing it with pip is possible:

“pip install pandas”


• Matplotlib: A flexible Python visualization toolkit for static, animated, and
interactive graphics creation. Use these to install Matplotlib:

“pip install matplotlib”

• Seaborn: A high-level interface for making visually appealing and educational


statistical graphics, built on top of Matplotlib. Seaborn can be installed using:

“pip install seaborn”

Open your Python environment and use the following commands in Python script to
import these packages:

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

Your EDA toolkit is built on these three libraries. Now that you have Python and
these packages installed, you can begin examining and evaluating your data.

2. Let the Data Load


The first step in the EDA process is to obtain your dataset and load it into your Python
environment. It's common for you to work with a variety of data formats, such as SQL
databases, Excel, and CSV. We'll walk you through loading data from various sources
and running preliminary analyses to make sense of your dataset in this section.

• Data Reading from CSV FilesPandas makes it simple to read data stored in
CSV (Comma-Separated Values) files. As follows:,,,import pandas as pd# Load
data from a CSV

data = pd.read_csv('your_data.csv')

• Examining the First Several Rows and Basic Data

It is important to see the data's structure and content as soon as you have loaded it

# Display the first few rows of the dataset

print(data.head())

# Display last few rows of the dataset


print(data.tail())

# Display the item using index

data.iloc[:, 5:10]

iloc ‘to select columns by index. It selects all rows (:) and columns from index 5
(column 6) to index 9 (column 10), and then prints those columns.

# Select rows from index 4 to index 9 (inclusive)

data.iloc[4:10]

# Get basic information about the dataset, including data types and missing values

print(data.info())

The first few rows of your dataset will be shown by this code, along with important
details like the number of non-null entries and the data types of each column. These
preliminary checks aid in determining the quality of the data and provide you with an
understanding of what you have

#Display the number of observations(rows) and features(columns) in the dataset

print(data.shape())

#Display the number of rows

print(data.count())

3. Pre-processing and Data Cleaning


Data cleaning and preprocessing are essential steps in the Exploratory Data Analysis
(EDA) process that come after loading your data. We'll go over key methods in this
section to make sure your dataset is prepared for insightful analysis.

• Missing Values Handling

Inaccurate insights can arise from your analysis when there are missing data.
Appropriate handling of missing values is essential. Pandas offers multiple
approaches to deal with missing data:

• Finding the Missing Values


Using the isna() and isnull() methods, you can determine which values in your dataset
are missing:

# Check for missing values in the entire dataset

missing_values = data.isnull().sum()

# Check for missing values in a specific column (e.g., 'column_name')

missing_values = data['column_name'].isnull().sum()

Eliminating Duplicate

Results from your dataset that contain duplicate records may be deceptive. To find
and eliminate duplicates:

# Identify and remove duplicate rows

data.drop_duplicates(inplace=True)

4. Overview of Basic Data


After loading and cleaning your data, it's time to familiarize yourself with the
fundamental features of the dataset. We'll go over how to create summary statistics
and show the data distribution in this section.

• Making Summaries of Statistics

Pandas offers a quick and easy way to calculate summary statistics for your data. For
numerical columns, the describe () function provides a brief summary of important
statistics like mean, median, standard deviation, and quartiles:

# Generate summary statistics for numerical columns

summary_stats = data.describe()

5. Matplotlib and Seaborn Data Visualization


Exploratory Data Analysis (EDA) relies heavily on effective data visualization. Matplotlib
and Seaborn, two potent Python libraries for data visualization. These libraries are used
to build scatter plots, box plots, and histograms to analyse and understand the
correlations and other relationships between the variables.

An Introduction to Seaborn and Matplotlib


• Matplotlib :

A complete Python visualization toolkit for static, animated, and interactive graphics
is called Matplotlib. It provides many customization options to create different kinds
of plots, ranging from simple charts to intricate visualizations.

• Seaborn :

Built on top of Matplotlib, Seaborn offers a high-level interface for producing visually
appealing and educational statistical graphics. By offering functions for frequent
tasks, it streamlines the process of producing intricate visualizations.

• How to Make Histograms

Numerical data distribution is shown using histograms. Using Matplotlib, a histogram


as follows:

import matplotlib.pyplot as plt

# Create a histogram

plt.hist(data['numeric_column'], bins=20, color='blue', alpha=0.7)

plt.xlabel('Value')

plt.ylabel('Frequency')

plt.title('Histogram of Numeric Column')

plt.show()

6. Recognizing and Managing Outliers


Data points that significantly differ from the majority of the data are known as
outliers, and they can have a big impact on the outcomes of your machine learning
and data analysis processes.

Finding the Outliers

Visualizations like box or scatter plots are frequently the first step in the outlier
detection process.

• Box Plots
Box plots, which show the distribution of the data and highlight data points outside
the "whiskers" (outliers), can be useful in locating possible outliers.

import seaborn as sns

import matplotlib.pyplot as plt

# Create a box plot to identify outliers

sns.boxplot(x=data['numeric_column'], color='purple')

plt.title('Box Plot of Numeric Column')

plt.show()

Summary:

Exploratory Data Analysis (EDA) is a critical starting point in any data science project,
enabling us to gain a deep understanding of our data's quality, patterns, and
relationships. Through the use of essential Python libraries like Pandas, Matplotlib, and
Seaborn, we can efficiently load, clean, and visualize data, revealing hidden insights
and guiding feature engineering and modeling decisions. EDA empowers us to identify
and address issues like missing data and outliers, enabling the creation of informative
data visualizations and ultimately enhancing our ability to make data-driven decisions
and effectively communicate findings to stakeholders.

You might also like