Statistical Data Analysis: Studi Independen - 2022
Statistical Data Analysis: Studi Independen - 2022
Statistical Data Analysis: Studi Independen - 2022
DATA ANALYSIS
DATA SCIENCE
GROUP 1
Related with specifying the target population. Make inferences from the sample and make them
generalize also according to the population.
Arrange, analyze and reflect the data Correlate, test and anticipate future outcomes.
in a meaningful mode.
Concluding outcomes are represented Final outcomes are the probability scores.
in the form of charts, tables and graphs.
R
EX
WHAT IS EDA?
Y
EDA
I S
D A
S
A Y
T
A N A L
Exploratory data analysis popularly known as EDA is a process of
performing some initial investigations on the dataset to discover the
structure and the content of the given dataset. It is often known as
Data Profiling.
EDA is where we get the basic understanding of the data in hand
which then helps us in the further process of Data Cleaning & Data
Preparation.
There will be two type of analysis. Univariate and Bivariate. In the
univariate, you will be analyzing a single attribute. But in the
bivariate, you will be analyzing an attribute with the target attribute.
Univariate non-graphical
This is simplest form of data analysis, where the data being analyzed consists of just one variable.
Since it’s a single variable, it doesn’t deal with causes or relationships. The main purpose of
univariate analysis is to describe the data and find patterns that exist within it.
Univariate graphical
Non-graphical methods don’t provide a full picture of the data. Graphical
methods are therefore required. Common types of univariate graphics include:
Stem-and-leaf plots, which show all data values and the shape of the
distribution.
Histograms, a bar plot in which each bar represents the frequency (count)
or proportion (count/total count) of cases for a range of values.
Box plots, which graphically depict the five-number summary of minimum,
first quartile, median, third quartile, and maximum.
Multivariate nongraphical
Multivariate data arises from more than one variable. Multivariate non-graphical EDA techniques
generally show the relationship between two or more variables of the data through cross-tabulation
or statistics.
Multivariate graphical
Multivariate data uses graphics to display relationships between two or more sets of data. The most
used graphic is a grouped bar plot or bar chart with each group representing one level of one of the
variables and each bar within a group representing the levels of the other variable.
Other common types of multivariate graphics
include :
Scatter plot which is used to plot data points on a horizontal and a vertical axis
to show how much one variable is affected by another.
Bubble chart which is a data visualization that displays multiple circles (bubbles)
in a two-dimensional plot.
df.duplicated().sum()
SYNTAX :
#Correlation
df.corr()
#Correlation plot
sns.heatmap(df.corr())
40
WHAT IS 30
SEABORN ? 20
Seaborn is a Python data
visualization library based on 10
ry
ch
ne
ar
attractive and informative
ua
ar
Ju
nu
M
br
Ja
Fe
statistical graphics.
PROFIT ESTIMATE
EXAMPLE
import seaborn as sns
sns.set_theme(style="whitegrid")
penguins = sns.load_dataset("penguins")