Statistical Data Analysis: Studi Independen - 2022

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

STATISTICAL

DATA ANALYSIS

DATA SCIENCE
GROUP 1

STUDI INDEPENDEN - 2022


OUR TEAM

FARHAN IMAM ANNISA SALWA AHLA MIKHAEL ADITHA DJEREMIAH


NAUFAL DWI ARI AMANIA SEMBIRING MELIALA CHRISTOFEL
WHAT IS STATISTICAL DATA ANALYSIS?
Statistical Data Analysis is a process of performing numerous statistical
functions involving the collection of data, interpretation of data and lastly,
validation of the data.
Numerous statistical tools such as SAS, SPSS, STATA, etc., are available nowadays
to analyze the statistical data from simple to complex problems based on the nature
of the study.

TYPES OF STATISTICAL DATA ANALYSIS

Summary or Descriptive statistics


Inferential statistics
Descriptive statistics
DESCRIPTIVE STATISTICS
Descriptive statistics is a means of describing features of a data set by
generating summaries about data samples. It's often depicted as a summary
of data shown that explains the contents of data.

Descriptive statistics are broken down into measures of central


tendency and measures of variability (spread). Measures of central tendency
include :
1. mean
2. median
3. mode
while measures of variability include :
1. standard deviation
2. variance
3. minimum and maximum variables
4. kurtosis
5. skewness.
Differences between
descriptive statistics and inferential statistics

DESCRIPTIVE STATISTICS INFERENTIAL STATISTICS

Related with specifying the target population. Make inferences from the sample and make them
generalize also according to the population.

Arrange, analyze and reflect the data Correlate, test and anticipate future outcomes.
in a meaningful mode.

Concluding outcomes are represented Final outcomes are the probability scores.
in the form of charts, tables and graphs.

Attempts in making conclusions regarding the


Explains the earlier acknowledged data. population which is beyond the data available.
R A T
LO O
P

R
EX

WHAT IS EDA?
Y
EDA

I S
D A

S
A Y
T

A N A L
Exploratory data analysis popularly known as EDA is a process of
performing some initial investigations on the dataset to discover the
structure and the content of the given dataset. It is often known as
Data Profiling.
EDA is where we get the basic understanding of the data in hand
which then helps us in the further process of Data Cleaning & Data
Preparation.
There will be two type of analysis. Univariate and Bivariate. In the
univariate, you will be analyzing a single attribute. But in the
bivariate, you will be analyzing an attribute with the target attribute.
Univariate non-graphical
This is simplest form of data analysis, where the data being analyzed consists of just one variable.
Since it’s a single variable, it doesn’t deal with causes or relationships. The main purpose of
univariate analysis is to describe the data and find patterns that exist within it.
Univariate graphical
Non-graphical methods don’t provide a full picture of the data. Graphical
methods are therefore required. Common types of univariate graphics include:
Stem-and-leaf plots, which show all data values and the shape of the
distribution.
Histograms, a bar plot in which each bar represents the frequency (count)
or proportion (count/total count) of cases for a range of values.
Box plots, which graphically depict the five-number summary of minimum,
first quartile, median, third quartile, and maximum.
Multivariate nongraphical
Multivariate data arises from more than one variable. Multivariate non-graphical EDA techniques
generally show the relationship between two or more variables of the data through cross-tabulation
or statistics.
Multivariate graphical
Multivariate data uses graphics to display relationships between two or more sets of data. The most
used graphic is a grouped bar plot or bar chart with each group representing one level of one of the
variables and each bar within a group representing the levels of the other variable.
Other common types of multivariate graphics
include :

Scatter plot which is used to plot data points on a horizontal and a vertical axis
to show how much one variable is affected by another.

Multivariate chart which is a graphical representation of the relationships


between factors and a response.

Run chart which is a line graph of data plotted over time

Bubble chart which is a data visualization that displays multiple circles (bubbles)
in a two-dimensional plot.

Heat map, which is a graphical representation of data where values are


depicted by color.
STEP OF EDA
1 LOAD DATA
#Load the required libraries
import pandas as pd
import numpy as np
import seaborn as sns
#Load the data
df = pd.read_csv('file_name')
#View the data
df.head()

2 BASIC INFORMATION ABOUT DATA EDA


The df.info() function will give us the basic information about the dataset.
For any data, it is good to start by knowing its information.
SYNTAX :
#Basic information
df.info()
#Describe the data
df.describe()
STEP OF EDA
3 DUPLICATE VALUES
df.duplicate.sum() function to the sum of duplicate value present if any. It
will show the number of duplicate values if they are present in the data.
SYNTAX

#Find the duplicates


df.duplicated().sum()

4 UNIQUE VALUES IN THE DATA


find the number of unique values in the particular column using unique()
function in python.
SYNTAX :
#unique values
df['data_name'].unique()
OR
array([..,..], dtype=..)
STEP OF EDA
5 FIND THE NULL VALUES
Finding the null values is the most important step in the EDA. As I told many a time, ensuring
the quality of data is paramount.
SYNTAX:
#Find null values
df.isnull().sum()

6 REPLACE THE NULL VALUES


replace() function to replace all the null values with a specific data.
SYNTAX
#Replace null values
df.replace(np.nan,'0',inplace = True)
#Check the changes now
df.isnull().sum()
STEP OF EDA
7 KNOW THE DATATYPES
Knowing the datatypes which you are exploring is very important and an
easy process too
SYNTAX :
#Datatypes
df.dtypes

8 FILTER THE DATA


SYNTAX :
#Filter data
df[df['data_name']==1].head()
STEP OF EDA
9 A QUICK BOX PLOT
You can create a box plot for any numerical column using a single line of
code.
SYNTAX :
#Boxplot
df[['data_name']].boxplot()

10 CORRELATION PLOT - EDA


find the correlation among the variables, we can make use of the correlation function.
This will give you a fair idea of the correlation strength between different variables.
And You can even visualize the correlation matrix using seaborn library

SYNTAX :
#Correlation
df.corr()
#Correlation plot
sns.heatmap(df.corr())
40

WHAT IS 30

SEABORN ? 20
Seaborn is a Python data
visualization library based on 10

matplotlib. It provides a high-


0
level interface for drawing

ry

ch

ne
ar
attractive and informative

ua

ar

Ju
nu

M
br
Ja

Fe
statistical graphics.

PROFIT ESTIMATE
EXAMPLE
import seaborn as sns
sns.set_theme(style="whitegrid")

penguins = sns.load_dataset("penguins")

# Draw a nested barplot by species and sex


g = sns.catplot(
data=penguins, kind="bar",
x="species", y="body_mass_g", hue="sex",
errorbar="sd", palette="dark", alpha=.6,
height=6
)
g.despine(left=True)
g.set_axis_labels("", "Body mass (g)")
g.legend.set_title("") GROUPED BARPLOTS
DIFFERENCE
MATPLOTLIB SEABORN
It is utilized for making basic graphs. Seaborn contains a number of patterns
Datasets are visualised with the help of and plots for data visualization. It uses
bargraphs, histograms, piecharts, scatter fascinating themes. It helps in compiling
plots, lines and so on. whole data into a single plot. It also
It uses comparatively complex and provides distribution of data.
lengthy syntax. Example: Syntax for It uses comparatively simple syntax which
bargraph- matplotlib.pyplot.bar(x_axis, is easier to learn and understand.
y_axis). Example: Syntax for bargraph-
Matplotlib works efficiently with data seaborn.barplot(x_axis, y_axis).
frames and arrays.It treats figures and Seaborn is much more functional and
axes as objects. It contains various stateful organized than Matplotlib and treats the
APIs for plotting. Therefore plot() like whole dataset as a single unit. Seaborn is
methods can work without parameters. not so stateful and therefore, parameters
are required while calling methods like
plot()

You might also like