0% found this document useful (0 votes)

40 views1 page

Sodapdf

This document discusses how to perform data analysis with Python. It covers analyzing numerical data with NumPy, tabular data with Pandas, data visualization with Matplotlib, and exploratory data analysis. NumPy is used to create and manipulate multidimensional arrays and is fundamental for scientific computing in Python. Pandas is used for working with tabular data. Matplotlib is used for data visualization. The six steps of data analysis are also outlined.

Uploaded by

Marcos Wilker

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views1 page

Sodapdf

Uploaded by

Marcos Wilker

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 1

DSA Tutorials ML & Data Science Web Development Practice Sign In

Trending Now Data Structures Algorithms Topic-wise Practice Python Machine Learning Data Science JavaScript Java Web Development Bootstrap C C++ ReactJS Competitive Programming Aptitude Puzzles Projects

Write an Interview Experience

Data Analysis with Python
Share Your Campus Experience Join our Communit y
Introduction Read Discuss Courses Practice Courses
Data Analysis with Python In this article, we will discuss how to do data analysis with Python. We will discuss all sorts of data analysis i.e. analyzing numerical
data with NumPy, Tabular data with Pandas, data visualization Matplotlib, and Exploratory data analysis.
Data Analysis with R
Data Analysis With Python
Read the data set 135k+ interested Geeks 13k+ intere
Data Analysis is the technique of collecting, transforming, and organizing data to make future predictions and informed data-driven
Data Visualization decisions. It also helps to find possible solutions for a business problem. There are six steps for Data Analysis. They are: Complete Machine Learning & Complete
Data Science Program Service-B
Ask or Specify Data Requirements
Exploratory Data Analysis Explore Explore
Prepare or Collect Data
Data Preprocessing Clean and Process
Analyze
Time Series Data Analysis Share
Act or Report
Data Analysis Tools

Data Analysis with Python

Note: To know more about these steps refer to our Six Steps of Data Analysis Process tutorial.

Analyzing Numerical Data with NumPy

NumPy is an array processing package in Python and provides a high-performance multidimensional array object and tools for
working with these arrays. It is the fundamental package for scientific computing with Python.

Arrays in NumPy

NumPy Array is a table of elements (usually numbers), all of the same types, indexed by a tuple of positive integers. In Numpy, the
number of dimensions of the array is called the rank of the array. A tuple of integers giving the size of the array along each dimension
is known as the shape of the array.

Creating NumPy Array

NumPy arrays can be created in multiple ways, with various ranks. It can also be created with the use of different data types like lists,
tuples, etc. The type of the resultant array is deduced from the type of elements in the sequences. NumPy offers several functions to
create arrays with initial placeholder content. These minimize the necessity of growing arrays, an expensive operation.

Create Array using numpy.empty(shape, dtype=float, order=’C’)

Python3

import numpy as np

b = np.empty(2, dtype = int)
print("Matrix b : \n", b)

a = np.empty([2, 2], dtype = int)
print("\nMatrix a : \n", a)

c = np.empty([3, 3])
print("\nMatrix c : \n", c)

Output:

Empty Matrix using pandas

Create Array using numpy.zeros(shape, dtype = None, order = ‘C’)

Python3

import numpy as np

b = np.zeros(2, dtype = int)
print("Matrix b : \n", b)

a = np.zeros([2, 2], dtype = int)
print("\nMatrix a : \n", a)

c = np.zeros([3, 3])
print("\nMatrix c : \n", c)

Output:

Matrix b :
[0 0]

Matrix a :
[[0 0]
[0 0]]

Matrix c :
[[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]]

Operations on Numpy Arrays

Arithmetic Operations
Addition:

Python3

import numpy as np

# Defining both the matrices
a = np.array([5, 72, 13, 100])
b = np.array([2, 5, 10, 30])

# Performing addition using arithmetic operator
add_ans = a+b
print(add_ans)

# Performing addition using numpy function
add_ans = np.add(a, b)
print(add_ans)

# The same functions and operations can be used for
# multiple matrices
c = np.array([1, 2, 3, 4])
add_ans = a+b+c
print(add_ans)

add_ans = np.add(a, b, c)
print(add_ans)

Output:

[ 7 77 23 130]
[ 7 77 23 130]
[ 8 79 26 134]
[ 7 77 23 130]

Subtraction:

Python3

import numpy as np

# Defining both the matrices
a = np.array([5, 72, 13, 100])
b = np.array([2, 5, 10, 30])

# Performing subtraction using arithmetic operator
sub_ans = a-b
print(sub_ans)

# Performing subtraction using numpy function
sub_ans = np.subtract(a, b)
print(sub_ans)

Output:

[ 3 67 3 70]
[ 3 67 3 70]

Multiplication:

Python3

import numpy as np

# Defining both the matrices
a = np.array([5, 72, 13, 100])
b = np.array([2, 5, 10, 30])

# Performing multiplication using arithmetic
# operator
mul_ans = a*b
print(mul_ans)

# Performing multiplication using numpy function
mul_ans = np.multiply(a, b)
print(mul_ans)

Output:

[ 10 360 130 3000]

Division:

Python3

import numpy as np

# Defining both the matrices
a = np.array([5, 72, 13, 100])
b = np.array([2, 5, 10, 30])

# Performing division using arithmetic operators
div_ans = a/b
print(div_ans)

# Performing division using numpy functions
div_ans = np.divide(a, b)
print(div_ans)

Output:

[ 2.5 14.4 1.3 3.33333333]

For more information, refer to our NumPy – Arithmetic Operations Tutorial

NumPy Array Indexing

Indexing can be done in NumPy by using an array as an index. In the case of the slice, a view or shallow copy of the array is returned
but in the index array, a copy of the original array is returned. Numpy arrays can be indexed with other arrays or any other sequence
with the exception of tuples. The last element is indexed by -1 second last by -2 and so on.

Python NumPy Array Indexing

Python3

# Python program to demonstrate

# the use of index arrays.
import numpy as np

# Create a sequence of integers from
# 10 to 1 with a step of -2
a = np.arange(10, 1, -2)
print("\n A sequential array with a negative step: \n",a)

# Indexes are specified inside the np.array method.
newarr = a[np.array([3, 1, 2 ])]
print("\n Elements at these indices are:\n",newarr)

Output:

A sequential array with a negative step:

[10 8 6 4 2]

Elements at these indices are:

[4 8 6]

NumPy Array Slicing

Consider the syntax x[obj] where x is the array and obj is the index. The slice object is the index in the case of basic slicing. Basic
slicing occurs when obj is :

a slice object that is of the form start: stop: step

an integer
or a tuple of slice objects and integers

All arrays generated by basic slicing are always the view in the original array.

Python3

# Python program for basic slicing.

import numpy as np

# Arrange elements from 0 to 19
a = np.arrange(20)
print("\n Array is:\n ",a)

# a[start:stop:step]
print("\n a[-8:17:1] = ",a[-8:17:1])

# The : operator means all elements till the end.
print("\n a[10:] = ",a[10:])

Output:

Array is:
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]

a[-8:17:1] = [12 13 14 15 16]

a[10:] = [10 11 12 13 14 15 16 17 18 19]

Ellipsis can also be used along with basic slicing. Ellipsis (…) is the number of : objects needed to make a selection tuple of the same
length as the dimensions of the array.

Python3

# Python program for indexing using basic slicing with ellipsis

import numpy as np

# A 3 dimensional array.
b = np.array([[[1, 2, 3],[4, 5, 6]],
[[7, 8, 9],[10, 11, 12]]])

print(b[...,1]) #Equivalent to b[: ,: ,1 ]

Output:

[[ 2 5]
[ 8 11]]

NumPy Array Broadcasting

The term broadcasting refers to how numpy treats arrays with different Dimensions during arithmetic operations which lead to
certain constraints, the smaller array is broadcast across the larger array so that they have compatible shapes.

Let’s assume that we have a large data set, each datum is a list of parameters. In Numpy we have a 2-D array, where each row is a
datum and the number of rows is the size of the data set. Suppose we want to apply some sort of scaling to all these data every
parameter gets its own scaling factor or say Every parameter is multiplied by some factor.

Just to have a clear understanding, let’s count calories in foods using a macro-nutrient breakdown. Roughly put, the caloric parts of
food are made of fats (9 calories per gram), protein (4 CPG), and carbs (4 CPG). So if we list some foods (our data), and for each food
list its macro-nutrient breakdown (parameters), we can then multiply each nutrient by its caloric value (apply scaling) to compute the
caloric breakdown of every food item.

With this transformation, we can now compute all kinds of useful information. For example, what is the total number of calories
present in some food or, given a breakdown of my dinner know how many calories did I get from protein and so on.

Let’s see a naive way of producing this computation with Numpy:

Python3

import numpy as np

macros = np.array([
[0.8, 2.9, 3.9],
[52.4, 23.6, 36.5],
[55.2, 31.7, 23.9],
[14.4, 11, 4.9]
])

# Create a new array filled with zeros,
# of the same shape as macros.
result = np.zeros_like(macros)

cal_per_macro = np.array([3, 3, 8])

# Now multiply each row of macros by
# cal_per_macro. In Numpy, `*` is
# element-wise multiplication between two arrays.
for i in range(macros.shape[0]):
result[i, :] = macros[i, :] * cal_per_macro

result

Output:

array([[ 2.4, 8.7, 31.2],

[157.2, 70.8, 292. ],
[165.6, 95.1, 191.2],
[ 43.2, 33. , 39.2]])

Broadcasting Rules: Broadcasting two arrays together follow these rules:

If the arrays don’t have the same rank then prepend the shape of the lower rank array with 1s until both shapes have the same
length.
The two arrays are compatible in a dimension if they have the same size in the dimension or if one of the arrays has size 1 in that
dimension.
The arrays can be broadcast together if they are compatible with all dimensions.
After broadcasting, each array behaves as if it had a shape equal to the element-wise maximum of shapes of the two input arrays.
In any dimension where one array had a size of 1 and the other array had a size greater than 1, the first array behaves as if it were
copied along that dimension.

Python3

import numpy as np

v = np.array([12, 24, 36])
w = np.array([45, 55])

# To compute an outer product we first
# reshape v to a column vector of shape 3x1
# then broadcast it against w to yield an output
# of shape 3x2 which is the outer product of v and w
print(np.reshape(v, (3, 1)) * w)

X = np.array([[12, 22, 33], [45, 55, 66]])

# x has shape 2x3 and v has shape (3, )
# so they broadcast to 2x3,
print(X + v)

# Add a vector to each column of a matrix X has
# shape 2x3 and w has shape (2, ) If we transpose X
# then it has shape 3x2 and can be broadcast against w
# to yield a result of shape 3x2.

# Transposing this yields the final result
# of shape 2x3 which is the matrix.
print((X.T + w).T)

# Another solution is to reshape w to be a column
# vector of shape 2X1 we can then broadcast it
# directly against X to produce the same output.
print(X + np.reshape(w, (2, 1)))

# Multiply a matrix by a constant, X has shape 2x3.
# Numpy treats scalars as arrays of shape();
# these can be broadcast together to shape 2x3.
print(X * 2)

Output:

[[ 540 660]
[1080 1320]
[1620 1980]]
[[ 24 46 69]
[ 57 79 102]]
[[ 57 67 78]
[100 110 121]]
[[ 57 67 78]
[100 110 121]]
[[ 24 44 66]
[ 90 110 132]]

Note: For more information, refer to our Python NumPy Tutorial.

Analyzing Data Using Pandas

Python Pandas Is used for relational or labeled data and provides various data structures for manipulating such data and time series.
This library is built on top of the NumPy library. This module is generally imported as:

import pandas as pd

Here, pd is referred to as an alias to the Pandas. However, it is not necessary to import the library using the alias, it just helps in
writing less amount code every time a method or property is called. Pandas generally provide two data structures for manipulating
data, They are:

Series
Dataframe

Series:

Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The
axis labels are collectively called indexes. Pandas Series is nothing but a column in an excel sheet. Labels need not be unique but
must be a hashable type. The object supports both integer and label-based indexing and provides a host of methods for performing
operations involving the index.

Pandas Series

It can be created using the Series() function by loading the dataset from the existing storage like SQL, Database, CSV Files, Excel
Files, etc., or from data structures like lists, dictionaries, etc.

Python Pandas Creating Series

Python3

import pandas as pd
import numpy as np

# Creating empty series
ser = pd.Series()

print(ser)

# simple array
data = np.array(['g', 'e', 'e', 'k', 's'])

ser = pd.Series(data)
print(ser)

Output:

pnadas series

Dataframe:

Pandas DataFrame is a two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and
columns). A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas
DataFrame consists of three principal components, the data, rows, and columns.

Pandas Dataframe

It can be created using the Dataframe() method and just like a series, it can also be from different file types and data structures.

Python Pandas Creating Dataframe

Python3

import pandas as pd

# Calling DataFrame constructor
df = pd.DataFrame()
print(df)

# list of strings
lst = ['Geeks', 'For', 'Geeks', 'is',
'portal', 'for', 'Geeks']

# Calling DataFrame constructor on list
df = pd.DataFrame(lst)
df

Output:

Creating Dataframe from python list

Creating Dataframe from CSV

We can create a dataframe from the CSV files using the read_csv() function.

Note: This dataset can be downloaded from here.

Python Pandas read CSV

Python3

import pandas as pd

# Reading the CSV file
df = pd.read_csv("Iris.csv")

# Printing top 5 rows
df.head()

Output:

head of a dataframe

Filtering DataFrame

Pandas dataframe.filter() function is used to Subset rows or columns of dataframe according to labels in the specified index. Note that
this routine does not filter a dataframe on its contents. The filter is applied to the labels of the index.

Python Pandas Filter Dataframe

Python3

import pandas as pd

# Reading the CSV file
df = pd.read_csv("Iris.csv")

# applying filter function
df.filter(["Species", "SepalLengthCm", "SepalLengthCm"]).head()

Output:

Applying filter on dataset

Sorting DataFrame

In order to sort the data frame in pandas, the function sort_values() is used. Pandas sort_values() can sort the data frame in Ascending
or Descending order.

Python Pandas Sorting Dataframe in Ascending Order

Output:

Sorted dataset based on a column value

Pandas GroupBy

Groupby is a pretty simple concept. We can create a grouping of categories and apply a function to the categories. In real data science
projects, you’ll be dealing with large amounts of data and trying things over and over, so for efficiency, we use the Groupby concept.
Groupby mainly refers to a process involving one or more of the following steps they are:

Splitting: It is a process in which we split data into group by applying some conditions on datasets.
Applying: It is a process in which we apply a function to each group independently.
Combining: It is a process in which we combine different datasets after applying groupby and results into a data structure.

The following image will help in understanding the process involve in the Groupby concept.

1. Group the unique values from the Team column

Pandas Groupby Method

2. Now there’s a bucket for each group

3. Toss the other data into the buckets

4. Apply a function on the weight column of each bucket.

Applying Function on the weight column of each column

Python Pandas GroupBy

Python3

# importing pandas module

import pandas as pd

# Define a dictionary containing employee data
data1 = {'Name': ['Jai', 'Anuj', 'Jai', 'Princi',
                  'Gaurav', 'Anuj', 'Princi', 'Abhi'],
         'Age': [27, 24, 22, 32,
                 33, 36, 27, 32],
         'Address': ['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj',
                     'Jaunpur', 'Kanpur', 'Allahabad', 'Aligarh'],
         'Qualification': ['Msc', 'MA', 'MCA', 'Phd',
                           'B.Tech', 'B.com', 'Msc', 'MA']}

# Convert the dictionary into DataFrame
df = pd.DataFrame(data1)

print("Original Dataframe")
display(df)

# applying groupby() function to
# group the data on Name value.
gk = df.groupby('Name')

# Let's print the first entries
# in all the groups formed.
print("After Creating Groups")
gk.first()

Output:

pandas groupby

Applying function to group:

After splitting a data into a group, we apply a function to each group in order to do that we perform some operations they are:

Aggregation: It is a process in which we compute a summary statistic (or statistics) about each group. For Example, Compute
group sums or means
Transformation: It is a process in which we perform some group-specific computations and return a like-indexed. For Example,
Filling NAs within groups with a value derived from each group
Filtration: It is a process in which we discard some groups, according to a group-wise computation that evaluates True or False. For
Example, Filtering out data based on the group sum or mean

Pandas Aggregation

Aggregation is a process in which we compute a summary statistic about each group. The aggregated function returns a single
aggregated value for each group. After splitting data into groups using groupby function, several aggregation operations can be
performed on the grouped data.

Python Pandas Aggregation

Python3

# importing pandas module

import pandas as pd

# importing numpy as np
import numpy as np

# Define a dictionary containing employee data
data1 = {'Name': ['Jai', 'Anuj', 'Jai', 'Princi',
                  'Gaurav', 'Anuj', 'Princi', 'Abhi'],
         'Age': [27, 24, 22, 32,
                 33, 36, 27, 32],
         'Address': ['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj',
                     'Jaunpur', 'Kanpur', 'Allahabad', 'Aligarh'],
         'Qualification': ['Msc', 'MA', 'MCA', 'Phd',
                                  'B.Tech', 'B.com', 'Msc', 'MA']}

# Convert the dictionary into DataFrame
df = pd.DataFrame(data1)

# performing aggregation using
# aggregate method

grp1 = df.groupby('Name')

grp1.aggregate(np.sum)

Output:

Use of sum aggregate function on dataset

Concatenating DataFrame
In order to concat the dataframe, we use concat() function which helps in concatenating the dataframe. This function does all the
heavy lifting of performing concatenation operations along with an axis of Pandas objects while performing optional set logic (union
or intersection) of the indexes (if any) on the other axes.

Python Pandas Concatenate Dataframe

Python3

# importing pandas module

import pandas as pd

# Define a dictionary containing employee data
data1 = {'key': ['K0', 'K1', 'K2', 'K3'],
         'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
        'Age':[27, 24, 22, 32],}

# Define a dictionary containing employee data
data2 = {'key': ['K0', 'K1', 'K2', 'K3'],
         'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'],
        'Qualification':['Btech', 'B.A', 'Bcom', 'B.hons']}

# Convert the dictionary into DataFrame
df = pd.DataFrame(data1)

# Convert the dictionary into DataFrame
df1 = pd.DataFrame(data2)


display(df, df1)

# combining series and dataframe
res = pd.concat([df, df1], axis=1)

res

Output:

Merging DataFrame
When we need to combine very large DataFrames, joins serve as a powerful way to perform these operations swiftly. Joins can only
be done on two DataFrames at a time, denoted as left and right tables. The key is the common column that the two DataFrames will
be joined on. It’s a good practice to use keys that have unique values throughout the column to avoid unintended duplication of row
values. Pandas provide a single function, merge(), as the entry point for all standard database join operations between DataFrame
objects.

There are four basic ways to handle the join (inner, left, right, and outer), depending on which rows must retain their data.

Python Pandas Merge Dataframe

Python3

# importing pandas module

import pandas as pd

# Define a dictionary containing employee data
data1 = {'key': ['K0', 'K1', 'K2', 'K3'],
         'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
        'Age':[27, 24, 22, 32],}

# Define a dictionary containing employee data
data2 = {'key': ['K0', 'K1', 'K2', 'K3'],
         'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'],
        'Qualification':['Btech', 'B.A', 'Bcom', 'B.hons']}

# Convert the dictionary into DataFrame
df = pd.DataFrame(data1)

# Convert the dictionary into DataFrame
df1 = pd.DataFrame(data2)


display(df, df1)

# using .merge() function
res = pd.merge(df, df1, on='key')

res

Output:

Concatinating Two datasets

Joining DataFrame
In order to join the dataframe, we use .join() function this function is used for combining the columns of two potentially differently
indexed DataFrames into a single result DataFrame.

Python Pandas Join Dataframe

Python3

# importing pandas module

import pandas as pd

# Define a dictionary containing employee data
data1 = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
        'Age':[27, 24, 22, 32]}

# Define a dictionary containing employee data
data2 = {'Address':['Allahabad', 'Kannuaj', 'Allahabad', 'Kannuaj'],
        'Qualification':['MCA', 'Phd', 'Bcom', 'B.hons']}

# Convert the dictionary into DataFrame
df = pd.DataFrame(data1,index=['K0', 'K1', 'K2', 'K3'])

# Convert the dictionary into DataFrame
df1 = pd.DataFrame(data2, index=['K0', 'K2', 'K3', 'K4'])


display(df, df1)

# joining dataframe
res = df.join(df1)

res

Output:

Joining two datasets

For more information, refer to our Pandas Merging, Joining, and Concatenating tutorial

For a complete guide on Pandas refer to our Pandas Tutorial.

Visualization with Matplotlib

Matplotlib is easy to use and an amazing visualizing library in Python. It is built on NumPy arrays and designed to work with the
broader SciPy stack and consists of several plots like line, bar, scatter, histogram, etc.

Pyplot

Pyplot is a Matplotlib module that provides a MATLAB-like interface. Pyplot provides functions that interact with the figure i.e.
creates a figure, decorates the plot with labels, and creates a plotting area in a figure.

Python3

# Python program to show pyplot module

import matplotlib.pyplot as plt

plt.plot([1, 2, 3, 4], [1, 4, 9, 16])
plt.axis([0, 6, 0, 20])
plt.show()

Output:

Bar chart

A bar plot or bar chart is a graph that represents the category of data with rectangular bars with lengths and heights that is
proportional to the values which they represent. The bar plots can be plotted horizontally or vertically. A bar chart describes the
comparisons between the discrete categories. It can be created using the bar() method.

Python Matplotlib Bar Chart

Here we will use the iris dataset only

Python3

import matplotlib.pyplot as plt

import pandas as pd

df = pd.read_csv("Iris.csv")

# This will plot a simple bar chart
plt.bar(df['Species'], df['SepalLengthCm'])

# Title to the plot
plt.title("Iris Dataset")

# Adding the legends
plt.legend(["bar"])
plt.show()

Output:

Bar chart using matplotlib library

Histograms

A histogram is basically used to represent data in the form of some groups. It is a type of bar plot where the X-axis represents the bin
ranges while the Y-axis gives information about frequency. To create a histogram the first step is to create a bin of the ranges, then
distribute the whole range of the values into a series of intervals, and count the values which fall into each of the intervals. Bins are
clearly identified as consecutive, non-overlapping intervals of variables. The hist() function is used to compute and create a histogram
of x.

Python Matplotlib Histogram

Python3

import matplotlib.pyplot as plt

import pandas as pd

df = pd.read_csv("Iris.csv")

plt.hist(df["SepalLengthCm"])

# Title to the plot
plt.title("Histogram")

# Adding the legends
plt.legend(["SepalLengthCm"])
plt.show()

Output:

Histplot using matplotlib library

Scatter Plot

Scatter plots are used to observe relationship between variables and uses dots to represent the relationship between them. The
scatter() method in the matplotlib library is used to draw a scatter plot.

Python Matplotlib Scatter Plot

Python3

import matplotlib.pyplot as plt

import pandas as pd

df = pd.read_csv("Iris.csv")

plt.scatter(df["Species"], df["SepalLengthCm"])

# Title to the plot
plt.title("Scatter Plot")

# Adding the legends
plt.legend(["SepalLengthCm"])
plt.show()

Output:

Scatter plot using matplotlib library

Box Plot

A boxplot,Correlation also known as a box and whisker plot. It is a very good visual representation when it comes to measuring the
data distribution. Clearly plots the median values, outliers and the quartiles. Understanding data distribution is another important
factor which leads to better model building. If data has outliers, box plot is a recommended way to identify them and take necessary
actions. The box and whiskers chart shows how data is spread out. Five pieces of information are generally included in the chart

The minimum is shown at the far left of the chart, at the end of the left ‘whisker’
First quartile, Q1, is the far left of the box (left whisker)
The median is shown as a line in the center of the box
Third quartile, Q3, shown at the far right of the box (right whisker)
The maximum is at the far right of the box

Representation of box plot

Inter quartile range

Illustrating box plot

Python Matplotlib Box Plot

Python3

import matplotlib.pyplot as plt

import pandas as pd

df = pd.read_csv("Iris.csv")

plt.boxplot(df["SepalWidthCm"])

# Title to the plot
plt.title("Box Plot")

# Adding the legends
plt.legend(["SepalWidthCm"])
plt.show()

Output:

Boxplot using matplotlib library

Correlation Heatmaps

A 2-D Heatmap is a data visualization tool that helps to represent the magnitude of the phenomenon in form of colors. A correlation
heatmap is a heatmap that shows a 2D correlation matrix between two discrete dimensions, using colored cells to represent data
from usually a monochromatic scale. The values of the first dimension appear as the rows of the table while the second dimension is a
column. The color of the cell is proportional to the number of measurements that match the dimensional value. This makes correlation
heatmaps ideal for data analysis since it makes patterns easily readable and highlights the differences and variation in the same data.
A correlation heatmap, like a regular heatmap, is assisted by a colorbar making data easily readable and comprehensible.

Note: The data here has to be passed with corr() method to generate a correlation heatmap. Also, corr() itself eliminates columns that
will be of no use while generating a correlation heatmap and selects those which can be used.

Python Matplotlib Correlation Heatmap

Python3

import matplotlib.pyplot as plt

import pandas as pd

df = pd.read_csv("Iris.csv")

plt.imshow(df.corr() , cmap = 'autumn' , interpolation = 'nearest' )

plt.title("Heat Map")
plt.show()

Output:

Heatmap using matplotlib library

For more information on data visualization refer to our below tutorials –

Data Visualization using Matplotlib

Data Visualization with Python Seaborn
Data Visualisation in Python using Matplotlib and Seaborn
Using Plotly for Interactive Data Visualization in Python
Interactive Data Visualization with Bokeh

Exploratory Data Analysis

Exploratory Data Analysis (EDA) is a technique to analyze data using some visual Techniques. With this technique, we can get
detailed information about the statistical summary of the data. We will also be able to deal with the duplicates values, outliers, and
also see some trends or patterns present in the dataset.

Note: We will be using Iris Dataset.

Getting Information about the Dataset

We will use the shape parameter to get the shape of the dataset.

Shape of Dataframe

Python3

df.shape

Output:

(150, 6)

We can see that the dataframe contains 6 columns and 150 rows.

Now, let’s also the columns and their data types. For this, we will use the info() method.

Information about Dataset

Python3

df.info()

Output:

information about the dataset

We can see that only one column has categorical data and all the other columns are of the numeric type with non-Null entries.

Let’s get a quick statistical summary of the dataset using the describe() method. The describe() function applies basic statistical
computations on the dataset like extreme values, count of data points standard deviation, etc. Any missing value or NaN value is
automatically skipped. describe() function gives a good picture of the distribution of data.

Description of dataset

Python3

df.describe()

Output:

Description about the dataset

We can see the count of each column along with their mean value, standard deviation, minimum and maximum values.

Checking Missing Values

We will check if our data contains any missing values or not. Missing values can occur when no information is provided for one or
more items or for a whole unit. We will use the isnull() method.

python code for missing value

Python3

df.isnull().sum()

Output:

Missing values in the dataset

We can see that no column has any missing value.

Checking Duplicates

Let’s see if our dataset contains any duplicates or not. Pandas drop_duplicates() method helps in removing duplicates from the data
frame.

Pandas function for missing values

Python3

data = df.drop_duplicates(subset ="Species",)

data

Output:

Dropping duplicate value in the dataset

We can see that there are only three unique species. Let’s see if the dataset is balanced or not i.e. all the species contain equal
amounts of rows or not. We will use the Series.value_counts() function. This function returns a Series containing counts of unique
values.

Python code for value counts in the column

Python3

df.value_counts("Species")

Output:

value count in the dataset

We can see that all the species contain an equal amount of rows, so we should not delete any entries.

Relation between variables

We will see the relationship between the sepal length and sepal width and also between petal length and petal width.

Comparing Sepal Length and Sepal Width

Python3

# importing packages
import seaborn as sns
import matplotlib.pyplot as plt

sns.scatterplot(x='SepalLengthCm', y='SepalWidthCm',
hue='Species', data=df, )

# Placing Legend outside the Figure
plt.legend(bbox_to_anchor=(1, 1), loc=2)

plt.show()

Output:

Scatter plot using matplotlib library

From the above plot, we can infer that –

Species Setosa has smaller sepal lengths but larger sepal widths.
Versicolor Species lies in the middle of the other two species in terms of sepal length and width
Species Virginica has larger sepal lengths but smaller sepal widths.

Comparing Petal Length and Petal Width

Python3

# importing packages
import seaborn as sns
import matplotlib.pyplot as plt

sns.scatterplot(x='PetalLengthCm', y='PetalWidthCm',
hue='Species', data=df, )

# Placing Legend outside the Figure
plt.legend(bbox_to_anchor=(1, 1), loc=2)

plt.show()

Output:

sactter plot petal length

From the above plot, we can infer that –

The species Setosa has smaller petal lengths and widths.

Versicolor Species lies in the middle of the other two species in terms of petal length and width
Species Virginica has the largest petal lengths and widths.

Let’s plot all the column’s relationships using a pairplot. It can be used for multivariate analysis.

Python code for pairplot

Python3

# importing packages
import seaborn as sns
import matplotlib.pyplot as plt

sns.pairplot(df.drop(['Id'], axis = 1),
hue='Species', height=2)

Output:

Pairplot for the dataset

We can see many types of relationships from this plot such as the species Seotsa has the smallest of petals widths and lengths. It
also has the smallest sepal length but larger sepal widths. Such information can be gathered about any other species.

Handling Correlation

Pandas dataframe.corr() is used to find the pairwise correlation of all columns in the dataframe. Any NA values are automatically
excluded. Any non-numeric data type columns in the dataframe are ignored.

Example:

Python3

data.corr(method='pearson')

Output:

correlation between columns in the dataset

Heatmaps

The heatmap is a data visualization technique that is used to analyze the dataset as colors in two dimensions. Basically, it shows a
correlation between all numerical variables in the dataset. In simpler terms, we can plot the above-found correlation using the
heatmaps.

python code for heatmap

Python3

# importing packages
import seaborn as sns
import matplotlib.pyplot as plt

sns.heatmap(df.corr(method='pearson').drop(
['Id'], axis=1).drop(['Id'], axis=0),
annot = True);

plt.show()

Output:

Heatmap for correlation in the dataset

From the above graph, we can see that –

Petal width and petal length have high correlations.

Petal length and sepal width have good correlations.
Petal Width and Sepal length have good correlations.

Handling Outliers

An Outlier is a data item/object that deviates significantly from the rest of the (so-called normal)objects. They can be caused by
measurement or execution errors. The analysis for outlier detection is referred to as outlier mining. There are many ways to detect
outliers, and the removal process is the data frame same as removing a data item from the panda’s dataframe.

Let’s consider the iris dataset and let’s plot the boxplot for the SepalWidthCm column.

python code for Boxplot

Python3

# importing packages
import seaborn as sns
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv('Iris.csv')

sns.boxplot(x='SepalWidthCm', data=df)

Output:

Boxplot for sepalwidth column

In the above graph, the values above 4 and below 2 are acting as outliers.

Removing Outliers

For removing the outlier, one must follow the same process of removing an entry from the dataset using its exact position in the
dataset because in all the above methods of detecting the outliers end result is the list of all those data items that satisfy the outlier
definition according to the method used.

We will detect the outliers using IQR and then we will remove them. We will also draw the boxplot to see if the outliers are removed
or not.

Python3

# Importing
import sklearn
from sklearn.datasets import load_boston
import pandas as pd
import seaborn as sns

# Load the dataset
df = pd.read_csv('Iris.csv')

# IQR
Q1 = np.percentile(df['SepalWidthCm'], 25,
interpolation = 'midpoint')

Q3 = np.percentile(df['SepalWidthCm'], 75,
interpolation = 'midpoint')
IQR = Q3 - Q1

print("Old Shape: ", df.shape)

# Upper bound
upper = np.where(df['SepalWidthCm'] >= (Q3+1.5*IQR))

# Lower bound
lower = np.where(df['SepalWidthCm'] <= (Q1-1.5*IQR))

# Removing the Outliers
df.drop(upper[0], inplace = True)
df.drop(lower[0], inplace = True)

print("New Shape: ", df.shape)

sns.boxplot(x='SepalWidthCm', data=df)

Output:

boxplot using seaborn library

For more information about EDA, refer to our below tutorials –

What is Exploratory Data Analysis ?

Exploratory Data Analysis in Python | Set 1
Exploratory Data Analysis in Python | Set 2
Exploratory Data Analysis on Iris Dataset

Last Updated :23 May, 2023 8

Similar Reads
1. RFM Analysis Analysis Using Python

2. Different Sources of Data for Data Analysis

3. Data analysis and Visualization with Python

4. Analysis of test data using K-Means Clustering in Python

5. Data Analysis and Visualization with Python | Set 2

6. Exploratory Data Analysis in Python

7. Python | Math operations for Data analysis

8. Multidimensional data analysis in Python

9. Exploratory Data Analysis in Python | Set 1

10. Exploratory Data Analysis in Python | Set 2

Related Tutorials
1. Pandas AI: The Generative AI Python Library

2. OpenAI Python API - Complete Guide

3. Python for Kids - Fun Tutorial to Learn Python Programming

4. Data Analysis Tutorial

5. Flask Tutorial

Previous Next

Data Analysis Tutorial Data analysis using R

Article Contributed By : Vote for difficulty

GeeksforGeeks Easy Normal Medium Hard Expert

Improved By : surindertarika1234, ruhelaa48, abhishek0719kadiyan, suryapra400t

Article Tags : data-science, Python
Practice Tags : python

Improve Article Report Issue

Company Explore Languages Data Structures Algorithms Web Development

A-143, 9th Floor, Sovereign Corporate About Us Job-A-Thon For Freshers Python Array Sorting HTML
Tower, Sector-136, Noida, Uttar Pradesh -
201305 Legal Job-A-Thon For Experienced Java String Searching CSS
feedback@geeksforgeeks.org Careers GfG Weekly Contest C++ Linked List Greedy JavaScript
In Media Offline Classes (Delhi/NCR) PHP Stack Dynamic Programming Bootstrap
Contact Us DSA in JAVA/C++ GoLang Queue Pattern Searching ReactJS
Advertise with us Master System Design SQL Tree Recursion AngularJS
Master CP R Language Graph Backtracking NodeJS
Android Tutorial

Computer Science Python Data Science & ML DevOps Competitive Programming System Design
GATE CS Notes Python Programming Examples Data Science With Python Git Top DSA for CP What is System Design
Operating Systems Django Tutorial Data Science For Beginner AWS Top 50 Tree Problems Monolithic and Distributed SD
Computer Network Python Projects Machine Learning Tutorial Docker Top 50 Graph Problems Scalability in SD
Database Management System Python Tkinter Maths For Machine Learning Kubernetes Top 50 Array Problems Databases in SD
Software Engineering OpenCV Python Tutorial Pandas Tutorial Azure Top 50 String Problems High Level Design or HLD
Digital Logic Design Python Interview Question NumPy Tutorial GCP Top 50 DP Problems Low Level Design or LLD
Engineering Maths NLP Tutorial Top 15 Websites for CP Top SD Interview Questions
Deep Learning Tutorial

Interview Corner GfG School Commerce UPSC SSC/ BANKING Write & Earn
Company Wise Preparation CBSE Notes for Class 8 Accountancy Polity Notes SSC CGL Syllabus Write an Article
Preparation for SDE CBSE Notes for Class 9 Business Studies Geography Notes SBI PO Syllabus Improve an Article
Experienced Interviews CBSE Notes for Class 10 Economics History Notes SBI Clerk Syllabus Pick Topics to Write
Internship Interviews CBSE Notes for Class 11 Management Science and Technology Notes IBPS PO Syllabus Write Interview Experience
Competitive Programming CBSE Notes for Class 12 Income Tax Economics Notes IBPS Clerk Syllabus Internships
Aptitude Preparation English Grammar Finance Important Topics in Ethics Aptitude Questions Video Internship
UPSC Previous Year Papers SSC CGL Practice Papers

@geeksforgeeks , Some rights reserved

We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy Got It !

Ocs353dsf Unit Wise Notes
100% (2)
Ocs353dsf Unit Wise Notes
121 pages
Data Analysis with Python
No ratings yet
Data Analysis with Python
51 pages
Record DSCP508 - DV-1-1
No ratings yet
Record DSCP508 - DV-1-1
89 pages
Idea Infinity It Solutions (P) : NFC Field Activity Report
No ratings yet
Idea Infinity It Solutions (P) : NFC Field Activity Report
158 pages
Data Analytics Roadmap
No ratings yet
Data Analytics Roadmap
8 pages
Konica Minolta 1600f - DTS - LFF3 - 4 EN PDF
No ratings yet
Konica Minolta 1600f - DTS - LFF3 - 4 EN PDF
168 pages
Python For Data Analysis
67% (3)
Python For Data Analysis
39 pages
Department of Collegiate and Technical Education
100% (1)
Department of Collegiate and Technical Education
14 pages
Python for Data Analysis
No ratings yet
Python for Data Analysis
84 pages
Instagram Python Project
No ratings yet
Instagram Python Project
66 pages
Ata-80 Aircraft Engine Starting Systems: Electrical Starter Motors
No ratings yet
Ata-80 Aircraft Engine Starting Systems: Electrical Starter Motors
13 pages
Roadmap
No ratings yet
Roadmap
77 pages
Foundations of Data Science
No ratings yet
Foundations of Data Science
139 pages
2.1 - Introduction To Data Analytics
No ratings yet
2.1 - Introduction To Data Analytics
32 pages
Chapter-7 Bus P.T. Voltage Selection Scheme
100% (1)
Chapter-7 Bus P.T. Voltage Selection Scheme
2 pages
Flashcut CNC 7 - 0 Users Guide
No ratings yet
Flashcut CNC 7 - 0 Users Guide
185 pages
DAL EXT 1 and 2
No ratings yet
DAL EXT 1 and 2
125 pages
IP Project Deepika
No ratings yet
IP Project Deepika
26 pages
documentation_sample
No ratings yet
documentation_sample
37 pages
AIML Curriculum powered by IBM - Pregrad-merged
No ratings yet
AIML Curriculum powered by IBM - Pregrad-merged
66 pages
Chapter 2. Data Analysis and Processing - Full
No ratings yet
Chapter 2. Data Analysis and Processing - Full
49 pages
Unit 1
No ratings yet
Unit 1
36 pages
Compliance Framework
100% (2)
Compliance Framework
4 pages
ERM Playbook
100% (1)
ERM Playbook
157 pages
Python Data Science 3 Books in 1 - Hands on Learning for Beginners a Hands-On Guide Beyond the Basics a Hands-On Guide for Experts
No ratings yet
Python Data Science 3 Books in 1 - Hands on Learning for Beginners a Hands-On Guide Beyond the Basics a Hands-On Guide for Experts
358 pages
Introduction To Ipv4 & Ipv6
No ratings yet
Introduction To Ipv4 & Ipv6
56 pages
MCA-SEM-III-Syllabus Mobile Computing
No ratings yet
MCA-SEM-III-Syllabus Mobile Computing
12 pages
Session1-DataCharacteristics
No ratings yet
Session1-DataCharacteristics
41 pages
AI - ML Curriculum Powered by IBM - Pregrad
No ratings yet
AI - ML Curriculum Powered by IBM - Pregrad
31 pages
AIML-Curriculum by Pregrad
No ratings yet
AIML-Curriculum by Pregrad
33 pages
Data Science Minimum - 10 Essential Skills You Need To Know To Start Doing Data Science - KDnuggets
No ratings yet
Data Science Minimum - 10 Essential Skills You Need To Know To Start Doing Data Science - KDnuggets
8 pages
PythonDASE_2025 Version1 (1)
No ratings yet
PythonDASE_2025 Version1 (1)
44 pages
DsNaIT v2.0
No ratings yet
DsNaIT v2.0
43 pages
FDS LAB
No ratings yet
FDS LAB
43 pages
Data Science Training in Naresh I Technologies
100% (3)
Data Science Training in Naresh I Technologies
18 pages
Data Science Course Details
No ratings yet
Data Science Course Details
9 pages
DSBA Curriculum Guide
No ratings yet
DSBA Curriculum Guide
18 pages
CSE3041 Syllabus
No ratings yet
CSE3041 Syllabus
5 pages
DS&a + AI ML Nov 23 6868 - Calendar
No ratings yet
DS&a + AI ML Nov 23 6868 - Calendar
9 pages
DS FINAL
No ratings yet
DS FINAL
46 pages
GenAI AUP Training
No ratings yet
GenAI AUP Training
29 pages
Kavin
No ratings yet
Kavin
13 pages
2023 State of Internal Audit Trends Report
No ratings yet
2023 State of Internal Audit Trends Report
33 pages
Co rDTEI: Goo M Uoint: Nueva Ecija University
No ratings yet
Co rDTEI: Goo M Uoint: Nueva Ecija University
11 pages
NAC.pdf (1)
No ratings yet
NAC.pdf (1)
23 pages
EXPLORATORY DATA ANALYSIS WITH PYTHON
No ratings yet
EXPLORATORY DATA ANALYSIS WITH PYTHON
24 pages
Unit 1
No ratings yet
Unit 1
21 pages
Hemanth SDP
No ratings yet
Hemanth SDP
13 pages
05 - para Load Share Rev 2 (2019)
No ratings yet
05 - para Load Share Rev 2 (2019)
47 pages
MiniContac RS V5.3 ENG
No ratings yet
MiniContac RS V5.3 ENG
89 pages
PDMS and Associated Products Installation Guide
No ratings yet
PDMS and Associated Products Installation Guide
90 pages
AI (Syllabus)
No ratings yet
AI (Syllabus)
7 pages
Dsvmannual
No ratings yet
Dsvmannual
13 pages
Audit and Assurance (AA)
No ratings yet
Audit and Assurance (AA)
126 pages
Data Science Machine Learning 17054
No ratings yet
Data Science Machine Learning 17054
27 pages
Centrifugal Pumps and Compressors
No ratings yet
Centrifugal Pumps and Compressors
22 pages
Digital Vidya Python Data Analytst Course
No ratings yet
Digital Vidya Python Data Analytst Course
18 pages
Electronic Media Buying
No ratings yet
Electronic Media Buying
46 pages
Data Analytics Curriculum
No ratings yet
Data Analytics Curriculum
8 pages
Data Science Learning Checklist
No ratings yet
Data Science Learning Checklist
1 page
Irfan Cs Project
No ratings yet
Irfan Cs Project
23 pages
Python ETL -Course Content
No ratings yet
Python ETL -Course Content
4 pages
CS352_lab syllabus
No ratings yet
CS352_lab syllabus
2 pages
PCAC2009
No ratings yet
PCAC2009
3 pages
Guidance For Board Risk Oversight
No ratings yet
Guidance For Board Risk Oversight
31 pages
TE-AINDS-Syllabus-REV-2019_DAV
No ratings yet
TE-AINDS-Syllabus-REV-2019_DAV
3 pages
Internship-Data Science and Machine Learning Using Python
No ratings yet
Internship-Data Science and Machine Learning Using Python
5 pages
Risks, Opportunities Under SEC's Cyber Incident Disclosure
No ratings yet
Risks, Opportunities Under SEC's Cyber Incident Disclosure
21 pages
Getting Started With Python Data Analysis - Sample Chapter
0% (1)
Getting Started With Python Data Analysis - Sample Chapter
17 pages
IOT Pet Feeder Using The Blynk Mobile App An ESP82
No ratings yet
IOT Pet Feeder Using The Blynk Mobile App An ESP82
25 pages
Data Minds - Data Science Curriculum 2023 V2
No ratings yet
Data Minds - Data Science Curriculum 2023 V2
15 pages
DS Curriculum
No ratings yet
DS Curriculum
4 pages
Blockchain For Managers
No ratings yet
Blockchain For Managers
13 pages
Data Science Periodic Table
100% (2)
Data Science Periodic Table
1 page
Networking Presentation
No ratings yet
Networking Presentation
28 pages
Global Internal Audit Standards
100% (4)
Global Internal Audit Standards
120 pages
Retrotec Field Cal
No ratings yet
Retrotec Field Cal
15 pages
Brochure NUS PA 210521
No ratings yet
Brochure NUS PA 210521
13 pages
Internal Audit Proceedures
No ratings yet
Internal Audit Proceedures
28 pages
Experiment No: 1 Introduction To Data Analytics and Python Fundamentals Page-1/11
No ratings yet
Experiment No: 1 Introduction To Data Analytics and Python Fundamentals Page-1/11
8 pages
A Distributed Filter Within A Switching Converter For Application To 3-D Integrated Circuits
No ratings yet
A Distributed Filter Within A Switching Converter For Application To 3-D Integrated Circuits
11 pages
OmniNote Final
No ratings yet
OmniNote Final
6 pages
John Chen Recognition
No ratings yet
John Chen Recognition
10 pages
Pandas Tutorial
No ratings yet
Pandas Tutorial
1 page
Python Data Science Group Bootcamp NYC (Affordable Machine Learning)
No ratings yet
Python Data Science Group Bootcamp NYC (Affordable Machine Learning)
16 pages
HW - 7 1
No ratings yet
HW - 7 1
4 pages
NOMOGRAM
No ratings yet
NOMOGRAM
6 pages
PS Ace 23 08
No ratings yet
PS Ace 23 08
5 pages
Semi-Automated Exploratory Data Analysis (EDA) in Python - by Destin Gong - Mar, 2021 - Towards Data
No ratings yet
Semi-Automated Exploratory Data Analysis (EDA) in Python - by Destin Gong - Mar, 2021 - Towards Data
3 pages
Poe Access Switch Technical Specifications
No ratings yet
Poe Access Switch Technical Specifications
3 pages
Applied AI - Machine Learning Course Syllabus PDF
No ratings yet
Applied AI - Machine Learning Course Syllabus PDF
22 pages
Configuring Inputs and Outputs in Profibus or Devicenet For The 710e Series
No ratings yet
Configuring Inputs and Outputs in Profibus or Devicenet For The 710e Series
4 pages
FCRAR2012 Robotic Tennis Ball Collector
No ratings yet
FCRAR2012 Robotic Tennis Ball Collector
6 pages
The Social Media & Its Impact On Socity 15
No ratings yet
The Social Media & Its Impact On Socity 15
1 page
Sodapdf
No ratings yet
Sodapdf
1 page
A Limit Switch Is Used To Stop The Motion of A Machine Slide or Element Once It Reaches A Fixed Point
No ratings yet
A Limit Switch Is Used To Stop The Motion of A Machine Slide or Element Once It Reaches A Fixed Point
2 pages
PYTHON DATA ANALYTICS: Mastering Python for Effective Data Analysis and Visualization (2024 Beginner Guide)
From Everand
PYTHON DATA ANALYTICS: Mastering Python for Effective Data Analysis and Visualization (2024 Beginner Guide)
FLOYD BAX
No ratings yet
Learning Data Mining with Python
From Everand
Learning Data Mining with Python
Robert Layton
No ratings yet
Practical Data Analysis - Second Edition
From Everand
Practical Data Analysis - Second Edition
Hector Cuesta
No ratings yet