Data Science - Unit II
Data Science - Unit II
Data Science - Unit II
• The name "Pandas" has a reference to both "Panel Data", and "Python
Data Analysis" and was created by Wes McKinney in 2008.
Futures of Pandas
• Pandas provides fast, flexible data structures, such as data frame CDs,
which are designed to work with structured data very easily and
intuitively.
• Pandas (Python data analysis) is a must in the data science life cycle.
• It is the most popular and widely used Python library for data
science, along with NumPy in matplotlib
Pandas Major Applications
Example
• Create a simple Pandas DataFrame:
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
print(df)
• Result
calories duration
0 420 50
1 380 40
2 390 45
Named Indexes
Add a list of names to give each row a name:
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
print(df)
• Result
calories duration
day1 420 50
day2 380 40
day3 390 45
Let us assume that we are creating a data
frame with student’s data.
Example 2
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
print df
Its output is as follows −
Name Age
0 Alex 10
1 Bob 12
2 Clarke 13
# importing the pandas library
import pandas as pd
# creating a dataframe object
student_register = pd.DataFrame()
# assigning values to the
# rows and columns of the
# dataframe
student_register['Name'] = ['Abhijit',
'Smriti',
'Akash',
'Roshni']
student_register['Age'] = [20, 19, 20, 14]
student_register['Student'] = [False, True,
True, False]
student_register
Add a new student in the datagram
Syntax : pandas.to_markdown()
print(gfg)
Output :
| | A| B|
|:---|----:|----:|
| a | 1 | 1.1 |
| a | 2 | 2.2 |
| b | 3 | 3.3 |
Example 2
# import pandas
import pandas as pd
print(gfg)
Output :
| | A|B |
|:----|----:|:----|
|I | 3|c |
| II | 4 | d |
| III | 5 | e |
Add a new column in Pandas Data Frame
Using a Dictionary
Pandas is basically the library in Python used for Data Analysis and
Manipulation.
To add a new Column in the data frame we have a variety of methods.
But here in this post, we are discussing adding a new column by using
the dictionary.
Let’s take Example!
# Python program to illustrate
# Add a new column in Pandas
print (data_frame)
Output:
data
0 0
1 1
2 2
3 3
4 4
5 5
6 6
Map Function : Adding column “new_data_1” by giving the
functionality of getting week name for the column named “data”.
Call map and pass the dict, this will perform a lookup and return the
associated value for that key.
# Python program to illustrate
# Add a new column in Pandas
# Data Frame Using a Dictionary
import pandas as pd
data_frame = pd.DataFrame([[i] for i in range(7)], columns =['data'])
# Introducing weeks as dictionary
weeks = {0:'Sunday', 1:'Monday', 2:'Tuesday', 3:'Wednesday',
4:'Thursday', 5:'Friday', 6:'Saturday'}
# Mapping the dictionary keys to the data frame.
data_frame['new_data_1'] = data_frame['data'].map(weeks)
print (data_frame)
Output:
data new_data_1
0 0 Sunday
1 1 Monday
2 2 Tuesday
3 3 Wednesday
4 4 Thursday
5 5 Friday
6 6 Saturday
import pandas as pd
data = pd.DataFrame({"x1":["x", "y", "x", "y", "x", "x"], # Create pandas
DataFrame
"x2":range(15, 21),
"x3":["a", "b", "c", "d", "e", "f"],
"x4":range(20, 8, - 2)})
print(data)
Example 1: Remove Column from pandas
DataFrame
data_drop = data.drop("x3", axis = 1) # Drop variable from
DataFrame
print(data_drop) # Print updated DataFrame
Example 2: Add New Column to pandas
DataFrame
x5 = ["foo", "bar", "foo", "bar", "foo", "bar"]
print(x5)
import numpy as np
arr = np.array( [[ 1, 2, 3], [ 4, 2, 5]] )
print("Array is of type: ", type(arr))
print("No. of dimensions: ", arr.ndim)
print("Shape of array: ", arr.shape)
print("Size of array: ", arr.size)
print("Array stores elements of type: ", arr.dtype)
• Output :
Array is of type:
No. of dimensions: 2
Shape of array: (2, 3)
Size of array: 6
Array stores elements of type: int64
• This article will help you get acquainted with the widely used array-
processing library in Python, NumPy.
# transpose of array
a = np.array([[1, 2, 3], [3, 4, 5], [9, 6, 0]])
This includes sum, min, max, etc. These functions can also be applied
row-wise or column-wise by setting an axis parameter.
# Python program to demonstrate
# unary operators in numpy
import numpy as np
a = np.array([[1, 2],
[3, 4]])
b = np.array([[4, 3],
[2, 1]])
# add arrays
print ("Array sum:\n", a + b)
# multiply arrays (elementwise multiplication)
print ("Array multiplication:\n", a*b)
# matrix multiplication
print ("Matrix multiplication:\n", a.dot(b))
Output:
Array sum:
[[5 5]
[5 5]]
Array multiplication:
[[4 6]
[6 4]]
Matrix multiplication:
[[ 8 5]
[20 13]]
SciPy Introduction
• SciPy is a scientific computation library that uses NumPy underneath.
• It provides more utility functions for optimization, stats and signal processing.
• If SciPy uses NumPy underneath, why can we not just use NumPy?
• SciPy has optimized and added functions that are frequently used in
NumPy and Data Science.
Constants in SciPy
• These constants can be helpful when you are working with Data
Science.
Unit Categories
The units are placed under these categories:
• Metric
• Binary
• Mass
• Angle
• Time
• Length
• Pressure
• Volume
• Speed
• Temperature
• Energy
• Power
• Force
Mass
• Return the specified unit in kg (e.g. gram returns 0.001)
from scipy import constants
print(constants.gram) #0.001
print(constants.metric_ton) #1000.0
print(constants.grain) #6.479891e-05
print(constants.lb) #0.45359236999999997
print(constants.pound) #0.45359236999999997
print(constants.oz) #0.028349523124999998
print(constants.ounce) #0.028349523124999998
print(constants.stone) #6.3502931799999995
print(constants.long_ton) #1016.0469088
Time:
Return the specified unit in seconds (e.g. hour returns 3600.0)
from scipy import constants
print(constants.minute) #60.0
print(constants.hour) #3600.0
print(constants.day) #86400.0
print(constants.week) #604800.0
print(constants.year) #31536000.0
print(constants.Julian_year) #31557600.0
SciPy Sparse Data
What is Sparse Data
Sparse data is data that has mostly unused elements (elements that
don't carry any information ).
[1, 0, 2, 0, 0, 3, 0, 0, 0, 0, 0, 0]
Sparse Data: is a data set where most of the item values are zero.
Dense Array: is the opposite of a sparse array: most of the values are
not zero.
How to Work With Sparse Data
SciPy has a module, scipy.sparse that provides functions to deal with sparse data.
CSC - Compressed Sparse Column. For efficient arithmetic, fast column slicing.
CSR - Compressed Sparse Row. For fast row slicing, faster matrix vector products
import numpy as np
from scipy.sparse import csr_matrix
print(csr_matrix(arr))
Result
(0, 5) 1
(0, 6)1
(0, 8)2
From the result we can see that there are 3 items with value.
import numpy as np
from scipy.sparse import csr_matrix
print(csr_matrix(arr).count_nonzero())
Removing zero-entries from the matrix with
the eliminate_zeros() method:
Example
import numpy as np
from scipy.sparse import csr_matrix
mat = csr_matrix(arr)
mat.eliminate_zeros()
print(mat)
SciPy Interpolation
• The function interp1d() is used to interpolate a distribution with 1
variable.
xs = np.arange(10)
ys = 2*xs + 1
interp_func = interp1d(xs, ys)
print(newarr)
Result:
SciPy provides us with the module scipy.sparse.csgraph for working with such
data structures.
Adjacency Matrix
Adjacency matrix is a nxn matrix where n is the number of elements in a graph.
ABC
A:[0 1 2]
B:[1 0 0]
C:[2 0 0]
Connected Components
Find all of the connected components with the
connected_components() method.
Example
import numpy as np
from scipy.sparse.csgraph import connected_components
from scipy.sparse import csr_matrix
arr = np.array([
[0, 1, 2],
[1, 0, 0],
[2, 0, 0]
])
newarr = csr_matrix(arr)
print(connected_components(newarr))
SciPy Spatial Data
Working with Spatial Data
Spatial data refers to data that is represented in a geometric space.
SciPy provides us with the module scipy.spatial, which has functions for working with
spatial data.
Example
Create a triangulation from following points:
import numpy as np
from scipy.spatial import Delaunay
import matplotlib.pyplot as plt
points = np.array([
[2, 4],
[3, 4],
[3, 0],
[2, 2],
[4, 1]
])
simplices = Delaunay(points).simplices
plt.show()
Matplotlib
Import Matplotlib
Once Matplotlib is installed, import it in your applications by adding the
import module statement:
import matplotlib
Example
import matplotlib
print(matplotlib.__version__)s
Matplotlib Pyplot
Pyplot
Most of the Matplotlib utilities lies under the pyplot submodule, and
are usually imported under the plt alias:
import matplotlib.pyplot as plt
Now the Pyplot package can be referred to as plt.
Example
Draw a line in a diagram from position (0,0) to position (6,250):
• If we need to plot a line from (1, 3) to (8, 10), we have to pass two
arrays [1, 8] and [3, 10] to the plot function.
Example
Draw a line in a diagram from position (1, 3) to position (8, 10):
ypoints = np.array([3, 8, 1, 10])
y1 = np.array([3, 8, 1, 10])
y2 = np.array([6, 2, 7, 11])
plt.plot(y1)
plt.plot(y2)
plt.show()
Line Width
You can use the keyword argument linewidth or the shorter lw to
change the width of the line.
x =
np.array([80, 85, 90, 95, 100, 105, 110, 115, 120, 12
5])
y =
np.array([240, 250, 260, 270, 280, 290, 300, 310, 320
, 330])
plt.title("Sports Watch Data")
plt.xlabel("Average Pulse")
plt.ylabel("Calorie Burnage")
plt.plot(x, y)
plt.grid()
plt.show()
Matplotlib Scatter
Creating Scatter Plots
With Pyplot, you can use the scatter() function to draw a scatter plot.
The scatter() function plots one dot for each observation. It needs two
arrays of the same length, one for the values of the x-axis, and one for
values on the y-axis:
Example
A simple scatter plot:
x = np.array([5,7,8,7,2,17,2,9,4,11,12,9,6])
y = np.array([99,86,87,88,111,86,103,87,94,78,77,85,86])
plt.scatter(x, y)
plt.show()
Matplotlib Pie Charts
y = np.array([35, 25, 25, 15])
plt.pie(y)
plt.show()
Sorting
• Sorting is the process of arranging data into meaningful order so that
you can analyze it more effectively.
• For example, you might want to order sales data by calendar month
so that you can produce a graph of sales performance.
• You can use Discoverer to sort data as follows: sort text data into
alphabetical order.
What are the examples of data sorting?
Some common examples include sorting alphabetically (A to Z or Z to
A), by value (largest to smallest or smallest to largest), by day of the
week (Mon, Tue, Wed..), or by month names (Jan, Feb..) etc.
What is sorting in real life example?
• Insertion Sort - Insertion Sort Algorithm with Examples
• The contact list in your phone is sorted, which means you can easily
access your desired contact from your phone since the data is
arranged in that manner for you. In other words, “it is sorted”.
• While shopping on flip kart or amazon, you sort items based on your
choice, that is, price low to high or high to low.
SORTING
This article will discuss how to sort Pandas Data Frame using various methods in Python.
# importing pandas library
import pandas as pd
# creating and initializing a nested list
age_list = [['Afghanistan', 1952, 8425333, 'Asia'],
['Australia', 1957, 9712569, 'Oceania'],
['Brazil', 1962, 76039390, 'Americas'],
['China', 1957, 637408000, 'Asia'],
['France', 1957, 44310863, 'Europe'],
['India', 1952, 3.72e+08, 'Asia'],
['United States', 1957, 171984000, 'Americas']]
# creating a pandas dataframe
df = pd.DataFrame(age_list, columns=['Country', 'Year',
'Population', 'Continent'])
df
OUTPUT
In order to sort the data frame in pandas, function sort_values() is
used.
Pandas sort_values() can sort the data frame in Ascending or
Descending order.
# Sorting by column 'Country'
df.sort_values(by=['Country'])
Sorting the Data frame in Descending order
df.sort_values(by=['Country', 'Continent'],
ascending=[False, True])
Grouping
• Grouping data is the process of organizing data into related sets.
• Grouping data can be helpful for data analysis and for understanding
patterns in data
What does grouping mean in data?
• Grouped data means the data (or information) given in the form of
class intervals such as 0-20, 20-40 and so on.
• Ungrouped data is defined as the data given as individual points (i.e.
values or numbers) such as 15, 63, 34, 20, 25, and so on.
Purpose of grouping data
• Data is grouped so that it becomes understandable and can be
interpreted.
• Grouped data is helpful to make calculations of certain values which
will help in describing and analyzing the data
Aggregation and Grouping
Example2
Pandas DataFrame.groupby()
• In Pandas, groupby() function allows us to rearrange the data by
utilizing them on real-world data sets.
• Its primary task is to split the data into various groups.
• These groups are categorized based on some criteria.
• The objects can be divided from any of their axes.
This operation consists of the following steps for aggregating/grouping
the data:
• Splitting datasets
• Analyzing data
• Aggregating or combining data
We can also add some functionality to each subset. The following
operations can be performed on the applied functionality:
grouped = df.groupby('Course')
print(grouped['Percentage'].agg(np.mean))
Output
Course
B.Ed 98
B.Sc 82
BA 87
M.Phill 91
Name: Percentage, dtype: int64
Advantage of grouping
Properly structured, group projects can reinforce skills that are relevant
to both group and individual work, including the ability to: Break
complex tasks into parts and steps.
Plan and manage time.
Refine understanding through discussion and explanation.
Plotting
# OutPut:
Courses Fee Duration
0 Spark 22000.0 30days