Python 5 Unit
Python 5 Unit
It also has functions for working in domain of linear algebra, fourier transform, and matrices.
NumPy was created in 2005 by Travis Oliphant. It is an open source project and you can use it freely.
In Python we have lists that serve the purpose of arrays, but they are slow to process.
NumPy aims to provide an array object that is up to 50x faster than traditional Python lists.
The array object in NumPy is called ndarray, it provides a lot of supporting functions that make
working with ndarray very easy.
Arrays are very frequently used in data science, where speed and resources are very important.
NumPy arrays are stored at one continuous place in memory unlike lists, so processes can access and
manipulate them very efficiently.
This is the main reason why NumPy is faster than lists. Also it is optimized to work with latest CPU
architectures.
NumPy is a Python library and is written partially in Python, but most of the parts that require fast
computation are written in C or C++.
Installation of NumPy
If you have Python and PIP already installed on a system, then installation of
NumPy is very easy.
import numpy
Example
import numpy
print(arr)
output: [1 2 3 4 5]
NumPy as np
NumPy is usually imported under the np alias.
import numpy as np
Example
import numpy as np
print(arr)
output: [1 2 3 4 5]
Example
import numpy as np
print(np.__version__)
import numpy as np
print(arr)
print(type(arr))
type(): This built-in Python function tells us the type of the object passed to it. Like in above code it
shows that arr is numpy.ndarray type.
To create an ndarray, we can pass a list, tuple or any array-like object into the array() method, and
it will be converted into an ndarray:
Example
import numpy as np
print(arr)
Dimensions in Arrays
A dimension in arrays is one level of array depth (nested arrays).
0-D Arrays
0-D arrays, or Scalars, are the elements in an array. Each value in an array
is a 0-D array.
Example
import numpy as np
arr = np.array(42)
print(arr)
output: 42
1-D Arrays
An array that has 0-D arrays as its elements is called uni-dimensional or 1-D array.
Example
import numpy as np
print(arr)
output: [1 2 3 4 5]
2-D Arrays
An array that has 1-D arrays as its elements is called a 2-D array.
NumPy has a whole sub module dedicated towards matrix operations called numpy.mat
Example
Create a 2-D array containing two arrays with the values 1,2,3 and 4,5,6:
import numpy as np
print(arr)
output:
[[1 2 3]
[4 5 6]]
3-D arrays
An array that has 2-D arrays (matrices) as its elements is called 3-D array.
These are often used to represent a 3rd order tensor.
Example
Create a 3-D array with two 2-D arrays, both containing two arrays with the values 1,2,3 and 4,5,6:
import numpy as np
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])
print(arr)
output:
[[[1 2 3]
[4 5 6]]
[[1 2 3]
[4 5 6]]]
NumPy Arrays provides the ndim attribute that returns an integer that tells us how many dimensions
the array have.
Example
import numpy as np
a = np.array(42)
b = np.array([1, 2, 3, 4, 5])
print(a.ndim)
print(b.ndim)
print(c.ndim)
print(d.ndim)
output:
3
Higher Dimensional Arrays
When the array is created, you can define the number of dimensions by using the ndmin argument.
Example
import numpy as np
print(arr)
number of dimensions : 5
The indexes in NumPy arrays start with 0, meaning that the first element has index 0, and the
second has index 1 etc.
Example
import numpy as np
print(arr[0])
output:
Example
import numpy as np
print(arr[1])
output:
2
Example
Get third and fourth elements from the following array and add them.
import numpy as np
print(arr[2] + arr[3])
output:7
To access elements from 2-D arrays we can use comma separated integers representing the
dimension and the index of the element.
Think of 2-D arrays like a table with rows and columns, where the dimension represents the row
and the index represents the column.
Example
import numpy as np
Example
import numpy as np
Output:
To access elements from 3-D arrays we can use comma separated integers representing the
dimensions and the index of the element.
Example
Access the third element of the second array of the first array:
import numpy as np
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
print(arr[0, 1, 2])
Explanation:
The first number represents the first dimension, which contains two arrays:
and:
The second number represents the second dimension, which also contains two arrays:
[1, 2, 3]
and:
[4, 5, 6]
[4, 5, 6]
The third number represents the third dimension, which contains three values:
Negative Indexing
Example
import numpy as np
Slicing arrays
Slicing in python means taking elements from one given index to another
given index.
Example
Slice elements from index 1 to index 5 from the following array:
import numpy as np
print(arr[1:5])
output: [2 3 4 5]
Note: The result includes the start index, but excludes the end index.
Example
Slice elements from index 4 to the end of the array:
import numpy as np
print(arr[4:])
output: [5 6 7]
Example
Slice elements from the beginning to index 4 (not included):
import numpy as np
print(arr[:4])
output: [1 2 3 4]
• strings - used to represent text data, the text is given under quote
marks. e.g. "ABCD"
• integer - used to represent integer numbers. e.g. -1, -2, -3
• float - used to represent real numbers. e.g. 1.2, 42.42
• boolean - used to represent True or False.
• complex - used to represent complex numbers. e.g. 1.0 + 2.0j, 1.5 +
2.5j
Below is a list of all data types in NumPy and the characters used to
represent them.
• i - integer
• b - boolean
• u - unsigned integer
• f - float
• c - complex float
• m - timedelta
• M - datetime
• O - object
• S - string
• U - unicode string
• V - fixed chunk of memory for other type ( void )
import numpy as np
print(arr.dtype)
output:
int64
Example
Get the data type of an array containing strings:
import numpy as np
print(arr.dtype)
output: <U6
We use the array() function to create arrays, this function can take an optional argument: dtype that
allows us to define the expected data type of the array elements:
Example
import numpy as np
print(arr)
print(arr.dtype)
output:
|S1
Creating Arrays With a Defined Data Type
We use the array() function to create arrays, this function can take an
optional argument: dtype that allows us to define the expected data type of
the array elements:
Example
Create an array with data type string:
import numpy as np
print(arr)
print(arr.dtype)
output
[b'1' b'2' b'3' b'4']
|S1
Example
Create an array with data type 4 bytes integer:
import numpy as np
print(arr)
print(arr.dtype)
output
[1 2 3 4]
int32
Example
A non integer string like 'a' can not be converted to integer (will raise an
error):
import numpy as np
The astype() function creates a copy of the array, and allows you to specify
the data type as a parameter.
The data type can be specified using a string, like 'f' for float, 'i' for integer
etc. or you can use the data type directly like float for float and int for
integer.
Example
Change data type from float to integer by using 'i' as parameter value:
import numpy as np
newarr = arr.astype('i')
print(newarr)
print(newarr.dtype)
output:
[1 2 3]
int32
Example
Change data type from float to integer by using int as parameter value:
import numpy as np
newarr = arr.astype(int)
print(newarr)
print(newarr.dtype)
output:
[1 2 3]
int64
Example
Change data type from integer to boolean:
import numpy as np
newarr = arr.astype(bool)
print(newarr)
print(newarr.dtype)
output:
Bool
In SQL we join tables based on a key, whereas in NumPy we join arrays by axes.
We pass a sequence of arrays that we want to join to the concatenate() function, along with the axis.
If axis is not explicitly passed, it is taken as 0.
Example
import numpy as np
print(arr)
output:
[1 2 3 4 5 6]
Example
import numpy as np
print(arr)
output:
[[1 2 5 6]
[3 4 7 8]]
If axis=0
[[1 2]
[3 4]
[5 6]
[7 8]]
Stacking is same as concatenation, the only difference is that stacking is done along a new axis.
We can concatenate two 1-D arrays along the second axis which would result in putting them one
over the other, ie. stacking.
We pass a sequence of arrays that we want to join to the stack() method along with the axis. If axis is
not explicitly passed it is taken as 0.
Example
import numpy as np
print(arr)
output:
[[1 4]
[2 5]
[3 6]]
If axis=0
[[1 2 3]
[4 5 6]]
Joining merges multiple arrays into one and Splitting breaks one array into multiple.
We use array_split() for splitting arrays, we pass it the array we want to split and the number of
splits.
Example
import numpy as np
newarr = np.array_split(arr, 3)
print(newarr)
output:
import numpy as np
newarr = np.array_split(arr, 2)
print(newarr)
output:
In 4 parts
import numpy as np
newarr = np.array_split(arr, 4)
print(newarr)
output:
In 5 parts
import numpy as np
newarr = np.array_split(arr, 5)
print(newarr)
output:
In 6 parts
import numpy as np
newarr = np.array_split(arr, 6)
print(newarr)
output:
Note: We also have the method split() available but it will not adjust the elements when elements
are less in source array for splitting like in example above, array_split() worked properly but split()
would fail.
If you split an array into 3 arrays, you can access them from the result just
like any array element:
Example
Access the splitted arrays:
import numpy as np
newarr = np.array_split(arr, 3)
print(newarr[0])
print(newarr[1])
print(newarr[2])
output:
[1 2]
[3 4]
[5 6]
Splitting 2-D Arrays
Use the array_split() method, pass in the array you want to split and the number of splits you want to
do.
Example
import numpy as np
arr = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [11, 12]])
newarr = np.array_split(arr, 3)
print(newarr)
output:
[array([[1, 2],
[11, 12]])]
Example
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12],
[13, 14, 15], [16, 17, 18]])
newarr = np.array_split(arr, 3)
print(newarr)
output:
[array([[1, 2, 3],
[4, 5, 6]]), array([[ 7, 8, 9],
[10, 11, 12]]), array([[13, 14, 15],
[16, 17, 18]])]
In addition, you can specify which axis you want to do the split around.
The example below also returns three 2-D arrays, but they are split along the
row (axis=1).
Example
Split the 2-D array into three 2-D arrays along rows.
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12],
[13, 14, 15], [16, 17, 18]])
print(newarr)
output:
[array([[ 1],
[ 4],
[ 7],
[10],
[13],
[16]]), array([[ 2],
[ 5],
[ 8],
[11],
[14],
[17]]), array([[ 3],
[ 6],
[ 9],
[12],
[15],
[18]])]
NumPy Searching Arrays
Searching Arrays
You can search an array for a certain value, and return the indexes
that get a match.
To search an array, use the where() method.
Example
Find the indexes where the value is 4:
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 4, 4])
x = np.where(arr == 4)
print(x)
output:
(array([3, 5, 6]),)
Example
Find the indexes where the values are even:
import numpy as np
x = np.where(arr%2 == 0)
print(x)
output:
(array([1, 3, 5, 7]),)
Example
Find the indexes where the values are odd:
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8])
x = np.where(arr%2 == 1)
print(x)
output:
(array([0, 2, 4, 6]),)
Search Sorted
There is a method called searchsorted() which performs a binary search
in the array, and returns the index where the specified value would be
inserted to maintain the search order.
The searchsorted() method is assumed to be used on sorted arrays.
Example
Find the indexes where the value 7 should be inserted:
import numpy as np
arr = np.array([6, 7, 8, 9])
x = np.searchsorted(arr, 7)
print(x)
output:
1
Pandas are also able to delete rows that are not relevant, or contains wrong
values, like empty or NULL values. This is called cleaning the data.
Installation of Pandas
If you have Python and PIP already installed on a system, then installation of
Pandas is very easy.
If this command fails, then use a python distribution that already has Pandas
installed like, Anaconda, Spyder etc.
Import Pandas
Once Pandas is installed, import it in your applications by adding the
import keyword:
import pandas
Now Pandas is imported and ready to use.
Example:
import pandas
mydataset = {
'cars': ["BMW", "Volvo", "Ford"],
'passings': [3, 7, 2]
}
myvar = pandas.DataFrame(mydataset)
print(myvar)
output:
cars passings
0 BMW 3
1 Volvo 7
2 Ford 2
Pandas as pd
Pandas is usually imported under the pd alias.
alias: In Python alias are an alternate name for referring to the same
thing.
Create an alias with the as keyword while importing:
import pandas as pd
Now the Pandas package can be referred to as pd instead of pandas.
Example
import pandas as pd
mydataset = {
'cars': ["BMW", "Volvo", "Ford"],
'passings': [3, 7, 2]
}
myvar = pd.DataFrame(mydataset)
print(myvar)
output:
cars passings
0 BMW 3
1 Volvo 7
2 Ford 2
What is a Series?
A Pandas Series is like a column in a table.
It is a one-dimensional array holding data of any type
Example
Create a simple Pandas Series from a list:
import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a)
print(myvar)
output:
0 1
1 7
2 2
dtype: int64
Labels
If nothing else is specified, the values are labeled with their index number. First value has
index 0, second value has index 1 etc.
This label can be used to access a specified value.
Example
Return the first value of the Series:
print(myvar[0])
output:
1
Create Labels
With the index argument, you can name your own labels.
Example
Create your own labels:
import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a, index = ["x", "y", "z"])
print(myvar)
output:
x 1
y 7
z 2
dtype: int64
When you have created labels, you can access an item by referring to the label.
Example
Return the value of "y":
print(myvar["y"])
output:
7
Key/Value Objects as Series
You can also use a key/value object, like a dictionary, when creating a Series.
Example
Create a simple Pandas Series from a dictionary:
import pandas as pd
calories = {"day1": 420, "day2": 380, "day3": 390}
myvar = pd.Series(calories)
print(myvar)
output:
day1 420
day2 380
day3 390
dtype: int64
Note: The keys of the dictionary become the labels.
To select only some of the items in the dictionary, use the index argument and specify
only the items you want to include in the Series.
Example
Create a Series using only data from "day1" and "day2":
import pandas as pd
calories = {"day1": 420, "day2": 380, "day3": 390}
myvar = pd.Series(calories, index = ["day1", "day2"])
print(myvar)
output:
day1 420
day2 380
dtype: int64
DataFrames
Data sets in Pandas are usually multi-dimensional tables, called DataFrames.
Example
Create a DataFrame from two Series:
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
myvar = pd.DataFrame(data)
print(myvar)
output:
calories duration
0 420 50
1 380 40
2 390 45
Pandas DataFrames
What is a DataFrame?
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table
with rows and columns.
Example
Create a simple Pandas DataFrame:
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
#load data into a DataFrame object:
df = pd.DataFrame(data)
print(df)
output:
calories duration
0 420 50
1 380 40
2 390 45
Locate Row
As you can see from the result above, the DataFrame is like a table with rows and columns.
Pandas use the loc attribute to return one or more specified row(s)
Example
Return row 0:
#refer to the row index:
print(df.loc[0])
Result
calories 420
duration 50
Name: 0, dtype: int64
Note: This example returns a Pandas Series.
Example
Return row 0 and 1:
#use a list of indexes:
print(df.loc[[0, 1]])
Result
calories duration
0 420 50
1 380 40
Note: When using [], the result is a Pandas DataFrame.
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
#load data into a DataFrame object:
df = pd.DataFrame(data)
print(df.loc[[]])
output:
Empty DataFrame
Columns: [calories, duration]
Index: []
Named Indexes
With the index argument, you can name your own indexes.
Example
Add a list of names to give each row a name:
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
df = pd.DataFrame(data, index = ["day1", "day2", "day3"])
print(df)
Result
calories duration
day1 420 50
day2 380 40
day3 390 45
Locate Named Indexes
Use the named index in the loc attribute to return the specified row(s).
Example
Return "day2":
#refer to the named index:
print(df.loc["day2"])
Result
calories 380
duration 40
Name: day2, dtype: int64
Load Files Into a DataFrame
If your data sets are stored in a file, Pandas can load them into a DataFrame.
Example
Load a comma separated file (CSV file) into a DataFrame:
import pandas as pd
df = pd.read_csv('data.csv')
print(df)
output:
Duration Pulse Maxpulse Calories
0 60 110 130 409.1
1 60 117 145 479.0
2 60 103 135 340.0
3 45 109 175 282.4
4 45 117 148 406.0
.. ... ... ... ...
164 60 105 140 290.8
165 60 110 145 300.4
166 60 115 145 310.2
167 75 120 150 320.4
168 75 125 150 330.4
[169 rows x 4 columns]
A simple way to store big data sets is to use CSV files (comma separated
files).
CSV files contains plain text and is a well know format that can be read by
everyone including Pandas.
max_rows
The number of rows returned is defined in Pandas option settings.
You can check your system's maximum rows with the pd.options.display.max_rows
statement.
Example
Check the number of maximum returned rows:
import pandas as pd
print(pd.options.display.max_rows)
output:
60
max_rows
The number of rows returned is defined in Pandas option settings.
You can check your system's maximum rows with the pd.options.display.max_rows
statement.
Example
Check the number of maximum returned rows:
import pandas as pd
print(pd.options.display.max_rows)
op: 60
Note: In my system the number is 60, which means that if the DataFrame
contains more than 60 rows, the print(df) statement will return only the
headers and the first and last 5 rows.
Example
Increase the maximum number of rows to display the entire DataFrame:
import pandas as pd
pd.options.display.max_rows = 9999
df = pd.read_csv('data.csv')
print(df)
output:
Duration Pulse Maxpulse Calories
0 60 110 130 409.1
1 60 117 145 479.0
2 60 103 135 340.0
3 45 109 175 282.4
4 45 117 148 406.0
5 60 102 127 300.5
6 60 110 136 374.0
7 45 104 134 253.3
8 30 109 133 195.1
9 60 98 124 269.0
10 60 103 147 329.3
11 60 100 120 250.7
12 60 106 128 345.3
13 60 104 132 379.3
14 60 98 123 275.0
15 60 98 120 215.2
…
161 45 90 130 260.4
162 45 95 130 270.0
163 45 100 140 280.9
164 60 105 140 290.8
165 60 110 145 300.4
166 60 115 145 310.2
167 75 120 150 320.4
168 75 125 150 330.4
Dictionary as JSON
JSON = Python Dictionary
If your JSON code is not in a file, but in a Python Dictionary, you can load it
into a DataFrame directly:
Example
Load a Python Dictionary into a DataFrame:
import pandas as pd
data = {
"Duration":{
"0":60,
"1":60,
"2":60,
"3":45,
"4":45,
"5":60
},
"Pulse":{
"0":110,
"1":117,
"2":103,
"3":109,
"4":117,
"5":102
},
"Maxpulse":{
"0":130,
"1":145,
"2":135,
"3":175,
"4":148,
"5":127
},
"Calories":{
"0":409,
"1":479,
"2":340,
"3":282,
"4":406,
"5":300
}
}
df = pd.DataFrame(data)
print(df)
output:
Duration Pulse Maxpulse Calories
0 60 110 130 409.1
1 60 117 145 479.0
2 60 103 135 340.0
3 45 109 175 282.4
4 45 117 148 406.0
5 60 102 127 300.5
Pandas - Analyzing DataFrames
Viewing the Data
One of the most used method for getting a quick overview of the DataFrame, is the head()
method.
The head() method returns the headers and a specified number of rows, starting from the top.
Example
Get a quick overview by printing the first 10 rows of the DataFrame:
import pandas as pd
df = pd.read_csv('data.csv')
print(df.head(10))
output:
Duration Pulse Maxpulse Calories
0 60 110 130 409.1
1 60 117 145 479.0
2 60 103 135 340.0
3 45 109 175 282.4
4 45 117 148 406.0
5 60 102 127 300.5
6 60 110 136 374.0
7 45 104 134 253.3
8 30 109 133 195.1
9 60 98 124 269.0
Note: if the number of rows is not specified, the head() method will return the top 5 rows.
Example
Print the first 5 rows of the DataFrame:
import pandas as pd
df = pd.read_csv('data.csv')
print(df.head())
output:
Duration Pulse Maxpulse Calories
0 60 110 130 409.1
1 60 117 145 479.0
2 60 103 135 340.0
3 45 109 175 282.4
4 45 117 148 406.0
There is also a tail() method for viewing the last rows of the DataFrame.
The tail() method returns the headers and a specified number of rows, starting from the
bottom.
Example
Print the last 5 rows of the DataFrame:
print(df.tail())
output:
Duration Pulse Maxpulse Calories
164 60 105 140 290.8
165 60 110 145 300.4
166 60 115 145 310.2
167 75 120 150 320.4
168 75 125 150 330.4
Null Values
The info() method also tells us how many Non-Null values there are present in each column, and in
our data set it seems like there are 164 of 169 Non-Null values in the "Calories" column.
Which means that there are 5 rows with no value at all, in the "Calories" column, for whatever
reason.
Empty values, or Null values, can be bad when analyzing data, and you should consider removing
rows with empty values. This is a step towards what is called cleaning data, and you will learn more
about that in the next chapters.
Pandas - Cleaning Data
Data Cleaning
Data cleaning means fixing bad data in your data set.
Bad data could be:
• Empty cells
• Data in wrong format
• Wrong data
• Duplicates
Sample dataset
Duration Date Pulse Maxpulse Calories
0 60 '2020/12/01' 110 130 409.1
1 60 '2020/12/02' 117 145 479.0
2 60 '2020/12/03' 103 135 340.0
3 45 '2020/12/04' 109 175 282.4
4 45 '2020/12/05' 117 148 406.0
5 60 '2020/12/06' 102 127 300.0
6 60 '2020/12/07' 110 136 374.0
7 450 '2020/12/08' 104 134 253.3
8 30 '2020/12/09' 109 133 195.1
9 60 '2020/12/10' 98 124 269.0
10 60 '2020/12/11' 103 147 329.3
11 60 '2020/12/12' 100 120 250.7
12 60 '2020/12/12' 100 120 250.7
13 60 '2020/12/13' 106 128 345.3
14 60 '2020/12/14' 104 132 379.3
15 60 '2020/12/15' 98 123 275.0
16 60 '2020/12/16' 98 120 215.2
17 60 '2020/12/17' 100 120 300.0
18 45 '2020/12/18' 90 112 NaN
19 60 '2020/12/19' 103 123 323.0
20 45 '2020/12/20' 97 125 243.0
21 60 '2020/12/21' 108 131 364.2
22 45 NaN 100 119 282.0
23 60 '2020/12/23' 130 101 300.0
24 45 '2020/12/24' 105 132 246.0
25 60 '2020/12/25' 102 126 334.5
26 60 2020/12/26 100 120 250.0
27 60 '2020/12/27' 92 118 241.0
28 60 '2020/12/28' 103 132 NaN
29 60 '2020/12/29' 100 132 280.0
30 60 '2020/12/30' 102 129 380.3
31 60 '2020/12/31' 92 115 243.0
The data set contains some empty cells ("Date" in row 22, and "Calories" in
row 18 and 28).
Remove Rows
One way to deal with empty cells is to remove rows that contain empty cells.
This is usually OK, since data sets can be very big, and removing a few rows will not have a big
impact on the result.
Example
Return a new Data Frame with no empty cells:
import pandas as pd
df = pd.read_csv('data.csv')
new_df = df.dropna()
print(new_df.to_string())
output:
Duration Date Pulse Maxpulse Calories
0 60 '2020/12/01' 110 130 409.1
1 60 '2020/12/02' 117 145 479.0
2 60 '2020/12/03' 103 135 340.0
3 45 '2020/12/04' 109 175 282.4
4 45 '2020/12/05' 117 148 406.0
5 60 '2020/12/06' 102 127 300.0
6 60 '2020/12/07' 110 136 374.0
7 450 '2020/12/08' 104 134 253.3
8 30 '2020/12/09' 109 133 195.1
9 60 '2020/12/10' 98 124 269.0
10 60 '2020/12/11' 103 147 329.3
11 60 '2020/12/12' 100 120 250.7
12 60 '2020/12/12' 100 120 250.7
13 60 '2020/12/13' 106 128 345.3
14 60 '2020/12/14' 104 132 379.3
15 60 '2020/12/15' 98 123 275.0
16 60 '2020/12/16' 98 120 215.2
17 60 '2020/12/17' 100 120 300.0
19 60 '2020/12/19' 103 123 323.0
20 45 '2020/12/20' 97 125 243.0
21 60 '2020/12/21' 108 131 364.2
23 60 '2020/12/23' 130 101 300.0
24 45 '2020/12/24' 105 132 246.0
25 60 '2020/12/25' 102 126 334.5
26 60 2020/12/26 100 120 250.0
27 60 '2020/12/27' 92 118 241.0
29 60 '2020/12/29' 100 132 280.0
30 60 '2020/12/30' 102 129 380.3
31 60 '2020/12/31' 92 115 243.0
Note: By default, the dropna() method returns a new DataFrame, and will not change the original.
If you want to change the original DataFrame, use the inplace = True argument:
Example
Remove all rows with NULL values:
import pandas as pd
df = pd.read_csv('data.csv')
df.dropna(inplace = True)
print(df.to_string())
output:
Duration Date Pulse Maxpulse Calories
0 60 '2020/12/01' 110 130 409.1
1 60 '2020/12/02' 117 145 479.0
2 60 '2020/12/03' 103 135 340.0
3 45 '2020/12/04' 109 175 282.4
4 45 '2020/12/05' 117 148 406.0
5 60 '2020/12/06' 102 127 300.0
6 60 '2020/12/07' 110 136 374.0
7 450 '2020/12/08' 104 134 253.3
8 30 '2020/12/09' 109 133 195.1
9 60 '2020/12/10' 98 124 269.0
10 60 '2020/12/11' 103 147 329.3
11 60 '2020/12/12' 100 120 250.7
12 60 '2020/12/12' 100 120 250.7
13 60 '2020/12/13' 106 128 345.3
14 60 '2020/12/14' 104 132 379.3
15 60 '2020/12/15' 98 123 275.0
16 60 '2020/12/16' 98 120 215.2
17 60 '2020/12/17' 100 120 300.0
19 60 '2020/12/19' 103 123 323.0
20 45 '2020/12/20' 97 125 243.0
21 60 '2020/12/21' 108 131 364.2
23 60 '2020/12/23' 130 101 300.0
24 45 '2020/12/24' 105 132 246.0
25 60 '2020/12/25' 102 126 334.5
26 60 2020/12/26 100 120 250.0
27 60 '2020/12/27' 92 118 241.0
29 60 '2020/12/29' 100 132 280.0
30 60 '2020/12/30' 102 129 380.3
31 60 '2020/12/31' 92 115 243.0
Note: Now, the dropna(inplace = True) will NOT return a new DataFrame, but it will remove all
rows containing NULL values from the original DataFrame.
Replace Empty Values
Another way of dealing with empty cells is to insert a new value instead.
This way you do not have to delete entire rows just because of some empty cells.
The fillna() method allows us to replace empty cells with a value:
Example
Replace NULL values with the number 130:
import pandas as pd
df = pd.read_csv('data.csv')
df.fillna(130, inplace = True)
output:
Duration Date Pulse Maxpulse Calories
0 60 '2020/12/01' 110 130 409.1
1 60 '2020/12/02' 117 145 479.0
2 60 '2020/12/03' 103 135 340.0
3 45 '2020/12/04' 109 175 282.4
4 45 '2020/12/05' 117 148 406.0
5 60 '2020/12/06' 102 127 300.0
6 60 '2020/12/07' 110 136 374.0
7 450 '2020/12/08' 104 134 253.3
8 30 '2020/12/09' 109 133 195.1
9 60 '2020/12/10' 98 124 269.0
10 60 '2020/12/11' 103 147 329.3
11 60 '2020/12/12' 100 120 250.7
12 60 '2020/12/12' 100 120 250.7
13 60 '2020/12/13' 106 128 345.3
14 60 '2020/12/14' 104 132 379.3
15 60 '2020/12/15' 98 123 275.0
16 60 '2020/12/16' 98 120 215.2
17 60 '2020/12/17' 100 120 300.0
18 45 '2020/12/18' 90 112 130.0
19 60 '2020/12/19' 103 123 323.0
20 45 '2020/12/20' 97 125 243.0
21 60 '2020/12/21' 108 131 364.2
22 45 130 100 119 282.0
23 60 '2020/12/23' 130 101 300.0
24 45 '2020/12/24' 105 132 246.0
25 60 '2020/12/25' 102 126 334.5
26 60 2020/12/26 100 120 250.0
27 60 '2020/12/27' 92 118 241.0
28 60 '2020/12/28' 103 132 130.0
29 60 '2020/12/29' 100 132 280.0
30 60 '2020/12/30' 102 129 380.3
31 60 '2020/12/31' 92 115 243.0
Replace Only For Specified Columns
The example above replaces all empty cells in the whole Data Frame.
To only replace empty values for one column, specify the column name for the DataFrame:
Example
Replace NULL values in the "Calories" columns with the number 130:
import pandas as pd
df = pd.read_csv('data.csv')
df["Calories"].fillna(130, inplace = True)
output:
Duration Date Pulse Maxpulse Calories
0 60 '2020/12/01' 110 130 409.1
1 60 '2020/12/02' 117 145 479.0
2 60 '2020/12/03' 103 135 340.0
3 45 '2020/12/04' 109 175 282.4
4 45 '2020/12/05' 117 148 406.0
5 60 '2020/12/06' 102 127 300.0
6 60 '2020/12/07' 110 136 374.0
7 450 '2020/12/08' 104 134 253.3
8 30 '2020/12/09' 109 133 195.1
9 60 '2020/12/10' 98 124 269.0
10 60 '2020/12/11' 103 147 329.3
11 60 '2020/12/12' 100 120 250.7
12 60 '2020/12/12' 100 120 250.7
13 60 '2020/12/13' 106 128 345.3
14 60 '2020/12/14' 104 132 379.3
15 60 '2020/12/15' 98 123 275.0
16 60 '2020/12/16' 98 120 215.2
17 60 '2020/12/17' 100 120 300.0
18 45 '2020/12/18' 90 112 130.0
19 60 '2020/12/19' 103 123 323.0
20 45 '2020/12/20' 97 125 243.0
21 60 '2020/12/21' 108 131 364.2
22 45 NaN 100 119 282.0
23 60 '2020/12/23' 130 101 300.0
24 45 '2020/12/24' 105 132 246.0
25 60 '2020/12/25' 102 126 334.5
26 60 2020/12/26 100 120 250.0
27 60 '2020/12/27' 92 118 241.0
28 60 '2020/12/28' 103 132 130.0
29 60 '2020/12/29' 100 132 280.0
30 60 '2020/12/30' 102 129 380.3
31 60 '2020/12/31' 92 115 243.0
Replace Using Mean, Median, or Mode
A common way to replace empty cells, is to calculate the mean, median or mode value of the column.
Pandas uses the mean() median() and mode() methods to calculate the respective values for a
specified column:
Example
Calculate the MEAN, and replace any empty values with it:
import pandas as pd
df = pd.read_csv('data.csv')
x = df["Calories"].mean()
df["Calories"].fillna(x, inplace = True)
output:
Duration Date Pulse Maxpulse Calories
0 60 '2020/12/01' 110 130 409.10
1 60 '2020/12/02' 117 145 479.00
2 60 '2020/12/03' 103 135 340.00
3 45 '2020/12/04' 109 175 282.40
4 45 '2020/12/05' 117 148 406.00
5 60 '2020/12/06' 102 127 300.00
6 60 '2020/12/07' 110 136 374.00
7 450 '2020/12/08' 104 134 253.30
8 30 '2020/12/09' 109 133 195.10
9 60 '2020/12/10' 98 124 269.00
10 60 '2020/12/11' 103 147 329.30
11 60 '2020/12/12' 100 120 250.70
12 60 '2020/12/12' 100 120 250.70
13 60 '2020/12/13' 106 128 345.30
14 60 '2020/12/14' 104 132 379.30
15 60 '2020/12/15' 98 123 275.00
16 60 '2020/12/16' 98 120 215.20
17 60 '2020/12/17' 100 120 300.00
18 45 '2020/12/18' 90 112 304.68
19 60 '2020/12/19' 103 123 323.00
20 45 '2020/12/20' 97 125 243.00
21 60 '2020/12/21' 108 131 364.20
22 45 NaN 100 119 282.00
23 60 '2020/12/23' 130 101 300.00
24 45 '2020/12/24' 105 132 246.00
25 60 '2020/12/25' 102 126 334.50
26 60 2020/12/26 100 120 250.00
27 60 '2020/12/27' 92 118 241.00
28 60 '2020/12/28' 103 132 304.68
29 60 '2020/12/29' 100 132 280.00
30 60 '2020/12/30' 102 129 380.30
31 60 '2020/12/31' 92 115 243.00
Mean = the average value (the sum of all values divided by number of values).
Example
Calculate the MEDIAN, and replace any empty values with it:
import pandas as pd
df = pd.read_csv('data.csv')
x = df["Calories"].median()
df["Calories"].fillna(x, inplace = True)
output:
Duration Date Pulse Maxpulse Calories
0 60 '2020/12/01' 110 130 409.1
1 60 '2020/12/02' 117 145 479.0
2 60 '2020/12/03' 103 135 340.0
3 45 '2020/12/04' 109 175 282.4
4 45 '2020/12/05' 117 148 406.0
5 60 '2020/12/06' 102 127 300.0
6 60 '2020/12/07' 110 136 374.0
7 450 '2020/12/08' 104 134 253.3
8 30 '2020/12/09' 109 133 195.1
9 60 '2020/12/10' 98 124 269.0
10 60 '2020/12/11' 103 147 329.3
11 60 '2020/12/12' 100 120 250.7
12 60 '2020/12/12' 100 120 250.7
13 60 '2020/12/13' 106 128 345.3
14 60 '2020/12/14' 104 132 379.3
15 60 '2020/12/15' 98 123 275.0
16 60 '2020/12/16' 98 120 215.2
17 60 '2020/12/17' 100 120 300.0
18 45 '2020/12/18' 90 112 291.2
19 60 '2020/12/19' 103 123 323.0
20 45 '2020/12/20' 97 125 243.0
21 60 '2020/12/21' 108 131 364.2
22 45 NaN 100 119 282.0
23 60 '2020/12/23' 130 101 300.0
24 45 '2020/12/24' 105 132 246.0
25 60 '2020/12/25' 102 126 334.5
26 60 2020/12/26 100 120 250.0
27 60 '2020/12/27' 92 118 241.0
28 60 '2020/12/28' 103 132 291.2
29 60 '2020/12/29' 100 132 280.0
30 60 '2020/12/30' 102 129 380.3
31 60 '2020/12/31' 92 115 243.0
Median = the value in the middle, after you have sorted all values ascending.
Example
Calculate the MODE, and replace any empty values with it:
import pandas as pd
df = pd.read_csv('data.csv')
x = df["Calories"].mode()[0]
df["Calories"].fillna(x, inplace = True)
output:
Duration Date Pulse Maxpulse Calories
0 60 '2020/12/01' 110 130 409.1
1 60 '2020/12/02' 117 145 479.0
2 60 '2020/12/03' 103 135 340.0
3 45 '2020/12/04' 109 175 282.4
4 45 '2020/12/05' 117 148 406.0
5 60 '2020/12/06' 102 127 300.0
6 60 '2020/12/07' 110 136 374.0
7 450 '2020/12/08' 104 134 253.3
8 30 '2020/12/09' 109 133 195.1
9 60 '2020/12/10' 98 124 269.0
10 60 '2020/12/11' 103 147 329.3
11 60 '2020/12/12' 100 120 250.7
12 60 '2020/12/12' 100 120 250.7
13 60 '2020/12/13' 106 128 345.3
14 60 '2020/12/14' 104 132 379.3
15 60 '2020/12/15' 98 123 275.0
16 60 '2020/12/16' 98 120 215.2
17 60 '2020/12/17' 100 120 300.0
18 45 '2020/12/18' 90 112 300.0
19 60 '2020/12/19' 103 123 323.0
20 45 '2020/12/20' 97 125 243.0
21 60 '2020/12/21' 108 131 364.2
22 45 NaN 100 119 282.0
23 60 '2020/12/23' 130 101 300.0
24 45 '2020/12/24' 105 132 246.0
25 60 '2020/12/25' 102 126 334.5
26 60 2020/12/26 100 120 250.0
27 60 '2020/12/27' 92 118 241.0
28 60 '2020/12/28' 103 132 300.0
29 60 '2020/12/29' 100 132 280.0
30 60 '2020/12/30' 102 129 380.3
31 60 '2020/12/31' 92 115 243.0
Mode = the value that appears most frequently.
Pandas - Cleaning Data of Wrong Format
Data of Wrong Format
Cells with data of wrong format can make it difficult, or even impossible, to analyze data.
To fix it, you have two options: remove the rows, or convert all cells in the columns into the same
format.
Convert Into a Correct Format
In our Data Frame, we have two cells with the wrong format. Check out row 22 and 26, the 'Date'
column should be a string that represents a date:
Duration Date Pulse Maxpulse Calories
0 60 '2020/12/01' 110 130 409.1
1 60 '2020/12/02' 117 145 479.0
2 60 '2020/12/03' 103 135 340.0
3 45 '2020/12/04' 109 175 282.4
4 45 '2020/12/05' 117 148 406.0
5 60 '2020/12/06' 102 127 300.0
6 60 '2020/12/07' 110 136 374.0
7 450 '2020/12/08' 104 134 253.3
8 30 '2020/12/09' 109 133 195.1
9 60 '2020/12/10' 98 124 269.0
10 60 '2020/12/11' 103 147 329.3
11 60 '2020/12/12' 100 120 250.7
12 60 '2020/12/12' 100 120 250.7
13 60 '2020/12/13' 106 128 345.3
14 60 '2020/12/14' 104 132 379.3
15 60 '2020/12/15' 98 123 275.0
16 60 '2020/12/16' 98 120 215.2
17 60 '2020/12/17' 100 120 300.0
18 45 '2020/12/18' 90 112 NaN
19 60 '2020/12/19' 103 123 323.0
20 45 '2020/12/20' 97 125 243.0
21 60 '2020/12/21' 108 131 364.2
22 45 NaN 100 119 282.0
23 60 '2020/12/23' 130 101 300.0
24 45 '2020/12/24' 105 132 246.0
25 60 '2020/12/25' 102 126 334.5
26 60 20201226 100 120 250.0
27 60 '2020/12/27' 92 118 241.0
28 60 '2020/12/28' 103 132 NaN
29 60 '2020/12/29' 100 132 280.0
30 60 '2020/12/30' 102 129 380.3
31 60 '2020/12/31' 92 115 243.0
Let's try to convert all cells in the 'Date' column into dates.
Pandas has a to_datetime() method for this:
Example
Convert to date:
import pandas as pd
df = pd.read_csv('data.csv')
df['Date'] = pd.to_datetime(df['Date'])
print(df.to_string())
output:
Duration Date Pulse Maxpulse Calories
0 60 2020-12-01 110 130 409.1
1 60 2020-12-02 117 145 479.0
2 60 2020-12-03 103 135 340.0
3 45 2020-12-04 109 175 282.4
4 45 2020-12-05 117 148 406.0
5 60 2020-12-06 102 127 300.0
6 60 2020-12-07 110 136 374.0
7 450 2020-12-08 104 134 253.3
8 30 2020-12-09 109 133 195.1
9 60 2020-12-10 98 124 269.0
10 60 2020-12-11 103 147 329.3
11 60 2020-12-12 100 120 250.7
12 60 2020-12-12 100 120 250.7
13 60 2020-12-13 106 128 345.3
14 60 2020-12-14 104 132 379.3
15 60 2020-12-15 98 123 275.0
16 60 2020-12-16 98 120 215.2
17 60 2020-12-17 100 120 300.0
18 45 2020-12-18 90 112 NaN
19 60 2020-12-19 103 123 323.0
20 45 2020-12-20 97 125 243.0
21 60 2020-12-21 108 131 364.2
22 45 NaT 100 119 282.0
23 60 2020-12-23 130 101 300.0
24 45 2020-12-24 105 132 246.0
25 60 2020-12-25 102 126 334.5
26 60 2020-12-26 100 120 250.0
27 60 2020-12-27 92 118 241.0
28 60 2020-12-28 103 132 NaN
29 60 2020-12-29 100 132 280.0
30 60 2020-12-30 102 129 380.3
31 60 2020-12-31 92 115 243.0
As you can see from the result, the date in row 26 was fixed, but the empty date in row 22 got a NaT
(Not a Time) value, in other words an empty value. One way to deal with empty values is simply
removing the entire row.
Removing Rows
The result from the converting in the example above gave us a NaT value, which can be handled as a
NULL value, and we can remove the row by using the dropna() method.
Example
Remove rows with a NULL value in the "Date" column:
df.dropna(subset=['Date'], inplace = True)
output:
Duration Date Pulse Maxpulse Calories
0 60 2020-12-01 110 130 409.1
1 60 2020-12-02 117 145 479.0
2 60 2020-12-03 103 135 340.0
3 45 2020-12-04 109 175 282.4
4 45 2020-12-05 117 148 406.0
5 60 2020-12-06 102 127 300.0
6 60 2020-12-07 110 136 374.0
7 450 2020-12-08 104 134 253.3
8 30 2020-12-09 109 133 195.1
9 60 2020-12-10 98 124 269.0
10 60 2020-12-11 103 147 329.3
11 60 2020-12-12 100 120 250.7
12 60 2020-12-12 100 120 250.7
13 60 2020-12-13 106 128 345.3
14 60 2020-12-14 104 132 379.3
15 60 2020-12-15 98 123 275.0
16 60 2020-12-16 98 120 215.2
17 60 2020-12-17 100 120 300.0
18 45 2020-12-18 90 112 NaN
19 60 2020-12-19 103 123 323.0
20 45 2020-12-20 97 125 243.0
21 60 2020-12-21 108 131 364.2
23 60 2020-12-23 130 101 300.0
24 45 2020-12-24 105 132 246.0
25 60 2020-12-25 102 126 334.5
26 60 2020-12-26 100 120 250.0
27 60 2020-12-27 92 118 241.0
28 60 2020-12-28 103 132 NaN
29 60 2020-12-29 100 132 280.0
30 60 2020-12-30 102 129 380.3
31 60 2020-12-31 92 115 243.0
Wrong Data
"Wrong data" does not have to be "empty cells" or "wrong format", it can just be wrong, like if
someone registered "199" instead of "1.99".
Sometimes you can spot wrong data by looking at the data set, because you have an expectation of
what it should be.
If you take a look at our data set, you can see that in row 7, the duration is 450, but for all the other
rows the duration is between 30 and 60.
It doesn't have to be wrong, but taking in consideration that this is the data set of someone's workout
sessions, we conclude with the fact that this person did not work out in 450 minutes.
How can we fix wrong values, like the one for "Duration" in row 7?
Replacing Values
One way to fix wrong values is to replace them with something else.
In our example, it is most likely a typo, and the value should be "45" instead of "450", and we could
just insert "45" in row 7:
Example
Set "Duration" = 45 in row 7:
df.loc[7, 'Duration'] = 45
output:
Duration Date Pulse Maxpulse Calories
0 60 '2020/12/01' 110 130 409.1
1 60 '2020/12/02' 117 145 479.0
2 60 '2020/12/03' 103 135 340.0
3 45 '2020/12/04' 109 175 282.4
4 45 '2020/12/05' 117 148 406.0
5 60 '2020/12/06' 102 127 300.0
6 60 '2020/12/07' 110 136 374.0
7 45 '2020/12/08' 104 134 253.3
8 30 '2020/12/09' 109 133 195.1
9 60 '2020/12/10' 98 124 269.0
10 60 '2020/12/11' 103 147 329.3
11 60 '2020/12/12' 100 120 250.7
12 60 '2020/12/12' 100 120 250.7
13 60 '2020/12/13' 106 128 345.3
14 60 '2020/12/14' 104 132 379.3
15 60 '2020/12/15' 98 123 275.0
16 60 '2020/12/16' 98 120 215.2
17 60 '2020/12/17' 100 120 300.0
18 45 '2020/12/18' 90 112 NaN
19 60 '2020/12/19' 103 123 323.0
20 45 '2020/12/20' 97 125 243.0
21 60 '2020/12/21' 108 131 364.2
22 45 NaN 100 119 282.0
23 60 '2020/12/23' 130 101 300.0
24 45 '2020/12/24' 105 132 246.0
25 60 '2020/12/25' 102 126 334.5
26 60 20201226 100 120 250.0
27 60 '2020/12/27' 92 118 241.0
28 60 '2020/12/28' 103 132 NaN
29 60 '2020/12/29' 100 132 280.0
30 60 '2020/12/30' 102 129 380.3
31 60 '2020/12/31' 92 115 243.0
For small data sets you might be able to replace the wrong data one by one, but not for big data sets.
To replace wrong data for larger data sets you can create some rules, e.g. set some boundaries for
legal values, and replace any values that are outside of the boundaries.
Example
Loop through all values in the "Duration" column.
If the value is higher than 120, set it to 120:
for x in df.index:
if df.loc[x, "Duration"] > 120:
df.loc[x, "Duration"] = 120
output:
Duration Date Pulse Maxpulse Calories
0 60 '2020/12/01' 110 130 409.1
1 60 '2020/12/02' 117 145 479.0
2 60 '2020/12/03' 103 135 340.0
3 45 '2020/12/04' 109 175 282.4
4 45 '2020/12/05' 117 148 406.0
5 60 '2020/12/06' 102 127 300.0
6 60 '2020/12/07' 110 136 374.0
7 120 '2020/12/08' 104 134 253.3
8 30 '2020/12/09' 109 133 195.1
9 60 '2020/12/10' 98 124 269.0
10 60 '2020/12/11' 103 147 329.3
11 60 '2020/12/12' 100 120 250.7
12 60 '2020/12/12' 100 120 250.7
13 60 '2020/12/13' 106 128 345.3
14 60 '2020/12/14' 104 132 379.3
15 60 '2020/12/15' 98 123 275.0
16 60 '2020/12/16' 98 120 215.2
17 60 '2020/12/17' 100 120 300.0
18 45 '2020/12/18' 90 112 NaN
19 60 '2020/12/19' 103 123 323.0
20 45 '2020/12/20' 97 125 243.0
21 60 '2020/12/21' 108 131 364.2
22 45 NaN 100 119 282.0
23 60 '2020/12/23' 130 101 300.0
24 45 '2020/12/24' 105 132 246.0
25 60 '2020/12/25' 102 126 334.5
26 60 20201226 100 120 250.0
27 60 '2020/12/27' 92 118 241.0
28 60 '2020/12/28' 103 132 NaN
29 60 '2020/12/29' 100 132 280.0
30 60 '2020/12/30' 102 129 380.3
31 60 '2020/12/31' 92 115 243.0
Removing Rows
Another way of handling wrong data is to remove the rows that contains wrong data.
This way you do not have to find out what to replace them with, and there is a good chance you do
not need them to do your analyses.
Example
Delete rows where "Duration" is higher than 120:
for x in df.index:
if df.loc[x, "Duration"] > 120:
df.drop(x, inplace = True)
Pandas - Removing Duplicates
Discovering Duplicates
Duplicate rows are rows that have been registered more than one time.
Duration Date Pulse Maxpulse Calories
0 60 '2020/12/01' 110 130 409.1
1 60 '2020/12/02' 117 145 479.0
2 60 '2020/12/03' 103 135 340.0
3 45 '2020/12/04' 109 175 282.4
4 45 '2020/12/05' 117 148 406.0
5 60 '2020/12/06' 102 127 300.0
6 60 '2020/12/07' 110 136 374.0
7 450 '2020/12/08' 104 134 253.3
8 30 '2020/12/09' 109 133 195.1
9 60 '2020/12/10' 98 124 269.0
10 60 '2020/12/11' 103 147 329.3
11 60 '2020/12/12' 100 120 250.7
12 60 '2020/12/12' 100 120 250.7
13 60 '2020/12/13' 106 128 345.3
14 60 '2020/12/14' 104 132 379.3
15 60 '2020/12/15' 98 123 275.0
16 60 '2020/12/16' 98 120 215.2
17 60 '2020/12/17' 100 120 300.0
18 45 '2020/12/18' 90 112 NaN
19 60 '2020/12/19' 103 123 323.0
20 45 '2020/12/20' 97 125 243.0
21 60 '2020/12/21' 108 131 364.2
22 45 NaN 100 119 282.0
23 60 '2020/12/23' 130 101 300.0
24 45 '2020/12/24' 105 132 246.0
25 60 '2020/12/25' 102 126 334.5
26 60 20201226 100 120 250.0
27 60 '2020/12/27' 92 118 241.0
28 60 '2020/12/28' 103 132 NaN
29 60 '2020/12/29' 100 132 280.0
30 60 '2020/12/30' 102 129 380.3
31 60 '2020/12/31' 92 115 243.0
By taking a look at our test data set, we can assume that row 11 and 12 are duplicates.
Plotting
We can use Pyplot, a submodule of the Matplotlib library to visualize the diagram on the
screen.
Example
Import pyplot from Matplotlib and visualize our DataFrame:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('data.csv')
df.plot()
plt.show()
output:
Scatter Plot
Specify that you want a scatter plot with the kind argument:
kind = 'scatter'
In the example below we will use "Duration" for the x-axis and "Calories" for
the y-axis.
x = 'Duration', y = 'Calories'
Example
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('data.csv')
plt.show()
Result
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('data.csv')
plt.show()
Result
Histogram
Use the kind argument to specify that you want a histogram:
kind = 'hist'
A histogram needs only one column.
A histogram shows us the frequency of each interval, e.g. how many workouts lasted between 50 and
60 minutes?
In the example below we will use the "Duration" column to create the histogram:
Example
df["Duration"].plot(kind = 'hist')
Result
Note: The histogram tells us that there were over 100 workouts that lasted between 50
and 60 minutes.
Histogram
Use the kind argument to specify that you want a histogram:
kind = 'hist'
Example
df["Duration"].plot(kind = 'hist')
Result
Note: The histogram tells us that there were over 100 workouts that lasted
between 50 and 60 minutes.
What is SciPy?
SciPy is a scientific computation library that uses NumPy underneath.
SciPy stands for Scientific Python.
It provides more utility functions for optimization, stats and signal processing.
Like NumPy, SciPy is open source so we can use it freely.
SciPy was created by NumPy's creator Travis Olliphant.
Why Use SciPy?
If SciPy uses NumPy underneath, why can we not just use NumPy?
SciPy has optimized and added functions that are frequently used in NumPy and Data Science.
Which Language is SciPy Written in?
SciPy is predominantly written in Python, but a few segments are written in C.
Installation of SciPy
If you have Python and PIP already installed on a system, then installation of SciPy is very easy.
Install it using this command:
C:\Users\Your Name>pip install scipy
If this command fails, then use a Python distribution that already has SciPy installed like, Anaconda,
Spyder etc.
Import SciPy
Once SciPy is installed, import the SciPy module(s) you want to use in your applications by adding
the from scipy import module statement:
from scipy import constants
Now we have imported the constants module from SciPy, and the application is ready to use it:
Example
How many cubic meters are in one liter:
from scipy import constants
print(constants.liter)
output:
0.001
Example
How many cubic meters are in one liter:
from scipy import constants
print(constants.liter)
output:
0.001
constants: SciPy offers a set of mathematical constants, one of them is liter which returns 1 liter as
cubic meters.
Checking SciPy Version
The version string is stored under the __version__ attribute.
Example
import scipy
print(scipy.__version__)
output:
0.18.1
Constants in SciPy
As SciPy is more focused on scientific implementations, it provides many built-in scientific
constants.
These constants can be helpful when you are working with Data Science.
PI is an example of a scientific constant.
Example
Print the constant value of PI:
from scipy import constants
print(constants.pi)
output:
3.141592653589793
Constant Units
A list of all units under the constants module can be seen using the dir() function.
Example
List all constants:
from scipy import constants
print(dir(constants))
output:
['Avogadro', 'Boltzmann', 'Btu', 'Btu_IT', 'Btu_th', 'C2F', 'C2K', 'ConstantWarning', 'F2C', 'F2K', 'G',
'Julian_year', 'K2C', 'K2F', 'N_A', 'Planck', 'R', 'Rydberg', 'Stefan_Boltzmann', 'Tester', 'Wien',
'__all__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__',
'__path__', '__spec__', '_obsolete_constants', 'absolute_import', 'acre', 'alpha', 'angstrom', 'arcmin',
'arcminute', 'arcsec', 'arcsecond', 'astronomical_unit', 'atm', 'atmosphere', 'atomic_mass', 'atto', 'au',
'bar', 'barrel', 'bbl', 'c', 'calorie', 'calorie_IT', 'calorie_th', 'carat', 'centi', 'codata', 'constants',
'convert_temperature', 'day', 'deci', 'degree', 'degree_Fahrenheit', 'deka', 'division', 'dyn', 'dyne', 'e',
'eV', 'electron_mass', 'electron_volt', 'elementary_charge', 'epsilon_0', 'erg', 'exa', 'exbi', 'femto',
'fermi', 'find', 'fine_structure', 'fluid_ounce', 'fluid_ounce_US', 'fluid_ounce_imp', 'foot', 'g', 'gallon',
'gallon_US', 'gallon_imp', 'gas_constant', 'gibi', 'giga', 'golden', 'golden_ratio', 'grain', 'gram',
'gravitational_constant', 'h', 'hbar', 'hectare', 'hecto', 'horsepower', 'hour', 'hp', 'inch', 'k', 'kgf', 'kibi',
'kilo', 'kilogram_force', 'kmh', 'knot', 'lambda2nu', 'lb', 'lbf', 'light_year', 'liter', 'litre', 'long_ton', 'm_e',
'm_n', 'm_p', 'm_u', 'mach', 'mebi', 'mega', 'metric_ton', 'micro', 'micron', 'mil', 'mile', 'milli', 'minute',
'mmHg', 'mph', 'mu_0', 'nano', 'nautical_mile', 'neutron_mass', 'nu2lambda', 'ounce', 'oz', 'parsec',
'pebi', 'peta', 'physical_constants', 'pi', 'pico', 'point', 'pound', 'pound_force', 'precision',
'print_function', 'proton_mass', 'psi', 'pt', 'short_ton', 'sigma', 'speed_of_light', 'speed_of_sound',
'stone', 'survey_foot', 'survey_mile', 'tebi', 'tera', 'test', 'ton_TNT', 'torr', 'troy_ounce', 'troy_pound', 'u',
'unit', 'value', 'week', 'yard', 'year', 'yobi', 'yotta', 'zebi', 'zepto', 'zero_Celsius', 'zetta']