Python Pandas
Python Pandas
Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language. In this session we are going to learn about
In [1]:
# import pandas library
import pandas as pd
In [2]:
df = pd.DataFrame({'team': ['India', 'South Africa', 'New Zealand', 'England'],
df
0 India 10 0.5 5
3 England 5 -0.6 2
In [3]:
df.shape
Out[3]: (4, 4)
In [4]:
df.columns
In [5]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
In [6]:
df.head(2)
0 India 10 0.5 5
In [7]:
df.tail(3)
3 England 5 -0.6 2
In [8]:
df.wins
Out[8]: 0 5
1 4
2 2
3 2
In [9]:
df.wins >2
Out[9]: 0 True
1 True
2 False
3 False
Name: wins, dtype: bool
In [10]:
df['wins']
Out[10]: 0 5
1 4
2 2
3 2
In [11]:
# creating a nested list
nested_list = [[1,2,3],[10,20,30],[100,200,300]]
# creating DataFrame
A B C
0 1 2 3
1 10 20 30
In [12]:
pwd
Out[12]: 'C:\\Users\\isdc0\\Downloads'
In [13]:
Data = pd.read_csv("titanic.csv")
Data.head(2)
Out[13]: PassengerId Survived Pclass Lname Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings Mrs. John Bradley (Florence Briggs Thayer) female 38.0 1 0 PC 17599 71.2833 C85 C
In [14]:
# Load data, available in other folder
Data.sample(2)
Out[14]: Daily Time Spent on Site Age Area Income Daily Internet Usage Ad Topic Line City Male Country Timestamp Clicked on Ad
912 34.96 42 36913.51 160.49 Right-sized mobile initiative West James 1 Macedonia 2016-06-01 16:10:30 1
78 56.14 38 32689.04 113.53 Devolved tangible approach Lake Edward 1 Ireland 2016-02-03 07:59:16 1
In [15]:
# Load excel file
df.head()
0 1 tv OK
1 2 laptop :-)
3 4 tv Superb!
4 5 laptop NaN
In [16]:
# Load txt file using pandas
print(df1)
1 third line
3 and more..
In [17]:
df1
1 third line
3 and more..
In [18]:
df_new=df[['Review','ID']]
df_new
Out[18]: Review ID
0 OK 1
1 :-) 2
2 poor quality 3
3 Superb! 4
4 NaN 5
5 over priced 6
6 repancement 7
7 bad quality 8
In [19]:
# write as a csv file
df.to_csv("E:/Aug 2022/Programming for Analytics/Learning Resources/Subset.csv")
In [20]:
# write as a excel file
The is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many
customers of the company are wholesalers.
InvoiceNo Invoice number. If this code starts with letter 'c', it indicates a cancellation.
StockCode Product (item) code.
Description Product (item) name.
Quantity The quantities of each product (item) per transaction.
InvoiceDate Invoice Date and time.
UnitPrice Unit price. Numeric, Product price per unit in sterling.
CustomerID Customer number.
Country Country name.
In [21]:
# let us look at loading the data set
df.head()
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850 United Kingdom
1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850 United Kingdom
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 2.75 17850 United Kingdom
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 3.39 17850 United Kingdom
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26:00 3.39 17850 United Kingdom
In [22]:
df.head(2)
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850 United Kingdom
1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850 United Kingdom
In [23]:
df.tail(2)
601 536412 21448 12 DAISY PEGS IN WOOD BOX 2 2010-12-01 11:49:00 1.65 17920 United Kingdom
602 536412 22968 ROSE COTTAGE KEEPSAKE BOX 4 2010-12-01 11:49:00 9.95 17920 United Kingdom
In [24]:
# information about the records in the dataframe
df.info()
<class 'pandas.core.frame.DataFrame'>
In [25]:
# summary statistics of dataframe; for numerical data
df.describe()
In [26]:
# summary statistics of dataframe; for object data
df.describe(include=[object])
top 536401 85123A WHITE HANGING HEART T-LIGHT HOLDER United Kingdom
freq 64 8 8 567
In [27]:
df.columns
'CustomerID', 'Country'],
dtype='object')
In [28]:
# rename column names (take care of inplace)
df.head()
Out[28]: InvoiceNo StockCode Description Quantity invoice date UnitPrice CustomerID Country
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850 United Kingdom
1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850 United Kingdom
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 2.75 17850 United Kingdom
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 3.39 17850 United Kingdom
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26:00 3.39 17850 United Kingdom
Slicing, filtering
In [29]:
# subset of the data with columns Quantity and UnitPrice
subset_cols = ['Quantity','UnitPrice']
subset_1 = df[subset_cols]
subset_1.head()
0 6 2.55
1 6 3.39
2 8 2.75
3 6 3.39
4 6 3.39
In [30]:
# indexing and slicing using iloc
df.iloc[10:20,4:6]
In [31]:
# filter the dataframe based on a criteria
df[df['Quantity']>5]
Out[31]: InvoiceNo StockCode Description Quantity invoice date UnitPrice CustomerID Country
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850 United Kingdom
1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850 United Kingdom
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 2.75 17850 United Kingdom
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 3.39 17850 United Kingdom
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26:00 3.39 17850 United Kingdom
560 536412 21880 RED RETROSPOT TAPE 10 2010-12-01 11:49:00 0.65 17920 United Kingdom
564 536412 22961 JAM MAKING SET PRINTED 24 2010-12-01 11:49:00 1.45 17920 United Kingdom
573 536412 22243 5 HOOK HANGER RED MAGIC TOADSTOOL 6 2010-12-01 11:49:00 1.65 17920 United Kingdom
577 536412 22077 6 RIBBONS RUSTIC CHARM 7 2010-12-01 11:49:00 1.65 17920 United Kingdom
581 536412 85049E SCANDINAVIAN REDS RIBBONS 12 2010-12-01 11:49:00 1.25 17920 United Kingdom
In [32]:
# fetch the records where quantity is exact 1000 and 2000 (isin and or operator)
df_sub = df[df['Quantity'].isin([1000,2000,3000])]
In [33]:
df_sub.to_csv("subset_data.csv")
Derived Data
In [34]:
# create a new derived column for invoice amount and tax
df.head()
# This means all values in the given column are multiplied by the value 0.08 at once.
Out[34]: InvoiceNo StockCode Description Quantity invoice date UnitPrice CustomerID Country InvoiceAmount VAT
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850 United Kingdom 15.30 1.2240
1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850 United Kingdom 20.34 1.6272
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 2.75 17850 United Kingdom 22.00 1.7600
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 3.39 17850 United Kingdom 20.34 1.6272
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26:00 3.39 17850 United Kingdom 20.34 1.6272
Aggregating data
In [35]:
# unique
items = list(df['Description'].unique())
len(items)
Out[35]: 386
In [36]:
countries = list(df['Country'].unique())
len(countries)
Out[36]: 4
In [37]:
# value counts
order_country = df['Country'].value_counts()
order_country
France 20
Australia 14
Netherlands 2
In [38]:
# calculate total, mean, and median of invoice amount
df['VAT'].sum()
Out[38]: 1383.328
In [39]:
# aggregating data for multiple columns
df[['InvoiceAmount','VAT']].median()
VAT 1.2
dtype: float64
Sorting data
In [40]:
# sort the dataframe by invoice amount
In [41]:
df.sort_values(by = ['Country','Quantity'],ascending = True)
Out[41]: InvoiceNo StockCode Description Quantity invoice date UnitPrice CustomerID Country InvoiceAmount VAT
208 536389 22191 IVORY DINER WALL CLOCK 2 2010-12-01 10:03:00 8.50 12431 Australia 17.00 1.3600
207 536389 22192 BLUE DINER WALL CLOCK 2 2010-12-01 10:03:00 8.50 12431 Australia 17.00 1.3600
204 536389 22193 RED DINER WALL CLOCK 2 2010-12-01 10:03:00 8.50 12431 Australia 17.00 1.3600
203 536389 85014A BLACK/BLUE POLKADOT UMBRELLA 3 2010-12-01 10:03:00 5.95 12431 Australia 17.85 1.4280
201 536389 35004G SET OF 3 GOLD FLYING DUCKS 4 2010-12-01 10:03:00 6.35 12431 Australia 25.40 2.0320
... ... ... ... ... ... ... ... ... ... ...
180 536387 22779 WOODEN OWLS LIGHT GARLAND 192 2010-12-01 09:58:00 3.37 16029 United Kingdom 647.04 51.7632
179 536387 22780 LIGHT GARLAND BUTTERFILES PINK 192 2010-12-01 09:58:00 3.37 16029 United Kingdom 647.04 51.7632
219 536390 20668 DISCO BALL CHRISTMAS DECORATION 288 2010-12-01 10:19:00 0.10 17511 United Kingdom 28.80 2.3040
181 536387 22466 FAIRY TALE COTTAGE NIGHTLIGHT 432 2010-12-01 09:58:00 1.45 16029 United Kingdom 626.40 50.1120
182 536387 21731 RED TOADSTOOL LED NIGHT LIGHT 432 2010-12-01 09:58:00 1.25 16029 United Kingdom 540.00 43.2000
Concatenate dataframes
In [42]:
# products dataframes
df_product = pd.DataFrame({'productCode':[1001,1002,1003,1004],
'productDesc':['Laptop','Phone','Keyboard','Camera']})
df_product
0 1001 Laptop
1 1002 Phone
2 1003 Keyboard
3 1004 Camera
In [43]:
df_product1 = pd.DataFrame({'productCode':[1005,1006,1007,1008],
'productDesc':['Mouse','Speakers','Monitor','UPS']})
df_product1
0 1005 Mouse
1 1006 Speakers
2 1007 Monitor
3 1008 UPS
In [44]:
df_product_master = pd.concat([df_product,df_product1], ignore_index=True)
df_product_master
0 1001 Laptop
1 1002 Phone
2 1003 Keyboard
3 1004 Camera
4 1005 Mouse
5 1006 Speakers
6 1007 Monitor
7 1008 UPS
Merging dataframes
In [45]:
df_product
0 1001 Laptop
1 1002 Phone
2 1003 Keyboard
3 1004 Camera
In [46]:
df_invoice = pd.DataFrame({'productCode':[1001,1001,1004,1002,1006],
'invoiceAmount':[1800,899,99,499,1299]})
df_invoice
0 1001 1800
1 1001 899
2 1004 99
3 1002 499
4 1006 1299
In [47]:
df_invoice.merge(right = df_product)
2 1004 99 Camera
Descriptive statistics -
Measure the Central Tendancy and Disperson of the Data
In [48]:
# Computation of measures of central tendency
# Mean
mean = df['UnitPrice'].mean()
print(mean)
3.4034991708126117
In [49]:
#median
median = df['UnitPrice'].median()
print(median)
2.1
In [50]:
# Mode
mode = df['UnitPrice'].mode()
print(mode)
0 1.65
dtype: float64
In [51]:
# Minimum Value
print(df['UnitPrice'].min())
# Maximum Value
print(df['UnitPrice'].max())
# Range
print(df['UnitPrice'].max() - df['UnitPrice'].min())
# Varience
print(df['UnitPrice'].var())
# SD
print(df['UnitPrice'].std())
0.1
165.0
164.9
54.6173550112117
7.390355540243764
In [52]:
# Computation of measures of shape of distribution
# 1. Skewness
print(df['UnitPrice'].skew())
# 2. Kurtosis
print(df['UnitPrice'].kurt())
17.846805793784075
382.0981389464283
In [ ]: