0% found this document useful (0 votes)
49 views

Python Pandas

This document provides an overview of handling structured data using Pandas. It discusses DataFrames and Series, how to create DataFrames from various data sources like dictionaries, lists, CSV and Excel files. It also covers common data manipulation operations like indexing, filtering, sorting and grouping data in DataFrames. The document uses an example retail dataset to demonstrate working with real data and various Pandas functionalities.

Uploaded by

Henish N Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views

Python Pandas

This document provides an overview of handling structured data using Pandas. It discusses DataFrames and Series, how to create DataFrames from various data sources like dictionaries, lists, CSV and Excel files. It also covers common data manipulation operations like indexing, filtering, sorting and grouping data in DataFrames. The document uses an example retail dataset to demonstrate working with real data and various Pandas functionalities.

Uploaded by

Henish N Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

Handling Structured Data using Pandas

Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language. In this session we are going to learn about

DataFrames and Series in Python Pandas


Creating DataFrames and loading data into Pandas
Indexing, Slicing data
Sorting, Filtering, Groupby, Pivot functionalities
Merging dataframes

Series and DataFrame


When we worked with tabular data, pandas help us explore, clean and process the data. This tabular data is called as a DataFrame in pandas. A dataframe can be visually represented as follows

In [1]:
# import pandas library

import pandas as pd

Create dataframe from dictionary of lists

In [2]:
df = pd.DataFrame({'team': ['India', 'South Africa', 'New Zealand', 'England'],

'points': [10, 8, 3, 5],

'runrate': [0.5, 1.4, 2, -0.6],

'wins': [5, 4, 2, 2]})

df

Out[2]: team points runrate wins

0 India 10 0.5 5

1 South Africa 8 1.4 4

2 New Zealand 3 2.0 2

3 England 5 -0.6 2

In [3]:
df.shape

Out[3]: (4, 4)

In [4]:
df.columns

Out[4]: Index(['team', 'points', 'runrate', 'wins'], dtype='object')

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 4 entries, 0 to 3

Data columns (total 4 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 team 4 non-null object

1 points 4 non-null int64

2 runrate 4 non-null float64

3 wins 4 non-null int64

dtypes: float64(1), int64(2), object(1)

memory usage: 256.0+ bytes

In [6]:
df.head(2)

Out[6]: team points runrate wins

0 India 10 0.5 5

1 South Africa 8 1.4 4

In [7]:
df.tail(3)

Out[7]: team points runrate wins

1 South Africa 8 1.4 4

2 New Zealand 3 2.0 2

3 England 5 -0.6 2

In [8]:
df.wins

Out[8]: 0 5

1 4

2 2

3 2

Name: wins, dtype: int64

In [9]:
df.wins >2

Out[9]: 0 True
1 True
2 False
3 False
Name: wins, dtype: bool

In [10]:
df['wins']

Out[10]: 0 5

1 4

2 2

3 2

Name: wins, dtype: int64

Create dataframe from list of lists

In [11]:
# creating a nested list

nested_list = [[1,2,3],[10,20,30],[100,200,300]]

# creating DataFrame

df = pd.DataFrame(nested_list, columns= ['A','B','C'])

print(df)# displaying resultant DataFrame

A B C

0 1 2 3

1 10 20 30

2 100 200 300

Loading data using pandas ( pd.read_csv)


When data and notebook in the same folder

In [12]:
pwd

Out[12]: 'C:\\Users\\isdc0\\Downloads'

In [13]:
Data = pd.read_csv("titanic.csv")

Data.head(2)

Out[13]: PassengerId Survived Pclass Lname Name Sex Age SibSp Parch Ticket Fare Cabin Embarked

0 1 0 3 Braund Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S

1 2 1 1 Cumings Mrs. John Bradley (Florence Briggs Thayer) female 38.0 1 0 PC 17599 71.2833 C85 C

In [14]:
# Load data, available in other folder

Data = pd.read_csv("E:/Aug 2022/Programming for Analytics/Learning Resources/advertising.csv")

Data.sample(2)

Out[14]: Daily Time Spent on Site Age Area Income Daily Internet Usage Ad Topic Line City Male Country Timestamp Clicked on Ad

912 34.96 42 36913.51 160.49 Right-sized mobile initiative West James 1 Macedonia 2016-06-01 16:10:30 1

78 56.14 38 32689.04 113.53 Devolved tangible approach Lake Edward 1 Ireland 2016-02-03 07:59:16 1

In [15]:
# Load excel file

df = pd.read_excel("E:/Aug 2022/Programming for Analytics/Learning Resources/sample.xlsx")

df.head()

Out[15]: ID Item Review

0 1 tv OK

1 2 laptop :-)

2 3 camera poor quality

3 4 tv Superb!

4 5 laptop NaN

In [16]:
# Load txt file using pandas

df1 = pd.read_csv("E:/Aug 2022/Programming for Analytics/Learning Resources/Text Folder/test.txt")

print(df1)

First line: This is a test file

0 This is second line

1 third line

2 and subsiquent lines..

3 and more..

Writing the files to a csv/excel using pandas

In [17]:
df1

Out[17]: First line: This is a test file

0 This is second line

1 third line

2 and subsiquent lines..

3 and more..

In [18]:
df_new=df[['Review','ID']]

df_new

Out[18]: Review ID

0 OK 1

1 :-) 2

2 poor quality 3

3 Superb! 4

4 NaN 5

5 over priced 6

6 repancement 7

7 bad quality 8

8 waiting for refund 9

In [19]:
# write as a csv file
df.to_csv("E:/Aug 2022/Programming for Analytics/Learning Resources/Subset.csv")

In [20]:
# write as a excel file

df.to_excel("E:/Aug 2022/Programming for Analytics/Learning Resources/Subset1.xlsx")

Data Manipulation - Working with Retail Data


To learn about dataframes there is no better way than to work with real data and hence we will learn all the functionalities in pandas using this online retail data.

The is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many
customers of the company are wholesalers.

InvoiceNo Invoice number. If this code starts with letter 'c', it indicates a cancellation.
StockCode Product (item) code.
Description Product (item) name.
Quantity The quantities of each product (item) per transaction.
InvoiceDate Invoice Date and time.
UnitPrice Unit price. Numeric, Product price per unit in sterling.
CustomerID Customer number.
Country Country name.

In [21]:
# let us look at loading the data set

# complete list https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io

df = pd.read_excel("E:/Aug 2022/Programming for Analytics/Learning Resources/Online Retail.xlsx")

# preview your dataframe head and tail

df.head()

Out[21]: InvoiceNo StockCode Description Quantity ID UnitPrice CustomerID Country

0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850 United Kingdom

1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850 United Kingdom

2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 2.75 17850 United Kingdom

3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 3.39 17850 United Kingdom

4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26:00 3.39 17850 United Kingdom

In [22]:
df.head(2)

Out[22]: InvoiceNo StockCode Description Quantity ID UnitPrice CustomerID Country

0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850 United Kingdom

1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850 United Kingdom

In [23]:
df.tail(2)

Out[23]: InvoiceNo StockCode Description Quantity ID UnitPrice CustomerID Country

601 536412 21448 12 DAISY PEGS IN WOOD BOX 2 2010-12-01 11:49:00 1.65 17920 United Kingdom

602 536412 22968 ROSE COTTAGE KEEPSAKE BOX 4 2010-12-01 11:49:00 9.95 17920 United Kingdom

In [24]:
# information about the records in the dataframe

df.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 603 entries, 0 to 602

Data columns (total 8 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 InvoiceNo 603 non-null object

1 StockCode 603 non-null object

2 Description 603 non-null object

3 Quantity 603 non-null int64

4 ID 603 non-null datetime64[ns]

5 UnitPrice 603 non-null float64

6 CustomerID 603 non-null int64

7 Country 603 non-null object

dtypes: datetime64[ns](1), float64(1), int64(2), object(4)

memory usage: 37.8+ KB

In [25]:
# summary statistics of dataframe; for numerical data

df.describe()

Out[25]: Quantity UnitPrice CustomerID

count 603.000000 603.000000 603.000000

mean 14.215589 3.403499 16050.829187

std 35.563101 7.390356 1814.393502

min -24.000000 0.100000 12431.000000

25% 2.000000 1.250000 14307.000000

50% 6.000000 2.100000 16098.000000

75% 12.000000 3.750000 17850.000000

max 432.000000 165.000000 18074.000000

In [26]:
# summary statistics of dataframe; for object data

df.describe(include=[object])

Out[26]: InvoiceNo StockCode Description Country

count 603 603 603 603

unique 46 385 386 4

top 536401 85123A WHITE HANGING HEART T-LIGHT HOLDER United Kingdom

freq 64 8 8 567

In [27]:
df.columns

Out[27]: Index(['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'ID', 'UnitPrice',

'CustomerID', 'Country'],

dtype='object')

In [28]:
# rename column names (take care of inplace)

df.rename(columns = {'ID':"invoice date"},inplace = True)

df.head()

Out[28]: InvoiceNo StockCode Description Quantity invoice date UnitPrice CustomerID Country

0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850 United Kingdom

1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850 United Kingdom

2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 2.75 17850 United Kingdom

3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 3.39 17850 United Kingdom

4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26:00 3.39 17850 United Kingdom

Slicing, filtering
In [29]:
# subset of the data with columns Quantity and UnitPrice

subset_cols = ['Quantity','UnitPrice']

subset_1 = df[subset_cols]

subset_1.head()

Out[29]: Quantity UnitPrice

0 6 2.55

1 6 3.39

2 8 2.75

3 6 3.39

4 6 3.39

In [30]:
# indexing and slicing using iloc

df.iloc[10:20,4:6]

Out[30]: invoice date UnitPrice

10 2010-12-01 08:34:00 2.10

11 2010-12-01 08:34:00 2.10

12 2010-12-01 08:34:00 3.75

13 2010-12-01 08:34:00 1.65

14 2010-12-01 08:34:00 4.25

15 2010-12-01 08:34:00 4.95

16 2010-12-01 08:34:00 9.95

17 2010-12-01 08:34:00 5.95

18 2010-12-01 08:34:00 5.95

19 2010-12-01 08:34:00 7.95

In [31]:
# filter the dataframe based on a criteria

# fetch all the records where quantity is greater than 2000

df[df['Quantity']>5]

Out[31]: InvoiceNo StockCode Description Quantity invoice date UnitPrice CustomerID Country

0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850 United Kingdom

1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850 United Kingdom

2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 2.75 17850 United Kingdom

3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 3.39 17850 United Kingdom

4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26:00 3.39 17850 United Kingdom

... ... ... ... ... ... ... ... ...

560 536412 21880 RED RETROSPOT TAPE 10 2010-12-01 11:49:00 0.65 17920 United Kingdom

564 536412 22961 JAM MAKING SET PRINTED 24 2010-12-01 11:49:00 1.45 17920 United Kingdom

573 536412 22243 5 HOOK HANGER RED MAGIC TOADSTOOL 6 2010-12-01 11:49:00 1.65 17920 United Kingdom

577 536412 22077 6 RIBBONS RUSTIC CHARM 7 2010-12-01 11:49:00 1.65 17920 United Kingdom

581 536412 85049E SCANDINAVIAN REDS RIBBONS 12 2010-12-01 11:49:00 1.25 17920 United Kingdom

328 rows × 8 columns

In [32]:
# fetch the records where quantity is exact 1000 and 2000 (isin and or operator)

#df[(df['Quantity'] == 1000) | (df['Quantity'] == 2000) | (df['Quantity'] == 3000)]

df_sub = df[df['Quantity'].isin([1000,2000,3000])]

In [33]:
df_sub.to_csv("subset_data.csv")

Derived Data
In [34]:
# create a new derived column for invoice amount and tax

df['InvoiceAmount'] = df['Quantity'] * df['UnitPrice']

df['VAT'] = df['InvoiceAmount'] * 0.08

df.head()

# The calculation of the values is done element_wise.

# This means all values in the given column are multiplied by the value 0.08 at once.

# You do not need to use a loop to iterate each of the rows!

Out[34]: InvoiceNo StockCode Description Quantity invoice date UnitPrice CustomerID Country InvoiceAmount VAT

0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850 United Kingdom 15.30 1.2240

1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850 United Kingdom 20.34 1.6272

2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 2.75 17850 United Kingdom 22.00 1.7600

3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 3.39 17850 United Kingdom 20.34 1.6272

4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26:00 3.39 17850 United Kingdom 20.34 1.6272

Aggregating data
In [35]:
# unique

items = list(df['Description'].unique())

len(items)

Out[35]: 386

In [36]:
countries = list(df['Country'].unique())

len(countries)

Out[36]: 4

In [37]:
# value counts

order_country = df['Country'].value_counts()

order_country

Out[37]: United Kingdom 567

France 20

Australia 14

Netherlands 2

Name: Country, dtype: int64

In [38]:
# calculate total, mean, and median of invoice amount

df['VAT'].sum()

Out[38]: 1383.328

In [39]:
# aggregating data for multiple columns

# calculate median of InvoiceAmount and Quantity

df[['InvoiceAmount','VAT']].median()

Out[39]: InvoiceAmount 15.0

VAT 1.2

dtype: float64

Sorting data
In [40]:
# sort the dataframe by invoice amount

df.sort_values(by='InvoiceAmount',ascending = False,inplace = True)

In [41]:
df.sort_values(by = ['Country','Quantity'],ascending = True)

Out[41]: InvoiceNo StockCode Description Quantity invoice date UnitPrice CustomerID Country InvoiceAmount VAT

208 536389 22191 IVORY DINER WALL CLOCK 2 2010-12-01 10:03:00 8.50 12431 Australia 17.00 1.3600

207 536389 22192 BLUE DINER WALL CLOCK 2 2010-12-01 10:03:00 8.50 12431 Australia 17.00 1.3600

204 536389 22193 RED DINER WALL CLOCK 2 2010-12-01 10:03:00 8.50 12431 Australia 17.00 1.3600

203 536389 85014A BLACK/BLUE POLKADOT UMBRELLA 3 2010-12-01 10:03:00 5.95 12431 Australia 17.85 1.4280

201 536389 35004G SET OF 3 GOLD FLYING DUCKS 4 2010-12-01 10:03:00 6.35 12431 Australia 25.40 2.0320

... ... ... ... ... ... ... ... ... ... ...

180 536387 22779 WOODEN OWLS LIGHT GARLAND 192 2010-12-01 09:58:00 3.37 16029 United Kingdom 647.04 51.7632

179 536387 22780 LIGHT GARLAND BUTTERFILES PINK 192 2010-12-01 09:58:00 3.37 16029 United Kingdom 647.04 51.7632

219 536390 20668 DISCO BALL CHRISTMAS DECORATION 288 2010-12-01 10:19:00 0.10 17511 United Kingdom 28.80 2.3040

181 536387 22466 FAIRY TALE COTTAGE NIGHTLIGHT 432 2010-12-01 09:58:00 1.45 16029 United Kingdom 626.40 50.1120

182 536387 21731 RED TOADSTOOL LED NIGHT LIGHT 432 2010-12-01 09:58:00 1.25 16029 United Kingdom 540.00 43.2000

603 rows × 10 columns

Concatenate dataframes
In [42]:
# products dataframes

df_product = pd.DataFrame({'productCode':[1001,1002,1003,1004],

'productDesc':['Laptop','Phone','Keyboard','Camera']})
df_product

Out[42]: productCode productDesc

0 1001 Laptop

1 1002 Phone

2 1003 Keyboard

3 1004 Camera

In [43]:
df_product1 = pd.DataFrame({'productCode':[1005,1006,1007,1008],

'productDesc':['Mouse','Speakers','Monitor','UPS']})

df_product1

Out[43]: productCode productDesc

0 1005 Mouse

1 1006 Speakers

2 1007 Monitor

3 1008 UPS

In [44]:
df_product_master = pd.concat([df_product,df_product1], ignore_index=True)

df_product_master

Out[44]: productCode productDesc

0 1001 Laptop

1 1002 Phone

2 1003 Keyboard

3 1004 Camera

4 1005 Mouse

5 1006 Speakers

6 1007 Monitor

7 1008 UPS

Merging dataframes

In [45]:
df_product

Out[45]: productCode productDesc

0 1001 Laptop

1 1002 Phone

2 1003 Keyboard

3 1004 Camera

In [46]:
df_invoice = pd.DataFrame({'productCode':[1001,1001,1004,1002,1006],

'invoiceAmount':[1800,899,99,499,1299]})

df_invoice

Out[46]: productCode invoiceAmount

0 1001 1800

1 1001 899

2 1004 99

3 1002 499

4 1006 1299

In [47]:
df_invoice.merge(right = df_product)

Out[47]: productCode invoiceAmount productDesc

0 1001 1800 Laptop

1 1001 899 Laptop

2 1004 99 Camera

3 1002 499 Phone

Descriptive statistics -
Measure the Central Tendancy and Disperson of the Data

In [48]:
# Computation of measures of central tendency

# Mean

mean = df['UnitPrice'].mean()

print(mean)

3.4034991708126117

In [49]:
#median

median = df['UnitPrice'].median()

print(median)

2.1

In [50]:
# Mode

mode = df['UnitPrice'].mode()

print(mode)

0 1.65

dtype: float64

Computation of measures of dispersion or variability

In [51]:
# Minimum Value

print(df['UnitPrice'].min())

# Maximum Value

print(df['UnitPrice'].max())

# Range

print(df['UnitPrice'].max() - df['UnitPrice'].min())

# Varience

print(df['UnitPrice'].var())

# SD

print(df['UnitPrice'].std())

0.1

165.0
164.9
54.6173550112117

7.390355540243764

In [52]:
# Computation of measures of shape of distribution

# 1. Skewness

print(df['UnitPrice'].skew())

# skewness=1 for normal distribution; more positive skewness in this case

# 2. Kurtosis

print(df['UnitPrice'].kurt())

# it is greater than 3 and so excess kurtosis > 0.

#Hence, we can conclude that the Revenue curve is a leptokurtic curve

17.846805793784075

382.0981389464283

Skewness and Kurtosis Qualitative check

In [ ]:

You might also like