0% found this document useful (0 votes)
27 views

Topic_4_Basic_programming_concepts

Uploaded by

alhammadheba77
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

Topic_4_Basic_programming_concepts

Uploaded by

alhammadheba77
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Table of Contents

1 Basic Programming
1.1 Python Programming
1.2 Learning Outcomes
1.3 Data Series
2 DataFrame (DF)
2.1 Creating DataFrame
2.1.1 Create DF with Given n-D Array
2.1.2 Create DF with Random n-D Array
2.1.3 Create DF from CSV
2.1.4 Create DF from XLSX
2.2 Accessing Elements
2.2.1 Accessing via .loc
2.2.2 Accessing via .iloc
2.3 DF Slicing
2.3.1 Single Column Slice
2.3.2 Slicing via .loc
2.3.3 Slicing via .iloc
2.4 DF Sorting
2.5 DF Selecting
2.5.1 Example: Displaying Data
2.6 Addition DF Methods
2.6.1 Count Rows and Columns
2.6.2 Apply Lambda Functions
2.6.3 Apply General Functions
3 Data Visualization
3.1 Why Visualization
3.2 Typical Visualizations
3.3 Pandas Plots
3.4 Seaborn Plots
3.4.1 Example: Seaborn's Basic Plots
3.4.2 Example: Seaborn's Advanced Plots
4 References:
4.1 Data Sets:
4.2 Others:

Basic Programming
Python Programming and DataFrames

Python Programming
1. In ICS 104, we have seen Python as a general purpose programming language.
2. Although, Python can be used in many areas that require programming, in this chapter we see the usage of Python w.r.t data reading and visualization.
3. Specifically, we highlight the usage of Pandas and Seaborn library.

Learning Outcomes
1. Build, read and/or manipulate DataFrames.
2. Learn DataFrame slicing, sorting and selection methods.
3. Visualize data using histogram, box plots, scatter plots, etc.

Data Series
1. In Pandas, series are labeled/indexed one-dimensional arrays.
2. The elements of the series could be integer, float, string, objects, etc.
3. The index may not be unique.
4. A collection of Data Series is called as Data Frame.
5. Our focus in this chapter will be on Data Frames.

DataFrame (DF)
Collection of series, where each column is a series.

Creating DataFrame
In order to create a data from, following things are typically needed:

1. Data in n-dimensional array format.


2. Optional column title/header

An optional row index can be given too. In addition to the above, there are many parameters, which can be read from pydata docs.

Create DF with Given n-D Array


Data:

[ ]
a11 a12
(1)
a21 a22

col_names: ['col 1', 'col 2']


row_index: ['row 1', 'row 2']

In [1]: import pandas as pd


df = pd.DataFrame([['a11', 'a12'], ['a21', 'a22']],
index=['row 1', 'row 2'],
columns=['col 1', 'col 2'])

display(df)

col 1 col 2

row 1 a11 a12

row 2 a21 a22

Create DF with Random n-D Array


Data: random 3 columns 1000 rows numbers between 0 and 1.
col_names: ['col 1', 'col 2', 'col 3']

In [2]: import pandas as pd


import numpy as np

df = pd.DataFrame(np.random.rand(1000,3),
columns=['col 1', 'col 2', 'col 3'])

display(df)

col 1 col 2 col 3

0 0.900379 0.450864 0.915527

1 0.107990 0.368869 0.516130

2 0.342405 0.492546 0.788363

3 0.597415 0.860259 0.241468

4 0.095641 0.791185 0.959279

... ... ... ...

995 0.247713 0.397768 0.911072

996 0.023047 0.423800 0.863465

997 0.714096 0.294497 0.737279

998 0.494320 0.450755 0.293238

999 0.552515 0.312760 0.456961

1000 rows × 3 columns

Create DF from CSV


Data: Basic-1.csv
col_names: Use from the first row

1. Display the above data.


2. Obtain the summary statistics:
For numeric data (count, mean, std, min, max)
For non-numeric data (count, unique-values, most occurring and its frequency)

In [3]: #1. Display the above data.


import pandas as pd
df = pd.read_csv('data/Basic-1.csv', delimiter=',')#the path to the data file.

## Path relative to the .ipynb file


# df = pd.read_csv('Basic-101.csv', delimiter=',',header=None,names=['Gender','Job Type',"Province"])

# display data
display(df)

#2. Obtain the summary statistics:


# summary statistics for non-numeric columns is available by default only when all columns are non-numeric
display(df.describe())

Gender Job Type Province

0 M Pink-collar Hejaz

1 M White-collar Central

2 M Pink-collar Hejaz

3 M Pink-collar Hejaz

4 M Gold-collar Eastern

... ... ... ...

1273 M Blue-collar Central

1274 M Pink-collar Hejaz

1275 F Blue-collar Hejaz

1276 F White-collar Central

1277 F Blue-collar Eastern

1278 rows × 3 columns

Gender Job Type Province

count 1278 1278 1278

unique 2 4 3

top M White-collar Central

freq 875 568 531

Create DF from XLSX


Data: Basic-2.xlsx
col_names: There are 17 columns, defined respectively as:

1. dur: duration of agreement


2. wage1.wage : wage increase in first year of contract
3. wage2.wage : wage increase in second year of contract
4. wage3.wage : wage increase in third year of contract
5. cola : cost of living allowance
6. hours.hrs : number of working hours during week
7. pension : employer contributions to pension plan
8. stby_pay : standby pay
9. shift_diff : shift differencial : supplement for work on II and III shift
10. educ_allw.boolean : education allowance
11. holidays : number of statutory holidays
12. vacation : number of paid vacation days
13. lngtrm_disabil.boolean : employer's help during employee longterm disability
14. dntl_ins : employers contribution towards the dental plan
15. bereavement.boolean : employer's financial contribution towards the covering the costs of bereavement
16. empl_hplan : employer's contribution towards the health plan
17. empl_hplan : employer's contribution towards the health plan

For the above data, do the following:

1. Display the above data.


2. Obtain the summary statistics:
For numeric data (count, mean, std, min, max)
For non-numeric data (count, unique-values, most occurring and its frequency)

In [3]: # 1. Display the above data.


import pandas as pd

df = pd.read_excel('data/Basic-2.xlsx')

display(df.head())
#df.tail()
#df.sample(5)

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 c16 c17

0 1.0 5.0 NaN NaN NaN 40.0 NaN NaN 2.0 NaN 11.0 average NaN NaN yes NaN good

1 2.0 4.5 5.8 NaN NaN 35.0 ret_allw NaN NaN yes 11.0 below average NaN full NaN full good

2 NaN NaN NaN NaN NaN 38.0 empl_contr NaN 5.0 NaN 11.0 generous yes half yes half good

3 3.0 3.7 4.0 5.0 tc NaN NaN NaN NaN yes NaN NaN NaN NaN yes NaN good

4 3.0 4.5 4.5 5.0 NaN 40.0 NaN NaN NaN NaN 12.0 average NaN half yes half good

In [5]: # 2. Obtain the summary statistics:

display(df.describe())
# by default the describe gives summary of the numeric columns, when there are mixed columns
# df.describe(include='all') # can be used, but too many NaNs

c1 c2 c3 c4 c6 c8 c9 c11

count 39.000000 39.000000 30.000000 12.000000 37.000000 7.000000 24.000000 38.000000

mean 2.102564 3.620513 3.913333 3.766667 37.810811 6.142857 4.583333 11.105263

std 0.753758 1.331438 1.280822 1.415070 2.716593 4.845223 4.754098 1.371318

min 1.000000 2.000000 2.000000 2.000000 27.000000 2.000000 0.000000 9.000000

25% 2.000000 2.500000 2.625000 2.250000 37.000000 2.000000 3.000000 10.000000

50% 2.000000 3.500000 4.000000 4.600000 38.000000 4.000000 4.000000 11.000000

75% 3.000000 4.500000 4.500000 5.000000 40.000000 10.000000 5.000000 12.000000

max 3.000000 6.900000 7.000000 5.100000 40.000000 13.000000 25.000000 15.000000

In [6]: # to get summary of non-numeric columns

# Idenfity non-numeric columns


cat_columns= df.select_dtypes(include='object').columns

# Show summary of non-numeric columns


df[cat_columns].describe()

## or in one step
# df.select_dtypes(include='object').describe()

Out[6]: c5 c7 c10 c12 c13 c14 c15 c16 c17

count 24 18 18 37 16 25 20 24 40

unique 3 3 2 3 2 3 2 3 2

top none none no below average yes half yes full good

freq 14 8 11 14 11 11 18 12 26

Accessing Elements
.loc and .iloc methods can be used for accessing the elements of a DF.

Accessing via .loc


Consider the dataset given in Basic-1.csv. Display second row, second column element.

In [7]: import pandas as pd


df = pd.read_csv('data/Basic-1.csv', delimiter=',')
print(df.loc[1,'Job Type'])

White-collar

Accessing via .iloc


Consider the dataset given in Basic-1.csv. Display second row, second column element.

In [8]: import pandas as pd


df = pd.read_csv('data/Basic-1.csv', delimiter=',')
display(df.head())
print(df.iloc[1,1])
print(df.loc[1,'Job Type'])

Gender Job Type Province

0 M Pink-collar Hejaz

1 M White-collar Central

2 M Pink-collar Hejaz

3 M Pink-collar Hejaz

4 M Gold-collar Eastern

White-collar
White-collar

DF Slicing
1. Slicing is nothing but extracting a part of the DF.
2. The part could be contiguous or broken chunks of the DF.

Single Column Slice


A single column can be extracted using the following methods:

1. df.COL_NAME
2. df[COL_NAME]

Consider the dataset given in Basic-2.xlsx. Display first column.

In [9]: import pandas as pd

df = pd.read_excel('data/Basic-2.xlsx')
# df.head()
display(df['c1'])

## Other styles to select a column


# display(df.iloc[:,0])
# display(df.loc[:,'c1'])

## Following code can be used to create column names when dataset does not contain any column names.
# col_name=[''.join(['c',str(i+1)]) for i in range(17)]
## Following options can be used to add column names to the dataset
# header=None, names=col_name

0 1.0
1 2.0
2 NaN
3 3.0
4 3.0
5 2.0
6 3.0
7 3.0
8 2.0
9 1.0
10 3.0
11 2.0
12 2.0
13 3.0
14 1.0
15 2.0
16 1.0
17 1.0
18 1.0
19 2.0
20 2.0
21 2.0
22 3.0
23 2.0
24 1.0
25 3.0
26 2.0
27 2.0
28 2.0
29 3.0
30 3.0
31 3.0
32 2.0
33 2.0
34 3.0
35 2.0
36 1.0
37 1.0
38 3.0
39 2.0
Name: c1, dtype: float64

Slicing via .loc


Consider the dataset given in Basic-2.xlsx. Display first 10 rows from the second, third, fourth and fifth column.

In [10]: import pandas as pd


df = pd.read_excel('data/Basic-2.xlsx')

display(df.head())

display(df.loc[0:9,'c2':'c5']) # ends are inclusive

# display(df.loc[0:9:2,'c2':'c5':2]) # ends are inclusive and steps can be used too

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 c16 c17

0 1.0 5.0 NaN NaN NaN 40.0 NaN NaN 2.0 NaN 11.0 average NaN NaN yes NaN good

1 2.0 4.5 5.8 NaN NaN 35.0 ret_allw NaN NaN yes 11.0 below average NaN full NaN full good

2 NaN NaN NaN NaN NaN 38.0 empl_contr NaN 5.0 NaN 11.0 generous yes half yes half good

3 3.0 3.7 4.0 5.0 tc NaN NaN NaN NaN yes NaN NaN NaN NaN yes NaN good

4 3.0 4.5 4.5 5.0 NaN 40.0 NaN NaN NaN NaN 12.0 average NaN half yes half good

c2 c3 c4 c5

0 5.0 NaN NaN NaN

1 4.5 5.8 NaN NaN

2 NaN NaN NaN NaN

3 3.7 4.0 5.0 tc

4 4.5 4.5 5.0 NaN

5 2.0 2.5 NaN NaN

6 4.0 5.0 5.0 tc

7 6.9 4.8 2.3 NaN

8 3.0 7.0 NaN NaN

9 5.7 NaN NaN none

Slicing via .iloc


Consider the dataset given in Basic-2.xlsx. Display first 10 rows from the second, third, fourth and fifth column.

In [11]: import pandas as pd


df = pd.read_excel('data/Basic-2.xlsx')

display(df.iloc[:10,1:5])

c2 c3 c4 c5

0 5.0 NaN NaN NaN

1 4.5 5.8 NaN NaN

2 NaN NaN NaN NaN

3 3.7 4.0 5.0 tc

4 4.5 4.5 5.0 NaN

5 2.0 2.5 NaN NaN

6 4.0 5.0 5.0 tc

7 6.9 4.8 2.3 NaN

8 3.0 7.0 NaN NaN

9 5.7 NaN NaN none

In [12]: ## Example of loc and iloc


df = pd.DataFrame([['a11', 'a12'], ['a21', 'a22']],
index=['row 1', 'row 2'],
columns=['col 1', 'col-2'])

display(df)

# display(df.loc[:,'col 1']) # Select one column

# display(df['col 1']) # Select one column

# display(df.col-2) # # Select one column, problem is Spaces or Special characters

# display(df.loc['row 2','col 1'])

# display(df.loc['row 2',:])

# display(df['row 2']) ## check...

# display(df.iloc[:,0]) # Select one column

# display(df.iloc[1,0]) # row 2 column 1

col 1 col-2

row 1 a11 a12

row 2 a21 a22

DF Sorting
1. DataFrames can be sorted w.r.t values in the columns.
2. One or more columns can be used for sorting.

Consider the dataset given in Basic-2.xlsx. Sort the rows according to the first column, in ascending order. Repeat the sort in descending order.

In [13]: import pandas as pd

#Read data
df = pd.read_excel('data/Basic-2.xlsx')
display(df.head())

# # sort by single column in ascending order


df.sort_values(by=['c1'], inplace=True)
display(df.head())

# # sort by single column in descending order


df.sort_values(by=['c1'], inplace=True, ascending=False)
display(df.head())

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 c16 c17

0 1.0 5.0 NaN NaN NaN 40.0 NaN NaN 2.0 NaN 11.0 average NaN NaN yes NaN good

1 2.0 4.5 5.8 NaN NaN 35.0 ret_allw NaN NaN yes 11.0 below average NaN full NaN full good

2 NaN NaN NaN NaN NaN 38.0 empl_contr NaN 5.0 NaN 11.0 generous yes half yes half good

3 3.0 3.7 4.0 5.0 tc NaN NaN NaN NaN yes NaN NaN NaN NaN yes NaN good

4 3.0 4.5 4.5 5.0 NaN 40.0 NaN NaN NaN NaN 12.0 average NaN half yes half good

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 c16 c17

0 1.0 5.0 NaN NaN NaN 40.0 NaN NaN 2.0 NaN 11.0 average NaN NaN yes NaN good

16 1.0 2.8 NaN NaN NaN 35.0 NaN NaN 2.0 NaN 12.0 below average NaN NaN NaN NaN good

14 1.0 3.0 NaN NaN none 36.0 NaN NaN 10.0 no 11.0 generous NaN NaN NaN NaN good

24 1.0 6.0 NaN NaN NaN 38.0 NaN 8.0 3.0 NaN 9.0 generous NaN NaN NaN NaN good

17 1.0 2.1 NaN NaN tc 40.0 ret_allw 2.0 3.0 no 9.0 below average yes half NaN none bad

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 c16 c17

29 3.0 2.0 2.5 NaN NaN 35.0 none NaN NaN NaN 10.0 average NaN NaN yes full bad

7 3.0 6.9 4.8 2.3 NaN 40.0 NaN NaN 3.0 NaN 12.0 below average NaN NaN NaN NaN good

3 3.0 3.7 4.0 5.0 tc NaN NaN NaN NaN yes NaN NaN NaN NaN yes NaN good

4 3.0 4.5 4.5 5.0 NaN 40.0 NaN NaN NaN NaN 12.0 average NaN half yes half good

34 3.0 2.0 2.5 2.1 tc 40.0 none 2.0 1.0 no 10.0 below average no half yes full bad

Consider the dataset given in Basic-2.xlsx. Sort the rows according to the first and then sixth column, where ascending in the first and descending in the sixth column. Display only first 6 columns.

In [14]: # sort by multiple columns

df.sort_values(by=['c1','c6'], inplace=True, ascending=[True, False])

display(df.iloc[:,0:6])

c1 c2 c3 c4 c5 c6

9 1.0 5.7 NaN NaN none 40.0

17 1.0 2.1 NaN NaN tc 40.0

36 1.0 2.0 NaN NaN tc 40.0

0 1.0 5.0 NaN NaN NaN 40.0

24 1.0 6.0 NaN NaN NaN 38.0

37 1.0 2.8 NaN NaN none 38.0

18 1.0 2.0 NaN NaN none 38.0

14 1.0 3.0 NaN NaN none 36.0

16 1.0 2.8 NaN NaN NaN 35.0

12 2.0 3.5 4.0 NaN none 40.0

39 2.0 4.5 4.0 NaN none 40.0

21 2.0 2.5 3.0 NaN NaN 40.0

23 2.0 4.5 4.0 NaN NaN 40.0

35 2.0 2.0 2.0 NaN none 40.0

33 2.0 4.0 5.0 NaN none 40.0

8 2.0 3.0 7.0 NaN NaN 38.0

11 2.0 6.4 6.4 NaN NaN 38.0

32 2.0 2.5 2.5 NaN NaN 38.0

20 2.0 4.3 4.4 NaN NaN 38.0

15 2.0 4.5 4.0 NaN none 37.0

28 2.0 5.0 4.0 NaN none 37.0

1 2.0 4.5 5.8 NaN NaN 35.0

5 2.0 2.0 2.5 NaN NaN 35.0

19 2.0 4.0 5.0 NaN tcf 35.0

27 2.0 3.0 3.0 NaN none 33.0

26 2.0 4.5 4.5 NaN tcf NaN

7 3.0 6.9 4.8 2.3 NaN 40.0

4 3.0 4.5 4.5 5.0 NaN 40.0

34 3.0 2.0 2.5 2.1 tc 40.0

30 3.0 4.5 4.5 5.0 none 40.0

25 3.0 2.0 2.0 2.0 none 40.0

31 3.0 3.0 2.0 2.5 tc 40.0

38 3.0 2.0 2.5 2.0 NaN 37.0

13 3.0 3.5 4.0 5.1 tcf 37.0

10 3.0 3.5 4.0 4.6 none 36.0

29 3.0 2.0 2.5 NaN NaN 35.0

22 3.0 3.5 4.0 4.6 tcf 27.0

3 3.0 3.7 4.0 5.0 tc NaN

6 3.0 4.0 5.0 5.0 tc NaN

2 NaN NaN NaN NaN NaN 38.0

DF Selecting
Selecting is similar to the slicing. Typically, it deals with the following:

1. We are typically interested in extracting rows based on conditions.


2. Conditions are based on the values of one or more columns.

Consider the dataset given in Basic-1.csv. Display all rows related to White-collar jobs.

In [15]: import pandas as pd


df = pd.read_csv('data/Basic-1.csv', delimiter=',')
display(df.head())

seleted_rows = df['Job Type']=='White-collar'


display(df.loc[seleted_rows,:])

Gender Job Type Province

0 M Pink-collar Hejaz

1 M White-collar Central

2 M Pink-collar Hejaz

3 M Pink-collar Hejaz

4 M Gold-collar Eastern

Gender Job Type Province

1 M White-collar Central

5 M White-collar Hejaz

8 M White-collar Eastern

9 F White-collar Hejaz

10 M White-collar Hejaz

... ... ... ...

1266 M White-collar Hejaz

1267 M White-collar Hejaz

1271 F White-collar Central

1272 M White-collar Central

1276 F White-collar Central

568 rows × 3 columns

Consider the dataset given in Basic-1.csv.

1. Display all rows related to White-collar jobs in Eastern Province.


2. Display the statistical summary of Gender column in the above selection.

In [16]: import pandas as pd


df = pd.read_csv('data/Basic-1.csv', delimiter=',')

seleted_rows = (df['Province']=='Eastern') & (df['Job Type']=='White-collar')


display(df.loc[seleted_rows,:])

# Summary
display(df.loc[seleted_rows,:].describe().iloc[:,0])
# display(df.loc[seleted_rows,:].describe())
# display(df.loc[seleted_rows,:].describe().loc[:,'Gender'])

Gender Job Type Province

8 M White-collar Eastern

18 F White-collar Eastern

27 M White-collar Eastern

31 M White-collar Eastern

37 M White-collar Eastern

... ... ... ...

1226 M White-collar Eastern

1232 M White-collar Eastern

1238 F White-collar Eastern

1241 M White-collar Eastern

1256 M White-collar Eastern

138 rows × 3 columns

count 138
unique 2
top M
freq 98
Name: Gender, dtype: object

Example: Displaying Data


Question-A: Consider the dataset given in Basic-1.csv.

1. Display all rows related to White-collar and Blue-collar jobs.


2. Display all rows related to White-collar and Blue-collar jobs for Females.
3. Display the statistical summary of Province column in the above selection.
4. Among all the White-collar and Blue-collar job Females, what proportion works in Eastern Province.

In [17]: #1. Display all rows related to White-collar and Blue-collar jobs.

import pandas as pd
df = pd.read_csv('data/Basic-1.csv', delimiter=',')
display(df)

seleted_rows = df['Job Type'].isin(['White-collar','Blue-collar'])

# seleted_rows = (df['Job Type']=='White-collar') | (df['Job Type']=='Blue-collar')

display(df.loc[seleted_rows,:])

Gender Job Type Province

0 M Pink-collar Hejaz

1 M White-collar Central

2 M Pink-collar Hejaz

3 M Pink-collar Hejaz

4 M Gold-collar Eastern

... ... ... ...

1273 M Blue-collar Central

1274 M Pink-collar Hejaz

1275 F Blue-collar Hejaz

1276 F White-collar Central

1277 F Blue-collar Eastern

1278 rows × 3 columns

Gender Job Type Province

1 M White-collar Central

5 M White-collar Hejaz

6 M Blue-collar Eastern

7 M Blue-collar Central

8 M White-collar Eastern

... ... ... ...

1272 M White-collar Central

1273 M Blue-collar Central

1275 F Blue-collar Hejaz

1276 F White-collar Central

1277 F Blue-collar Eastern

984 rows × 3 columns

In [18]: # 2. Display all rows related to White-collar and Blue-collar jobs for Females.

import pandas as pd
df = pd.read_csv('data/Basic-1.csv', delimiter=',')

seleted_rows = df['Job Type'].isin(['White-collar','Blue-collar']) & (df['Gender']=='F')


display(df.loc[seleted_rows,:])

# 3. Display the statistical summary of Province column in the above selection.


df.loc[seleted_rows,'Province'].describe()

Gender Job Type Province

9 F White-collar Hejaz

11 F Blue-collar Central

13 F White-collar Central

14 F White-collar Hejaz

15 F White-collar Central

... ... ... ...

1254 F White-collar Hejaz

1271 F White-collar Central

1275 F Blue-collar Hejaz

1276 F White-collar Central

1277 F Blue-collar Eastern

311 rows × 3 columns

Out[18]: count 311


unique 3
top Central
freq 136
Name: Province, dtype: object

In [19]: # 4. Among all the White-collar and Blue-collar job Females, what proportion works in Eastern Province.
ndf = df.loc[seleted_rows,'Province']
# display(ndf)
# print(ndf.value_counts())

proportion = ndf.value_counts()[2]/len(ndf.index)*100

## to get length of dataframe


# print(len(ndf.index))
# print(ndf.shape[0])
# print(len(ndf))

print(f'The proportion of Females working as blue/white collar in Eastern province is: {proportion: 0.2f}%')

The proportion of Females working as blue/white collar in Eastern province is: 22.83%

Addition DF Methods

Count Rows and Columns


Consider the dataset given in Basic-2.csv.

1. Count the number of rows and columns.


2. Count the number of non-null rows for each column.

In [1]: import pandas as pd


df = pd.read_excel('data/Basic-2.xlsx')
display(df.head())

print(f'The number of rows are {len(df.index)}, and the number of columns are {len(df.columns)}')

print(f'The number of non-null rows for each column are:\n{df.count()}')

# print(f'The number of null rows for each column are:\n{df.isna().sum()}')

print(df.shape)

File "C:\Users\TAHIR~1.MAH\AppData\Local\Temp/ipykernel_29224/563380489.py", line 11


print(df.shape)
^
IndentationError: unexpected indent
Apply Lambda Functions
Lambda functions can be applied to each row or column using .apply() method.

Consider the data in file Basic-2-Clean.csv.

1. In columns c1, c6 and c11, convert every number to the nearest integer.
2. In column c12, replace the space between two words by underscore.

In [21]: # 1. In columns c1, c6 and c11, convert every number to the nearest integer.
import pandas as pd
df = pd.read_csv('data/Basic-2-Clean.csv')
display(df.head())

df['c1']=df['c1'].apply(lambda x: int(round(x))) # apply for single column


# df['c1']= (lambda x: round(x))(df['c1']) #another style

df.loc[:,['c6','c11']]=df.loc[:,['c6','c11']].applymap(lambda x: int(round(x)) ) #for multiple columns

display(df.head())

c1 c2 c3 c5 c6 c9 c11 c12 c14 c16 c17

0 1.000000 5.000000 3.913333 none 40.000000 2.000000 11.000000 average half full good

1 2.000000 4.500000 5.800000 none 35.000000 4.583333 11.000000 below average full full good

2 2.102564 3.620513 3.913333 none 38.000000 5.000000 11.000000 generous half half good

3 3.000000 3.700000 4.000000 tc 37.810811 4.583333 11.105263 below average half full good

4 3.000000 4.500000 4.500000 none 40.000000 4.583333 12.000000 average half half good

c1 c2 c3 c5 c6 c9 c11 c12 c14 c16 c17

0 1 5.000000 3.913333 none 40 2.000000 11 average half full good

1 2 4.500000 5.800000 none 35 4.583333 11 below average full full good

2 2 3.620513 3.913333 none 38 5.000000 11 generous half half good

3 3 3.700000 4.000000 tc 38 4.583333 11 below average half full good

4 3 4.500000 4.500000 none 40 4.583333 12 average half half good

In [22]: # 2. In column c12, replace the space between two words by underscore.
display(df.head())
df['c12']=df['c12'].apply(lambda x: x.replace(' ','_'))
display(df.head())

c1 c2 c3 c5 c6 c9 c11 c12 c14 c16 c17

0 1 5.000000 3.913333 none 40 2.000000 11 average half full good

1 2 4.500000 5.800000 none 35 4.583333 11 below average full full good

2 2 3.620513 3.913333 none 38 5.000000 11 generous half half good

3 3 3.700000 4.000000 tc 38 4.583333 11 below average half full good

4 3 4.500000 4.500000 none 40 4.583333 12 average half half good

c1 c2 c3 c5 c6 c9 c11 c12 c14 c16 c17

0 1 5.000000 3.913333 none 40 2.000000 11 average half full good

1 2 4.500000 5.800000 none 35 4.583333 11 below_average full full good

2 2 3.620513 3.913333 none 38 5.000000 11 generous half half good

3 3 3.700000 4.000000 tc 38 4.583333 11 below_average half full good

4 3 4.500000 4.500000 none 40 4.583333 12 average half half good

Apply General Functions


Similar to Lambda functions, user defined or inbuilt functions can be applied to each row or column using .apply() method. Moreover, .applymap() method can be used for multiple columns.

Consider the data in file Basic-2-Clean.csv.


In columns c2, c3 and c9, round the values to the nearest 0, 0.5 or 1. If any value is negative, then replace it with zero.

In [23]: def custom_round(x):


if x <=0:
return 0
int_x = int(x)
if (x <= int_x+0.25):
return int_x
elif (x > int_x+0.25) and (x <= int_x+0.75):
return int_x+0.5
else:
return int_x+1

import pandas as pd
df = pd.read_csv('data/Basic-2-Clean.csv')
display(df.head())
# df['c2']=df['c2'].apply(custom_round) # for single column
df.loc[:,['c2','c3','c9']]=df.loc[:,['c2','c3','c9']].applymap(custom_round) # for multiple columns
display(df.head())

c1 c2 c3 c5 c6 c9 c11 c12 c14 c16 c17

0 1.000000 5.000000 3.913333 none 40.000000 2.000000 11.000000 average half full good

1 2.000000 4.500000 5.800000 none 35.000000 4.583333 11.000000 below average full full good

2 2.102564 3.620513 3.913333 none 38.000000 5.000000 11.000000 generous half half good

3 3.000000 3.700000 4.000000 tc 37.810811 4.583333 11.105263 below average half full good

4 3.000000 4.500000 4.500000 none 40.000000 4.583333 12.000000 average half half good

c1 c2 c3 c5 c6 c9 c11 c12 c14 c16 c17

0 1.000000 5.0 4.0 none 40.000000 2.0 11.000000 average half full good

1 2.000000 4.5 6.0 none 35.000000 4.5 11.000000 below average full full good

2 2.102564 3.5 4.0 none 38.000000 5.0 11.000000 generous half half good

3 3.000000 3.5 4.0 tc 37.810811 4.5 11.105263 below average half full good

4 3.000000 4.5 4.5 none 40.000000 4.5 12.000000 average half half good

Data Visualization
Graphical representation of data

Why Visualization
1. When data is small (no magic numbers, but say 5 rows and 2 columns), then one may extract meaningful patterns by looking at the data.
2. However, when we have large data (again no magic numbers, but say 5000 rows and 2000 columns), then mere observation may not be fruitful in identifying patterns.
3. Visualization is the first typical step in data analysis.

Typical Visualizations
1. Histogram
2. Box Plots
3. Scatter Plots
4. Line Plots

Pandas Plots
1. Pandas provide in-built basic plots using the numpy and matplotlib libraries.
2. Several plots from pandas are under df.plot, which can be accessed via kind option.
3. In addition to that, specific plots like histogram (df.hist()), and boxplot (df.boxplot()) are also available.
4. However, pandas basic plots are not enough, and in the course we look at the seaborn library for plotting.

In the following cells, we will look at most typical plots that can be constructed using seaborn plots.

Seaborn Plots
1. Seaborn library provides a variety of plots, inlcluding histogram, boxplots, lineplots, scatterplots, countplot, violinplot, swarmplot, pairplot, catplot etc.
2. These plots can be accessed through .countplot, .violinplot, catplot, etc.
3. Typical parameters include

x, y, and/or hue

where x, y refers to the x and y axis, and hue defines subsets of the data, which will be drawn on separate facets in the grid.
4. It also comes with huge customization.
5. The Seaborn Plots are aesthetically better than basic plots from pandas library.
6. The library requires numpy and matplotlib libraries.
7. It is integrated with pandas data.

In the following cell, we will look at most typical plots from the seaborn library.

Example: Seaborn's Basic Plots


Question-B: Consider the data in file Basic-3.csv.

1. Read and display information the data.


2. Add a new column to dataframe that represents total score. The total score is sum of score of math, reading, and writing scores.
3. Draw histograms of both numeric and non-numeric columns.
4. Draw box-plots for the numeric columns, and differentiate by test preparation course column.
5. Draw three overlapping scatter plots of columns math score, reading score and writing score w.r.t Total score. Set the markers transparency level (alpha) to 0.5.
6. Draw scatter plot of columns math score and writing score, where the size of the marker is based on column Total score. In addition to that, reduce the marker transparency (alpha) to 0.5.
7. Plot Total score in ascending order on the x axis, and the corresponding pairwise absolute differences of math score, reading score and writing score on the y axis.

In [24]: # 1. Read and display information the data.

import pandas as pd
df = pd.read_csv('data/Basic-3.csv')

display(df.head())
display(df.info())

gender race/ethnicity parental level of education lunch test preparation course math score reading score writing score

0 female group B bachelor's degree standard none 72 72 74

1 female group C some college standard completed 69 90 88

2 female group B master's degree standard none 90 95 93

3 male group A associate's degree free/reduced none 47 57 44

4 male group C some college standard none 76 78 75

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 gender 1000 non-null object
1 race/ethnicity 1000 non-null object
2 parental level of education 1000 non-null object
3 lunch 1000 non-null object
4 test preparation course 1000 non-null object
5 math score 1000 non-null int64
6 reading score 1000 non-null int64
7 writing score 1000 non-null int64
dtypes: int64(3), object(5)
memory usage: 62.6+ KB
None

In [25]: # 2. Add a new column to dataframe that represents total score.


# The total score is sum of score of math, reading, and writing scores

df['Total score']=df['math score']+df['reading score']+df['writing score']

df

Out[25]: gender race/ethnicity parental level of education lunch test preparation course math score reading score writing score Total score

0 female group B bachelor's degree standard none 72 72 74 218

1 female group C some college standard completed 69 90 88 247

2 female group B master's degree standard none 90 95 93 278

3 male group A associate's degree free/reduced none 47 57 44 148

4 male group C some college standard none 76 78 75 229

... ... ... ... ... ... ... ... ... ...

995 female group E master's degree standard completed 88 99 95 282

996 male group C high school free/reduced none 62 55 55 172

997 female group C high school free/reduced completed 59 71 65 195

998 female group D some college standard completed 68 78 77 223

999 female group D some college free/reduced none 77 86 86 249

1000 rows × 9 columns

In [26]: %matplotlib inline


# 3. (a) Draw histograms of both numeric and non-numeric columns.

# Load all the libraries


import matplotlib.pyplot as plt
import seaborn as sns

# Identify numeric columns


num_columns = df.select_dtypes(exclude='object').columns

for c in num_columns:
plt.figure()
sns.histplot(x=c,bins=10,data=df);
plt.show()

## To draw a single column


# plt.figure()
# sns.histplot(x='math score',bins=10,data=df);
# plt.show()

In [27]: # (b) For non-numeric columns, count plot option may be used.

# Identify non-numeric columns


obj_columns = df.select_dtypes('object').columns

# Plot for each identified column


for c in obj_columns:
plt.figure()
# sns.histplot(y=c,data=df);
sns.countplot(y=c,data=df);
plt.show()

## To draw a single column


# plt.figure()
# sns.countplot(y='gender',data=df);
# plt.show()

In [28]: # 4. Draw box-plots for the numeric columns, and differentiate by *test preparation course* column.

import matplotlib.pyplot as plt


import seaborn as sns

# Identify numeric columns


num_columns = df.select_dtypes(exclude='object').columns

for c in num_columns:
plt.figure()
sns.boxplot(y=c,x='test preparation course',data=df);
plt.show()

# plt.figure()
# sns.boxplot(y='math score',x='test preparation course',data=df);
# plt.show()

In [29]: # 5. Draw three overlapping scatter plots of columns *math score, reading score* and *writing score*
#w.r.t *Total score*. Set the markers transparency level (alpha) to 0.5.

plt.figure(figsize=(10,8))
sns.scatterplot(x='Total score', y='math score', color='red', label='math score', alpha=0.5, data=df)
sns.scatterplot(x='Total score', y='reading score',color='blue',label='reading score', alpha=0.5, data=df)
sns.scatterplot(x='Total score', y='writing score',color='green',label='writing score', alpha=0.5, data=df)
plt.ylabel("Subject Scores");
plt.xlabel("Total Scores");
plt.title("Scores Comparison");
plt.show()

In [30]: # 6. Draw scatter plot of columns *math score* and *writing score*,
#where the size of the marker is based on column *Total score*.
#In addition to that, reduce the marker transparency (alpha) to 0.5.
-

plt.figure(figsize=(10,10))
sns.scatterplot(x='math score', y='writing score',
size='Total score',sizes=(20,200),
alpha=0.5, color='red',
data=df)
plt.show()

In [31]: # 7. Plot *Total score* in ascending order on the x axis,


# and the corresponding pairwise absolute differences of *math score, reading score* and *writing score*
# on the y axis.

# add new columns


df['m-r']=np.abs(df['math score']-df['reading score'])
df['m-w']=np.abs(df['math score']-df['writing score'])
df['w-r']=np.abs(df['writing score']-df['reading score'])

# select the data


sel_columns=['m-r','m-w','w-r','Total score']
ndf = df.loc[:,sel_columns]
ndf.set_index('Total score',inplace=True)
ndf.sort_index(inplace=True)
# display(ndf.head())

plt.figure(figsize=(20,10))
sns.lineplot(markers=True, data=ndf) #the plot `y values at each value of x and shows an estimate of the central tendency and a confidence interval for that estimate
plt.show()

# # the same plot can be made using the following code:


# plt.figure(figsize=(20,10))
# sns.lineplot(y='m-r',x='Total score', data=df)
# sns.lineplot(y='m-w',x='Total score', data=df)
# sns.lineplot(y='w-r',x='Total score', data=df)
# plt.show()

## Or to display the points without line


# plt.figure(figsize=(20,10))
# sns.scatterplot(data=ndf)
# plt.show()

Example: Seaborn's Advanced Plots


Question-C: Consider the data in file Basic-4-Clean.csv.

1. Read and display information the data. Remove all rows containing NA values.
2. Depict Category column by count. In another plot, depict Category counts (differentiated) by Type.
3. Draw a boxplot plot of Installs vs Rating columns, ordered by Installs.
4. Draw a violinplot plot of Content Rating vs Rating columns, differentiated w.r.t Type .
5. Plot a barplot depicting count of Category that have Installs above 0.5B.
6. Depicting the Size of Category that have Installs above 0.5B.
7. For all the paid apps, depict Genres by count diferentiated by Content Rating. In another plot depict Genres by Rating.

In [3]: # 1. Read and display information the data. Remove all rows containing NA values.

import pandas as pd
df = pd.read_csv('data/Basic-4-Clean.csv')
print(df.info())
#print(df.shape)# Can also be obtained from info
#print(df.count())# Can also be obtained from info
display(df.sample(5))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7723 entries, 0 to 7722
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 App 7723 non-null object
1 Category 7723 non-null object
2 Rating 7723 non-null float64
3 Reviews 7723 non-null int64
4 Size 7723 non-null float64
5 Installs 7723 non-null object
6 Type 7723 non-null object
7 Price 7723 non-null float64
8 Content Rating 7723 non-null object
9 Genres 7723 non-null object
10 Last Updated 7723 non-null object
11 Current Ver 7723 non-null object
12 Android Ver 7723 non-null object
dtypes: float64(3), int64(1), object(9)
memory usage: 784.5+ KB
None
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver

2028 Night Photo Frame PHOTOGRAPHY 4.5 4400 8.3 1000000+ Free 0.0 Everyone Photography 28-Jul-18 2.0.1 4.0.3 and up

1913 The wall SHOPPING 4.4 35563 4.2 1000000+ Free 0.0 Everyone Shopping 23-Jul-18 10.2.1 4.1 and up

3291 Remote For ATT U-verse TV - NOW FREE TOOLS 2.5 20 26000.0 5000+ Free 0.0 Everyone Tools 25-Jul-18 6.1.5 4.2 and up

3860 Heroes Arena GAME 4.2 336386 88000.0 10000000+ Free 0.0 Everyone 10+ Action 24-Jul-18 2.0.1 4.0.3 and up

880 Six Packs for Man–Body Building with No Equipment HEALTH_AND_FITNESS 4.6 2880 7.1 100000+ Free 0.0 Everyone Health & Fitness 24-Jul-18 1.0.8 4.1 and up

In [33]: # 2. Depict *Category* column by count. In another plot, depict *Category* counts (differentiated) by *Type*.

#plot category counts


import seaborn as sns
plt.figure(figsize=(10,10))
ax = sns.countplot(y="Category", data=df)
plt.title('Apps available')
plt.show()

#plot category with type


import seaborn as sns
plt.figure(figsize=(10,10))
ax = sns.countplot(y='Category',hue='Type',data=df)
plt.title('Apps popularity')
plt.show()

In [34]: #3. Draw a boxplot plot of *Installs* vs *Rating* columns, ordered by *Installs*.

#adding a new column


df['Installs_num']=df['Installs'].apply(lambda x:x.replace('+','')).apply(pd.to_numeric)

#the box plot


ax = plt.figure(figsize=(10,5))
sns.boxplot(x="Installs", y="Rating", data=df.sort_values(by=['Installs_num']))
plt.title("Installs vs Rating")
plt.xticks(rotation=80)
plt.show()

In [35]: #4. Draw a violinplot plot of *Content Rating* vs *Rating* columns, differentiated w.r.t *Type* .

import seaborn as sns


plt.figure(figsize=(10,10))
sns.violinplot(x ='Content Rating', y ='Rating', hue='Type', data = df)
plt.title('Apps ratings')
plt.show()
#violinplot plot also depitcs the distribution.

# # contrast with box plots


# plt.figure(figsize=(10,10))
# sns.boxplot(x ='Content Rating', y ='Rating', hue='Type', data = df)
# plt.title('Apps ratings')
# plt.show()

In [36]: #5. Plot a barplot depicting count of *Category* that have *Installs* above 0.5B.
import seaborn as sns
seleted_rows = (df['Installs']=='1000000000+') | (df['Installs']=='500000000+')

plt.figure(figsize=(10,2))
sns.countplot(y ='Category', data = df.loc[seleted_rows,:])
plt.title('App Categories in 0.5B+ & above')
plt.show()

In [37]: #6. Depicting the *Size* of *Category* that have *Installs* above 0.5B.

import seaborn as sns

plt.figure(figsize=(5,5))
# ax=sns.swarmplot(x ='Category', y ='Size', data = df.loc[seleted_rows,:])
# ax=sns.violinplot(x ='Category', y ='Size', data = df.loc[seleted_rows,:])
ax=sns.boxplot(x ='Category', y ='Size', data = df.loc[seleted_rows,:])
plt.xticks(rotation=80)
plt.title('Top Apps sizes in Kb')
plt.show()

In [38]: # 7. For all the paid apps, depict *Genres* by count diferentiated by *Content Rating*.
# In another plot depict *Genres* by *Rating*.

import seaborn as sns


seleted_rows = df['Type']=='Paid'

plt.figure(figsize=(10,20))
sns.countplot(y ='Genres',hue='Content Rating',data = df.loc[seleted_rows,:])
plt.title('Typical Paid Apps Genre Content')
plt.show()

plt.figure(figsize=(10,20))
sns.boxplot(y ='Genres',x='Rating',data = df.loc[seleted_rows,:])
plt.title('Typical Paid Apps Genre Prices')
plt.show()
References:
Data Sets:
1. Basic-2: Labor Relations Data, UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. Published in: Bergadano,
F., Matwin, S., Michalski, R., Zhang, J., Measuring Quality of Concept Descriptions, Procs. of the 3rd European Working Sessions on Learning, Glasgow, October 1988.
2. Basic-3: https://www.kaggle.com/spscientist/students-performance-in-exams
3. Basic-4: modified, https://www.kaggle.com/lava18/google-play-store-apps

Others:
1. Series: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html
2. DataFrames: https://pandas.pydata.org/pandas-docs/stable/reference/frame.html
3. Read CSV: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
4. Read XLSX: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html
5. Visualization: https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html
6. Seaborn: https://seaborn.pydata.org/

You might also like