Topic_4_Basic_programming_concepts
Topic_4_Basic_programming_concepts
1 Basic Programming
1.1 Python Programming
1.2 Learning Outcomes
1.3 Data Series
2 DataFrame (DF)
2.1 Creating DataFrame
2.1.1 Create DF with Given n-D Array
2.1.2 Create DF with Random n-D Array
2.1.3 Create DF from CSV
2.1.4 Create DF from XLSX
2.2 Accessing Elements
2.2.1 Accessing via .loc
2.2.2 Accessing via .iloc
2.3 DF Slicing
2.3.1 Single Column Slice
2.3.2 Slicing via .loc
2.3.3 Slicing via .iloc
2.4 DF Sorting
2.5 DF Selecting
2.5.1 Example: Displaying Data
2.6 Addition DF Methods
2.6.1 Count Rows and Columns
2.6.2 Apply Lambda Functions
2.6.3 Apply General Functions
3 Data Visualization
3.1 Why Visualization
3.2 Typical Visualizations
3.3 Pandas Plots
3.4 Seaborn Plots
3.4.1 Example: Seaborn's Basic Plots
3.4.2 Example: Seaborn's Advanced Plots
4 References:
4.1 Data Sets:
4.2 Others:
Basic Programming
Python Programming and DataFrames
Python Programming
1. In ICS 104, we have seen Python as a general purpose programming language.
2. Although, Python can be used in many areas that require programming, in this chapter we see the usage of Python w.r.t data reading and visualization.
3. Specifically, we highlight the usage of Pandas and Seaborn library.
Learning Outcomes
1. Build, read and/or manipulate DataFrames.
2. Learn DataFrame slicing, sorting and selection methods.
3. Visualize data using histogram, box plots, scatter plots, etc.
Data Series
1. In Pandas, series are labeled/indexed one-dimensional arrays.
2. The elements of the series could be integer, float, string, objects, etc.
3. The index may not be unique.
4. A collection of Data Series is called as Data Frame.
5. Our focus in this chapter will be on Data Frames.
DataFrame (DF)
Collection of series, where each column is a series.
Creating DataFrame
In order to create a data from, following things are typically needed:
An optional row index can be given too. In addition to the above, there are many parameters, which can be read from pydata docs.
[ ]
a11 a12
(1)
a21 a22
display(df)
col 1 col 2
df = pd.DataFrame(np.random.rand(1000,3),
columns=['col 1', 'col 2', 'col 3'])
display(df)
# display data
display(df)
0 M Pink-collar Hejaz
1 M White-collar Central
2 M Pink-collar Hejaz
3 M Pink-collar Hejaz
4 M Gold-collar Eastern
unique 2 4 3
df = pd.read_excel('data/Basic-2.xlsx')
display(df.head())
#df.tail()
#df.sample(5)
0 1.0 5.0 NaN NaN NaN 40.0 NaN NaN 2.0 NaN 11.0 average NaN NaN yes NaN good
1 2.0 4.5 5.8 NaN NaN 35.0 ret_allw NaN NaN yes 11.0 below average NaN full NaN full good
2 NaN NaN NaN NaN NaN 38.0 empl_contr NaN 5.0 NaN 11.0 generous yes half yes half good
3 3.0 3.7 4.0 5.0 tc NaN NaN NaN NaN yes NaN NaN NaN NaN yes NaN good
4 3.0 4.5 4.5 5.0 NaN 40.0 NaN NaN NaN NaN 12.0 average NaN half yes half good
display(df.describe())
# by default the describe gives summary of the numeric columns, when there are mixed columns
# df.describe(include='all') # can be used, but too many NaNs
c1 c2 c3 c4 c6 c8 c9 c11
## or in one step
# df.select_dtypes(include='object').describe()
count 24 18 18 37 16 25 20 24 40
unique 3 3 2 3 2 3 2 3 2
top none none no below average yes half yes full good
freq 14 8 11 14 11 11 18 12 26
Accessing Elements
.loc and .iloc methods can be used for accessing the elements of a DF.
White-collar
0 M Pink-collar Hejaz
1 M White-collar Central
2 M Pink-collar Hejaz
3 M Pink-collar Hejaz
4 M Gold-collar Eastern
White-collar
White-collar
DF Slicing
1. Slicing is nothing but extracting a part of the DF.
2. The part could be contiguous or broken chunks of the DF.
1. df.COL_NAME
2. df[COL_NAME]
df = pd.read_excel('data/Basic-2.xlsx')
# df.head()
display(df['c1'])
## Following code can be used to create column names when dataset does not contain any column names.
# col_name=[''.join(['c',str(i+1)]) for i in range(17)]
## Following options can be used to add column names to the dataset
# header=None, names=col_name
0 1.0
1 2.0
2 NaN
3 3.0
4 3.0
5 2.0
6 3.0
7 3.0
8 2.0
9 1.0
10 3.0
11 2.0
12 2.0
13 3.0
14 1.0
15 2.0
16 1.0
17 1.0
18 1.0
19 2.0
20 2.0
21 2.0
22 3.0
23 2.0
24 1.0
25 3.0
26 2.0
27 2.0
28 2.0
29 3.0
30 3.0
31 3.0
32 2.0
33 2.0
34 3.0
35 2.0
36 1.0
37 1.0
38 3.0
39 2.0
Name: c1, dtype: float64
display(df.head())
0 1.0 5.0 NaN NaN NaN 40.0 NaN NaN 2.0 NaN 11.0 average NaN NaN yes NaN good
1 2.0 4.5 5.8 NaN NaN 35.0 ret_allw NaN NaN yes 11.0 below average NaN full NaN full good
2 NaN NaN NaN NaN NaN 38.0 empl_contr NaN 5.0 NaN 11.0 generous yes half yes half good
3 3.0 3.7 4.0 5.0 tc NaN NaN NaN NaN yes NaN NaN NaN NaN yes NaN good
4 3.0 4.5 4.5 5.0 NaN 40.0 NaN NaN NaN NaN 12.0 average NaN half yes half good
c2 c3 c4 c5
display(df.iloc[:10,1:5])
c2 c3 c4 c5
display(df)
# display(df.loc['row 2',:])
col 1 col-2
DF Sorting
1. DataFrames can be sorted w.r.t values in the columns.
2. One or more columns can be used for sorting.
Consider the dataset given in Basic-2.xlsx. Sort the rows according to the first column, in ascending order. Repeat the sort in descending order.
#Read data
df = pd.read_excel('data/Basic-2.xlsx')
display(df.head())
0 1.0 5.0 NaN NaN NaN 40.0 NaN NaN 2.0 NaN 11.0 average NaN NaN yes NaN good
1 2.0 4.5 5.8 NaN NaN 35.0 ret_allw NaN NaN yes 11.0 below average NaN full NaN full good
2 NaN NaN NaN NaN NaN 38.0 empl_contr NaN 5.0 NaN 11.0 generous yes half yes half good
3 3.0 3.7 4.0 5.0 tc NaN NaN NaN NaN yes NaN NaN NaN NaN yes NaN good
4 3.0 4.5 4.5 5.0 NaN 40.0 NaN NaN NaN NaN 12.0 average NaN half yes half good
0 1.0 5.0 NaN NaN NaN 40.0 NaN NaN 2.0 NaN 11.0 average NaN NaN yes NaN good
16 1.0 2.8 NaN NaN NaN 35.0 NaN NaN 2.0 NaN 12.0 below average NaN NaN NaN NaN good
14 1.0 3.0 NaN NaN none 36.0 NaN NaN 10.0 no 11.0 generous NaN NaN NaN NaN good
24 1.0 6.0 NaN NaN NaN 38.0 NaN 8.0 3.0 NaN 9.0 generous NaN NaN NaN NaN good
17 1.0 2.1 NaN NaN tc 40.0 ret_allw 2.0 3.0 no 9.0 below average yes half NaN none bad
29 3.0 2.0 2.5 NaN NaN 35.0 none NaN NaN NaN 10.0 average NaN NaN yes full bad
7 3.0 6.9 4.8 2.3 NaN 40.0 NaN NaN 3.0 NaN 12.0 below average NaN NaN NaN NaN good
3 3.0 3.7 4.0 5.0 tc NaN NaN NaN NaN yes NaN NaN NaN NaN yes NaN good
4 3.0 4.5 4.5 5.0 NaN 40.0 NaN NaN NaN NaN 12.0 average NaN half yes half good
34 3.0 2.0 2.5 2.1 tc 40.0 none 2.0 1.0 no 10.0 below average no half yes full bad
Consider the dataset given in Basic-2.xlsx. Sort the rows according to the first and then sixth column, where ascending in the first and descending in the sixth column. Display only first 6 columns.
display(df.iloc[:,0:6])
c1 c2 c3 c4 c5 c6
DF Selecting
Selecting is similar to the slicing. Typically, it deals with the following:
Consider the dataset given in Basic-1.csv. Display all rows related to White-collar jobs.
0 M Pink-collar Hejaz
1 M White-collar Central
2 M Pink-collar Hejaz
3 M Pink-collar Hejaz
4 M Gold-collar Eastern
1 M White-collar Central
5 M White-collar Hejaz
8 M White-collar Eastern
9 F White-collar Hejaz
10 M White-collar Hejaz
# Summary
display(df.loc[seleted_rows,:].describe().iloc[:,0])
# display(df.loc[seleted_rows,:].describe())
# display(df.loc[seleted_rows,:].describe().loc[:,'Gender'])
8 M White-collar Eastern
18 F White-collar Eastern
27 M White-collar Eastern
31 M White-collar Eastern
37 M White-collar Eastern
count 138
unique 2
top M
freq 98
Name: Gender, dtype: object
In [17]: #1. Display all rows related to White-collar and Blue-collar jobs.
import pandas as pd
df = pd.read_csv('data/Basic-1.csv', delimiter=',')
display(df)
display(df.loc[seleted_rows,:])
0 M Pink-collar Hejaz
1 M White-collar Central
2 M Pink-collar Hejaz
3 M Pink-collar Hejaz
4 M Gold-collar Eastern
1 M White-collar Central
5 M White-collar Hejaz
6 M Blue-collar Eastern
7 M Blue-collar Central
8 M White-collar Eastern
In [18]: # 2. Display all rows related to White-collar and Blue-collar jobs for Females.
import pandas as pd
df = pd.read_csv('data/Basic-1.csv', delimiter=',')
9 F White-collar Hejaz
11 F Blue-collar Central
13 F White-collar Central
14 F White-collar Hejaz
15 F White-collar Central
In [19]: # 4. Among all the White-collar and Blue-collar job Females, what proportion works in Eastern Province.
ndf = df.loc[seleted_rows,'Province']
# display(ndf)
# print(ndf.value_counts())
proportion = ndf.value_counts()[2]/len(ndf.index)*100
print(f'The proportion of Females working as blue/white collar in Eastern province is: {proportion: 0.2f}%')
The proportion of Females working as blue/white collar in Eastern province is: 22.83%
Addition DF Methods
print(f'The number of rows are {len(df.index)}, and the number of columns are {len(df.columns)}')
print(df.shape)
1. In columns c1, c6 and c11, convert every number to the nearest integer.
2. In column c12, replace the space between two words by underscore.
In [21]: # 1. In columns c1, c6 and c11, convert every number to the nearest integer.
import pandas as pd
df = pd.read_csv('data/Basic-2-Clean.csv')
display(df.head())
display(df.head())
0 1.000000 5.000000 3.913333 none 40.000000 2.000000 11.000000 average half full good
1 2.000000 4.500000 5.800000 none 35.000000 4.583333 11.000000 below average full full good
2 2.102564 3.620513 3.913333 none 38.000000 5.000000 11.000000 generous half half good
3 3.000000 3.700000 4.000000 tc 37.810811 4.583333 11.105263 below average half full good
4 3.000000 4.500000 4.500000 none 40.000000 4.583333 12.000000 average half half good
In [22]: # 2. In column c12, replace the space between two words by underscore.
display(df.head())
df['c12']=df['c12'].apply(lambda x: x.replace(' ','_'))
display(df.head())
import pandas as pd
df = pd.read_csv('data/Basic-2-Clean.csv')
display(df.head())
# df['c2']=df['c2'].apply(custom_round) # for single column
df.loc[:,['c2','c3','c9']]=df.loc[:,['c2','c3','c9']].applymap(custom_round) # for multiple columns
display(df.head())
0 1.000000 5.000000 3.913333 none 40.000000 2.000000 11.000000 average half full good
1 2.000000 4.500000 5.800000 none 35.000000 4.583333 11.000000 below average full full good
2 2.102564 3.620513 3.913333 none 38.000000 5.000000 11.000000 generous half half good
3 3.000000 3.700000 4.000000 tc 37.810811 4.583333 11.105263 below average half full good
4 3.000000 4.500000 4.500000 none 40.000000 4.583333 12.000000 average half half good
0 1.000000 5.0 4.0 none 40.000000 2.0 11.000000 average half full good
1 2.000000 4.5 6.0 none 35.000000 4.5 11.000000 below average full full good
2 2.102564 3.5 4.0 none 38.000000 5.0 11.000000 generous half half good
3 3.000000 3.5 4.0 tc 37.810811 4.5 11.105263 below average half full good
4 3.000000 4.5 4.5 none 40.000000 4.5 12.000000 average half half good
Data Visualization
Graphical representation of data
Why Visualization
1. When data is small (no magic numbers, but say 5 rows and 2 columns), then one may extract meaningful patterns by looking at the data.
2. However, when we have large data (again no magic numbers, but say 5000 rows and 2000 columns), then mere observation may not be fruitful in identifying patterns.
3. Visualization is the first typical step in data analysis.
Typical Visualizations
1. Histogram
2. Box Plots
3. Scatter Plots
4. Line Plots
Pandas Plots
1. Pandas provide in-built basic plots using the numpy and matplotlib libraries.
2. Several plots from pandas are under df.plot, which can be accessed via kind option.
3. In addition to that, specific plots like histogram (df.hist()), and boxplot (df.boxplot()) are also available.
4. However, pandas basic plots are not enough, and in the course we look at the seaborn library for plotting.
In the following cells, we will look at most typical plots that can be constructed using seaborn plots.
Seaborn Plots
1. Seaborn library provides a variety of plots, inlcluding histogram, boxplots, lineplots, scatterplots, countplot, violinplot, swarmplot, pairplot, catplot etc.
2. These plots can be accessed through .countplot, .violinplot, catplot, etc.
3. Typical parameters include
x, y, and/or hue
where x, y refers to the x and y axis, and hue defines subsets of the data, which will be drawn on separate facets in the grid.
4. It also comes with huge customization.
5. The Seaborn Plots are aesthetically better than basic plots from pandas library.
6. The library requires numpy and matplotlib libraries.
7. It is integrated with pandas data.
In the following cell, we will look at most typical plots from the seaborn library.
import pandas as pd
df = pd.read_csv('data/Basic-3.csv')
display(df.head())
display(df.info())
gender race/ethnicity parental level of education lunch test preparation course math score reading score writing score
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 gender 1000 non-null object
1 race/ethnicity 1000 non-null object
2 parental level of education 1000 non-null object
3 lunch 1000 non-null object
4 test preparation course 1000 non-null object
5 math score 1000 non-null int64
6 reading score 1000 non-null int64
7 writing score 1000 non-null int64
dtypes: int64(3), object(5)
memory usage: 62.6+ KB
None
df
Out[25]: gender race/ethnicity parental level of education lunch test preparation course math score reading score writing score Total score
... ... ... ... ... ... ... ... ... ...
for c in num_columns:
plt.figure()
sns.histplot(x=c,bins=10,data=df);
plt.show()
In [27]: # (b) For non-numeric columns, count plot option may be used.
In [28]: # 4. Draw box-plots for the numeric columns, and differentiate by *test preparation course* column.
for c in num_columns:
plt.figure()
sns.boxplot(y=c,x='test preparation course',data=df);
plt.show()
# plt.figure()
# sns.boxplot(y='math score',x='test preparation course',data=df);
# plt.show()
In [29]: # 5. Draw three overlapping scatter plots of columns *math score, reading score* and *writing score*
#w.r.t *Total score*. Set the markers transparency level (alpha) to 0.5.
plt.figure(figsize=(10,8))
sns.scatterplot(x='Total score', y='math score', color='red', label='math score', alpha=0.5, data=df)
sns.scatterplot(x='Total score', y='reading score',color='blue',label='reading score', alpha=0.5, data=df)
sns.scatterplot(x='Total score', y='writing score',color='green',label='writing score', alpha=0.5, data=df)
plt.ylabel("Subject Scores");
plt.xlabel("Total Scores");
plt.title("Scores Comparison");
plt.show()
In [30]: # 6. Draw scatter plot of columns *math score* and *writing score*,
#where the size of the marker is based on column *Total score*.
#In addition to that, reduce the marker transparency (alpha) to 0.5.
-
plt.figure(figsize=(10,10))
sns.scatterplot(x='math score', y='writing score',
size='Total score',sizes=(20,200),
alpha=0.5, color='red',
data=df)
plt.show()
plt.figure(figsize=(20,10))
sns.lineplot(markers=True, data=ndf) #the plot `y values at each value of x and shows an estimate of the central tendency and a confidence interval for that estimate
plt.show()
1. Read and display information the data. Remove all rows containing NA values.
2. Depict Category column by count. In another plot, depict Category counts (differentiated) by Type.
3. Draw a boxplot plot of Installs vs Rating columns, ordered by Installs.
4. Draw a violinplot plot of Content Rating vs Rating columns, differentiated w.r.t Type .
5. Plot a barplot depicting count of Category that have Installs above 0.5B.
6. Depicting the Size of Category that have Installs above 0.5B.
7. For all the paid apps, depict Genres by count diferentiated by Content Rating. In another plot depict Genres by Rating.
In [3]: # 1. Read and display information the data. Remove all rows containing NA values.
import pandas as pd
df = pd.read_csv('data/Basic-4-Clean.csv')
print(df.info())
#print(df.shape)# Can also be obtained from info
#print(df.count())# Can also be obtained from info
display(df.sample(5))
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7723 entries, 0 to 7722
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 App 7723 non-null object
1 Category 7723 non-null object
2 Rating 7723 non-null float64
3 Reviews 7723 non-null int64
4 Size 7723 non-null float64
5 Installs 7723 non-null object
6 Type 7723 non-null object
7 Price 7723 non-null float64
8 Content Rating 7723 non-null object
9 Genres 7723 non-null object
10 Last Updated 7723 non-null object
11 Current Ver 7723 non-null object
12 Android Ver 7723 non-null object
dtypes: float64(3), int64(1), object(9)
memory usage: 784.5+ KB
None
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
2028 Night Photo Frame PHOTOGRAPHY 4.5 4400 8.3 1000000+ Free 0.0 Everyone Photography 28-Jul-18 2.0.1 4.0.3 and up
1913 The wall SHOPPING 4.4 35563 4.2 1000000+ Free 0.0 Everyone Shopping 23-Jul-18 10.2.1 4.1 and up
3291 Remote For ATT U-verse TV - NOW FREE TOOLS 2.5 20 26000.0 5000+ Free 0.0 Everyone Tools 25-Jul-18 6.1.5 4.2 and up
3860 Heroes Arena GAME 4.2 336386 88000.0 10000000+ Free 0.0 Everyone 10+ Action 24-Jul-18 2.0.1 4.0.3 and up
880 Six Packs for Man–Body Building with No Equipment HEALTH_AND_FITNESS 4.6 2880 7.1 100000+ Free 0.0 Everyone Health & Fitness 24-Jul-18 1.0.8 4.1 and up
In [33]: # 2. Depict *Category* column by count. In another plot, depict *Category* counts (differentiated) by *Type*.
In [34]: #3. Draw a boxplot plot of *Installs* vs *Rating* columns, ordered by *Installs*.
In [35]: #4. Draw a violinplot plot of *Content Rating* vs *Rating* columns, differentiated w.r.t *Type* .
In [36]: #5. Plot a barplot depicting count of *Category* that have *Installs* above 0.5B.
import seaborn as sns
seleted_rows = (df['Installs']=='1000000000+') | (df['Installs']=='500000000+')
plt.figure(figsize=(10,2))
sns.countplot(y ='Category', data = df.loc[seleted_rows,:])
plt.title('App Categories in 0.5B+ & above')
plt.show()
In [37]: #6. Depicting the *Size* of *Category* that have *Installs* above 0.5B.
plt.figure(figsize=(5,5))
# ax=sns.swarmplot(x ='Category', y ='Size', data = df.loc[seleted_rows,:])
# ax=sns.violinplot(x ='Category', y ='Size', data = df.loc[seleted_rows,:])
ax=sns.boxplot(x ='Category', y ='Size', data = df.loc[seleted_rows,:])
plt.xticks(rotation=80)
plt.title('Top Apps sizes in Kb')
plt.show()
In [38]: # 7. For all the paid apps, depict *Genres* by count diferentiated by *Content Rating*.
# In another plot depict *Genres* by *Rating*.
plt.figure(figsize=(10,20))
sns.countplot(y ='Genres',hue='Content Rating',data = df.loc[seleted_rows,:])
plt.title('Typical Paid Apps Genre Content')
plt.show()
plt.figure(figsize=(10,20))
sns.boxplot(y ='Genres',x='Rating',data = df.loc[seleted_rows,:])
plt.title('Typical Paid Apps Genre Prices')
plt.show()
References:
Data Sets:
1. Basic-2: Labor Relations Data, UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. Published in: Bergadano,
F., Matwin, S., Michalski, R., Zhang, J., Measuring Quality of Concept Descriptions, Procs. of the 3rd European Working Sessions on Learning, Glasgow, October 1988.
2. Basic-3: https://www.kaggle.com/spscientist/students-performance-in-exams
3. Basic-4: modified, https://www.kaggle.com/lava18/google-play-store-apps
Others:
1. Series: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html
2. DataFrames: https://pandas.pydata.org/pandas-docs/stable/reference/frame.html
3. Read CSV: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
4. Read XLSX: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html
5. Visualization: https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html
6. Seaborn: https://seaborn.pydata.org/