Pandas
Pandas
Pandas
What is Pandas
Pandas is a fast, powerful, flexible and easy to use open source data
analysis and manipulation tool,built on top of the Python
programming language.
Prudhvi Vardhan Notes
Pandas Series
A Pandas Series is like a column in a table. It is a 1-D array holding data of any type.
Importing Pandas
Out[6]: 0 India
1 Pakistan
2 USA
3 Nepal
4 Srilanka
dtype: object
In [7]: # integers
marks= [13,24,56,78,100]
pd.Series(marks)
Out[7]: 0 13
1 24
2 56
3 78
4 100
dtype: int64
Out[8]: maths 67
english 57
science 89
hindi 100
dtype: int64
Out[10]: maths 67
english 57
science 89
hindi 100
Name: Jack Marks, dtype: int64
In [11]: marks = {
'maths':67,
'english':57,
'science':89,
'hindi':100
}
marks_series = pd.Series(marks,name="jack Marks")
In [12]: marks_series
Out[12]: maths 67
english 57
science 89
hindi 100
Name: jack Marks, dtype: int64
Series Attributes
In [13]: # size
marks_series.size
Out[13]: 4
In [14]: # dtype
marks_series.dtype
Out[14]: dtype('int64')
In [15]: # name
marks_series.name
unique is an attribute of a Pandas Series that returns an array of the unique values in the
Series.
In [16]: # is_unique
marks_series.is_unique
Out[16]: True
Out[17]: False
In [18]: # index
marks_series.index
In [19]: # values
marks_series.values
In [20]: type(marks_series.values)
Out[20]: numpy.ndarray
Pandas.read_csv
In [23]: type(sub)
Out[23]: pandas.core.frame.DataFrame
In [30]: sub.head(5)
0 48
1 57
2 40
3 43
4 44
In [32]: type(sub)
Out[32]: pandas.core.series.Series
In [33]: sub
Out[33]: 0 48
1 57
2 40
3 43
4 44
...
360 231
361 226
362 155
363 144
364 172
Name: Subscribers gained, Length: 365, dtype: int64
In [57]: kl
Out[57]: match_no
1 1
2 23
3 13
4 12
5 1
..
211 0
212 20
213 73
214 25
215 7
Name: runs, Length: 215, dtype: int64
In [38]: movies
Out[38]: movie
Uri: The Surgical Strike Vicky Kaushal
Battalion 609 Vicky Ahuja
The Accidental Prime Minister (film) Anupam Kher
Why Cheat India Emraan Hashmi
Evening Shadows Mona Ambegaonkar
...
Hum Tumhare Hain Sanam Shah Rukh Khan
Aankhen (2002 film) Amitabh Bachchan
Saathiya (film) Vivek Oberoi
Company (film) Ajay Devgn
Awara Paagal Deewana Akshay Kumar
Name: lead, Length: 1500, dtype: object
Series Methods
In [40]: # Head
sub.head()
Out[40]: 0 48
1 57
2 40
3 43
4 44
Name: Subscribers gained, dtype: int64
In [41]: # tail
kl.tail()
Out[41]: match_no
211 0
212 20
213 73
214 25
215 7
Name: runs, dtype: int64
Out[43]: movie
Enemmy Sunil Shetty
Name: lead, dtype: object
Out[45]: match_no
87 0
211 0
207 0
206 0
91 0
...
164 100
120 100
123 108
126 109
128 113
Name: runs, Length: 215, dtype: int64
Out[50]: 113
Out[55]: match_no
87 0
211 0
207 0
206 0
91 0
...
164 100
120 100
123 108
126 109
128 113
Name: runs, Length: 215, dtype: int64
Out[60]: movie
1920 (film) Rajniesh Duggall
1920: London Sharman Joshi
1920: The Evil Returns Vicky Ahuja
1971 (2007 film) Manoj Bajpayee
2 States (2014 film) Arjun Kapoor
...
Zindagi 50-50 Veena Malik
Zindagi Na Milegi Dobara Hrithik Roshan
Zindagi Tere Naam Mithun Chakraborty
Zokkomon Darsheel Safary
Zor Lagaa Ke...Haiya! Meghan Jadhav
Name: lead, Length: 1500, dtype: object
In [61]: movies.sort_index(ascending=False)
Out[61]: movie
Zor Lagaa Ke...Haiya! Meghan Jadhav
Zokkomon Darsheel Safary
Zindagi Tere Naam Mithun Chakraborty
Zindagi Na Milegi Dobara Hrithik Roshan
Zindagi 50-50 Veena Malik
...
2 States (2014 film) Arjun Kapoor
1971 (2007 film) Manoj Bajpayee
1920: The Evil Returns Vicky Ahuja
1920: London Sharman Joshi
1920 (film) Rajniesh Duggall
Name: lead, Length: 1500, dtype: object
In [62]: # count
kl.count()
Out[62]: 215
Out[66]: 49510
Out[67]: 0
Statical Methods
In [68]: # mean
sub.mean()
Out[68]: 135.64383561643837
In [72]: # median
kl.median()
Out[72]: 24.0
mode(): The mode is the value that appears most frequently in the Series.
In [74]: # mode
print(movies.mode())
0 Akshay Kumar
dtype: object
Out[71]: 62.67502303725269
Out[75]: 3928.1585127201556
In [76]: # min
sub.min()
Out[76]: 33
In [77]: # max
sub.max()
Out[77]: 396
In [79]: # describe
movies.describe()
In [80]: kl.describe()
In [81]: sub.describe()
Series Indexing
In [83]: # integer indexing
x = pd.Series([12,13,14,35,46,57,58,79,9])
x[1]
Out[83]: 13
In [86]: movies[0]
In [87]: sub[0]
Out[87]: 48
In [90]: # slicing
kl[4:10]
Out[90]: match_no
5 1
6 9
7 34
8 0
9 21
10 3
Name: runs, dtype: int64
In [96]: movies[-5:]
Out[96]: movie
Hum Tumhare Hain Sanam Shah Rukh Khan
Aankhen (2002 film) Amitabh Bachchan
Saathiya (film) Vivek Oberoi
Company (film) Ajay Devgn
Awara Paagal Deewana Akshay Kumar
Name: lead, dtype: object
In [97]: movies[::2]
Out[97]: movie
Uri: The Surgical Strike Vicky Kaushal
The Accidental Prime Minister (film) Anupam Kher
Evening Shadows Mona Ambegaonkar
Fraud Saiyaan Arshad Warsi
Manikarnika: The Queen of Jhansi Kangana Ranaut
...
Raaz (2002 film) Dino Morea
Waisa Bhi Hota Hai Part II Arshad Warsi
Kaante Amitabh Bachchan
Aankhen (2002 film) Amitabh Bachchan
Company (film) Ajay Devgn
Name: lead, Length: 750, dtype: object
Out[98]: match_no
1 1
8 0
22 38
11 10
2 23
Name: runs, dtype: int64
Out[99]: movie
Uri: The Surgical Strike Vicky Kaushal
Battalion 609 Vicky Ahuja
The Accidental Prime Minister (film) Anupam Kher
Why Cheat India Emraan Hashmi
Evening Shadows Mona Ambegaonkar
...
Hum Tumhare Hain Sanam Shah Rukh Khan
Aankhen (2002 film) Amitabh Bachchan
Saathiya (film) Vivek Oberoi
Company (film) Ajay Devgn
Awara Paagal Deewana Akshay Kumar
Name: lead, Length: 1500, dtype: object
Out[101]: maths 67
english 57
science 89
hindi 100
Name: jack Marks, dtype: int64
In [102]: marks_series[1]=88
marks_series
Out[102]: maths 67
english 88
science 89
hindi 100
Name: jack Marks, dtype: int64
Out[103]: maths 67
english 88
science 89
hindi 100
social 90
Name: jack Marks, dtype: int64
Out[111]: movie
Uri: The Surgical Strike Vicky Kaushal
Battalion 609 Vicky Ahuja
The Accidental Prime Minister (film) Anupam Kher
Why Cheat India Emraan Hashmi
Evening Shadows Mona Ambegaonkar
...
Hum Tumhare Hain Sanam Shah Rukh Khan
Aankhen (2002 film) Amitabh Bachchan
Saathiya (film) Vivek Oberoi
Company (film) Ajay Devgn
Awara Paagal Deewana Akshay Kumar
Name: lead, Length: 1500, dtype: object
In [115]: movies
Out[115]: movie
Uri: The Surgical Strike Vicky Kaushal
Battalion 609 Vicky Ahuja
The Accidental Prime Minister (film) Anupam Kher
Why Cheat India Emraan Hashmi
Evening Shadows Mona Ambegaonkar
...
Hum Tumhare Hain Sanam Jack
Aankhen (2002 film) Amitabh Bachchan
Saathiya (film) Vivek Oberoi
Company (film) Ajay Devgn
Awara Paagal Deewana Akshay Kumar
Name: lead, Length: 1500, dtype: object
In [117]: # len/type/dir/sorted/max/min
print(len(sub))
print(type(sub))
365
<class 'pandas.core.series.Series'>
In [122]: print(dir(sub))
print(sorted(sub))
In [123]: print(min(sub))
print(max(sub))
33
396
In [126]: dict(marks_series)
Out[126]: {'maths': 67, 'english': 88, 'science': 89, 'hindi': 100, 'social': 90}
Out[129]: True
Out[133]: True
In [138]: # looping
for i in movies:
print(i)
Vicky Kaushal
Vicky Ahuja
Anupam Kher
Emraan Hashmi
Mona Ambegaonkar
Geetika Vidya Ohlyan
Arshad Warsi
Radhika Apte
Kangana Ranaut
Nawazuddin Siddiqui
Ali Asgar
Ranveer Singh
Prit Kamani
Ajay Devgn
Sushant Singh Rajput
Amitabh Bachchan
Abhimanyu Dasani
Talha Arshad Reshi
Nawazuddin Siddiqui
G i A l
Out[140]: maths 33
english 12
science 11
hindi 0
social 10
Name: jack Marks, dtype: int64
In [141]: 100+marks_series
Out[143]: match_no
1 False
2 False
3 False
4 False
5 False
...
211 False
212 False
213 True
214 False
215 False
Name: runs, Length: 215, dtype: bool
Out[146]: 50
Out[148]: 9
In [149]: # Count number of day when I had more than 200 subs a day
sub[sub>=200].size
Out[149]: 59
In [160]: num_mov[num_mov>=20].size
Out[160]: 7
In [162]: sub.plot()
Out[162]: <AxesSubplot:>
In [164]: movies.value_counts().head(20).plot(kind="pie")
Out[164]: <AxesSubplot:ylabel='lead'>
In [165]: movies.value_counts().head(20).plot(kind="bar")
Out[165]: <AxesSubplot:>
In [166]: # astype
# between
# clip
# drop_duplicates
# isnull
# dropna
# fillna
# isin
# apply
# copy
In [175]: # astype
import sys
sys.getsizeof(kl)
Out[175]: 11752
In [176]: kl
Out[176]: match_no
1 1
2 23
3 13
4 12
5 1
..
211 0
212 20
213 73
214 25
215 7
Name: runs, Length: 215, dtype: int64
In [177]: (kl.astype("int16"))
Out[177]: match_no
1 1
2 23
3 13
4 12
5 1
..
211 0
212 20
213 73
214 25
215 7
Name: runs, Length: 215, dtype: int16
In [178]: sys.getsizeof(kl.astype("int16"))
Out[178]: 10462
In [181]: # between
kl[kl.between(50,60)]
Out[181]: match_no
15 50
34 58
44 56
57 57
71 51
73 58
80 57
85 56
103 51
122 52
129 54
131 54
137 55
141 58
144 57
182 50
197 51
198 53
209 58
Name: runs, dtype: int64
In [182]: kl[kl.between(50,60)].size
Out[182]: 19
In [183]: # clip
sub.clip(100,200)
Out[183]: 0 100
1 100
2 100
3 100
4 100
...
360 200
361 200
362 155
363 144
364 172
Name: Subscribers gained, Length: 365, dtype: int64
In [186]: # drop duplicates #### drop_duplicates(): Returns a Series with duplicates rem
dele = pd.Series([1,2,33,3,3,3,1,23,33,22,33,11])
dele
Out[186]: 0 1
1 2
2 33
3 3
4 3
5 3
6 1
7 23
8 33
9 22
10 33
11 11
dtype: int64
In [188]: dele.drop_duplicates()
Out[188]: 0 1
1 2
2 33
3 3
7 23
9 22
11 11
dtype: int64
In [189]: dele.drop_duplicates(keep='last')
Out[189]: 1 2
5 3
6 1
7 23
9 22
10 33
11 11
dtype: int64
In [190]: movies.drop_duplicates()
Out[190]: movie
Uri: The Surgical Strike Vicky Kaushal
Battalion 609 Vicky Ahuja
The Accidental Prime Minister (film) Anupam Kher
Why Cheat India Emraan Hashmi
Evening Shadows Mona Ambegaonkar
...
Rules: Pyaar Ka Superhit Formula Tanuja
Right Here Right Now (film) Ankit
Talaash: The Hunt Begins... Rakhee Gulzar
The Pink Mirror Edwin Fernandes
Hum Tumhare Hain Sanam Jack
Name: lead, Length: 567, dtype: object
In [191]: dele.duplicated().sum()
Out[191]: 5
In [193]: kl.duplicated().sum()
Out[193]: 137
In [194]: dele.count()
Out[194]: 12
isin(values): Returns a boolean Series indicating whether each element in the Series is
in the provided values
In [198]: # isnull
kl.isnull().sum()
Out[198]: 0
In [199]: dele.isnull().sum()
Out[199]: 0
In [200]: # dropna
dele.dropna()
Out[200]: 0 1
1 2
2 33
3 3
4 3
5 3
6 1
7 23
8 33
9 22
10 33
11 11
dtype: int64
In [202]: # fillna
dele.fillna(0)
dele.fillna(dele.mean())
Out[202]: 0 1
1 2
2 33
3 3
4 3
5 3
6 1
7 23
8 33
9 22
10 33
11 11
dtype: int64
In [205]: # isin
kl
Out[205]: match_no
1 1
2 23
3 13
4 12
5 1
..
211 0
212 20
213 73
214 25
215 7
Name: runs, Length: 215, dtype: int64
Out[207]: match_no
82 99
86 49
Name: runs, dtype: int64
In [209]: kl[kl.isin([49,99])]
Out[209]: match_no
82 99
86 49
Name: runs, dtype: int64
In [210]: # apply
movies
Out[210]: movie
Uri: The Surgical Strike Vicky Kaushal
Battalion 609 Vicky Ahuja
The Accidental Prime Minister (film) Anupam Kher
Why Cheat India Emraan Hashmi
Evening Shadows Mona Ambegaonkar
...
Hum Tumhare Hain Sanam Jack
Aankhen (2002 film) Amitabh Bachchan
Saathiya (film) Vivek Oberoi
Company (film) Ajay Devgn
Awara Paagal Deewana Akshay Kumar
Name: lead, Length: 1500, dtype: object
Out[212]: movie
Uri: The Surgical Strike [Vicky, Kaushal]
Battalion 609 [Vicky, Ahuja]
The Accidental Prime Minister (film) [Anupam, Kher]
Why Cheat India [Emraan, Hashmi]
Evening Shadows [Mona, Ambegaonkar]
...
Hum Tumhare Hain Sanam [Jack]
Aankhen (2002 film) [Amitabh, Bachchan]
Saathiya (film) [Vivek, Oberoi]
Company (film) [Ajay, Devgn]
Awara Paagal Deewana [Akshay, Kumar]
Name: lead, Length: 1500, dtype: object
Out[213]: movie
Uri: The Surgical Strike Vicky
Battalion 609 Vicky
The Accidental Prime Minister (film) Anupam
Why Cheat India Emraan
Evening Shadows Mona
...
Hum Tumhare Hain Sanam Jack
Aankhen (2002 film) Amitabh
Saathiya (film) Vivek
Company (film) Ajay
Awara Paagal Deewana Akshay
Name: lead, Length: 1500, dtype: object
Out[214]: movie
Uri: The Surgical Strike VICKY
Battalion 609 VICKY
The Accidental Prime Minister (film) ANUPAM
Why Cheat India EMRAAN
Evening Shadows MONA
...
Hum Tumhare Hain Sanam JACK
Aankhen (2002 film) AMITABH
Saathiya (film) VIVEK
Company (film) AJAY
Awara Paagal Deewana AKSHAY
Name: lead, Length: 1500, dtype: object
In [215]: sub
Out[215]: 0 48
1 57
2 40
3 43
4 44
...
360 231
361 226
362 155
363 144
364 172
Name: Subscribers gained, Length: 365, dtype: int64
In [216]: sub.mean()
Out[216]: 135.64383561643837
In [229]: # Copy
kl
Out[229]: match_no
1 1
2 23
3 13
4 12
5 1
..
211 0
212 20
213 73
214 25
215 7
Name: runs, Length: 215, dtype: int64
In [231]: new[1]=100
In [232]: new
Out[232]: match_no
1 100
2 23
3 13
4 12
5 1
Name: runs, dtype: int64
In [233]: kl
Out[233]: match_no
1 100
2 23
3 13
4 12
5 1
...
211 0
212 20
213 73
214 25
215 7
Name: runs, Length: 215, dtype: int64
In [241]: new[1]=20
In [242]: new
Out[242]: match_no
1 20
2 23
3 13
4 12
5 1
Name: runs, dtype: int64
In [250]: kl
Out[250]: match_no
1 100
2 23
3 13
4 12
5 1
...
211 0
212 20
213 73
214 25
215 7
Name: runs, Length: 215, dtype: int64
In [ ]:
In [ ]:
Creating DataFrame
Out[2]:
iq marks package
0 100 80 10
1 90 70 7
2 120 100 14
3 80 50 2
Out[3]:
name iq marks package
0 peter 100 80 10
1 saint 90 70 7
3 parle 80 50 2
4 samme 13 11 15
5 dave 90 80 100
In [4]: students.set_index('name',inplace=True)
students
Out[4]:
iq marks package
name
peter 100 80 10
saint 90 70 7
parle 80 50 2
samme 13 11 15
dave 90 80 100
Out[5]:
title_x imdb_id poster_path wiki_link title_y original_title is_adult year_
Battalion Battalion
1 tt9472208 NaN https://en.wikipedia.org/wiki/Battalion_609 Battalion 609 0
609 609
The
The The
Accidental
Accidental Accidental
2 Prime tt6986710 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/The_Accidental_P... 0
Prime Prime
Minister
Minister Minister
(film)
Why Why
Why Cheat
3 Cheat tt8108208 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/Why_Cheat_India Cheat 0
India
India India
Out[6]:
ID City Date Season MatchNumber Team1 Team2 Venue TossWinner TossDecision SuperOver WinningTeam W
Narendra
2022- Rajasthan Gujarat Modi Rajasthan
0 1312200 Ahmedabad 2022 Final bat N Gujarat Titans Wi
05-29 Royals Titans Stadium, Royals
Ahmedabad
Narendra
Royal
2022- Rajasthan Modi Rajasthan Rajasthan
1 1312199 Ahmedabad 2022 Qualifier 2 Challengers field N Wi
05-27 Royals Stadium, Royals Royals
Bangalore
Ahmedabad
Eden
2022- Rajasthan Gujarat Gujarat
3 1312197 Kolkata 2022 Qualifier 1 Gardens, field N Gujarat Titans Wi
05-24 Royals Titans Titans
K lk t
In [7]: # shape
movies.shape
In [8]: ipl.shape
Out[8]: (950, 20)
In [9]: # dtype
movies.dtypes
In [10]: ipl.dtypes
Out[10]: ID int64
City object
Date object
Season object
MatchNumber object
Team1 object
Team2 object
Venue object
TossWinner object
TossDecision object
SuperOver object
WinningTeam object
WonBy object
Margin float64
method object
Player_of_Match object
Team1Players object
Team2Players object
Umpire1 object
Umpire2 object
dtype: object
In [11]: # index
movies.index
In [12]: ipl.index
In [13]: # Columns
movies.columns
In [14]: ipl.columns
In [15]: # Values
students.values
In [16]: ipl.values
Out[16]: array([[1312200, 'Ahmedabad', '2022-05-29', ...,
"['WP Saha', 'Shubman Gill', 'MS Wade', 'HH Pandya', 'DA Miller', 'R Tewatia', 'Rashid Khan', 'R Sai Kishore',
'LH Ferguson', 'Yash Dayal', 'Mohammed Shami']",
'CB Gaffaney', 'Nitin Menon'],
[1312199, 'Ahmedabad', '2022-05-27', ...,
"['YBK Jaiswal', 'JC Buttler', 'SV Samson', 'D Padikkal', 'SO Hetmyer', 'R Parag', 'R Ashwin', 'TA Boult', 'YS C
hahal', 'M Prasidh Krishna', 'OC McCoy']",
'CB Gaffaney', 'Nitin Menon'],
[1312198, 'Kolkata', '2022-05-25', ...,
"['Q de Kock', 'KL Rahul', 'M Vohra', 'DJ Hooda', 'MP Stoinis', 'E Lewis', 'KH Pandya', 'PVD Chameera', 'Mohsin
Khan', 'Avesh Khan', 'Ravi Bishnoi']",
'J Madanagopal', 'MA Gough'],
...,
[335984, 'Delhi', '2008-04-19', ...,
"['T Kohli', 'YK Pathan', 'SR Watson', 'M Kaif', 'DS Lehmann', 'RA Jadeja', 'M Rawat', 'D Salunkhe', 'SK Warne',
'SK Trivedi', 'MM Patel']",
'Aleem Dar', 'GA Pratapkumar'],
[335983, 'Chandigarh', '2008-04-19', ...,
"['PA Patel', 'ML Hayden', 'MEK Hussey', 'MS Dhoni', 'SK Raina', 'JDP Oram', 'S Badrinath', 'Joginder Sharma',
'P Amarnath', 'MS Gony', 'M Muralitharan']",
'MR Benson', 'SL Shastri'],
[335982, 'Bangalore', '2008-04-18', ...,
"['SC Ganguly', 'BB McCullum', 'RT Ponting', 'DJ Hussey', 'Mohammad Hafeez', 'LR Shukla', 'WP Saha', 'AB Agarka
r', 'AB Dinda', 'M Kartik', 'I Sharma']",
'Asad Rauf', 'RE Koertzen']], dtype=object)
Out[17]:
ID City Date Season MatchNumber Team1 Team2 Venue TossWinner TossDecision SuperOver WinningTeam WonBy Ma
Chennai Brabourne
2022- Punjab Chennai
63 1304057 Mumbai 2022 11 Super Stadium, field N Punjab Kings Runs 5
04-03 Kings Super Kings
Kings Mumbai
In [18]: # info
movies.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1629 entries, 0 to 1628
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 title_x 1629 non-null object
1 imdb_id 1629 non-null object
2 poster_path 1526 non-null object
3 wiki_link 1629 non-null object
4 title_y 1629 non-null object
5 original_title 1629 non-null object
6 is_adult 1629 non-null int64
7 year_of_release 1629 non-null int64
8 runtime 1629 non-null object
9 genres 1629 non-null object
10 imdb_rating 1629 non-null float64
11 imdb_votes 1629 non-null int64
12 story 1609 non-null object
13 summary 1629 non-null object
14 tagline 557 non-null object
15 actors 1624 non-null object
16 wins_nominations 707 non-null object
17 release_date 1522 non-null object
dtypes: float64(1), int64(3), object(14)
memory usage: 229.2+ KB
In [19]: ipl.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 950 entries, 0 to 949
Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 950 non-null int64
1 City 899 non-null object
2 Date 950 non-null object
3 Season 950 non-null object
4 MatchNumber 950 non-null object
5 Team1 950 non-null object
6 Team2 950 non-null object
7 Venue 950 non-null object
8 TossWinner 950 non-null object
9 TossDecision 950 non-null object
10 SuperOver 946 non-null object
11 WinningTeam 946 non-null object
12 WonBy 950 non-null object
13 Margin 932 non-null float64
14 method 19 non-null object
15 Player_of_Match 946 non-null object
16 Team1Players 950 non-null object
17 Team2Players 950 non-null object
18 Umpire1 950 non-null object
19 Umpire2 950 non-null object
dtypes: float64(1), int64(1), object(18)
memory usage: 148.6+ KB
In [20]: # describe
movies.describe()
Out[20]:
is_adult year_of_release imdb_rating imdb_votes
In [21]: ipl.describe()
Out[21]:
ID Margin
In [22]: # isnull
movies.isnull()
Out[22]:
title_x imdb_id poster_path wiki_link title_y original_title is_adult year_of_release runtime genres imdb_rating imdb_votes story summar
0 False False False False False False False False False False False False False Fals
1 False False True False False False False False False False False False False Fals
2 False False False False False False False False False False False False False Fals
3 False False False False False False False False False False False False False Fals
4 False False True False False False False False False False False False False Fals
... ... ... ... ... ... ... ... ... ... ... ... ... ...
1624 False False False False False False False False False False False False False Fals
1625 False False False False False False False False False False False False False Fals
1626 False False True False False False False False False False False False False Fals
1627 False False False False False False False False False False False False False Fals
1628 False False False False False False False False False False False False False Fals
In [23]: movies.isnull().sum()
Out[23]: title_x 0
imdb_id 0
poster_path 103
wiki_link 0
title_y 0
original_title 0
is_adult 0
year_of_release 0
runtime 0
genres 0
imdb_rating 0
imdb_votes 0
story 20
summary 0
tagline 1072
actors 5
wins_nominations 922
release_date 107
dtype: int64
In [24]: # duplicated
movies.duplicated().sum()
Out[24]: 0
In [25]: # rename
students
Out[25]:
iq marks package
name
peter 100 80 10
saint 90 70 7
parle 80 50 2
samme 13 11 15
dave 90 80 100
In [26]: students.rename(columns={'marks':'percent','package':'lpa'},inplace=True)
In [ ]: students.drop(columns='name',inplace=True)
Maths Method
In [28]: # sum -> Axis Argument
students.sum(axis=1)
Out[28]: name
peter 190
saint 167
noeum 234
parle 132
samme 39
dave 270
dtype: int64
In [29]: # mean
students.mean()
Out[29]: iq 82.166667
percent 65.166667
lpa 24.666667
dtype: float64
In [30]: students.min(axis=1)
Out[30]: name
peter 10
saint 7
noeum 14
parle 2
samme 11
dave 80
dtype: int64
In [31]: students.var()
Out[31]: iq 1332.166667
percent 968.166667
lpa 1384.666667
dtype: float64
In [33]: type(movies['title_x'])
Out[33]: pandas.core.series.Series
In [35]: type(movies[['year_of_release','actors','title_x']].head(2))
Out[35]: pandas.core.frame.DataFrame
In [36]: ipl[['City','Team1','Team2' ]]
Out[36]:
City Team1 Team2
In [37]: student_dict = {
'name':['peter','saint','noeum','parle','samme','dave'],
'iq':[100,90,120,80,13,90],
'marks':[80,70,100,50,11,80],
'package':[10,7,14,2,15,100]
}
students=pd.DataFrame(student_dict)
students.set_index('name',inplace=True)
In [38]: students
Out[38]:
iq marks package
name
peter 100 80 10
saint 90 70 7
parle 80 50 2
samme 13 11 15
dave 90 80 100
In [39]: # single_row
movies.iloc[1]
Out[40]:
title_x imdb_id poster_path wiki_link title_y original_title is_adult y
Thackeray
9 tt7777196 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/Thackeray_(film) Thackeray Thackeray 0
(film)
In [41]: movies.iloc[5:12:2]
Out[41]:
title_x imdb_id poster_path wiki_link title_y original_title is_adult year_
Thackeray
9 tt7777196 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/Thackeray_(film) Thackeray Thackeray 0
(film)
Out[42]:
ID City Date Season MatchNumber Team1 Team2 Venue TossWinner TossDecision SuperOver WinningTeam WonBy
Narendra
2022- Rajasthan Gujarat Modi Rajasthan
0 1312200 Ahmedabad 2022 Final bat N Gujarat Titans Wickets
05-29 Royals Titans Stadium, Royals
Ahmedabad
Wankhede
2022- Sunrisers Punjab Sunrisers
4 1304116 Mumbai 2022 70 Stadium, bat N Punjab Kings Wickets
05-22 Hyderabad Kings Hyderabad
Mumbai
Wankhede
2022- Delhi Mumbai Mumbai Mumbai
5 1304115 Mumbai 2022 69 Stadium, field N Wickets
05-21 Capitals Indians Indians Indians
Mumbai
Out[43]:
iq marks package
name
peter 100 80 10
saint 90 70 7
parle 80 50 2
samme 13 11 15
dave 90 80 100
In [44]: students.loc['parle']
Out[44]: iq 80
marks 50
package 2
Name: parle, dtype: int64
In [45]: students.loc['saint':'samme':2]
Out[45]:
iq marks package
name
saint 90 70 7
parle 80 50 2
Out[46]:
iq marks package
name
saint 90 70 7
dave 90 80 100
In [47]: students.iloc[[0,4,3]]
Out[47]:
iq marks package
name
peter 100 80 10
samme 13 11 15
parle 80 50 2
In [48]: movies.iloc[0:3,0:3]
Out[48]:
title_x imdb_id poster_path
In [49]: movies.loc[0:2,'title_x':'poster_path']
Out[49]:
title_x imdb_id poster_path
Filtering a DataFrame
In [50]: ipl.head(2)
Out[50]:
ID City Date Season MatchNumber Team1 Team2 Venue TossWinner TossDecision SuperOver WinningTeam WonB
Narendra
2022- Rajasthan Gujarat Modi Rajasthan
0 1312200 Ahmedabad 2022 Final bat N Gujarat Titans Wicket
05-29 Royals Titans Stadium, Royals
Ahmedabad
Narendra
Royal
2022- Rajasthan Modi Rajasthan Rajasthan
1 1312199 Ahmedabad 2022 Qualifier 2 Challengers field N Wicket
05-27 Royals Stadium, Royals Royals
Bangalore
Ahmedabad
Out[51]:
Season WinningTeam
Out[52]:
Season WinningTeam
Out[53]:
ID City Date Season MatchNumber Team1 Team2 Venue TossWinner TossDecision SuperOver WinningTeam WonB
Narendra
2022- Rajasthan Gujarat Modi Rajasthan
0 1312200 Ahmedabad 2022 Final bat N Gujarat Titans Wicket
05-29 Royals Titans Stadium, Royals
Ahmedabad
Narendra
Royal
2022- Rajasthan Modi Rajasthan Rajasthan
1 1312199 Ahmedabad 2022 Qualifier 2 Challengers field N Wicket
05-27 Royals Stadium, Royals Royals
Bangalore
Ahmedabad
In [54]: ipl[ipl['SuperOver']=='Y'].shape[0]
Out[54]: 14
Out[55]:
ID City Date Season MatchNumber Team1 Team2 Venue TossWinner TossDecision SuperOver WinningTeam Wo
Maharashtra
Rising
2017- Delhi Cricket Rising Pune Delhi
364 1082599 Pune 2017 9 Pune field N R
04-11 Daredevils Association Supergiant Daredevils
Supergiant
Stadium
In [56]:
ipl[(ipl['City'] == 'Kolkata') & (ipl['WinningTeam'] == 'Chennai Super Kings')].shape[0]
Out[56]: 5
Out[57]:
ID City Date Season MatchNumber Team1 Team2 Venue TossWinner TossDecision SuperOver WinningTeam WonBy Ma
Sheikh
Abu 2020- Delhi Mumbai Delhi Mumbai
168 1216529 2020/21 27 Zayed bat N Wickets
Dhabi 10-11 Capitals Indians Capitals Indians
Stadium
Wankhede
2021- Delhi Rajasthan Rajasthan Rajasthan
127 1254064 Mumbai 2021 7 Stadium, field N Wickets
04-15 Capitals Royals Royals Royals
Mumbai
Out[58]: 51.473684210526315
Out[59]:
title_x imdb_id poster_path wiki_link title_y original_title is_adult year_of
Junglee
24 (2019 tt7463730 https://upload.wikimedia.org/wikipedia/en/e/e2... https://en.wikipedia.org/wiki/Junglee_(2019_film) Junglee Junglee 0
film)
Hey Hey
390 tt4512230 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/Hey_Bro Hey Bro 0
Bro Bro
Out[61]:
title_x imdb_id poster_path wiki_link title_y original_titl
In [62]: movies['country']='India'
movies.sample(2)
Out[62]:
title_x imdb_id poster_path wiki_link title_y original_title is_adult year_of
Wanted
915 (2009 tt1084972 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/Wanted_(2009_film) Wanted Wanted 0
film)
Shaadi Shaadi
Mein Mein Shaadi Mein
241 tt7469726 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/Shaadi_Mein_Zaro... 0
Zaroor Zaroor Zaroor Aana
Aana Aana
In [63]: movies.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1629 entries, 0 to 1628
Data columns (total 19 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 title_x 1629 non-null object
1 imdb_id 1629 non-null object
2 poster_path 1526 non-null object
3 wiki_link 1629 non-null object
4 title_y 1629 non-null object
5 original_title 1629 non-null object
6 is_adult 1629 non-null int64
7 year_of_release 1629 non-null int64
8 runtime 1629 non-null object
9 genres 1629 non-null object
10 imdb_rating 1629 non-null float64
11 imdb_votes 1629 non-null int64
12 story 1609 non-null object
13 summary 1629 non-null object
14 tagline 557 non-null object
15 actors 1624 non-null object
16 wins_nominations 707 non-null object
17 release_date 1522 non-null object
18 country 1629 non-null object
dtypes: float64(1), int64(3), object(15)
memory usage: 241.9+ KB
In [68]: ipl.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 950 entries, 0 to 949
Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 950 non-null int64
1 City 899 non-null object
2 Date 950 non-null object
3 Season 950 non-null object
4 MatchNumber 950 non-null object
5 Team1 950 non-null object
6 Team2 950 non-null object
7 Venue 950 non-null object
8 TossWinner 950 non-null object
9 TossDecision 950 non-null object
10 SuperOver 946 non-null object
11 WinningTeam 946 non-null object
12 WonBy 950 non-null object
13 Margin 932 non-null float64
14 method 19 non-null object
15 Player_of_Match 946 non-null object
16 Team1Players 950 non-null object
17 Team2Players 950 non-null object
18 Umpire1 950 non-null object
19 Umpire2 950 non-null object
dtypes: float64(1), int64(1), object(18)
memory usage: 148.6+ KB
In [69]: ipl['ID']=ipl['ID'].astype('int32')
In [70]: ipl.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 950 entries, 0 to 949
Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 950 non-null int32
1 City 899 non-null object
2 Date 950 non-null object
3 Season 950 non-null object
4 MatchNumber 950 non-null object
5 Team1 950 non-null object
6 Team2 950 non-null object
7 Venue 950 non-null object
8 TossWinner 950 non-null object
9 TossDecision 950 non-null object
10 SuperOver 946 non-null object
11 WinningTeam 946 non-null object
12 WonBy 950 non-null object
13 Margin 932 non-null float64
14 method 19 non-null object
15 Player_of_Match 946 non-null object
16 Team1Players 950 non-null object
17 Team2Players 950 non-null object
18 Umpire1 950 non-null object
19 Umpire2 950 non-null object
dtypes: float64(1), int32(1), object(18)
memory usage: 144.9+ KB
In [72]: ipl.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 950 entries, 0 to 949
Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 950 non-null int32
1 City 899 non-null object
2 Date 950 non-null object
3 Season 950 non-null category
4 MatchNumber 950 non-null object
5 Team1 950 non-null object
6 Team2 950 non-null object
7 Venue 950 non-null object
8 TossWinner 950 non-null object
9 TossDecision 950 non-null object
10 SuperOver 946 non-null object
11 WinningTeam 946 non-null object
12 WonBy 950 non-null object
13 Margin 932 non-null float64
14 method 19 non-null object
15 Player_of_Match 946 non-null object
16 Team1Players 950 non-null object
17 Team2Players 950 non-null object
18 Umpire1 950 non-null object
19 Umpire2 950 non-null object
dtypes: category(1), float64(1), int32(1), object(17)
memory usage: 139.0+ KB
In [74]: ipl.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 950 entries, 0 to 949
Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 950 non-null int32
1 City 899 non-null object
2 Date 950 non-null object
3 Season 950 non-null category
4 MatchNumber 950 non-null object
5 Team1 950 non-null category
6 Team2 950 non-null category
7 Venue 950 non-null object
8 TossWinner 950 non-null object
9 TossDecision 950 non-null object
10 SuperOver 946 non-null object
11 WinningTeam 946 non-null object
12 WonBy 950 non-null object
13 Margin 932 non-null float64
14 method 19 non-null object
15 Player_of_Match 946 non-null object
16 Team1Players 950 non-null object
17 Team2Players 950 non-null object
18 Umpire1 950 non-null object
19 Umpire2 950 non-null object
dtypes: category(3), float64(1), int32(1), object(15)
memory usage: 127.4+ KB
Value Counts
Out[143]:
iq marks package
0 100 80 10
1 90 70 7
2 120 100 14
3 80 70 14
4 80 70 14
In [76]: marks.value_counts()
In [77]: # find which player has won most potm -> in finals and qualifiers
ipl.sample(2)
Out[77]:
ID City Date Season MatchNumber Team1 Team2 Venue TossWinner TossDecision SuperOver WinningTeam WonBy
Kolkata
2018- Sunrisers Eden Sunrisers Sunrisers
304 1136570 Kolkata 2018 10 Knight field N Wickets
04-14 Hyderabad Gardens Hyderabad Hyderabad
Riders
Himachal
Kings Pradesh
2013- Delhi Delhi Kings XI
561 598028 Dharamsala 2013 67 XI Cricket field N Runs
05-16 Daredevils Daredevils Punjab
Punjab Association
Stadium
Out[79]: <AxesSubplot:ylabel='TossDecision'>
Sort values
In [81]: x = pd.Series([12,14,1,56,89])
x
Out[81]: 0 12
1 14
2 1
3 56
4 89
dtype: int64
In [82]: x.sort_values(ascending=True)
Out[82]: 2 1
0 12
1 14
3 56
4 89
dtype: int64
In [83]: movies.sample(2)
Out[83]:
title_x imdb_id poster_path wiki_link title_y original_title is_adult year_of_r
Hope Hope
Hope Aur
107 Aur tt8324474 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/Hope_Aur_Hum Aur 0
Hum
Hum Hum
Tere Tere
Naal Naal Tere Naal
666 Love tt2130242 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/Tere_Naal_Love_H... Love Love Ho 0
Ho Ho Gaya
Gaya Gaya
Out[84]:
title_x imdb_id poster_path wiki_link title_y original_title is_adult y
Zor Lagaa
Zor Lagaa Zor Lagaa
939 tt1479857 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/Zor_Lagaa_Ke...H... Ke... 0
Ke...Haiya! Ke... Haiya!
Haiya!
Zindagi
Zindagi Zindagi Tere
670 tt2164702 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/Zindagi_Tere_Naam Tere 0
Tere Naam Naam
Naam
1971
1039 tt0983990 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/1971_(2007_film) 1971 1971 0
(2007 film)
1920: The
1920: Evil 1920: Evil
723 Evil tt2222550 https://upload.wikimedia.org/wikipedia/en/e/e7... https://en.wikipedia.org/wiki/1920:_The_Evil_R... 0
Returns Returns
Returns
1920: 1920
287 tt5638500 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/1920_London 1920 London 0
London London
16
1498 December tt0313844 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/16_December_(film) 16-Dec 16-Dec 0
(film)
Out[85]:
name college branch cgpa package
In [87]: students
Out[87]:
name college branch cgpa package
Yeh Yeh
Yeh Zindagi
1625 Zindagi tt0298607 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/Yeh_Zindagi_Ka_S... Zindagi 0
Ka Safar
Ka Safar Ka Safar
rank(Series)
In [89]: batsman=pd.read_csv("batsman_runs_ipl.csv")
In [90]: batsman.head(2)
Out[90]:
batter batsman_run
1 A Badoni 161
In [91]: batsman['batsman_run'].rank(ascending=False)
Out[91]: 0 166.5
1 226.0
2 535.0
3 329.0
4 402.5
...
600 594.0
601 343.0
602 547.5
603 27.0
604 256.0
Name: batsman_run, Length: 605, dtype: float64
Out[92]:
batter batsman_run batsman_rank
In [93]: marks = {
'maths':67,
'english':57,
'science':89,
'hindi':100
}
marks_series = pd.Series(marks)
marks_series
Out[93]: maths 67
english 57
science 89
hindi 100
dtype: int64
In [94]: marks_series.sort_index()
Out[94]: english 57
hindi 100
maths 67
science 89
dtype: int64
In [95]: movies.sort_index(ascending=False)
Out[95]:
title_x imdb_id poster_path wiki_link title_y original_title is_adult
Sabse Sabse
Sabse Bada
1626 Bada tt0069204 NaN https://en.wikipedia.org/wiki/Sabse_Bada_Sukh Bada 0
Sukh
Sukh Sukh
Yeh Yeh
Yeh Zindagi
1625 Zindagi tt0298607 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/Yeh_Zindagi_Ka_S... Zindagi 0
Ka Safar
Ka Safar Ka Safar
Why Why
Why Cheat
3 Cheat tt8108208 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/Why_Cheat_India Cheat 0
India
India India
The
The The
Accidental
Accidental Accidental
2 Prime tt6986710 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/The_Accidental_P... 0
Prime Prime
Minister
Minister Minister
(film)
Battalion Battalion
1 tt9472208 NaN https://en.wikipedia.org/wiki/Battalion_609 Battalion 609 0
609 609
In [140]: # This code drops two columns from the 'batsman' dataframe: 'level_0' and 'index'.
# The 'inplace=True' parameter ensures that the original dataframe is modified instead of creating a new one.
batsman.drop(['index'], axis=1, inplace=True)
In [102]: batsman
Out[102]:
batter batsman_run batsman_rank
2 A Chandila 4 535.0
3 A Chopra 53 329.0
4 A Choudhary 25 402.5
Out[103]:
index batter batsman_run
batsman_rank
535.0 2 A Chandila 4
329.0 3 A Chopra 53
402.5 4 A Choudhary 25
In [104]: batsman
Out[104]:
batter batsman_run batsman_rank
2 A Chandila 4 535.0
3 A Chopra 53 329.0
4 A Choudhary 25 402.5
Out[105]:
index 0
0 maths 67
1 english 57
2 science 89
3 hindi 100
In [106]: type(marks_series.reset_index())
Out[106]: pandas.core.frame.DataFrame
In [107]: movies.set_index('title_x',inplace=True)
In [108]: movies
Out[108]:
imdb_id poster_path wiki_link title_y original_title is_adult year_
title_x
Battalion Battalion
tt9472208 NaN https://en.wikipedia.org/wiki/Battalion_609 Battalion 609 0
609 609
The
The The
Accidental
Accidental Accidental
Prime tt6986710 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/The_Accidental_P... 0
Prime Prime
Minister
Minister Minister
(film)
Why Why
Why Cheat
Cheat tt8108208 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/Why_Cheat_India Cheat 0
India
India India
Yeh Yeh
Yeh Zindagi
Zindagi tt0298607 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/Yeh_Zindagi_Ka_S... Zindagi 0
Ka Safar
Ka Safar Ka Safar
Sabse Sabse
Sabse Bada
Bada tt0069204 NaN https://en.wikipedia.org/wiki/Sabse_Bada_Sukh Bada 0
Sukh
Sukh Sukh
In [110]: movies
Out[110]:
imdb link wiki_link title_y original_title is_adult year_
title_x
Battalion Battalion
tt9472208 NaN https://en.wikipedia.org/wiki/Battalion_609 Battalion 609 0
609 609
The
The The
Accidental
Accidental Accidental
Prime tt6986710 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/The_Accidental_P... 0
Prime Prime
Minister
Minister Minister
(film)
Why Why
Why Cheat
Cheat tt8108208 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/Why_Cheat_India Cheat 0
India
India India
Yeh Yeh
Yeh Zindagi
Zindagi tt0298607 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/Yeh_Zindagi_Ka_S... Zindagi 0
Ka Safar
Ka Safar Ka Safar
Sabse Sabse
Sabse Bada
Bada tt0069204 NaN https://en.wikipedia.org/wiki/Sabse_Bada_Sukh Bada 0
Sukh
Sukh Sukh
Out[111]:
imdb link wiki_link title_y original_title is_adult year_
title_x
Battalion Battalion
tt9472208 NaN https://en.wikipedia.org/wiki/Battalion_609 Battalion 609 0
609 609
The
The The
Accidental
Accidental Accidental
Prime tt6986710 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/The_Accidental_P... 0
Prime Prime
Minister
Minister Minister
(film)
Why Why
Why Cheat
Cheat tt8108208 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/Why_Cheat_India Cheat 0
India
India India
Yeh Yeh
Yeh Zindagi
Zindagi tt0298607 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/Yeh_Zindagi_Ka_S... Zindagi 0
Ka Safar
Ka Safar Ka Safar
Sabse Sabse
Sabse Bada
Bada tt0069204 NaN https://en.wikipedia.org/wiki/Sabse_Bada_Sukh Bada 0
Sukh
Sukh Sukh
unique
In [112]: # unique(series)
temp = pd.Series([1,1,2,2,3,3,4,4,5,5,np.nan,np.nan])
print(temp)
temp.unique()
0 1.0
1 1.0
2 2.0
3 2.0
4 3.0
5 3.0
6 4.0
7 4.0
8 5.0
9 5.0
10 NaN
11 NaN
dtype: float64
In [113]: ipl['Season'].unique()
Out[113]: ['2022', '2021', '2020/21', '2019', '2018', ..., '2012', '2011', '2009/10', '2009', '2007/08']
Length: 15
Categories (15, object): ['2007/08', '2009', '2009/10', '2011', ..., '2019', '2020/21', '2021', '2022']
nunique : returns the number of unique elements in a pandas Series or DataFrame. It doesn't count the missing values (NaNs) by default. If you
want to count the missing values, you can set the argument "dropna" to False.
unique: returns the unique elements in a pandas Series or DataFrame. It counts the missing values (NaNs) by default. If you don't want to count
the missing values, you can use the "dropna" argument and set it to True.
In [114]: len(ipl['Season'].unique())
Out[114]: 15
In [115]: # nunique(series + dataframe) -> does not count nan -> dropna parameter
ipl['Season'].nunique()
Out[115]: 15
isnull(series + dataframe)
In [116]: students
Out[116]:
name college branch cgpa package
In [117]: students['name'].isnull()
Out[117]: 0 False
1 False
2 False
3 True
4 False
5 True
6 False
7 True
8 False
9 True
Name: name, dtype: bool
Out[118]: 0 True
1 True
2 True
3 False
4 True
5 False
6 True
7 False
8 True
9 False
Name: name, dtype: bool
In [119]: students['name'][students['name'].notnull()]
Out[119]: 0 nitish
1 ankit
2 rupesh
4 mrityunjay
6 rishabh
8 aditya
Name: name, dtype: object
In [120]: # hasnans(series)
students['college'].hasnans
Out[120]: True
In [121]: students.isnull()
Out[121]:
name college branch cgpa package
In [122]: students.notnull()
Out[122]:
name college branch cgpa package
Out[123]: 0 nitish
1 ankit
2 rupesh
4 mrityunjay
6 rishabh
8 aditya
Name: name, dtype: object
In [124]: students.dropna(how='any')
Out[124]:
name college branch cgpa package
In [125]: students.dropna(how='all')
Out[125]:
name college branch cgpa package
In [126]: students.dropna(subset=['name'])
Out[126]:
name college branch cgpa package
In [127]: students.dropna(subset=['name','college'])
Out[127]:
name college branch cgpa package
In [128]: students
Out[128]:
name college branch cgpa package
In [129]: students['name'].fillna('unknown')
Out[129]: 0 nitish
1 ankit
2 rupesh
3 unknown
4 mrityunjay
5 unknown
6 rishabh
7 unknown
8 aditya
9 unknown
Name: name, dtype: object
In [130]: students.fillna('0')
Out[130]:
name college branch cgpa package
3 0 0 0 0 0
9 0 git 0 0 0
In [131]:
students['package'].fillna(students['package'].mean())
Out[131]: 0 4.000000
1 5.000000
2 6.000000
3 6.428571
4 6.000000
5 7.000000
6 8.000000
7 9.000000
8 6.428571
9 6.428571
Name: package, dtype: float64
Out[132]: 0 nitish
1 ankit
2 rupesh
3 rupesh
4 mrityunjay
5 mrityunjay
6 rishabh
7 rishabh
8 aditya
9 aditya
Name: name, dtype: object
drop_duplicates
In [135]: marks
In [144]: marks
Out[144]:
iq marks package
0 100 80 10
1 90 70 7
2 120 100 14
3 80 70 14
4 80 70 14
In [145]: marks.drop_duplicates()
Out[145]:
iq marks package
0 100 80 10
1 90 70 7
2 120 100 14
3 80 70 14
In [146]: marks.drop_duplicates(keep='last')
Out[146]:
iq marks package
0 100 80 10
1 90 70 7
2 120 100 14
4 80 70 14
Out[148]:
ID City Date Season MatchNumber Team1 Team2 Venue TossWinner TossDecision SuperOver WinningTeam WonBy M
Chennai Brabourne
2022- Rajasthan Chennai Rajasthan
6 1304114 Mumbai 2022 68 Super Stadium, bat N Wickets
05-20 Royals Super Kings Royals
Kings Mumbai
Holkar
2018- Rajasthan Kings XI Kings XI Kings XI
276 1136598 Indore 2018 38 Cricket field N Wickets
05-06 Royals Punjab Punjab Punjab
Stadium
Out[149]:
enue TossWinner TossDecision ... WinningTeam WonBy Margin method Player_of_Match Team1Players Team2Players Umpire1 Umpire2
['MEK
MA ['SR Watson',
Hussey', 'M
aram Rajasthan Chennai 'R Dravid', 'AL
bat ... Wickets 8.0 NaN MEK Hussey Vijay', 'SK SS Hazare RB Tiffin
ium, Royals Super Kings Menaria', 'J
Raina', 'JA
pauk Bot...
Mork...
In [165]: ipl.sample(2)
Out[165]:
ID City Date Season MatchNumber Team1 Team2 Venue TossWinner TossDecision ... Margin method Player_of_Matc
Punjab
Cricket
2008- Kings XI Mumbai Mumbai
940 335991 Chandigarh 2007/08 10 Association field ... 66.0 NaN KC Sangakka
04-25 Punjab Indians Indians
Stadium,
Mohali
2 rows × 23 columns
Out[166]:
ID City Date Season MatchNumber Team1 Team2 Venue TossWinner TossDecision ... Margin method Player_of_Match T
Royal Arun
2019- Delhi Delhi
208 1178421 Delhi 2019 46 Challengers Jaitley bat ... 16.0 NaN S Dhawan
04-28 Capitals Capitals
Bangalore Stadium
Royal Feroz
2013- Delhi Delhi
571 598054 Delhi 2013 57 Challengers Shah field ... 4.0 NaN JD Unadkat
05-10 Daredevils Daredevils
Bangalore Kotla
Royal Feroz
2012- Delhi Delhi '
638 548372 Delhi 2012 67 Challengers Shah field ... 21.0 NaN CH Gayle
05-17 Daredevils Daredevils
Bangalore Kotla
Royal Feroz
2010- Delhi Delhi
801 419140 Delhi 2009/10 35 Challengers Shah bat ... 37.0 NaN PD Collingwood
04-04 Daredevils Daredevils
Bangalore Kotla
9 rows × 23 columns
Out[168]:
ID City Date Season MatchNumber Team1 Team2 Venue TossWinner TossDecision ... Margin method Player_of_Match Tea
['P
Royal Arun
2019- Delhi Delhi D
208 1178421 Delhi 2019 46 Challengers Jaitley bat ... 16.0 NaN S Dhawan
04-28 Capitals Capitals
Bangalore Stadium
1 rows × 23 columns
drop(series + dataframe)
In [169]: # Series
temp = pd.Series([10,2,3,16,45,78,10])
temp
Out[169]: 0 10
1 2
2 3
3 16
4 45
5 78
6 10
dtype: int64
In [170]: temp.drop(index=[0,6])
Out[170]: 1 2
2 3
3 16
4 45
5 78
dtype: int64
In [171]: students
Out[171]:
name college branch cgpa package
Out[172]:
name college package
Out[173]:
name college branch cgpa package
In [174]: students.set_index('name')
Out[174]:
college branch cgpa package
name
Out[175]:
college branch cgpa package
name
apply(series + dataframe)
In [176]: # series
temp = pd.Series([10,20,30,40,50])
temp
Out[176]: 0 10
1 20
2 30
3 40
4 50
dtype: int64
In [178]: temp.apply(sigmoid)
Out[178]: 0 1.000045
1 1.000000
2 1.000000
3 1.000000
4 1.000000
dtype: float64
Out[179]:
1st point 2nd point
0 (3, 4) (-3, 4)
1 (-6, 5) (0, 0)
2 (0, 0) (2, 2)
4 (4, 5) (1, 1)
Out[182]: 0 6.000000
1 7.810250
2 2.828427
3 21.931712
4 5.000000
dtype: float64
Out[184]:
1st point 2nd point distance
In [189]: movies.head(1)
Out[189]:
Series_Title Released_Year Runtime Genre IMDB_Rating Director Star1 No_of_Votes Gross Metascore
0 The Shawshank Redemption 1994 142 Drama 9.3 Frank Darabont Tim Robbins 2343110 28341469.0 80.0
In [190]: movies.groupby('Genre')
Out[196]:
Runtime IMDB_Rating No_of_Votes Gross Metascore
Genre
In [197]: generes.mean()
Out[197]:
Runtime IMDB_Rating No_of_Votes Gross Metascore
Genre
In [198]: generes.min()
Out[198]:
Series_Title Released_Year Runtime IMDB_Rating Director Star1 No_of_Votes Gross Metascore
Genre
Abhishek
Action 300 1924 45 7.6 Aamir Khan 25312 3296.0 33.0
Chaubey
Adventure 2001: A Space Odyssey 1925 88 7.6 Akira Kurosawa Aamir Khan 29999 61001.0 41.0
Animation Akira 1940 71 7.6 Adam Elliot Adrian Molina 25229 128985.0 61.0
Biography 12 Years a Slave 1928 93 7.6 Adam McKay Adrien Brody 27254 21877.0 48.0
Alejandro G.
Comedy (500) Days of Summer 1921 68 7.6 Aamir Khan 26337 1305.0 45.0
Iñárritu
Crime 12 Angry Men 1931 80 7.6 Akira Kurosawa Ajay Devgn 27712 6013.0 47.0
Drama 1917 1925 64 7.6 Aamir Khan Abhay Deol 25088 3600.0 28.0
Film-Noir Shadow of a Doubt 1941 100 7.8 Alfred Hitchcock Humphrey Bogart 59556 449191.0 94.0
Alejandro
Horror Alien 1933 71 7.6 Anthony Perkins 27007 89029.0 46.0
Amenábar
Bernard-Pierre
Mystery Dark City 1938 96 7.6 Alex Proyas 33982 1035953.0 52.0
Donnadieu
Thriller Wait Until Dark 1967 108 7.8 Terence Young Audrey Hepburn 27733 17550741.0 81.0
Il buono, il brutto, il
Western 1965 132 7.8 Clint Eastwood Clint Eastwood 65659 5321508.0 69.0
cattivo
Out[206]: Genre
Drama 3.540997e+10
Action 3.263226e+10
Comedy 1.566387e+10
Name: Gross, dtype: float64
Out[212]: Genre
Thriller 1.755074e+07
Western 5.822151e+07
Film-Noir 1.259105e+08
Family 4.391106e+08
Fantasy 7.827267e+08
Horror 1.034649e+09
Mystery 1.256417e+09
Biography 8.276358e+09
Crime 8.452632e+09
Adventure 9.496922e+09
Animation 1.463147e+10
Comedy 1.566387e+10
Action 3.263226e+10
Drama 3.540997e+10
Name: Gross, dtype: float64
Out[214]: Genre
Drama 3.540997e+10
Action 3.263226e+10
Comedy 1.566387e+10
Name: Gross, dtype: float64
Out[217]: Genre
Western 8.35
Name: IMDB_Rating, dtype: float64
Pandas groupby
groupby in pandas is a function that lets you group data in a DataFrame based on specific
criteria, and then apply aggregate functions to each group. It's a powerful tool for data analysis
that allows you to quickly and easily calculate summary statistics for your data.
The
Frank Tim
0 Shawshank 1994 142 Drama 9.3 2343110
Darabont Robbins
Redemption
Francis
The Marlon
1 1972 175 Crime 9.2 Ford 1620367
Godfather Brando
Coppola
In [3]: movies.groupby('Genre')
Genre
Out[6]: Genre
Thriller 1.755074e+07
Western 5.822151e+07
Film-Noir 1.259105e+08
Family 4.391106e+08
Fantasy 7.827267e+08
Horror 1.034649e+09
Mystery 1.256417e+09
Biography 8.276358e+09
Crime 8.452632e+09
Adventure 9.496922e+09
Animation 1.463147e+10
Comedy 1.566387e+10
Action 3.263226e+10
Drama 3.540997e+10
Name: Gross, dtype: float64
Out[7]: Genre
Western 8.35
Name: IMDB_Rating, dtype: float64
Out[8]: Director
Christopher Nolan 11578345
Name: No_of_Votes, dtype: int64
The
Frank Tim
0 Shawshank 1994 142 Drama 9.3 2343110 2
Darabont Robbins
Redemption
Out[10]: Star1
Tom Hanks 12
Robert De Niro 11
Clint Eastwood 10
Al Pacino 10
Leonardo DiCaprio 9
..
Glen Hansard 1
Giuseppe Battiston 1
Giulietta Masina 1
Gerardo Taracena 1
Ömer Faruk Sorak 1
Name: Series_Title, Length: 660, dtype: int64
Out[12]: 14
Out[13]: 14
Out[14]: Genre
Action 172
Adventure 72
Animation 82
Biography 88
Comedy 155
Crime 107
Drama 289
Family 2
Fantasy 2
Film-Noir 3
Horror 11
Mystery 12
Thriller 1
Western 4
dtype: int64
Genre
Star Wars:
Episode V - Irvin Mark
Action 1980 124 8.7 115931
The Empire Kershner Hamill
Strikes Back
Andrew
Animation WALL·E 2008 98 8.4 Ben Burtt 99979
Stanton
Mel
Biography Braveheart 1995 178 8.3 Mel Gibson 95918
Gibson
David Morgan
Crime Se7en 1995 127 8.6 144509
Fincher Freeman
It's a
Frank James
Drama Wonderful 1946 130 8.6 40580
Capra Stewart
Life
Jordan Daniel
Horror Get Out 2017 104 7.7 49285
Peele Kaluuya
Joseph L. Laurence
Mystery Sleuth 1972 138 8.0 4474
Mankiewicz Olivier
Alfred Anthony
49 Psycho 1960 109 Horror 8.5 6042
Hitchcock Perkins
Ridley Sigourney
75 Alien 1979 117 Horror 8.4 7878
Scott Weaver
John Kurt
271 The Thing 1982 109 Horror 8.1 3712
Carpenter Russell
William Ellen
419 The Exorcist 1973 122 Horror 8.0 3623
Friedkin Burstyn
Jordan Daniel
724 Get Out 2017 104 Horror 7.7 4928
Peele Kaluuya
John Donald
844 Halloween 1978 91 Horror 7.7 2331
Carpenter Pleasence
The
James Claude
876 Invisible 1933 71 Horror 7.7 306
Whale Rains
Man
James Cary
932 Saw 2004 103 Horror 7.6 3790
Wan Elwes
Alejandro Nicole
948 The Others 2001 101 Horror 7.6 3376
Amenábar Kidman
In [20]: movies.groupby('Genre').get_group('Fantasy')
Das Cabinet
Robert Werner
321 des Dr. 1920 76 Fantasy 8.1 57428
Wiene Krauss
Caligari
F.W. Max
568 Nosferatu 1922 94 Fantasy 7.9 88794
Murnau Schreck
Das Cabinet
Robert Werner
321 des Dr. 1920 76 Fantasy 8.1 57428
Wiene Krauss
Caligari
F.W. Max
568 Nosferatu 1922 94 Fantasy 7.9 88794
Murnau Schreck
In [24]: # groups
movies.groupby('Genre').groups ---> #Dictionary= gives index postion
Out[24]: {'Action': [2, 5, 8, 10, 13, 14, 16, 29, 30, 31, 39, 42, 44, 55, 57, 59, 60,
63, 68, 72, 106, 109, 129, 130, 134, 140, 142, 144, 152, 155, 160, 161, 166,
168, 171, 172, 177, 181, 194, 201, 202, 216, 217, 223, 224, 236, 241, 262, 27
5, 294, 308, 320, 325, 326, 331, 337, 339, 340, 343, 345, 348, 351, 353, 356,
357, 362, 368, 369, 375, 376, 390, 410, 431, 436, 473, 477, 479, 482, 488, 49
3, 496, 502, 507, 511, 532, 535, 540, 543, 564, 569, 570, 573, 577, 582, 583,
602, 605, 608, 615, 623, ...], 'Adventure': [21, 47, 93, 110, 114, 116, 118,
137, 178, 179, 191, 193, 209, 226, 231, 247, 267, 273, 281, 300, 301, 304, 30
6, 323, 329, 361, 366, 377, 402, 406, 415, 426, 458, 470, 497, 498, 506, 513,
514, 537, 549, 552, 553, 566, 576, 604, 609, 618, 638, 647, 675, 681, 686, 69
2, 711, 713, 739, 755, 781, 797, 798, 851, 873, 884, 912, 919, 947, 957, 964,
966, 984, 991], 'Animation': [23, 43, 46, 56, 58, 61, 66, 70, 101, 135, 146,
151, 158, 170, 197, 205, 211, 213, 219, 229, 230, 242, 245, 246, 270, 330, 33
2, 358, 367, 378, 386, 389, 394, 395, 399, 401, 405, 409, 469, 499, 510, 516,
518, 522, 578, 586, 592, 595, 596, 599, 633, 640, 643, 651, 665, 672, 694, 72
8, 740, 741, 744, 756, 758, 761, 771, 783, 796, 799, 822, 828, 843, 875, 891,
892, 902, 906, 920, 956, 971, 976, 986, 992], 'Biography': [7, 15, 18, 35, 3
8, 54, 102, 107, 131, 139, 147, 157, 159, 173, 176, 212, 215, 218, 228, 235,
243, 263, 276, 282, 290, 298, 317, 328, 338, 342, 346, 359, 360, 365, 372, 37
3, 385, 411, 416, 418, 424, 429, 484, 525, 536, 542, 545, 575, 579, 587, 600,
606, 614, 622, 632, 635, 644, 649, 650, 657, 671, 673, 684, 729, 748, 753, 75
7, 759, 766, 770, 779, 809, 810, 815, 820, 831, 849, 858, 877, 882, 897, 910,
915, 923, 940, 949, 952, 987], 'Comedy': [19, 26, 51, 52, 64, 78, 83, 95, 96,
112, 117, 120, 127, 128, 132, 153, 169, 183, 192, 204, 207, 208, 214, 221, 23
3, 238, 240, 250, 251, 252, 256, 261, 266, 277, 284, 311, 313, 316, 318, 322,
327, 374, 379, 381, 392, 396, 403, 413, 414, 417, 427, 435, 445, 446, 449, 45
5, 459, 460, 463, 464, 466, 471, 472, 475, 481, 490, 494, 500, 503, 509, 526,
528, 530, 531, 533, 538, 539, 541, 547, 557, 558, 562, 563, 565, 574, 591, 59
3, 594, 598, 613, 626, 630, 660, 662, 667, 679, 680, 683, 687, 701, ...], 'Cr
ime': [1, 3, 4, 6, 22, 25, 27, 28, 33, 37, 41, 71, 77, 79, 86, 87, 103, 108,
111, 113, 123, 125, 133, 136, 162, 163, 164, 165, 180, 186, 187, 189, 198, 22
2, 232, 239, 255, 257, 287, 288, 299, 305, 335, 363, 364, 380, 384, 397, 437,
438, 441, 442, 444, 450, 451, 465, 474, 480, 485, 487, 505, 512, 519, 520, 52
3, 527, 546, 556, 560, 584, 597, 603, 607, 611, 621, 639, 653, 664, 669, 676,
695, 708, 723, 762, 763, 767, 775, 791, 795, 802, 811, 823, 827, 833, 885, 89
5, 921, 922, 926, 938, ...], 'Drama': [0, 9, 11, 17, 20, 24, 32, 34, 36, 40,
45, 50, 53, 62, 65, 67, 73, 74, 76, 80, 82, 84, 85, 88, 89, 90, 91, 92, 94, 9
7, 98, 99, 100, 104, 105, 121, 122, 124, 126, 138, 141, 143, 148, 149, 150, 1
54, 156, 167, 174, 175, 182, 184, 185, 188, 190, 195, 196, 199, 200, 203, 20
6, 210, 225, 227, 234, 237, 244, 248, 249, 253, 254, 258, 259, 260, 264, 265,
268, 269, 272, 274, 278, 279, 280, 283, 285, 286, 289, 291, 292, 293, 295, 29
6, 297, 302, 303, 307, 310, 312, 314, 315, ...], 'Family': [688, 698], 'Fanta
sy': [321, 568], 'Film-Noir': [309, 456, 712], 'Horror': [49, 75, 271, 419, 5
44, 707, 724, 844, 876, 932, 948], 'Mystery': [69, 81, 119, 145, 220, 393, 42
0, 714, 829, 899, 959, 961], 'Thriller': [700], 'Western': [12, 48, 115, 69
1]}
In [25]: # describe
movies.groupby('Genre').describe()
count mean std min 25% 50% 75% max count mean .
Genre
Action 172.0 129.046512 28.500706 45.0 110.75 127.5 143.25 321.0 172.0 7.949419 .
Adventure 72.0 134.111111 33.317320 88.0 109.00 127.0 149.00 228.0 72.0 7.937500 .
Animation 82.0 99.585366 14.530471 71.0 90.00 99.5 106.75 137.0 82.0 7.930488 .
Biography 88.0 136.022727 25.514466 93.0 120.00 129.0 146.25 209.0 88.0 7.938636 .
Comedy 155.0 112.129032 22.946213 68.0 96.00 106.0 124.50 188.0 155.0 7.901290 .
Crime 107.0 126.392523 27.689231 80.0 106.50 122.0 141.50 229.0 107.0 8.016822 .
Drama 289.0 124.737024 27.740490 64.0 105.00 121.0 137.00 242.0 289.0 7.957439 .
Family 2.0 107.500000 10.606602 100.0 103.75 107.5 111.25 115.0 2.0 7.800000 .
Fantasy 2.0 85.000000 12.727922 76.0 80.50 85.0 89.50 94.0 2.0 8.000000 .
Film-Noir 3.0 104.000000 4.000000 100.0 102.00 104.0 106.00 108.0 3.0 7.966667 .
Horror 11.0 102.090909 13.604812 71.0 98.00 103.0 109.00 122.0 11.0 7.909091 .
Mystery 12.0 119.083333 14.475423 96.0 110.75 117.5 130.25 138.0 12.0 7.975000 .
Thriller 1.0 108.000000 NaN 108.0 108.00 108.0 108.00 108.0 1.0 7.800000 .
Western 4.0 148.250000 17.153717 132.0 134.25 148.0 162.00 165.0 4.0 8.350000 .
14 rows × 40 columns
In [30]: # sample
movies.groupby('Genre').sample(2,replace=True)
Raoul James
308 White Heat 1949 114 Action 8.1
Walsh Cagney
John
171 Die Hard 1988 132 Action 8.2 Bruce Willis
McTiernan
William Charlton
300 Ben-Hur 1959 212 Adventure 8.1
Wyler Heston
South Park:
Bigger, Trey
799 1999 81 Animation 7.7 Trey Parker
Longer & Parker
Uncut
The World's
Roger Anthony
632 Fastest 2005 127 Biography 7.8
Donaldson Hopkins
Indian
Alejandro Javier
372 Mar adentro 2014 126 Biography 8.0
Amenábar Bardem
David
660 The Sandlot 1993 101 Comedy 7.8 Mickey Tom Guiry
Evans
Brian De
108 Scarface 1983 170 Crime 8.3 Al Pacino
Palma
Nadine Zain Al
53 Capharnaüm 2018 126 Drama 8.4
Labaki Rafeea
Ryan Michael B.
894 Creed 2015 133 Drama 7.6
Coogler Jordan
E.T. the
Steven Henry
688 Extra- 1982 115 Family 7.8
Spielberg Thomas
Terrestrial
E.T. the
Steven Henry
688 Extra- 1982 115 Family 7.8
Spielberg Thomas
Terrestrial
F.W. Max
568 Nosferatu 1922 94 Fantasy 7.9
Murnau Schreck
Das Cabinet
Robert Werner
321 des Dr. 1920 76 Fantasy 8.1
Wiene Krauss
Caligari
Alfred Anthony
49 Psycho 1960 109 Horror 8.5
Hitchcock Perkins
Alejandro Nicole
948 The Others 2001 101 Horror 7.6
Amenábar Kidman
Alex Rufus
959 Dark City 1998 100 Mystery 7.6
Proyas Sewell
Once Upon
Sergio Henry
48 a Time in the 1968 165 Western 8.5
Leone Fonda
West
In [31]: # nunique
movies.groupby('Genre').nunique() # unique --> unique items , nunique--> gives
Genre
Adventure 72 49 58 10 59 59 72
Animation 82 35 41 11 51 77 82
Biography 88 44 56 13 76 72 88
Family 2 2 2 1 2 2 2
Fantasy 2 2 2 2 2 2 2
Film-Noir 3 3 3 3 3 3 3
Horror 11 11 10 8 10 11 11
Mystery 12 11 10 8 10 11 12
Thriller 1 1 1 1 1 1 1
Western 4 4 4 4 2 2 4
aggregate method
In [33]:
# passing dict
movies.groupby('Genre').agg(
{
'Runtime':'mean',
'IMDB_Rating':'mean',
'No_of_Votes':'sum',
'Gross':'sum',
'Metascore':'min'
}
)
Genre
min max mean sum median min max mean sum median ...
Genre
Action 45 321 129.046512 22196 127.5 7.6 9.0 7.949419 1367.3 7.9 ...
Adventure 88 228 134.111111 9656 127.0 7.6 8.6 7.937500 571.5 7.9 ...
Animation 71 137 99.585366 8166 99.5 7.6 8.6 7.930488 650.3 7.9 ...
Biography 93 209 136.022727 11970 129.0 7.6 8.9 7.938636 698.6 7.9 ...
Comedy 68 188 112.129032 17380 106.0 7.6 8.6 7.901290 1224.7 7.9 ...
Crime 80 229 126.392523 13524 122.0 7.6 9.2 8.016822 857.8 8.0 ...
Drama 64 242 124.737024 36049 121.0 7.6 9.3 7.957439 2299.7 8.0 ...
Family 100 115 107.500000 215 107.5 7.8 7.8 7.800000 15.6 7.8 ... 40
Fantasy 76 94 85.000000 170 85.0 7.9 8.1 8.000000 16.0 8.0 ... 3375
Film-Noir 100 108 104.000000 312 104.0 7.8 8.1 7.966667 23.9 8.0 ... 4
Horror 71 122 102.090909 1123 103.0 7.6 8.5 7.909091 87.0 7.8 ...
Mystery 96 138 119.083333 1429 117.5 7.6 8.4 7.975000 95.7 8.0 ... 10
Thriller 108 108 108.000000 108 108.0 7.8 7.8 7.800000 7.8 7.8 ... 175
Western 132 165 148.250000 593 148.0 7.8 8.8 8.350000 33.4 8.4 ... 53
14 rows × 25 columns
Genre
Christopher Matthew
21 Interstellar 2014 169 Adventure 8.6
Nolan McConaughey
Sen to Chihiro
Hayao Daveigh
23 no 2001 125 Animation 8.6
Miyazaki Chase
kamikakushi
Schindler's Steven
7 1993 195 Biography 8.9 Liam Neeson
List Spielberg
Bong Joon
19 Gisaengchung 2019 132 Comedy 8.6 Kang-ho Song
Ho
Roberto Roberto
26 La vita è bella 1997 116 Comedy 8.6
Benigni Benigni
Francis
Marlon
1 The Godfather 1972 175 Crime 9.2 Ford
Brando
Coppola
The
Frank
0 Shawshank 1994 142 Drama 9.3 Tim Robbins
Darabont
Redemption
Das Cabinet
Robert Werner
321 des Dr. 1920 76 Fantasy 8.1
Wiene Krauss
Caligari
The Third
309 1949 104 Film-Noir 8.1 Carol Reed Orson Welles
Man
Alfred Anthony
49 Psycho 1960 109 Horror 8.5
Hitchcock Perkins
Christopher
69 Memento 2000 113 Mystery 8.4 Guy Pearce
Nolan
Alfred James
81 Rear Window 1954 112 Mystery 8.4
Hitchcock Stewart
Il buono, il
Sergio Clint
12 brutto, il 1966 161 Western 8.8
Leone Eastwood
cattivo
Genre
Abhishek Aamir
Action 300 1924 45 Action 7.6
Chaubey Khan
2001: A
Akira Aamir
Adventure Space 1925 88 Adventure 7.6
Kurosawa Khan
Odyssey
Adam Adrian
Animation Akira 1940 71 Animation 7.6
Elliot Molina
Aamir Abhay
Drama 1917 1925 64 Drama 7.6
Khan Deol
E.T. the
Gene
Family Extra- 1971 100 Family 7.8 Mel Stuart
Wilder
Terrestrial
Das Cabinet
F.W. Max
Fantasy des Dr. 1920 76 Fantasy 7.9
Murnau Schreck
Caligari
Alejandro Anthony
Horror Alien 1933 71 Horror 7.6
Amenábar Perkins
Bernard-
Alex
Mystery Dark City 1938 96 Mystery 7.6 Pierre
Proyas
Donnadieu
Il buono, il
Clint Clint
Western brutto, il 1965 132 Western 7.8
Eastwood Eastwood
cattivo
In [44]: movies.groupby('Genre').apply(foo)
In [50]: movies.groupby('Genre').apply(foo)
Out[50]: Genre
Action 10
Adventure 2
Animation 2
Biography 9
Comedy 14
Crime 4
Drama 21
Family 0
Fantasy 0
Film-Noir 0
Horror 1
Mystery 0
Thriller 0
Western 0
dtype: int64
In [51]: # find ranking of each movie in the group according to IMDB score
def rank_movie(group):
group['genre_rank']=group['IMDB_Rating'].rank(ascending=False)
return group
In [52]: movies.groupby('Genre').apply(rank_movie)
The
Frank Tim
0 Shawshank 1994 142 Drama 9.3 234
Darabont Robbins
Redemption
Francis
The Marlon
1 1972 175 Crime 9.2 Ford 162
Godfather Brando
Coppola
The Francis
3 Godfather: 1974 202 Crime 9.0 Ford Al Pacino 112
Part II Coppola
George Elizabeth
996 Giant 1956 201 Drama 7.6 3
Stevens Taylor
Alfred Tallulah
998 Lifeboat 1944 97 Drama 7.6 2
Hitchcock Bankhead
The
Frank Tim
0 Shawshank 1994 142 Drama 9.3 234
Darabont Robbins
Redemption
Francis
The Marlon
1 1972 175 Crime 9.2 Ford 162
Godfather Brando
Coppola
The Francis
3 Godfather: 1974 202 Crime 9.0 Ford Al Pacino 112
Part II Coppola
George Elizabeth
996 Giant 1956 201 Drama 7.6 3
Stevens Taylor
Alfred Tallulah
998 Lifeboat 1944 97 Drama 7.6 2
Hitchcock Bankhead
Director Star1
Aamir Amole
165 165 165.0 8.4 8.4 8.4 168895 168895 168895.0 1223869
Khan Gupte
Aaron Eddie
129 129 129.0 7.8 7.8 7.8 89896 89896 89896.0 853090410
Sorkin Redmayne
Abdellatif Léa
180 180 180.0 7.7 7.7 7.7 138741 138741 138741.0 2199675
Kechiche Seydoux
Abhishek Shahid
148 148 148.0 7.8 7.8 7.8 27175 27175 27175.0 218428303
Chaubey Kapoor
Abhishek
Amit Sadh 130 130 130.0 7.7 7.7 7.7 32628 32628 32628.0 1122527
Kapoor
... ... ... ... ... ... ... ... ... ... ...
Zaza Lembit
87 87 87.0 8.2 8.2 8.2 40382 40382 40382.0 144501
Urushadze Ulfsak
Hrithik
155 155 155.0 8.1 8.1 8.1 67927 67927 67927.0 3108485
Roshan
Zoya
Akhtar Vijay
154 154 154.0 8.0 8.0 8.0 31886 31886 31886.0 5566534
Varma
Çagan Çetin
112 112 112.0 8.3 8.3 8.3 78925 78925 78925.0 461855363
Irmak Tekindor
Ömer
Cem
Faruk 127 127 127.0 8.0 8.0 8.0 56960 56960 56960.0 196206077
Yilmaz
Sorak
Excercise
Out[78]: match_id inning batting_team bowling_team over ball batsman non_striker bowler is_s
Royal
Sunrisers DA TS
0 1 1 Challengers 1 1 S Dhawan
Hyderabad Warner Mills
Bangalore
Royal
Sunrisers DA TS
1 1 1 Challengers 1 2 S Dhawan
Hyderabad Warner Mills
Bangalore
2 rows × 21 columns
0 V Kohli 5434
1 SK Raina 5415
2 RG Sharma 4914
3 DA Warner 4741
4 S Dhawan 4632
5 CH Gayle 4560
6 MS Dhoni 4477
7 RV Uthappa 4446
8 AB de Villiers 4428
9 G Gambhir 4223
Out[94]: batsman
CH Gayle 327
AB de Villiers 214
Name: batsman, dtype: int64
In [105]: # find batsman with most number of 4's and 6's in last 5 overs
temp = ipl[ipl['over']>15]
temp = temp[(temp['batsman_runs'] == 4) | (temp['batsman_runs'] == 6)]
temp.groupby('batsman')['batsman'].count().sort_values(ascending=False).head(1
In [118]: # Create a function that can return the highest score of any batsman
temp = ipl[ipl['batsman'] == 'V Kohli']
temp.groupby('match_id')['batsman_runs'].sum().sort_values(ascending=False).he
Out[118]: 113
Out[123]: 126
In [ ]:
In [3]: courses.head(2)
Out[3]:
course_id course_name price
0 1 python 2499
1 2 sql 3499
In [4]: students.head(2)
Out[4]:
student_id name partner
0 1 Kailash Harjo 23
1 2 Esha Butala 1
In [5]: may.head(2)
Out[5]:
student_id course_id
0 23 1
1 15 5
In [6]: june.head(2)
Out[6]:
student_id course_id
0 3 5
1 16 7
In [7]: matches.head(2)
Out[7]:
id season city date team1 team2 toss_winner toss_decision result dl_applied winner win_by_runs win_by_wickets player_of_ma
Royal Royal
2017- Sunrisers Sunrisers
0 1 2017 Hyderabad Challengers Challengers field normal 0 35 0 Yuvraj Si
04-05 Hyderabad Hyderabad
Bangalore Bangalore
Rising Rising
2017- Mumbai Rising Pune
1 2 2017 Pune Pune field normal 0 Pune 0 7 SPD Sm
04-06 Indians Supergiant
Supergiant Supergiant
Concat
it is a powerful function that allows you to concatenate two or more DataFrames along a particular axis (row-wise or column-wise). You can control how the
data is concatenated by specifying several parameters, such as axis, join, ignore_index, and keys.
localhost:8888/notebooks/ Pandas. Merging%2C Joining %2C Concatenating (Prudhvi vardhan Notes).ipynb 1/20
5/10/23, 2:14 PM Pandas. Merging, Joining , Concatenating (Prudhvi vardhan Notes) - Jupyter Notebook
localhost:8888/notebooks/ Pandas. Merging%2C Joining %2C Concatenating (Prudhvi vardhan Notes).ipynb 2/20
5/10/23, 2:14 PM Pandas. Merging, Joining , Concatenating (Prudhvi vardhan Notes) - Jupyter Notebook
Out[8]:
student_id course_id
0 23 1
1 15 5
2 18 6
3 23 4
4 16 9
5 18 1
6 1 1
7 7 8
8 22 3
9 15 1
10 19 4
11 1 6
12 7 10
13 11 7
14 13 3
15 24 4
16 21 1
17 16 5
18 23 3
19 17 7
20 23 6
21 25 1
22 19 2
23 25 10
24 3 3
25 3 5
26 16 7
27 12 10
28 12 1
29 14 9
30 7 7
31 7 2
32 16 3
33 17 10
34 11 8
35 14 6
36 12 5
37 12 7
38 18 8
39 1 10
40 1 9
41 2 5
42 7 6
43 22 5
44 22 6
45 23 9
46 23 5
47 14 4
48 14 1
49 11 10
50 42 9
51 50 8
52 38 1
localhost:8888/notebooks/ Pandas. Merging%2C Joining %2C Concatenating (Prudhvi vardhan Notes).ipynb 3/20
5/10/23, 2:14 PM Pandas. Merging, Joining , Concatenating (Prudhvi vardhan Notes) - Jupyter Notebook
localhost:8888/notebooks/ Pandas. Merging%2C Joining %2C Concatenating (Prudhvi vardhan Notes).ipynb 4/20
5/10/23, 2:14 PM Pandas. Merging, Joining , Concatenating (Prudhvi vardhan Notes) - Jupyter Notebook
Out[9]:
student_id course_id
may 0 23 1
1 15 5
2 18 6
3 23 4
4 16 9
5 18 1
6 1 1
7 7 8
8 22 3
9 15 1
10 19 4
11 1 6
12 7 10
13 11 7
14 13 3
15 24 4
16 21 1
17 16 5
18 23 3
19 17 7
20 23 6
21 25 1
22 19 2
23 25 10
24 3 3
june 0 3 5
1 16 7
2 12 10
3 12 1
4 14 9
5 7 7
6 7 2
7 16 3
8 17 10
9 11 8
10 14 6
11 12 5
12 12 7
13 18 8
14 1 10
15 1 9
16 2 5
17 7 6
18 22 5
19 22 6
20 23 9
21 23 5
22 14 4
23 14 1
24 11 10
25 42 9
26 50 8
27 38 1
localhost:8888/notebooks/ Pandas. Merging%2C Joining %2C Concatenating (Prudhvi vardhan Notes).ipynb 5/20
5/10/23, 2:14 PM Pandas. Merging, Joining , Concatenating (Prudhvi vardhan Notes) - Jupyter Notebook
In [10]: multi.loc['may']
Out[10]:
student_id course_id
0 23 1
1 15 5
2 18 6
3 23 4
4 16 9
5 18 1
6 1 1
7 7 8
8 22 3
9 15 1
10 19 4
11 1 6
12 7 10
13 11 7
14 13 3
15 24 4
16 21 1
17 16 5
18 23 3
19 17 7
20 23 6
21 25 1
22 19 2
23 25 10
24 3 3
In [11]: multi.loc[('june',0)]
Out[11]: student_id 3
course_id 5
Name: (june, 0), dtype: int64
localhost:8888/notebooks/ Pandas. Merging%2C Joining %2C Concatenating (Prudhvi vardhan Notes).ipynb 6/20
5/10/23, 2:14 PM Pandas. Merging, Joining , Concatenating (Prudhvi vardhan Notes) - Jupyter Notebook
Out[12]:
student_id course_id student_id course_id
0 23.0 1.0 3 5
1 15.0 5.0 16 7
2 18.0 6.0 12 10
3 23.0 4.0 12 1
4 16.0 9.0 14 9
5 18.0 1.0 7 7
6 1.0 1.0 7 2
7 7.0 8.0 16 3
8 22.0 3.0 17 10
9 15.0 1.0 11 8
10 19.0 4.0 14 6
11 1.0 6.0 12 5
12 7.0 10.0 12 7
13 11.0 7.0 18 8
14 13.0 3.0 1 10
15 24.0 4.0 1 9
16 21.0 1.0 2 5
17 16.0 5.0 7 6
18 23.0 3.0 22 5
19 17.0 7.0 22 6
20 23.0 6.0 23 9
21 25.0 1.0 23 5
22 19.0 2.0 14 4
23 25.0 10.0 14 1
24 3.0 3.0 11 10
25 NaN NaN 42 9
26 NaN NaN 50 8
27 NaN NaN 38 1
Merge
On Joins
Inner Join
localhost:8888/notebooks/ Pandas. Merging%2C Joining %2C Concatenating (Prudhvi vardhan Notes).ipynb 7/20
5/10/23, 2:14 PM Pandas. Merging, Joining , Concatenating (Prudhvi vardhan Notes) - Jupyter Notebook
In each set of data, there should to be a "common" column. Students[student_id] and regs[student_id] are listed here. We combine based on the student_id,
however the inner join only displays the data that is "Common" across the two dataframes.
Out[13]:
student_id name partner course_id
45 23 Chhavi Lachman 18 9
46 23 Chhavi Lachman 18 5
47 24 Radhika Suri 17 4
48 25 Shashank D’Alia 2 1
49 25 Shashank D’Alia 2 10
Left Join
Regardless of whether or not the right side data leaves, it prints all of the left side data. so , we can see left data (Numpy , c++) but we cannot see any right
side data which is student_id here, courses reflect = Left and regs reflect = right
In [14]:
courses.merge(regs,how='left',on='course_id').tail(5)
Out[14]:
course_id course_name price student_id
Right join
In [15]:
temp_df = pd.DataFrame({
'student_id':[26,27,28],
'name':['Nitish','Ankit','Rahul'],
'partner':[28,26,17]
})
students = pd.concat([students,temp_df],ignore_index=True)
In [16]: students.tail()
Out[16]:
student_id name partner
23 24 Radhika Suri 17
24 25 Shashank D’Alia 2
25 26 Nitish 28
26 27 Ankit 26
27 28 Rahul 17
Regs data(50,51,52) in the current case does not contain students data, however even this, data is printed since the join was done right.
why.?
because when using a right join, all right side data is printed regardless of whether the left side data exits or not.
localhost:8888/notebooks/ Pandas. Merging%2C Joining %2C Concatenating (Prudhvi vardhan Notes).ipynb 8/20
5/10/23, 2:14 PM Pandas. Merging, Joining , Concatenating (Prudhvi vardhan Notes) - Jupyter Notebook
Out[17]:
student_id name partner course_id
50 42 NaN NaN 9
51 50 NaN NaN 8
52 38 NaN NaN 1
Since there is no course_id in the student data in the current case, "Nan" data is displayed.
Why was a left join performed using the student_id? Regardless of whether or not the right side data leaves, it prints all of the left side data.
Out[18]:
student_id name partner course_id
57 26 Nitish 28 NaN
58 27 Ankit 26 NaN
59 28 Rahul 17 NaN
Outer join
Initially the left join data is clearly apparent with (Nitish, Ankit, Rahul) data written,
but the right side data (course id) is blank. like which,
Right join shows Nan even though we don't have any data for (42, 50, 38), but we can see the course's id column because it's a right join.
Finally, we may view both data sets, both common and individual, regardless of whether they have ever been. As seen in the outer join
Out[19]:
student_id name partner course_id
Out[20]: 154247
Out[27]: level_0
june 65072
may 89175
Name: price, dtype: int64
localhost:8888/notebooks/ Pandas. Merging%2C Joining %2C Concatenating (Prudhvi vardhan Notes).ipynb 9/20
5/10/23, 2:14 PM Pandas. Merging, Joining , Concatenating (Prudhvi vardhan Notes) - Jupyter Notebook
Out[32]:
student_id course_id name partner course_name price
localhost:8888/notebooks/ Pandas. Merging%2C Joining %2C Concatenating (Prudhvi vardhan Notes).ipynb 10/20
5/10/23, 2:14 PM Pandas. Merging, Joining , Concatenating (Prudhvi vardhan Notes) - Jupyter Notebook
Out[33]:
name course_name price
localhost:8888/notebooks/ Pandas. Merging%2C Joining %2C Concatenating (Prudhvi vardhan Notes).ipynb 11/20
5/10/23, 2:14 PM Pandas. Merging, Joining , Concatenating (Prudhvi vardhan Notes) - Jupyter Notebook
Out[38]: course_name
data analysis 24995
machine learning 39996
ms sxcel 7995
pandas 4396
plotly 3495
power bi 11394
pyspark 14994
python 22491
sql 6998
tableau 17493
Name: price, dtype: int64
Out[41]: <AxesSubplot:xlabel='course_name'>
intersect1d
Find the intersection of two arrays. Return the sorted, unique values that are in both of the input arrays.
In [47]: students[students['student_id'].isin(common_students_id)]
Out[47]:
student_id name partner
0 1 Kailash Harjo 23
2 3 Parveen Bhalla 3
6 7 Tarun Thaker 9
10 11 David Mukhopadhyay 20
15 16 Elias Dodiya 25
16 17 Yasmin Palan 7
17 18 Fardeen Mahabir 13
21 22 Yash Sethi 21
22 23 Chhavi Lachman 18
numpy.setdiff1d()
function find the set difference of two arrays and return the unique values in arr1 that are not in arr2.
localhost:8888/notebooks/ Pandas. Merging%2C Joining %2C Concatenating (Prudhvi vardhan Notes).ipynb 12/20
5/10/23, 2:14 PM Pandas. Merging, Joining , Concatenating (Prudhvi vardhan Notes) - Jupyter Notebook
Out[52]:
course_id course_name price
10 11 Numpy 699
11 12 C++ 1299
In [53]: # 7. find students who did not enroll into any courses
student_id_list = np.setdiff1d(students['student_id'], regs['student_id'])
students[students['student_id'].isin(student_id_list)]
Out[53]:
student_id name partner
3 4 Marlo Dugal 14
4 5 Kusum Bahri 6
5 6 Lakshmi Contractor 10
7 8 Radheshyam Dey 5
8 9 Nitika Chatterjee 4
9 10 Aayushman Sant 8
19 20 Hanuman Hegde 11
25 26 Nitish 28
26 27 Ankit 26
27 28 Rahul 17
In [55]: students[students['student_id'].isin(student_id_list)].shape[0]
Out[55]: 10
Out[56]: 35.714285714285715
Self Join
A self join is a regular join, but the table is joined with itself.
here, left_on = partner from outside students on left , right_on =student_id from iside students on right .
localhost:8888/notebooks/ Pandas. Merging%2C Joining %2C Concatenating (Prudhvi vardhan Notes).ipynb 13/20
5/10/23, 2:14 PM Pandas. Merging, Joining , Concatenating (Prudhvi vardhan Notes) - Jupyter Notebook
In [60]: # 8. Print student name -> partner name for all enrolled students
# self join
students.merge(students,how ='inner',left_on = 'partner', right_on= 'student_id')[['name_x','name_y']]
Out[60]:
name_x name_y
26 Nitish Rahul
27 Ankit Nitish
In [81]: # 10. find top 5 students who spent most amount of money on courses
regs.merge(students , on ='student_id').merge(courses, on= 'course_id').groupby(['student_id','name'])['price'].sum().sort_values
localhost:8888/notebooks/ Pandas. Merging%2C Joining %2C Concatenating (Prudhvi vardhan Notes).ipynb 14/20
5/10/23, 2:14 PM Pandas. Merging, Joining , Concatenating (Prudhvi vardhan Notes) - Jupyter Notebook
Out[82]:
student_id name partner course_id
0 1 Kailash Harjo 23 1
1 1 Kailash Harjo 23 6
2 1 Kailash Harjo 23 10
3 1 Kailash Harjo 23 9
4 2 Esha Butala 1 5
5 3 Parveen Bhalla 3 3
6 3 Parveen Bhalla 3 5
7 7 Tarun Thaker 9 8
8 7 Tarun Thaker 9 10
9 7 Tarun Thaker 9 7
10 7 Tarun Thaker 9 2
11 7 Tarun Thaker 9 6
12 11 David Mukhopadhyay 20 7
13 11 David Mukhopadhyay 20 8
14 11 David Mukhopadhyay 20 10
15 12 Radha Dutt 19 10
16 12 Radha Dutt 19 1
17 12 Radha Dutt 19 5
18 12 Radha Dutt 19 7
19 13 Munni Varghese 24 3
20 14 Pranab Natarajan 22 9
21 14 Pranab Natarajan 22 6
22 14 Pranab Natarajan 22 4
23 14 Pranab Natarajan 22 1
24 15 Preet Sha 16 5
25 15 Preet Sha 16 1
26 16 Elias Dodiya 25 9
27 16 Elias Dodiya 25 5
28 16 Elias Dodiya 25 7
29 16 Elias Dodiya 25 3
30 17 Yasmin Palan 7 7
31 17 Yasmin Palan 7 10
32 18 Fardeen Mahabir 13 6
33 18 Fardeen Mahabir 13 1
34 18 Fardeen Mahabir 13 8
35 19 Qabeel Raman 12 4
36 19 Qabeel Raman 12 2
37 21 Seema Kota 15 1
38 22 Yash Sethi 21 3
39 22 Yash Sethi 21 5
40 22 Yash Sethi 21 6
41 23 Chhavi Lachman 18 1
42 23 Chhavi Lachman 18 4
43 23 Chhavi Lachman 18 3
44 23 Chhavi Lachman 18 6
45 23 Chhavi Lachman 18 9
46 23 Chhavi Lachman 18 5
47 24 Radhika Suri 17 4
48 25 Shashank D’Alia 2 1
49 25 Shashank D’Alia 2 10
localhost:8888/notebooks/ Pandas. Merging%2C Joining %2C Concatenating (Prudhvi vardhan Notes).ipynb 15/20
5/10/23, 2:14 PM Pandas. Merging, Joining , Concatenating (Prudhvi vardhan Notes) - Jupyter Notebook
Out[87]:
id season city date team1 team2 toss_winner toss_decision result dl_applied winner win_by_runs win_by_wickets player_
Royal Royal
2017- Sunrisers Sunrisers
0 1 2017 Hyderabad Challengers Challengers field normal 0 35 0 Yuv
04-05 Hyderabad Hyderabad
Bangalore Bangalore
Rising Rising
2017- Mumbai Rising Pune
1 2 2017 Pune Pune field normal 0 Pune 0 7 S
04-06 Indians Supergiant
Supergiant Supergiant
Rising
2017- Kings XI Kings XI Kings XI
3 4 2017 Indore Pune field normal 0 0 6 G
04-08 Punjab Punjab Punjab
Supergiant
... ... ... ... ... ... ... ... ... ... ... ... ... ...
Kolkata Kolkata
2016- Sunrisers Sunrisers
633 634 2016 Delhi Knight Knight field normal 0 22 0 MC H
05-25 Hyderabad Hyderabad
Riders Riders
Royal
2016- Sunrisers Sunrisers Sunrisers
635 636 2016 Bangalore Challengers bat normal 0 8 0 BC
05-29 Hyderabad Hyderabad Hyderabad
Bangalore
localhost:8888/notebooks/ Pandas. Merging%2C Joining %2C Concatenating (Prudhvi vardhan Notes).ipynb 16/20
5/10/23, 2:14 PM Pandas. Merging, Joining , Concatenating (Prudhvi vardhan Notes) - Jupyter Notebook
In [89]: deliveries
Out[89]:
match_id inning batting_team bowling_team over ball batsman non_striker bowler is_super_over ... bye_runs legbye_runs noball_runs penalt
Royal
Sunrisers DA TS
0 1 1 Challengers 1 1 S Dhawan 0 ... 0 0 0
Hyderabad Warner Mills
Bangalore
Royal
Sunrisers DA TS
1 1 1 Challengers 1 2 S Dhawan 0 ... 0 0 0
Hyderabad Warner Mills
Bangalore
Royal
Sunrisers DA TS
2 1 1 Challengers 1 3 S Dhawan 0 ... 0 0 0
Hyderabad Warner Mills
Bangalore
Royal
Sunrisers DA TS
3 1 1 Challengers 1 4 S Dhawan 0 ... 0 0 0
Hyderabad Warner Mills
Bangalore
Royal
Sunrisers DA TS
4 1 1 Challengers 1 5 S Dhawan 0 ... 0 0 0
Hyderabad Warner Mills
Bangalore
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Chennai Mumbai RA SL
179073 11415 2 20 2 SR Watson 0 ... 0 0 0
Super Kings Indians Jadeja Malinga
Chennai Mumbai SR SL
179074 11415 2 20 3 RA Jadeja 0 ... 0 0 0
Super Kings Indians Watson Malinga
Chennai Mumbai SR SL
179075 11415 2 20 4 RA Jadeja 0 ... 0 0 0
Super Kings Indians Watson Malinga
Chennai Mumbai SN SL
179076 11415 2 20 5 RA Jadeja 0 ... 0 0 0
Super Kings Indians Thakur Malinga
Chennai Mumbai SN SL
179077 11415 2 20 6 RA Jadeja 0 ... 0 0 0
Super Kings Indians Thakur Malinga
Out[94]:
match_id inning batting_team bowling_team over ball batsman non_striker bowler is_super_over ... result dl_applied winner win_by_runs win_
Royal
Sunrisers DA TS Sunrisers
0 1 1 Challengers 1 1 S Dhawan 0 ... normal 0 35
Hyderabad Warner Mills Hyderabad
Bangalore
Royal
Sunrisers DA TS Sunrisers
1 1 1 Challengers 1 2 S Dhawan 0 ... normal 0 35
Hyderabad Warner Mills Hyderabad
Bangalore
2 rows × 39 columns
In [101]: six_df=temp[temp['batsman_runs']==6]
six_df.head(2)
Out[101]:
match_id inning batting_team bowling_team over ball batsman non_striker bowler is_super_over ... result dl_applied winner win_by_runs
Royal
Sunrisers DA A Sunrisers
10 1 1 Challengers 2 4 S Dhawan 0 ... normal 0 35
Hyderabad Warner Choudhary Hyderabad
Bangalore
Royal
Sunrisers MC Sunrisers
47 1 1 Challengers 8 4 S Dhawan TM Head 0 ... normal 0 35
Hyderabad Henriques Hyderabad
Bangalore
2 rows × 39 columns
localhost:8888/notebooks/ Pandas. Merging%2C Joining %2C Concatenating (Prudhvi vardhan Notes).ipynb 17/20
5/10/23, 2:14 PM Pandas. Merging, Joining , Concatenating (Prudhvi vardhan Notes) - Jupyter Notebook
Out[105]: venue
Barabati Stadium 68
Brabourne Stadium 114
Buffalo Park 27
De Beers Diamond Oval 34
Dr DY Patil Sports Academy 173
Name: venue, dtype: int64
In [112]: (number_six/number_matches).sort_values(ascending=False).head()
Out[113]:
match_id inning batting_team bowling_team over ball batsman non_striker bowler is_super_over ... result dl_applied winner win_by_runs
Royal
Sunrisers DA TS Sunrisers
0 1 1 Challengers 1 1 S Dhawan 0 ... normal 0 35
Hyderabad Warner Mills Hyderabad
Bangalore
Royal
Sunrisers DA TS Sunrisers
1 1 1 Challengers 1 2 S Dhawan 0 ... normal 0 35
Hyderabad Warner Mills Hyderabad
Bangalore
Royal
Sunrisers DA TS Sunrisers
2 1 1 Challengers 1 3 S Dhawan 0 ... normal 0 35
Hyderabad Warner Mills Hyderabad
Bangalore
Royal
Sunrisers DA TS Sunrisers
3 1 1 Challengers 1 4 S Dhawan 0 ... normal 0 35
Hyderabad Warner Mills Hyderabad
Bangalore
Royal
Sunrisers DA TS Sunrisers
4 1 1 Challengers 1 5 S Dhawan 0 ... normal 0 35
Hyderabad Warner Mills Hyderabad
Bangalore
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Royal
Sunrisers Sachin B Sunrisers
150455 636 2 Challengers 20 2 CJ Jordan 0 ... normal 0 8
Hyderabad Baby Kumar Hyderabad
Bangalore
Royal
Sunrisers Sachin B Sunrisers
150456 636 2 Challengers 20 3 CJ Jordan 0 ... normal 0 8
Hyderabad Baby Kumar Hyderabad
Bangalore
Royal
Sunrisers Iqbal Sachin B Sunrisers
150457 636 2 Challengers 20 4 0 ... normal 0 8
Hyderabad Abdulla Baby Kumar Hyderabad
Bangalore
Royal
Sunrisers Sachin Iqbal B Sunrisers
150458 636 2 Challengers 20 5 0 ... normal 0 8
Hyderabad Baby Abdulla Kumar Hyderabad
Bangalore
Royal
Sunrisers Iqbal Sachin B Sunrisers
150459 636 2 Challengers 20 6 0 ... normal 0 8
Hyderabad Abdulla Baby Kumar Hyderabad
Bangalore
localhost:8888/notebooks/ Pandas. Merging%2C Joining %2C Concatenating (Prudhvi vardhan Notes).ipynb 18/20
5/10/23, 2:14 PM Pandas. Merging, Joining , Concatenating (Prudhvi vardhan Notes) - Jupyter Notebook
Out[114]:
match_id inning batting_team bowling_team over ball batsman non_striker bowler is_super_over ... result dl_applied winner win_by_runs win_
Royal
Sunrisers DA TS Sunrisers
0 1 1 Challengers 1 1 S Dhawan 0 ... normal 0 35
Hyderabad Warner Mills Hyderabad
Bangalore
Royal
Sunrisers DA TS Sunrisers
1 1 1 Challengers 1 2 S Dhawan 0 ... normal 0 35
Hyderabad Warner Mills Hyderabad
Bangalore
2 rows × 39 columns
In [117]: df.groupby(['season','batsman'])['batsman_runs'].sum()
In [120]: df.groupby(['season','batsman'])['batsman_runs'].sum().reset_index().sort_values('batsman_runs',ascending=False)
Out[120]:
season batsman batsman_runs
58 2008 L Balaji 0
In [123]: '])['batsman_runs'].sum().reset_index().sort_values('batsman_runs',ascending=False).drop_duplicates(subset='season',keep='first')
Out[123]:
season batsman batsman_runs
localhost:8888/notebooks/ Pandas. Merging%2C Joining %2C Concatenating (Prudhvi vardhan Notes).ipynb 19/20
5/10/23, 2:14 PM Pandas. Merging, Joining , Concatenating (Prudhvi vardhan Notes) - Jupyter Notebook
In [124]: oupby(['season','batsman'])['batsman_runs'].sum().reset_index().sort_values('batsman_runs',ascending=False).sort_values('season')
Out[124]:
season batsman batsman_runs
58 2008 L Balaji 0
45 2008 I Sharma 11
67 2008 M Ntini 11
In [ ]:
localhost:8888/notebooks/ Pandas. Merging%2C Joining %2C Concatenating (Prudhvi vardhan Notes).ipynb 20/20
5/12/23, 2:19 PM Pandas -Multi-Index Series(Prudhvi Vardhan Notes) - Jupyter Notebook
But why?
And what exactly is index?
In [5]: # 2. pd.MultiIndex.from_product()
pd.MultiIndex.from_product([['cse','ece'],[2019,2020,2021,2022]])
Out[7]: 4
In [8]: sample['cse']
Out[8]: 2019 1
2020 2
2021 3
2022 4
dtype: int64
unstack
reshape the given Pandas DataFrame by transposing specified row level to column level
Out[9]:
2019 2020 2021 2022
cse 1 2 3 4
ece 5 6 7 8
stack
reshapes the given DataFrame by converting the column label to a row index.
In [10]: temp.stack()
Out[11]:
avg_package students
cse 2019 1 2
2020 3 4
2021 5 6
2022 7 8
ece 2019 9 10
2020 11 12
2021 13 14
2022 15 16
In [12]: branch_df1.loc['cse']
Out[12]:
avg_package students
2019 1 2
2020 3 4
2021 5 6
2022 7 8
In [13]: branch_df1['avg_package']
In [14]: branch_df1['students']
In [15]: branch_df1.loc['ece']
Out[15]:
avg_package students
2019 9 10
2020 11 12
2021 13 14
2022 15 16
Out[16]:
delhi mumbai
2019 1 2 0 0
2020 3 4 0 0
2021 5 6 0 0
2022 7 8 0 0
In [17]: branch_df2['delhi']
Out[17]:
avg_package students
2019 1 2
2020 3 4
2021 5 6
2022 7 8
In [18]: branch_df2.loc[2019]
In [19]: branch_df2.iloc[1]
Out[20]:
delhi mumbai
cse 2019 1 2 0 0
2020 3 4 0 0
2021 5 6 0 0
2022 7 8 0 0
ece 2019 9 10 0 0
2020 11 12 0 0
2021 13 14 0 0
2022 15 16 0 0
In [21]: branch_df1
Out[21]:
avg_package students
cse 2019 1 2
2020 3 4
2021 5 6
2022 7 8
ece 2019 9 10
2020 11 12
2021 13 14
2022 15 16
Out[22]:
avg_package students
cse 1 3 5 7 2 4 6 8
ece 9 11 13 15 10 12 14 16
In [23]: branch_df1.unstack().unstack()
Out[24]:
avg_package students
cse 2019 1 2
2020 3 4
2021 5 6
2022 7 8
ece 2019 9 10
2020 11 12
2021 13 14
2022 15 16
In [26]: # Example : 2
branch_df2
Out[26]:
delhi mumbai
2019 1 2 0 0
2020 3 4 0 0
2021 5 6 0 0
2022 7 8 0 0
The Unstack()
It is method can be used to move the index to the columns. This means that the index will become the rows, and the rows will become the
columns.
The unstack method can be used to move the index to the columns
In [27]: branch_df2.unstack()
In [28]: branch_df2.stack()
Out[28]:
delhi mumbai
2019 avg_package 1 0
students 2 0
2020 avg_package 3 0
students 4 0
2021 avg_package 5 0
students 6 0
2022 avg_package 7 0
students 8 0
In [29]: branch_df2.stack().stack()
Out[30]:
delhi mumbai
cse 2019 1 2 0 0
2020 3 4 0 0
2021 5 6 0 0
2022 7 8 0 0
ece 2019 9 10 0 0
2020 11 12 0 0
2021 13 14 0 0
2022 15 16 0 0
In [31]: branch_df3.stack()
Out[31]:
delhi mumbai
students 2 0
2020 avg_package 3 0
students 4 0
2021 avg_package 5 0
students 6 0
2022 avg_package 7 0
students 8 0
students 10 0
2020 avg_package 11 0
students 12 0
2021 avg_package 13 0
students 14 0
2022 avg_package 15 0
students 16 0
In [32]: branch_df3.stack().stack()
Out[33]:
delhi mumbai
2019 2020 2021 2022 2019 2020 2021 2022 2019 2020 2021 2022 2019 2020 2021 2022
cse 1 3 5 7 2 4 6 8 0 0 0 0 0 0 0 0
ece 9 11 13 15 10 12 14 16 0 0 0 0 0 0 0 0
In [34]: branch_df3.unstack().unstack()
Out[35]:
delhi mumbai
cse 2019 1 2 0 0
2020 3 4 0 0
2021 5 6 0 0
2022 7 8 0 0
ece 2019 9 10 0 0
2020 11 12 0 0
2021 13 14 0 0
2022 15 16 0 0
Out[36]:
delhi mumbai
cse 2019 1 2 0 0
2020 3 4 0 0
2021 5 6 0 0
2022 7 8 0 0
ece 2019 9 10 0 0
In [37]: # shape
branch_df3.shape
Out[37]: (8, 4)
In [38]: # info
branch_df3.info()
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 8 entries, ('cse', 2019) to ('ece', 2022)
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 (delhi, avg_package) 8 non-null int64
1 (delhi, students) 8 non-null int64
2 (mumbai, avg_package) 8 non-null int64
3 (mumbai, students) 8 non-null int64
dtypes: int64(4)
memory usage: 932.0+ bytes
In [40]: branch_df3.isnull()
Out[40]:
delhi mumbai
Out[42]:
delhi mumbai
cse 2019 1 2 0 0
2021 5 6 0 0
ece 2019 9 10 0 0
Out[43]:
delhi mumbai
cse 2019 1 2 0 0
2021 5 6 0 0
ece 2019 9 10 0 0
Out[45]:
delhi mumbai
cse 2019 1 2 0 0
2020 3 4 0 0
2021 5 6 0 0
2022 7 8 0 0
ece 2019 9 10 0 0
2020 11 12 0 0
2021 13 14 0 0
2022 15 16 0 0
branch_df3.iloc[:,1:3]
Out[46]:
delhi mumbai
students avg_package
cse 2019 2 0
2020 4 0
2021 6 0
2022 8 0
ece 2019 10 0
2020 12 0
2021 14 0
2022 16 0
Out[47]:
delhi mumbai
students avg_package
cse 2019 2 0
ece 2019 10 0
Out[48]:
delhi mumbai
cse 2019 1 2 0 0
2020 3 4 0 0
2021 5 6 0 0
2022 7 8 0 0
ece 2019 9 10 0 0
2020 11 12 0 0
2021 13 14 0 0
2022 15 16 0 0
In [49]: branch_df3.sort_index(ascending=False)
Out[49]:
delhi mumbai
ece 2022 15 16 0 0
2021 13 14 0 0
2020 11 12 0 0
2019 9 10 0 0
cse 2022 7 8 0 0
2021 5 6 0 0
2020 3 4 0 0
2019 1 2 0 0
Out[50]:
delhi mumbai
ece 2019 9 10 0 0
2020 11 12 0 0
2021 13 14 0 0
2022 15 16 0 0
cse 2019 1 2 0 0
2020 3 4 0 0
2021 5 6 0 0
2022 7 8 0 0
Out[51]:
cse ece
delhi avg_package 1 3 5 7 9 11 13 15
students 2 4 6 8 10 12 14 16
mumbai avg_package 0 0 0 0 0 0 0 0
students 0 0 0 0 0 0 0 0
In [52]: # swaplevel
branch_df3
Out[52]:
delhi mumbai
cse 2019 1 2 0 0
2020 3 4 0 0
2021 5 6 0 0
2022 7 8 0 0
ece 2019 9 10 0 0
2020 11 12 0 0
2021 13 14 0 0
2022 15 16 0 0
In [53]: # On rows
branch_df3.swaplevel()
Out[53]:
delhi mumbai
2019 cse 1 2 0 0
2020 cse 3 4 0 0
2021 cse 5 6 0 0
2022 cse 7 8 0 0
2019 ece 9 10 0 0
2020 ece 11 12 0 0
2021 ece 13 14 0 0
2022 ece 15 16 0 0
In [54]: # on columns
branch_df3.swaplevel(axis=1)
Out[54]:
avg_package students avg_package students
cse 2019 1 2 0 0
2020 3 4 0 0
2021 5 6 0 0
2022 7 8 0 0
ece 2019 9 10 0 0
2020 11 12 0 0
2021 13 14 0 0
2022 15 16 0 0
Wide format is where we have a single row for every data point with multiple columns to hold the values of various attributes.
Long format is where, for each data point we have as many rows as the number of attributes and each row contains the value of a particular
attribute for a given data point.
Out[55]:
cse
0 120
In [56]: pd.DataFrame({'cse':[120]}).melt()
Out[56]:
variable value
0 cse 120
Out[57]:
variable value
0 cse 120
1 ece 100
2 mech 50
Out[58]:
branch num_students
0 cse 120
1 ece 100
2 mech 50
In [59]: pd.DataFrame(
{
'branch':['cse','ece','mech'],
'2020':[100,150,60],
'2021':[120,130,80],
'2022':[150,140,70]
}
)
Out[59]:
branch 2020 2021 2022
2 mech 60 80 70
In [60]: pd.DataFrame(
{
'branch':['cse','ece','mech'],
'2020':[100,150,60],
'2021':[120,130,80],
'2022':[150,140,70]
}
).melt()
Out[60]:
variable value
0 branch cse
1 branch ece
2 branch mech
3 2020 100
4 2020 150
5 2020 60
6 2021 120
7 2021 130
8 2021 80
9 2022 150
10 2022 140
11 2022 70
Out[61]:
branch variable value
2 mech 2020 60
5 mech 2021 80
8 mech 2022 70
the melt() method is used to reshape a DataFrame from wide to long format. This means that the columns of the DataFrame are converted into
rows, and the values in the columns are converted into columns.
Out[62]:
branch year students
2 mech 2020 60
5 mech 2021 80
8 mech 2022 70
In [64]: deaths.head(2)
Out[64]:
Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 ... 12/24/22 12/25/22 12/26/22 12/27/22
In [65]: deaths.shape
Out[67]: (311253, 6)
In [75]: deaths.head()
Out[75]:
Province/State Country/Region Lat Long date no. of deaths
In [68]: confirm.head(2)
Out[68]:
Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 ... 12/24/22 12/25/22 12/26/22 12/27/22
In [69]: confirm.shape
In [76]: confirm.head()
Out[76]:
Province/State Country/Region Lat Long date no. of confirmed
Out[74]: (311253, 6)
Out[77]:
Province/State Country/Region Lat Long date no. of confirmed no. of deaths
311248 NaN West Bank and Gaza 31.952200 35.233200 1/2/23 703228 5708
Out[80]:
Country/Region date no. of confirmed no. of deaths
0 Afghanistan 1/22/20 0 0
1 Albania 1/22/20 0 0
2 Algeria 1/22/20 0 0
3 Andorra 1/22/20 0 0
4 Angola 1/22/20 0 0
the Pivot table takes simple column wise data as input, and groups as the entire Into 2 dimensional table that provides a multi dimensional
summarization of the data.
In [83]: df = sns.load_dataset('tips')
df.head()
Out[83]:
total_bill tip sex smoker day time size
Out[85]: sex
Male 20.744076
Female 18.056897
Name: total_bill, dtype: float64
Out[88]:
smoker Yes No
sex
Out[89]:
smoker Yes No
sex
Out[90]:
smoker Yes No
sex
Out[91]:
smoker Yes No
sex
Male 60 97
Female 33 54
Out[92]:
smoker Yes No
sex
Out[93]:
size tip total_bill
sex
Out[95]:
smoker Yes No
sex
In [96]: df.pivot_table(index='sex',columns='smoker')['size']
Out[96]:
smoker Yes No
sex
Out[98]:
total_bill tip sex smoker day time size
Out[100]:
day Thur Fri Sat Sun
sex smoker
In [102]: df.pivot_table(index=['sex','smoker'],columns=['day','time'])
Out[102]:
tip total_bill
Fri Sat Sun Thur Fri Sat Sun Thur Fri Sat
Dinner Lunch Dinner Dinner Dinner Lunch Dinner Lunch Dinner Dinner Dinner Lunch Dinner Lunch Dinner Dinner
NaN 1.666667 2.4 2.629630 2.600000 3.058000 NaN 1.90 3.246 2.879259 3.521333 19.171000 NaN 11.386667 25.892 21.837778
NaN NaN 2.0 2.656250 2.883721 2.941500 NaN NaN 2.500 3.256563 3.115349 18.486500 NaN NaN 17.475 19.929063
NaN 2.000000 2.0 2.200000 2.500000 2.990000 NaN 2.66 2.700 2.868667 3.500000 19.218571 NaN 13.260000 12.200 20.266667
2.0 3.000000 2.0 2.307692 3.071429 2.437083 3.0 3.00 3.250 2.724615 3.329286 15.899167 18.78 15.980000 22.750 19.003846
In [103]: df.pivot_table(index=['sex','smoker'],columns=['day','time'],aggfunc={'size':'mean','tip':'max','total_bill':'sum'})
Out[103]:
size tip total_bill
day Thur Fri Sat Sun Thur Fri Sat Sun Thur Fri
time Lunch Dinner Lunch Dinner Dinner Dinner Lunch Dinner Lunch Dinner Dinner Dinner Lunch Dinner Lunch Din
sex smoker
Male Yes 2.300000 NaN 1.666667 2.4 2.629630 2.600000 5.00 NaN 2.20 4.73 10.00 6.5 191.71 0.00 34.16 12
No 2.500000 NaN NaN 2.0 2.656250 2.883721 6.70 NaN NaN 3.50 9.00 6.0 369.73 0.00 0.00 3
Female Yes 2.428571 NaN 2.000000 2.0 2.200000 2.500000 5.00 NaN 3.48 4.30 6.50 4.0 134.53 0.00 39.78 4
No 2.500000 2.0 3.000000 2.0 2.307692 3.071429 5.17 3.0 3.00 3.25 4.67 5.2 381.58 18.78 15.98 2
In [106]: # Margins.
df.pivot_table(index='sex',columns= 'smoker',values ='total_bill',aggfunc='sum',margins=True)
Out[106]:
smoker Yes No All
sex
In [110]: expense.head(2)
Out[110]:
Date Account Category Subcategory Note INR Income/Expense Note.1 Amount Currency Account.1
0 3/2/2022 10:11 CUB - online payment Food NaN Brownie 50.0 Expense NaN 50.0 INR 50.0
1 3/2/2022 10:11 CUB - online payment Other NaN To lended people 300.0 Expense NaN 300.0 INR 300.0
In [115]: # Categories
expense['Category'].value_counts()
In [117]: expense.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 277 entries, 0 to 276
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 277 non-null object
1 Account 277 non-null object
2 Category 277 non-null object
3 Subcategory 0 non-null float64
4 Note 273 non-null object
5 INR 277 non-null float64
6 Income/Expense 277 non-null object
7 Note.1 0 non-null float64
8 Amount 277 non-null float64
9 Currency 277 non-null object
10 Account.1 277 non-null float64
dtypes: float64(5), object(6)
memory usage: 23.9+ KB
In [121]: expense.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 277 entries, 0 to 276
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 277 non-null datetime64[ns]
1 Account 277 non-null object
2 Category 277 non-null object
3 Subcategory 0 non-null float64
4 Note 273 non-null object
5 INR 277 non-null float64
6 Income/Expense 277 non-null object
7 Note.1 0 non-null float64
8 Amount 277 non-null float64
9 Currency 277 non-null object
10 Account.1 277 non-null float64
dtypes: datetime64[ns](1), float64(5), object(5)
memory usage: 23.9+ KB
Out[123]: 0 March
1 March
2 March
3 March
4 March
...
272 November
273 November
274 November
275 November
276 November
Name: Date, Length: 277, dtype: object
In [125]: expense.head(2)
Out[125]:
Date Account Category Subcategory Note INR Income/Expense Note.1 Amount Currency Account.1 month
Out[126]:
Petty Self- Social
Category Allowance Apparel Beauty Education Food Gift Household Other Salary Transportation
cash development Life
month
December 11000.0 2590.0 196.0 NaN 6440.72 NaN 4800.0 1790.0 NaN NaN 400.0 513.72 914.0
February NaN 798.0 NaN NaN 5579.85 NaN 2808.0 20000.0 NaN NaN NaN 1800.00 5078.8
January 1000.0 NaN NaN 1400.0 9112.51 NaN 4580.0 13178.0 NaN 8000.0 NaN 200.00 2850.0
March NaN NaN NaN NaN 195.00 NaN NaN 900.0 NaN NaN NaN NaN 30.0
November 2000.0 NaN NaN NaN 3174.40 115.0 NaN 2000.0 3.0 NaN NaN NaN 331.0
Out[128]:
Petty Self- Social
Category Allowance Apparel Beauty Education Food Gift Household Other Salary Transportation
cash development Life
month
December 11000 2590 196 0 6440.72 0 4800 1790 0 0 400 513.72 914.0
In [131]: # plot
expense.pivot_table(index ='month', columns='Category', values ='INR', aggfunc='sum',fill_value =0).plot()
Out[131]: <AxesSubplot:xlabel='month'>
Out[132]: <AxesSubplot:xlabel='month'>
Out[133]: <AxesSubplot:xlabel='month'>
In [ ]:
In [2]: a = np.array([1,2,3,4])
a * 4
Out[4]: 0 cat
1 mat
2 None
3 rat
dtype: object
Out[5]: 0 True
1 False
2 None
3 False
dtype: object
In [7]: df.head(1)
Out[7]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Emba
Braund,
Mr. A/5
0 1 0 3 male 22.0 1 0 7.25 NaN
Owen 21171
Harris
In [8]: df['Name']
Common Functions
lower/upper/capitalize/title
In [9]: # Upper
df['Name'].str.upper() # converts into Capital Words
In [10]: # lower
df['Name'].str.lower() # converts into small Words
In [11]: # title
df['Name'].str.title() # converts into starting letter of Word to Capital
Out[12]: 82
Out[14]: 'Penasco y Castellana, Mrs. Victor de Satode (Maria Josefa Perez de Soto y Va
llejo)'
strip
Out[16]: 'jack'
split
In [19]: # split
df['Name'].str.split(',')
Out[21]: 0 Braund
1 Cumings
2 Heikkinen
3 Futrelle
4 Allen
...
886 Montvila
887 Graham
888 Johnston
889 Behr
890 Dooley
Name: Name, Length: 891, dtype: object
In [22]: df['Name'].str.split(',').str.get(1)
In [25]: df.head(1)
Out[25]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Emba
Braund,
Mr. A/5
0 1 0 3 male 22.0 1 0 7.25 NaN
Owen 21171
Harris
In [29]: # it is used to split the Name column of the DataFrame df into two columns
# FirstName and LastName.
df['Name'].str.split(',').str.get(1).str.strip().str.split(' ',n=1, expand=Tru
Out[29]:
0 1
2 Miss. Laina
In [31]: df.head(1)
Out[31]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Emba
Braund,
Mr. A/5
0 1 0 3 male 22.0 1 0 7.25 NaN
Owen 21171
Harris
replace
C:\Users\user\AppData\Local\Temp/ipykernel_15952/1805277261.py:1: FutureWarni
ng: The default value of regex will change from True to False in a future ver
sion.
df['title'] = df['title'].str.replace('Ms.','Miss.')
C:\Users\user\AppData\Local\Temp/ipykernel_15952/1805277261.py:2: FutureWarni
ng: The default value of regex will change from True to False in a future ver
sion.
df['title'] = df['title'].str.replace('Mlle.','Miss.')
In [37]: df['title'].value_counts()
Filtering
In [38]: # startswith/endswith
df[df['firstname'].str.startswith('A')]
Out[38]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare
Andersson,
13 14 0 3 Mr. Anders male 39.0 1 5 347082 31.2750
Johan
McGowan,
22 23 1 3 Miss. Anna female 15.0 0 0 330923 8.0292
"Annie"
Holverson,
Mr.
35 36 0 1 male 42.0 1 0 113789 52.0000
Alexander
Oskar
Vander
Planke,
38 39 0 3 Miss. female 18.0 2 0 345764 18.0000
Augusta
Maria
Icard, Miss.
61 62 1 1 female 38.0 0 0 113572 80.0000
Amelie
... ... ... ... ... ... ... ... ... ... ...
Serepeca,
842 843 1 1 Miss. female 30.0 0 0 113798 31.0000
Augusta
Abbing, Mr.
845 846 0 3 male 42.0 0 0 C.A. 5547 7.5500
Anthony
Duran y
SC/PARIS
866 867 1 2 More, Miss. female 27.0 1 0 13.8583
2149
Asuncion
Najib, Miss.
Adele
875 876 1 3 female 15.0 0 0 2667 7.2250
Kiamie
"Jane"
Gustafsson,
876 877 0 3 Mr. Alfred male 20.0 0 0 7534 9.8458
Ossian
95 rows × 15 columns
In [40]: # endswith
df[df['firstname'].str.endswith('z')]
Out[40]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabi
Kink, Mr.
69 70 0 3 male 26.0 2 0 315151 8.6625 Na
Vincenz
Cardeza,
Mr. B5
PC
679 680 1 1 Thomas male 36.0 0 1 512.3292 B5
17755
Drake B5
Martinez
Jensen,
Mr.
721 722 0 3 male 17.0 1 0 350048 7.0542 Na
Svend
Lauritz
In [41]: # isdigit/isalpha...
df[df['firstname'].str.isdigit()]
Out[41]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarke
regex
Out[42]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket
Cumings,
Mrs. John
1 2 1 1 Bradley female 38.0 1 0 PC 17599 7
(Florence
Briggs Th...
Turpin, Mrs.
William
41 42 0 2 John Robert female 27.0 1 0 11668 2
(Dorothy
Ann ...
Rogers, Mr.
S.C./A.4.
45 46 0 3 William male NaN 0 0
23567
John
Doling, Mrs.
98 99 1 2 John T (Ada female 34.0 0 1 231919 2
Julia Bone)
In [44]: # find lastnames with start and end char vowel ( aeiou)
df[df['last_name'].str.contains('^[^aeiouAEIOU].+[^aeiouAEIOU]$')]
Out[44]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare
Braund,
0 1 0 3 Mr. Owen male 22.0 1 0 A/5 21171 7.2500
Harris
Cumings,
Mrs. John
Bradley
1 2 1 1 female 38.0 1 0 PC 17599 71.2833
(Florence
Briggs
Th...
Heikkinen,
STON/O2.
2 3 1 3 Miss. female 26.0 0 0 7.9250
3101282
Laina
Moran,
5 6 0 3 male NaN 0 0 330877 8.4583
Mr. James
McCarthy,
6 7 0 1 Mr. male 54.0 0 0 17463 51.8625
Timothy J
... ... ... ... ... ... ... ... ... ... ...
Sutehall,
SOTON/OQ
884 885 0 3 Mr. Henry male 25.0 0 0 7.0500
392076
Jr
Graham,
Miss.
887 888 1 1 female 19.0 0 0 112053 30.0000
Margaret
Edith
Johnston,
Miss.
888 889 0 3 Catherine female NaN 1 2 W./C. 6607 23.4500
Helen
"Carrie"
Behr, Mr.
889 890 1 1 Karl male 26.0 0 0 111369 30.0000
Howell
Dooley,
890 891 0 3 Mr. male 32.0 0 0 370376 7.7500
Patrick
slicing
Out[45]: 0 Brau
1 Cumi
2 Heik
3 Futr
4 Alle
...
886 Mont
887 Grah
888 John
889 Behr
890 Dool
Name: Name, Length: 891, dtype: object
In [ ]:
Timestamp Object
Time stamps reference particular moments in time (e.g., Oct 24th, 2022 at 7:00pm)
Vectorized date and time operations are a powerful tool for working with date and time data. They can be used to
quickly and easily perform a wide variety of operations on date and time data.
In [3]: # type
type(pd.Timestamp('2023/05/12'))
Out[3]: pandas._libs.tslibs.timestamps.Timestamp
In [4]: # Variations
pd.Timestamp('2023-05-12')
In [5]: pd.Timestamp('2023,05,12')
In [6]: pd.Timestamp('2023.05.12')
Out[12]: 2023
In [15]: x.day
Out[15]: 12
In [17]: x.time()
In [18]: x.month
Out[18]: 5
why separate objects to handle data and time when python already has datetime
functionality?
Because of the uniform type in NumPy datetime64 arrays, this type of operation can be accomplished much more
quickly than if we were working directly with Python's datetime objects, especially as arrays get large
Pandas Timestamp object combines the ease-of-use of python datetime with the efficient storage and vectorized
interface of numpy.datetime64
From a group of these Timestamp objects, Pandas can construct a DatetimeIndex that can be used to index data in
a Series or DataFrame
DatetimeIndex Object
In [23]: pd.DatetimeIndex(['2023/05/12','2023/01/01','2025/01/22'])[0]
In [25]: # type
type(pd.DatetimeIndex(['2023/05/12','2023/01/01','2025/01/22']))
Out[25]: pandas.core.indexes.datetimes.DatetimeIndex
In [28]: dt_index
Out[29]: 2023-01-01 1
2022-01-01 2
2021-01-01 3
dtype: int64
date_range function
In [33]: # generate daily dates in a given range
pd.date_range(start='2023/5/12',end='2023/6/12',freq='D')
to_datetime function
Out[59]: 0 2023
1 2022
2 2021
dtype: int64
In [60]: pd.to_datetime(s).dt.day
Out[60]: 0 12
1 1
2 1
dtype: int64
In [61]: pd.to_datetime(s).dt.day_name()
Out[61]: 0 Friday
1 Saturday
2 Monday
dtype: object
In [62]: pd.to_datetime(s).dt.month_name()
Out[62]: 0 May
1 January
2 February
dtype: object
Out[63]: 0 2023-01-01
1 2022-01-01
2 NaT
dtype: datetime64[ns]
In [64]: pd.to_datetime(s,errors='coerce').dt.year
Out[64]: 0 2023.0
1 2022.0
2 NaN
dtype: float64
In [65]: pd.to_datetime(s,errors='coerce').dt.month_name()
Out[65]: 0 January
1 January
2 NaN
dtype: object
In [66]: df = pd.read_csv("expense_data.csv")
In [69]: df.head()
Out[69]:
Date Account Category Subcategory Note INR Income/Expense Note.1 Amount Currency Account.1
CUB -
3/2/2022
0 online Food NaN Brownie 50.0 Expense NaN 50.0 INR 50.0
10:11
payment
CUB - To
3/2/2022
1 online Other NaN lended 300.0 Expense NaN 300.0 INR 300.0
10:11
payment people
CUB -
3/1/2022
2 online Food NaN Dinner 78.0 Expense NaN 78.0 INR 78.0
19:50
payment
CUB -
3/1/2022
3 online Transportation NaN Metro 30.0 Expense NaN 30.0 INR 30.0
18:56
payment
CUB -
3/1/2022
4 online Food NaN Snacks 67.0 Expense NaN 67.0 INR 67.0
18:22
payment
In [70]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 277 entries, 0 to 276
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 277 non-null object
1 Account 277 non-null object
2 Category 277 non-null object
3 Subcategory 0 non-null float64
4 Note 273 non-null object
5 INR 277 non-null float64
6 Income/Expense 277 non-null object
7 Note.1 0 non-null float64
8 Amount 277 non-null float64
9 Currency 277 non-null object
10 Account.1 277 non-null float64
dtypes: float64(5), object(6)
memory usage: 23.9+ KB
In [73]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 277 entries, 0 to 276
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 277 non-null datetime64[ns]
1 Account 277 non-null object
2 Category 277 non-null object
3 Subcategory 0 non-null float64
4 Note 273 non-null object
5 INR 277 non-null float64
6 Income/Expense 277 non-null object
7 Note.1 0 non-null float64
8 Amount 277 non-null float64
9 Currency 277 non-null object
10 Account.1 277 non-null float64
dtypes: datetime64[ns](1), float64(5), object(5)
memory usage: 23.9+ KB
dt accessor
In [75]: df['Date'].dt.year
Out[75]: 0 2022
1 2022
2 2022
3 2022
4 2022
...
272 2021
273 2021
274 2021
275 2021
276 2021
Name: Date, Length: 277, dtype: int64
In [76]: df['Date'].dt.month
Out[76]: 0 3
1 3
2 3
3 3
4 3
..
272 11
273 11
274 11
275 11
276 11
Name: Date, Length: 277, dtype: int64
In [77]: df['Date'].dt.month_name()
Out[77]: 0 March
1 March
2 March
3 March
4 March
...
272 November
273 November
274 November
275 November
276 November
Name: Date, Length: 277, dtype: object
In [80]: df['Date'].dt.day_name()
Out[80]: 0 Wednesday
1 Wednesday
2 Tuesday
3 Tuesday
4 Tuesday
...
272 Monday
273 Monday
274 Sunday
275 Sunday
276 Sunday
Name: Date, Length: 277, dtype: object
In [86]: df['Date'].dt.is_month_end
Out[86]: 0 False
1 False
2 False
3 False
4 False
...
272 False
273 False
274 False
275 False
276 False
Name: Date, Length: 277, dtype: bool
In [87]: df['Date'].dt.is_year_end
Out[87]: 0 False
1 False
2 False
3 False
4 False
...
272 False
273 False
274 False
275 False
276 False
Name: Date, Length: 277, dtype: bool
In [90]: df['Date'].dt.is_quarter_end
Out[90]: 0 False
1 False
2 False
3 False
4 False
...
272 False
273 False
274 False
275 False
276 False
Name: Date, Length: 277, dtype: bool
In [91]: df['Date'].dt.is_quarter_start
Out[91]: 0 False
1 False
2 False
3 False
4 False
...
272 False
273 False
274 False
275 False
276 False
Name: Date, Length: 277, dtype: bool
In [95]: plt.plot(df['Date'],df['INR'])
In [97]: df.head()
Out[97]:
Date Account Category Subcategory Note INR Income/Expense Note.1 Amount Currency Account.1
2022- CUB -
0 03-02 online Food NaN Brownie 50.0 Expense NaN 50.0 INR 50.0
10:11:00 payment
2022- CUB - To
1 03-02 online Other NaN lended 300.0 Expense NaN 300.0 INR 300.0
10:11:00 payment people
2022- CUB -
2 03-01 online Food NaN Dinner 78.0 Expense NaN 78.0 INR 78.0
19:50:00 payment
2022- CUB -
3 03-01 online Transportation NaN Metro 30.0 Expense NaN 30.0 INR 30.0
18:56:00 payment
2022- CUB -
4 03-01 online Food NaN Snacks 67.0 Expense NaN 67.0 INR 67.0
18:22:00 payment
In [99]: df.groupby('day_name')['INR'].mean().plot(kind='bar')
Out[99]: <AxesSubplot:xlabel='day_name'>
In [101]: df.head()
Out[101]:
Date Account Category Subcategory Note INR Income/Expense Note.1 Amount Currency Account.1
2022- CUB -
0 03-02 online Food NaN Brownie 50.0 Expense NaN 50.0 INR 50.0
10:11:00 payment
2022- CUB - To
1 03-02 online Other NaN lended 300.0 Expense NaN 300.0 INR 300.0
10:11:00 payment people
2022- CUB -
2 03-01 online Food NaN Dinner 78.0 Expense NaN 78.0 INR 78.0
19:50:00 payment
2022- CUB -
3 03-01 online Transportation NaN Metro 30.0 Expense NaN 30.0 INR 30.0
18:56:00 payment
2022- CUB -
4 03-01 online Food NaN Snacks 67.0 Expense NaN 67.0 INR 67.0
18:22:00 payment
Out[102]: <AxesSubplot:ylabel='INR'>
In [109]: # Average
df.groupby('month_name')['INR'].mean().plot(kind ='bar')
Out[109]: <AxesSubplot:xlabel='month_name'>