0% found this document useful (0 votes)
3 views63 pages

✌️???? ????????????✌️???? ??????

The document contains Python code snippets for reading and manipulating datasets using the pandas library. It demonstrates how to read CSV and Excel files, view data dimensions, select specific columns and rows, and filter data based on conditions. The dataset appears to relate to passenger information, likely from the Titanic, with various attributes such as PassengerId, Survived status, and other personal details.

Uploaded by

siva ganesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views63 pages

✌️???? ????????????✌️???? ??????

The document contains Python code snippets for reading and manipulating datasets using the pandas library. It demonstrates how to read CSV and Excel files, view data dimensions, select specific columns and rows, and filter data based on conditions. The dataset appears to relate to passenger information, likely from the Titanic, with various attributes such as PassengerId, Survived status, and other personal details.

Uploaded by

siva ganesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

In [1]: #reading csv file

import pandas as pd
df= pd.read_csv("data.csv")
df.head()
Out[1]: PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Em

Braund,
A/5
0 1 0 3 Mr. Owen male 22.0 1 0 7.2500 NaN
21171
Harris

Cumings,
Mrs. John
Bradley
1 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85
(Florence
Briggs
Th...

Heikkinen,
STON/O2.
2 3 1 3 Miss. female 26.0 0 0 7.9250 NaN
3101282
Laina

Futrelle,
Mrs.
Jacques
3 4 1 1 female 35.0 1 0 113803 53.1000 C123
Heath
(Lily May
Peel)

Allen, Mr.
4 5 0 3 William male 35.0 0 0 373450 8.0500 NaN
Henry
In [2]: df=pd.read_excel("Copy of data.xlsx")
df.head()
Out[2]: PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Em

Braund,
A/5
0 1 0 3 Mr. Owen male 22.0 1 0 7.2500 NaN
21171
Harris

Cumings,
Mrs. John
Bradley
1 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85
(Florence
Briggs
Th...

Heikkinen,
STON/O2.
2 3 1 3 Miss. female 26.0 0 0 7.9250 NaN
3101282
Laina

Futrelle,
Mrs.
Jacques
3 4 1 1 female 35.0 1 0 113803 53.1000 C123
Heath
(Lily May
Peel)

Allen, Mr.
4 5 0 3 William male 35.0 0 0 373450 8.0500 NaN
Henry
In [3]: #import the required library
import pandas as pd

In [4]: #reading the dataset


df=pd.read_csv("data.csv")

In [5]: #seeing the dimension of the df dataframe


df.shape

Out[5]: (891, 12)

In [6]: #top 5 rows


df.head(5)
Out[6]: PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Em

Braund,
A/5
0 1 0 3 Mr. Owen male 22.0 1 0 7.2500 NaN
21171
Harris

Cumings,
Mrs. John
Bradley
1 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85
(Florence
Briggs
Th...

Heikkinen,
STON/O2.
2 3 1 3 Miss. female 26.0 0 0 7.9250 NaN
3101282
Laina

Futrelle,
Mrs.
Jacques
3 4 1 1 female 35.0 1 0 113803 53.1000 C123
Heath
(Lily May
Peel)

Allen, Mr.
4 5 0 3 William male 35.0 0 0 373450 8.0500 NaN
Henry
In [7]: #bottom 5 rows
df.tail()

Out[7]: PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Emba

Montvila,
886 887 0 2 Rev. male 27.0 0 0 211536 13.00 NaN
Juozas

Graham,
Miss.
887 888 1 1 female 19.0 0 0 112053 30.00 B42
Margaret
Edith

Johnston,
Miss.
W./C.
888 889 0 3 Catherine female NaN 1 2 23.45 NaN
6607
Helen
"Carrie"

Behr, Mr.
889 890 1 1 Karl male 26.0 0 0 111369 30.00 C148
Howell

Dooley,
890 891 0 3 Mr. male 32.0 0 0 370376 7.75 NaN
Patrick
In [8]: df.tail(10)
Out[8]: PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin

Markun,
881 882 0 3 Mr. male 33.0 0 0 349257 7.8958 NaN
Johann

Dahlberg,
Miss.
882 883 0 3 female 22.0 0 0 7552 10.5167 NaN
Gerda
Ulrika

Banfield,
Mr. C.A./SOTON
883 884 0 2 male 28.0 0 0 10.5000 NaN
Frederick 34068
James

Sutehall,
SOTON/OQ
884 885 0 3 Mr. male 25.0 0 0 7.0500 NaN
392076
Henry Jr

Rice, Mrs.
William
885 886 0 3 female 39.0 0 5 382652 29.1250 NaN
(Margaret
Norton)

Montvila,
886 887 0 2 Rev. male 27.0 0 0 211536 13.0000 NaN
Juozas

887 888 1 1 Graham, female 19.0 0 0 112053 30.0000 B42


Miss.
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin
Margaret
Edith

Johnston,
Miss.
888 889 0 3 Catherine female NaN 1 2 W./C. 6607 23.4500 NaN
Helen
"Carrie"

Behr, Mr.
889 890 1 1 Karl male 26.0 0 0 111369 30.0000 C148
Howell

Dooley,
890 891 0 3 Mr. male 32.0 0 0 370376 7.7500 NaN
Patrick

In [9]: #seeing the names of columns in dataframe


df.columns

Out[9]: Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',


'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')

In [10]: #selecting a single column


df["Pclass"]
Out[10]: 0 3
1 1
2 3
3 1
4 3
..
886 2
887 1
888 3
889 1
890 3
Name: Pclass, Length: 891, dtype: int64

In [11]: #seeing the dimension of dataframe

df.shape

Out[11]: (891, 12)

In [12]: #selecting a single column by name


df["Pclass"]
Out[12]: 0 3
1 1
2 3
3 1
4 3
..
886 2
887 1
888 3
889 1
890 3
Name: Pclass, Length: 891, dtype: int64

In [13]: #selcting multiple columns using columns names


df[["PassengerId","Cabin","Age"]]
Out[13]: PassengerId Cabin Age

0 1 NaN 22.0

1 2 C85 38.0

2 3 NaN 26.0

3 4 C123 35.0

4 5 NaN 35.0

... ... ... ...

886 887 NaN 27.0

887 888 B42 19.0

888 889 NaN NaN

889 890 C148 26.0

890 891 NaN 32.0

891 rows × 3 columns

In [14]: #selecting rows by there positions


df.iloc[: 5]
Out[14]: PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Em

Braund,
A/5
0 1 0 3 Mr. Owen male 22.0 1 0 7.2500 NaN
21171
Harris

Cumings,
Mrs. John
Bradley
1 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85
(Florence
Briggs
Th...

Heikkinen,
STON/O2.
2 3 1 3 Miss. female 26.0 0 0 7.9250 NaN
3101282
Laina

Futrelle,
Mrs.
Jacques
3 4 1 1 female 35.0 1 0 113803 53.1000 C123
Heath
(Lily May
Peel)

Allen, Mr.
4 5 0 3 William male 35.0 0 0 373450 8.0500 NaN
Henry
In [15]: #selcting columns by their position
df.iloc[:,:2]

Out[15]: PassengerId Survived

0 1 0

1 2 1

2 3 1

3 4 1

4 5 0

... ... ...

886 887 0

887 888 1

888 889 0

889 890 1

890 891 0

891 rows × 2 columns

In [16]: df[df["Age"]==15]
Out[16]: PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin E

McGowan,
Miss.
22 23 1 3 female 15.0 0 0 330923 8.0292 NaN
Anna
"Annie"

Elias, Mr.
352 353 0 3 male 15.0 1 1 2695 7.2292 NaN
Tannous

Madill,
Miss.
689 690 1 1 female 15.0 0 1 24160 211.3375 B5
Georgette
Alexandra

Yasbeck,
Mrs.
830 831 1 3 Antoni female 15.0 1 0 2659 14.4542 NaN
(Selini
Alexander)

Najib,
Miss.
875 876 1 3 Adele female 15.0 0 0 2667 7.2250 NaN
Kiamie
"Jane"

In [17]: Df=pd.read_csv("index.csv")
Df
Out[17]: A B C D E F G H I J K L

0 C A 8 0 S 0 COIL 0.700 610.0 0 0 3

1 C R 0 0 S 0 COIL 3.200 610.0 0 0 3

2 C R 0 0 S 0 SHEET 0.700 1300.0 762 0 3

3 C A 0 60 ? 0 COIL 2.801 385.1 0 0 3

4 C A 0 60 ? 0 SHEET 0.801 255.0 269 0 3

... ... ... ... ... ... ... ... ... ... ... ... ...

252 C A 0 60 ? 0 COIL 0.799 609.0 0 0 3

253 C R 6 0 ? 0 SHEET 1.000 610.0 4880 0 3

254 C A 0 50 ? 0 COIL 0.600 20.0 0 0 3

255 C K 45 0 ? 0 COIL 0.600 900.0 0 0 3

256 C R 6 0 ? 0 SHEET 1.000 610.0 762 0 3

257 rows × 12 columns

In [18]: DF=Df[["C"]] # access the third column of the dataframe

In [19]: DF
Out[19]: C

0 8

1 0

2 0

3 0

4 0

... ...

252 0

253 6

254 0

255 45

256 6

257 rows × 1 columns

In [20]: Df.iloc[:,-2:] #access the last two column of the index dataframe using the iloc dataframe
Out[20]: K L

0 0 3

1 0 3

2 0 3

3 0 3

4 0 3

... ... ...

252 0 3

253 0 3

254 0 3

255 0 3

256 0 3

257 rows × 2 columns

In [21]: Df.iloc[-10:,:2] #access the last 10 roes and 2 columns in index dataframe
Out[21]: A B

247 C R

248 C A

249 C A

250 C R

251 C A

252 C A

253 C R

254 C A

255 C K

256 C R

Data Manipulation with pandas


pandas is the most widely used library of python for data science. it is incredibly helpful in manipulating
the data so that you can derive better inshight and build grat machine leanring model.

in this notebook, we will have a look at some of the intermediate concepts of working with pandas
Table of contents
1.sorting dataframes

2.merging dataframes

In [22]: import pandas as pd


import numpy as np

# read the dataset

data_Bm=pd.read_csv("bigmart_data.csv")

#drop the null values


data_Bm=data_Bm.dropna(how="any")
#view the top results
data_Bm.head()
Out[22]: Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type Item_MRP Outlet_Identif

0 FDA15 9.300 Low Fat 0.016047 Dairy 249.8092 OUT0

Soft
1 DRC01 5.920 Regular 0.019278 48.2692 OUT0
Drinks

2 FDN15 17.500 Low Fat 0.016760 Meat 141.6180 OUT0

4 NCD19 8.930 Low Fat 0.000000 Household 53.8614 OUT0

Baking
5 FDP36 10.395 Regular 0.000000 51.4008 OUT0
Goods

1. Sorting dataframe
pandas data frame has two useful functions

. sort_values(): to sort pandas dataframe by one or more columns

. sort_index(): to sort pandas data frame by row index


each of these functions come with numerous options. like sorting the data frame in specific
order(ascending or descending), sorting in place, sorting with missing values, sorting by specific
algorithm etc.

suppose you want to sort the dataframe by öutlet_establishment_year then you will use sort_values

In [23]: #sort by year


sorted_data=data_Bm.sort_values(by="Outlet_Establishment_Year")
#print sorted data
sorted_data[:5]

Out[23]: Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type Item_MRP Outlet_Ide

Baking
2812 FDR60 14.30 Low Fat 0.130307 75.7328 O
Goods

5938 NCJ06 20.10 Low Fat 0.034624 Household 118.9782 O

3867 FDY38 13.60 Regular 0.119077 Dairy 231.2300 O

Baking
1307 FDB37 20.25 Regular 0.022922 240.7538 O
Goods

5930 NCA18 10.10 Low Fat 0.056031 Household 115.1492 O


.Now sort_value takes multiple options like:

.ascending : the default sorting order is ascending,when you pass false here
then it sorts in descending order

. inplace : whether to do inplace sorting or not

In [24]: #sort in place and descending order


data_Bm.sort_values(by="Outlet_Establishment_Year", ascending=False, inplace=True)
#print sorted data
data_Bm[:5]
Out[24]: Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type Item_MRP Outlet_Ide

Frozen
2825 FDL16 12.85 Low Fat 0.169139 46.4060 O
Foods

Health
7389 NCD42 16.50 Low Fat 0.012689 and 39.7506 O
Hygiene

2165 DRJ39 20.25 Low Fat 0.036474 Dairy 218.3482 O

Baking
2162 FDR60 14.30 Low Fat 0.130946 76.7328 O
Goods

Snack
2158 FDM58 16.85 Regular 0.080015 111.8544 O
Foods

you might want to sort a data frame based on the values of multiple columns. we can specify the
columns we want to sort by list in the argument for sort_value()

In [25]: #read the dataset


data_Bm=pd.read_csv("bigmart_data.csv")
#drop the null values
data_Bm=data_Bm.dropna(how="any")

#sort by multiple column


data_Bm.sort_values(by=["Outlet_Establishment_Year","Item_Outlet_Sales"],ascending=False)[:5]
Out[25]: Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type Item_MRP Outlet_Ide

43 FDC02 21.35 Low Fat 0.069103 Canned 259.9278 O

2803 FDU51 20.20 Regular 0.096907 Meat 175.5028 O

641 FDY51 12.50 Low Fat 0.081465 Meat 220.7798 O

2282 NCX30 16.70 Low Fat 0.026729 Household 248.4776 O

2887 FDR25 17.00 Regular 0.140090 Canned 265.1884 O

Note that when sorting by multiple columns pandas sort_values() uses the first five variable first and
second varibale next.

we can see the difference by switching the order of column names in hte list.

In [26]: data_Bm.sort_values(by=["Item_Outlet_Sales","Outlet_Establishment_Year"],ascending=False,inpla
data_Bm[:5]
Out[26]: Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type Item_MRP Outlet_Ide

4888 FDF39 14.850 Regular 0.019495 Dairy 261.2910 O

Health
4289 NCM05 6.825 Low Fat 0.059847 and 262.5226 O
Hygiene

Snack
6409 FDA21 13.650 Low Fat 0.035931 184.4924 O
Foods

Health
4991 NCQ53 17.600 Low Fat 0.018905 and 234.6590 O
Hygiene

5752 FDI15 13.800 Low Fat 0.141326 Dairy 265.0884 O

In [27]: data_Bm.head()
Out[27]: Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type Item_MRP Outlet_Ide

4888 FDF39 14.850 Regular 0.019495 Dairy 261.2910 O

Health
4289 NCM05 6.825 Low Fat 0.059847 and 262.5226 O
Hygiene

Snack
6409 FDA21 13.650 Low Fat 0.035931 184.4924 O
Foods

Health
4991 NCQ53 17.600 Low Fat 0.018905 and 234.6590 O
Hygiene

5752 FDI15 13.800 Low Fat 0.141326 Dairy 265.0884 O

In [28]: data_Bm
Out[28]: Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type Item_MRP Outlet_Ide

4888 FDF39 14.850 Regular 0.019495 Dairy 261.2910 O

Health
4289 NCM05 6.825 Low Fat 0.059847 and 262.5226 O
Hygiene

Snack
6409 FDA21 13.650 Low Fat 0.035931 184.4924 O
Foods

Health
4991 NCQ53 17.600 Low Fat 0.018905 and 234.6590 O
Hygiene

5752 FDI15 13.800 Low Fat 0.141326 Dairy 265.0884 O

... ... ... ... ... ... ...

2711 NCO06 19.250 Low Fat 0.108030 Household 32.4558 O

1543 FDA39 6.320 Low Fat 0.012770 Meat 39.9822 O

Frozen
1657 FDL52 6.635 Regular 0.046278 36.7506 O
Foods

2370 NCA06 20.500 Low Fat 0.143164 Household 34.9190 O


Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type Item_MRP Outlet_Ide

5573 NCE31 7.670 Low Fat 0.185597 Household 35.7216 O

4650 rows × 12 columns

. we can use sort_index() to sort pandas dataframe to sort by row index or names.

. in this example, row index are numbers and int the earlier example we sorted data frame by
"Item_Outlet_Sales","Outlet_Establishment_Year" and there for the row index are jumbled up.

. we can sort by row index(with inpalce=True option) and retrive the original dataframe.

In [29]: #sort by index

data_Bm.sort_index(inplace=True)
data_Bm[:5]
Out[29]: Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type Item_MRP Outlet_Identif

0 FDA15 9.300 Low Fat 0.016047 Dairy 249.8092 OUT0

Soft
1 DRC01 5.920 Regular 0.019278 48.2692 OUT0
Drinks

2 FDN15 17.500 Low Fat 0.016760 Meat 141.6180 OUT0

4 NCD19 8.930 Low Fat 0.000000 Household 53.8614 OUT0

Baking
5 FDP36 10.395 Regular 0.000000 51.4008 OUT0
Goods

Merging dataframes
.joining and merging dataframes is the core process to start with data analysis and machine learning
tasks.

.it is one of the toolkits which every data analyst or data scientist should master because in almost all the
cases datacomes from multiple source and files

.pandas has two useful functions for merging dataframes:


.cancat()
.merge()

creating dummy data

In [30]: #create dummy data


df1=pd.DataFrame({'A':["A0","A1","A2","A3"],
'B': ["B0","B1","B2","B3"],
'C':["C0","C1","C2","C3"],
'D': ["D0","D1","D2","D3"]},
index=[0,1,2,3])

In [31]: #create dummy data


df2=pd.DataFrame({'A':["A4","A5","A6","A7"],
'B': ["B4","B5","B6","B7"],
'C':["C4","C5","C6","C7"],
'D': ["D4","D5","D6","D7"]},
index=[4,5,6,7])

In [32]: #create dummy data


df3=pd.DataFrame({'A':["A8","A9","A10","A11"],
'B': ["B8","B9","B10","B11"],
'C':["C8","C9","C10","C11"],
'D': ["D8","D9","D10","D11"]},
index=[8,9,10,11])

In [33]: #combine dataframes


result=pd.concat([df1,df2,df3])
result
Out[33]: A B C D

0 A0 B0 C0 D0

1 A1 B1 C1 D1

2 A2 B2 C2 D2

3 A3 B3 C3 D3

4 A4 B4 C4 D4

5 A5 B5 C5 D5

6 A6 B6 C6 D6

7 A7 B7 C7 D7

8 A8 B8 C8 D8

9 A9 B9 C9 D9

10 A10 B10 C10 D10

11 A11 B11 C11 D11

.pandas also provides you with an option to label the dataframes,after the concatenation,with a key so
that you may know which data came from from which dataframes

. you can achieve the same by passing additional argument keys specifying the label names of the
dataframes in a list.
In [34]: #combine dataframes
result=pd.concat([df1,df2,df3],keys=['x','y','z'])
result

Out[34]: A B C D

x 0 A0 B0 C0 D0

1 A1 B1 C1 D1

2 A2 B2 C2 D2

3 A3 B3 C3 D3

y 4 A4 B4 C4 D4

5 A5 B5 C5 D5

6 A6 B6 C6 D6

7 A7 B7 C7 D7

z 8 A8 B8 C8 D8

9 A9 B9 C9 D9

10 A10 B10 C10 D10

11 A11 B11 C11 D11

. mentioning the keys also makes it easy to retrive data corresponding to a particular dataframe.
. you can retrieve the data of dataframe df2 which had the label y by using the loc method.

In [35]: #get second dataframe


result.loc['y']

Out[35]: A B C D

4 A4 B4 C4 D4

5 A5 B5 C5 D5

6 A6 B6 C6 D6

7 A7 B7 C7 D7

. when gluing togethermultiple dataframes, you have a choice of how to habdle the other axes(other
than the one being concatenated).this can be done in the following three ways:

. take the union of them all, join= "outer". this is the default option as it result in zero information loss

. take the intersection, join= "inner".

use a specific index, as passed to the join axes argumet

In [36]: df4=pd.DataFrame({'B':['B2','B3','B6','B7'],
'D':['D2','D3','D6','D7'],
'F':['F2','F3','F6','F7']},
index=[2,3,6,7])
result=pd.concat([df1,df4],axis=1,sort=False)
result

Out[36]: A B C D B D F

0 A0 B0 C0 D0 NaN NaN NaN

1 A1 B1 C1 D1 NaN NaN NaN

2 A2 B2 C2 D2 B2 D2 F2

3 A3 B3 C3 D3 B3 D3 F3

6 NaN NaN NaN NaN B6 D6 F6

7 NaN NaN NaN NaN B7 D7 F7

In [37]: #here is the same thing with inner join="inner":

In [38]: result=pd.concat([df1,df4],axis=1,join="inner")
result

Out[38]: A B C D B D F

2 A2 B2 C2 D2 B2 D2 F2

3 A3 B3 C3 D3 B3 D3 F3

lastly, suppose we just wanted to reuse the excat index from the original dataframe:
b.merge() for combining dataframs using sql like joins

. another ubiquitous operation related to dataframes is the merging operation. two datafram might hold
diffrent kinds of information about the same entity and linked by some common feature/column.

we can use merge() to combine such dataframes in pandas.

creating dummy data


In [39]: df_a=pd.DataFrame({
"subject_id":['1','2','3','4','5'],
"first_name": ["alex","amy","allen","alice","ayoung"],
"last_name":["anderson","ackerman","ali","aoni", "atiches"]})

df_b=pd.DataFrame({
"subject_id":["4","5","6","7","8"],
"first_name":["billy","brian","bran","bryce","betty"],
"last_name":["bonder","black","balwner","brice","btisan"]})

df_c=pd.DataFrame({
"subject_id": ['1','2','3','4','5','6','7','8','9','10','11'],
"test_id" :[51,15,15,61,16,14,15,1,61,16,61]})

In [40]: #let's start with the basic join, we want to combine df_a with df_c based on the subject _id c
pd.merge(df_a,df_c,on="subject_id")
Out[40]: subject_id first_name last_name test_id

0 1 alex anderson 51

1 2 amy ackerman 15

2 3 allen ali 15

3 4 alice aoni 61

4 5 ayoung atiches 16

merge with outer join


full outer join products the set of all records in table a and table b, with matching records from both
sides where avilable. if there is no match, the missing side wil conatain null

In [41]: pd.merge(df_a,df_b,on="subject_id",how="outer")
Out[41]: subject_id first_name_x last_name_x first_name_y last_name_y

0 1 alex anderson NaN NaN

1 2 amy ackerman NaN NaN

2 3 allen ali NaN NaN

3 4 alice aoni billy bonder

4 5 ayoung atiches brian black

5 6 NaN NaN bran balwner

6 7 NaN NaN bryce brice

7 8 NaN NaN betty btisan

merge with innner join . the inner join produces only th set of records that match in both table A and
table B

In [ ]:

In [42]: pd.merge(df_a,df_b,on="subject_id",how="inner")
Out[42]: subject_id first_name_x last_name_x first_name_y last_name_y

0 4 alice aoni billy bonder

1 5 ayoung atiches brian black

merge with right join


Right outer join produces a complete set of records from table B with matching records(where avilable)
in table A if there is no match, the left side will conatain null

In [43]: pd.merge(df_a,df_b,on="subject_id",how="right")

Out[43]: subject_id first_name_x last_name_x first_name_y last_name_y

0 4 alice aoni billy bonder

1 5 ayoung atiches brian black

2 6 NaN NaN bran balwner

3 7 NaN NaN bryce brice

4 8 NaN NaN betty btisan

merge with left join


. "left outer join produces a complete set of records from table A , with the matching records(where
avilable ) in table B. if there is no match, the right side will conatain null".

In [44]: pd.merge(df_a,df_b,on="subject_id",how="left")

Out[44]: subject_id first_name_x last_name_x first_name_y last_name_y

0 1 alex anderson NaN NaN

1 2 amy ackerman NaN NaN

2 3 allen ali NaN NaN

3 4 alice aoni billy bonder

4 5 ayoung atiches brian black


merge or concat : which to use when?

1. after learning both of the function in detail, chances are that you might be confused which to use
when.

2. one major differences is that merge() is used to combine dataframes on the basis of values of
common columns while cocat() is used to append dataframes

one below the other(or sideways,depending on whether the axis option is set 0 or 1).

3. exact usage depends upon the kind of data you have and analysis you want to perform.

3. Apply function
. apply fucntion can be used to perform pre-processing/data-manipulation on your data both row wise
and column wise.

. it is a faster method then simply using a for loop over your dataframe.

. almost every time i need to itterate over a dataframe or it's rows and columns, i will think of using the
apply.

. hence, it is widely used in feature engineering code.


In [45]: #accessing row wise
data_Bm.apply(lambda x :x)
Out[45]: Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type Item_MRP Outlet_Ide

0 FDA15 9.300 Low Fat 0.016047 Dairy 249.8092 O

Soft
1 DRC01 5.920 Regular 0.019278 48.2692 O
Drinks

2 FDN15 17.500 Low Fat 0.016760 Meat 141.6180 O

4 NCD19 8.930 Low Fat 0.000000 Household 53.8614 O

Baking
5 FDP36 10.395 Regular 0.000000 51.4008 O
Goods

... ... ... ... ... ... ...

Frozen
8517 FDF53 20.750 reg 0.083607 178.8318 O
Foods

Snack
8518 FDF22 6.865 Low Fat 0.056783 214.5218 O
Foods

Health
8520 NCJ29 10.600 Low Fat 0.035186 and 85.1224 O
Hygiene

Snack
8521 FDN46 7.210 Regular 0.145221 103.1332 O
Foods
Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type Item_MRP Outlet_Ide

Soft
8522 DRG01 14.800 Low Fat 0.044878 75.4670 O
Drinks

4650 rows × 12 columns

In [46]: #access first row


data_Bm.apply(lambda x: x[0])

Out[46]: Item_Identifier FDA15


Item_Weight 9.3
Item_Fat_Content Low Fat
Item_Visibility 0.016047
Item_Type Dairy
Item_MRP 249.8092
Outlet_Identifier OUT049
Outlet_Establishment_Year 1999
Outlet_Size Medium
Outlet_Location_Type Tier 1
Outlet_Type Supermarket Type1
Item_Outlet_Sales 3735.138
dtype: object

In [47]: #access first row


data_Bm.apply(lambda x: x[0],axis=1)
C:\Users\satya shukla\AppData\Local\Temp\ipykernel_8964\2295982325.py:2: FutureWarning: Serie
s.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will
always be treated as labels (consistent with DataFrame behavior). To access a value by positio
n, use `ser.iloc[pos]`
data_Bm.apply(lambda x: x[0],axis=1)
Out[47]: 0 FDA15
1 DRC01
2 FDN15
4 NCD19
5 FDP36
...
8517 FDF53
8518 FDF22
8520 NCJ29
8521 FDN46
8522 DRG01
Length: 4650, dtype: object

. you can also apply to implement a condition individually on every row .column of your dataframe.

.suppose you want to clip item_MRP to 200 and not cosider any value grater than that.

def clip_price(price):

if price>200:

price=200

retun price
In [48]: #before clipping

data_Bm["Item_MRP"] [:5]

Out[48]: 0 249.8092
1 48.2692
2 141.6180
4 53.8614
5 51.4008
Name: Item_MRP, dtype: float64

In [51]: #cli[ price if it is greater than 200

def clip_price(price):
if price>200:
price=200
return price

#after clipping
data_Bm["Item_MRP"].apply(lambda x: clip_price(x))[:5]

Out[51]: 0 200.0000
1 48.2692
2 141.6180
4 53.8614
5 51.4008
Name: Item_MRP, dtype: float64

In [52]: data_Bm["Outlet_Location_Type"][:5]
Out[52]: 0 Tier 1
1 Tier 3
2 Tier 1
4 Tier 3
5 Tier 3
Name: Outlet_Location_Type, dtype: object

In [53]: #label encode city type


def label_encode(city):
if city=='Tier 1':
label=0
elif city=='Tier 2' :
label=1
else:
label=2
return label

#operate label encode on every row of outlet location type

data_Bm["Outlet_Location_Type"]=data_Bm["Outlet_Location_Type"].apply(label_encode)

In [54]: data_Bm["Outlet_Location_Type"][:5]

Out[54]: 0 0
1 2
2 0
4 2
5 2
Name: Outlet_Location_Type, dtype: int64
In [55]: #read_the dataset
data_Bm.dropna(how="any")
#reset index after dropping
data_Bm=data_Bm.reset_index(drop=True)
#view the top results
data_Bm.head()

Out[55]: Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type Item_MRP Outlet_Identif

0 FDA15 9.300 Low Fat 0.016047 Dairy 249.8092 OUT0

Soft
1 DRC01 5.920 Regular 0.019278 48.2692 OUT0
Drinks

2 FDN15 17.500 Low Fat 0.016760 Meat 141.6180 OUT0

3 NCD19 8.930 Low Fat 0.000000 Household 53.8614 OUT0

Baking
4 FDP36 10.395 Regular 0.000000 51.4008 OUT0
Goods

1.Aggregating data
there aremultiple fucntions that can be used to perform useful aggregation on data in pandas
. groupby

. crosstab

.pivottable

A. what is the mean price of each item type? .: gropuby

. in the given dataset, i want to find out what is the mean price for each item type? you can use
groupby() to achive this.

. the first step would be to group the data by item_type column

In [57]: #group_price based on item type.

price_by_item=data_Bm.groupby("Item_Type")

#display first few rows

price_by_item.first()
Out[57]: Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_MRP Outlet_Identifier

Item_Type

Baking
FDP36 10.395 Regular 0.000000 51.4008 OUT018
Goods

Breads FDW11 12.600 Low Fat 0.048981 61.9194 OUT018

Breakfast FDP49 9.000 Regular 0.069089 56.3614 OUT046

Canned FDC02 21.350 Low Fat 0.069103 259.9278 OUT018

Dairy FDA15 9.300 Low Fat 0.016047 249.8092 OUT049

Frozen
FDR28 13.850 Regular 0.025896 165.0210 OUT046
Foods

Fruits and
FDY07 11.800 Low Fat 0.000000 45.5402 OUT049
Vegetables

Hard
DRJ59 11.650 low fat 0.019356 39.1164 OUT013
Drinks

Health
and NCB42 11.800 Low Fat 0.008596 115.3492 OUT018
Hygiene
Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_MRP Outlet_Identifier

Item_Type

Household NCD19 8.930 Low Fat 0.000000 53.8614 OUT013

Meat FDN15 17.500 Low Fat 0.016760 141.6180 OUT049

Others NCM43 14.500 Low Fat 0.019472 164.8210 OUT035

Seafood FDH21 10.395 Low Fat 0.031274 160.0604 OUT049

Snack
FDO10 13.650 Regular 0.012741 57.6588 OUT013
Foods

Soft
DRC01 5.920 Regular 0.019278 48.2692 OUT018
Drinks

Starchy
FDB11 16.000 Low Fat 0.060837 226.8404 OUT035
Foods

.Now that you have grouped by the item_type, the next step would be to calculte the mean of item_MRP.

In [58]: #mean price by item


price_by_item.Item_MRP.mean()

Out[58]: Item_Type
Baking Goods 125.795653
Breads 141.300639
Breakfast 134.090683
Canned 138.551179
Dairy 149.481471
Frozen Foods 140.095830
Fruits and Vegetables 145.418257
Hard Drinks 140.102908
Health and Hygiene 131.437324
Household 149.884244
Meat 140.279344
Others 137.640870
Seafood 146.595782
Snack Foods 147.569955
Soft Drinks 130.910182
Starchy Foods 151.256747
Name: Item_MRP, dtype: float64

you can use groupby with multiple columns of the dataset too.

in this case, if you want to group first based on the item_Type and then item_mrp you can simply pass a
list of column names.

In [59]: #group on multiple columns


multiple_groups=data_Bm[:10].groupby(["Item_Type","Item_Fat_Content"])
multiple_groups.first()
Out[59]: Item_Identifier Item_Weight Item_Visibility Item_MRP Outlet_Identifier

Item_Type Item_Fat_Content

Baking Regular
FDP36 10.395 0.000000 51.4008 OUT018
Goods

Dairy Low Fat


FDA15 9.300 0.016047 249.8092 OUT049

Regular FDA03 18.500 0.045464 144.1102 OUT046

Fruits and Low Fat


FDY07 11.800 0.000000 45.5402 OUT049
Vegetables

Regular FDX32 15.100 0.100014 145.4786 OUT049

Household Low Fat


NCD19 8.930 0.000000 53.8614 OUT013

Meat Low Fat


FDN15 17.500 0.016760 141.6180 OUT049

Snack Regular
FDO10 13.650 0.012741 57.6588 OUT013
Foods

Soft Regular
DRC01 5.920 0.019278 48.2692 OUT018
Drinks
B. how are outlet sizes distributed based on the city type ? : crosstab

. this function is used to get an initial "feel" (view) of the data, here, we can validate some basic
hyposthesis.

. for example, in this case,"outlet_location_type" is expected to affect the "outlet_size" significantly. this
can be tested using cross-tabulation as shown below

In [61]: #generate crosstab of outlet_size and outlet location_type


pd.crosstab(data_Bm["Outlet_Size"],data_Bm["Outlet_Location_Type"],margins=True)

Out[61]: Outlet_Location_Type 0 1 2 All

Outlet_Size

High 0 0 932 932

Medium 930 0 928 1858

Small 930 930 0 1860

All 1860 930 1860 4650

if you notice in the above crosstab there are interesting insights like 50 % of medium size outlets are
present only in either Tier 1 or Tier 2 cities

.Another counter intuitive things to notice is that outlet size is only present in Tier3 city though general
assumption would be towards tier1 cities having
larger outlet sizes

c. How are the sales chainging per year ? : pivottable


. pandas can be used to create ms excel style pivote tables.

. the fun thing about pandas pivottable is you can get another point of viewon your data with only one
line of code.

. most of the pivote_table parameter use default values, so the only mandatory parameter you must add
are data and index .

. data is sell explanatory_its the dataframe you would like to use.

. index is the column, grouper, array(or list of the previous) you would likegropu of your data by.

. value(optional) is the column you would like to aggregate. if you do not specify this then the function
will aggregate all numeric columns

In [63]: #create pivote table


pd.pivot_table(data_Bm,index=["Outlet_Establishment_Year"],values="Item_Outlet_Sales")
Out[63]: Item_Outlet_Sales

Outlet_Establishment_Year

1987 2298.995256

1997 2277.844267

1999 2348.354635

2004 2438.841866

2009 1995.498739

. in the above example, the mean sales for each year is show. . you can also pass multiple columns to
pivote table, in the next example we try to see mean sales not just by the year but also taking into
account the outlt size and type of the city.

In [65]: #create pivote table


pd.pivot_table(data_Bm, index=["Outlet_Establishment_Year","Outlet_Location_Type","Outlet_Size
Out[65]: Item_Outlet_Sales

Outlet_Establishment_Year Outlet_Location_Type Outlet_Size

1987 2 High 2298.995256

1997 0 Small 2277.844267

1999 0 Medium 2348.354635

2004 1 Small 2438.841866

2009 2 Medium 1995.498739

In [65]: #create pivote table


pd.pivot_table(data_Bm, index=["Outlet_Establishment_Year","Outlet_Location_Type","Outlet_Size

Out[65]: Item_Outlet_Sales

Outlet_Establishment_Year Outlet_Location_Type Outlet_Size

1987 2 High 2298.995256

1997 0 Small 2277.844267

1999 0 Medium 2348.354635

2004 1 Small 2438.841866

2009 2 Medium 1995.498739


In [68]: #create pivote table
pd.pivot_table(data_Bm, index=["Outlet_Establishment_Year","Outlet_Location_Type","Outlet_Size
C:\Users\satya shukla\AppData\Local\Temp\ipykernel_8964\2234684528.py:2: FutureWarning: The pr
ovided callable <function mean at 0x000001F20CC9D300> is currently using DataFrameGroupBy.mea
n. In a future version of pandas, the provided callable will be used directly. To keep current
behavior pass the string "mean" instead.
pd.pivot_table(data_Bm, index=["Outlet_Establishment_Year","Outlet_Location_Type","Outlet_Si
ze"],values="Item_Outlet_Sales" ,aggfunc=[np.mean,np.median,min,max,np.std])
C:\Users\satya shukla\AppData\Local\Temp\ipykernel_8964\2234684528.py:2: FutureWarning: The pr
ovided callable <function median at 0x000001F20CE78CC0> is currently using DataFrameGroupBy.me
dian. In a future version of pandas, the provided callable will be used directly. To keep curr
ent behavior pass the string "median" instead.
pd.pivot_table(data_Bm, index=["Outlet_Establishment_Year","Outlet_Location_Type","Outlet_Si
ze"],values="Item_Outlet_Sales" ,aggfunc=[np.mean,np.median,min,max,np.std])
C:\Users\satya shukla\AppData\Local\Temp\ipykernel_8964\2234684528.py:2: FutureWarning: The pr
ovided callable <built-in function min> is currently using DataFrameGroupBy.min. In a future v
ersion of pandas, the provided callable will be used directly. To keep current behavior pass t
he string "min" instead.
pd.pivot_table(data_Bm, index=["Outlet_Establishment_Year","Outlet_Location_Type","Outlet_Si
ze"],values="Item_Outlet_Sales" ,aggfunc=[np.mean,np.median,min,max,np.std])
C:\Users\satya shukla\AppData\Local\Temp\ipykernel_8964\2234684528.py:2: FutureWarning: The pr
ovided callable <built-in function max> is currently using DataFrameGroupBy.max. In a future v
ersion of pandas, the provided callable will be used directly. To keep current behavior pass t
he string "max" instead.
pd.pivot_table(data_Bm, index=["Outlet_Establishment_Year","Outlet_Location_Type","Outlet_Si
ze"],values="Item_Outlet_Sales" ,aggfunc=[np.mean,np.median,min,max,np.std])
C:\Users\satya shukla\AppData\Local\Temp\ipykernel_8964\2234684528.py:2: FutureWarning: The pr
ovided callable <function std at 0x000001F20CC9D440> is currently using DataFrameGroupBy.std.
In a future version of pandas, the provided callable will be used directly. To keep current be
havior pass the string "std" instead.
pd.pivot_table(data_Bm, index=["Outlet_Establishment_Year","Outlet_Location_Type","Outlet_Si
ze"],values="Item_Outlet_Sales" ,aggfunc=[np.mean,np.median,min,max,np.std])
Out[68]: mean median

Item_Outlet_Sales Item_Outlet_Sales Item

Outlet_Establishment_Year Outlet_Location_Type Outlet_Size

1987 2 High 2298.995256 2050.6640

1997 0 Small 2277.844267 1945.8005

1999 0 Medium 2348.354635 1966.1074

2004 1 Small 2438.841866 2109.2544

2009 2 Medium 1995.498739 1655.1788

you can also perform multiple aggregations like mean,median, min,max,etc. in a pivote table by using
aggfunc paramter

In [ ]:

You might also like