✌️???? ????????????✌️???? ??????
✌️???? ????????????✌️???? ??????
import pandas as pd
df= pd.read_csv("data.csv")
df.head()
Out[1]: PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Em
Braund,
A/5
0 1 0 3 Mr. Owen male 22.0 1 0 7.2500 NaN
21171
Harris
Cumings,
Mrs. John
Bradley
1 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85
(Florence
Briggs
Th...
Heikkinen,
STON/O2.
2 3 1 3 Miss. female 26.0 0 0 7.9250 NaN
3101282
Laina
Futrelle,
Mrs.
Jacques
3 4 1 1 female 35.0 1 0 113803 53.1000 C123
Heath
(Lily May
Peel)
Allen, Mr.
4 5 0 3 William male 35.0 0 0 373450 8.0500 NaN
Henry
In [2]: df=pd.read_excel("Copy of data.xlsx")
df.head()
Out[2]: PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Em
Braund,
A/5
0 1 0 3 Mr. Owen male 22.0 1 0 7.2500 NaN
21171
Harris
Cumings,
Mrs. John
Bradley
1 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85
(Florence
Briggs
Th...
Heikkinen,
STON/O2.
2 3 1 3 Miss. female 26.0 0 0 7.9250 NaN
3101282
Laina
Futrelle,
Mrs.
Jacques
3 4 1 1 female 35.0 1 0 113803 53.1000 C123
Heath
(Lily May
Peel)
Allen, Mr.
4 5 0 3 William male 35.0 0 0 373450 8.0500 NaN
Henry
In [3]: #import the required library
import pandas as pd
Braund,
A/5
0 1 0 3 Mr. Owen male 22.0 1 0 7.2500 NaN
21171
Harris
Cumings,
Mrs. John
Bradley
1 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85
(Florence
Briggs
Th...
Heikkinen,
STON/O2.
2 3 1 3 Miss. female 26.0 0 0 7.9250 NaN
3101282
Laina
Futrelle,
Mrs.
Jacques
3 4 1 1 female 35.0 1 0 113803 53.1000 C123
Heath
(Lily May
Peel)
Allen, Mr.
4 5 0 3 William male 35.0 0 0 373450 8.0500 NaN
Henry
In [7]: #bottom 5 rows
df.tail()
Out[7]: PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Emba
Montvila,
886 887 0 2 Rev. male 27.0 0 0 211536 13.00 NaN
Juozas
Graham,
Miss.
887 888 1 1 female 19.0 0 0 112053 30.00 B42
Margaret
Edith
Johnston,
Miss.
W./C.
888 889 0 3 Catherine female NaN 1 2 23.45 NaN
6607
Helen
"Carrie"
Behr, Mr.
889 890 1 1 Karl male 26.0 0 0 111369 30.00 C148
Howell
Dooley,
890 891 0 3 Mr. male 32.0 0 0 370376 7.75 NaN
Patrick
In [8]: df.tail(10)
Out[8]: PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin
Markun,
881 882 0 3 Mr. male 33.0 0 0 349257 7.8958 NaN
Johann
Dahlberg,
Miss.
882 883 0 3 female 22.0 0 0 7552 10.5167 NaN
Gerda
Ulrika
Banfield,
Mr. C.A./SOTON
883 884 0 2 male 28.0 0 0 10.5000 NaN
Frederick 34068
James
Sutehall,
SOTON/OQ
884 885 0 3 Mr. male 25.0 0 0 7.0500 NaN
392076
Henry Jr
Rice, Mrs.
William
885 886 0 3 female 39.0 0 5 382652 29.1250 NaN
(Margaret
Norton)
Montvila,
886 887 0 2 Rev. male 27.0 0 0 211536 13.0000 NaN
Juozas
Johnston,
Miss.
888 889 0 3 Catherine female NaN 1 2 W./C. 6607 23.4500 NaN
Helen
"Carrie"
Behr, Mr.
889 890 1 1 Karl male 26.0 0 0 111369 30.0000 C148
Howell
Dooley,
890 891 0 3 Mr. male 32.0 0 0 370376 7.7500 NaN
Patrick
df.shape
0 1 NaN 22.0
1 2 C85 38.0
2 3 NaN 26.0
3 4 C123 35.0
4 5 NaN 35.0
Braund,
A/5
0 1 0 3 Mr. Owen male 22.0 1 0 7.2500 NaN
21171
Harris
Cumings,
Mrs. John
Bradley
1 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85
(Florence
Briggs
Th...
Heikkinen,
STON/O2.
2 3 1 3 Miss. female 26.0 0 0 7.9250 NaN
3101282
Laina
Futrelle,
Mrs.
Jacques
3 4 1 1 female 35.0 1 0 113803 53.1000 C123
Heath
(Lily May
Peel)
Allen, Mr.
4 5 0 3 William male 35.0 0 0 373450 8.0500 NaN
Henry
In [15]: #selcting columns by their position
df.iloc[:,:2]
0 1 0
1 2 1
2 3 1
3 4 1
4 5 0
886 887 0
887 888 1
888 889 0
889 890 1
890 891 0
In [16]: df[df["Age"]==15]
Out[16]: PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin E
McGowan,
Miss.
22 23 1 3 female 15.0 0 0 330923 8.0292 NaN
Anna
"Annie"
Elias, Mr.
352 353 0 3 male 15.0 1 1 2695 7.2292 NaN
Tannous
Madill,
Miss.
689 690 1 1 female 15.0 0 1 24160 211.3375 B5
Georgette
Alexandra
Yasbeck,
Mrs.
830 831 1 3 Antoni female 15.0 1 0 2659 14.4542 NaN
(Selini
Alexander)
Najib,
Miss.
875 876 1 3 Adele female 15.0 0 0 2667 7.2250 NaN
Kiamie
"Jane"
In [17]: Df=pd.read_csv("index.csv")
Df
Out[17]: A B C D E F G H I J K L
... ... ... ... ... ... ... ... ... ... ... ... ...
In [19]: DF
Out[19]: C
0 8
1 0
2 0
3 0
4 0
... ...
252 0
253 6
254 0
255 45
256 6
In [20]: Df.iloc[:,-2:] #access the last two column of the index dataframe using the iloc dataframe
Out[20]: K L
0 0 3
1 0 3
2 0 3
3 0 3
4 0 3
252 0 3
253 0 3
254 0 3
255 0 3
256 0 3
In [21]: Df.iloc[-10:,:2] #access the last 10 roes and 2 columns in index dataframe
Out[21]: A B
247 C R
248 C A
249 C A
250 C R
251 C A
252 C A
253 C R
254 C A
255 C K
256 C R
in this notebook, we will have a look at some of the intermediate concepts of working with pandas
Table of contents
1.sorting dataframes
2.merging dataframes
data_Bm=pd.read_csv("bigmart_data.csv")
Soft
1 DRC01 5.920 Regular 0.019278 48.2692 OUT0
Drinks
Baking
5 FDP36 10.395 Regular 0.000000 51.4008 OUT0
Goods
1. Sorting dataframe
pandas data frame has two useful functions
suppose you want to sort the dataframe by öutlet_establishment_year then you will use sort_values
Baking
2812 FDR60 14.30 Low Fat 0.130307 75.7328 O
Goods
Baking
1307 FDB37 20.25 Regular 0.022922 240.7538 O
Goods
.ascending : the default sorting order is ascending,when you pass false here
then it sorts in descending order
Frozen
2825 FDL16 12.85 Low Fat 0.169139 46.4060 O
Foods
Health
7389 NCD42 16.50 Low Fat 0.012689 and 39.7506 O
Hygiene
Baking
2162 FDR60 14.30 Low Fat 0.130946 76.7328 O
Goods
Snack
2158 FDM58 16.85 Regular 0.080015 111.8544 O
Foods
you might want to sort a data frame based on the values of multiple columns. we can specify the
columns we want to sort by list in the argument for sort_value()
Note that when sorting by multiple columns pandas sort_values() uses the first five variable first and
second varibale next.
we can see the difference by switching the order of column names in hte list.
In [26]: data_Bm.sort_values(by=["Item_Outlet_Sales","Outlet_Establishment_Year"],ascending=False,inpla
data_Bm[:5]
Out[26]: Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type Item_MRP Outlet_Ide
Health
4289 NCM05 6.825 Low Fat 0.059847 and 262.5226 O
Hygiene
Snack
6409 FDA21 13.650 Low Fat 0.035931 184.4924 O
Foods
Health
4991 NCQ53 17.600 Low Fat 0.018905 and 234.6590 O
Hygiene
In [27]: data_Bm.head()
Out[27]: Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type Item_MRP Outlet_Ide
Health
4289 NCM05 6.825 Low Fat 0.059847 and 262.5226 O
Hygiene
Snack
6409 FDA21 13.650 Low Fat 0.035931 184.4924 O
Foods
Health
4991 NCQ53 17.600 Low Fat 0.018905 and 234.6590 O
Hygiene
In [28]: data_Bm
Out[28]: Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type Item_MRP Outlet_Ide
Health
4289 NCM05 6.825 Low Fat 0.059847 and 262.5226 O
Hygiene
Snack
6409 FDA21 13.650 Low Fat 0.035931 184.4924 O
Foods
Health
4991 NCQ53 17.600 Low Fat 0.018905 and 234.6590 O
Hygiene
Frozen
1657 FDL52 6.635 Regular 0.046278 36.7506 O
Foods
. we can use sort_index() to sort pandas dataframe to sort by row index or names.
. in this example, row index are numbers and int the earlier example we sorted data frame by
"Item_Outlet_Sales","Outlet_Establishment_Year" and there for the row index are jumbled up.
. we can sort by row index(with inpalce=True option) and retrive the original dataframe.
data_Bm.sort_index(inplace=True)
data_Bm[:5]
Out[29]: Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type Item_MRP Outlet_Identif
Soft
1 DRC01 5.920 Regular 0.019278 48.2692 OUT0
Drinks
Baking
5 FDP36 10.395 Regular 0.000000 51.4008 OUT0
Goods
Merging dataframes
.joining and merging dataframes is the core process to start with data analysis and machine learning
tasks.
.it is one of the toolkits which every data analyst or data scientist should master because in almost all the
cases datacomes from multiple source and files
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
4 A4 B4 C4 D4
5 A5 B5 C5 D5
6 A6 B6 C6 D6
7 A7 B7 C7 D7
8 A8 B8 C8 D8
9 A9 B9 C9 D9
.pandas also provides you with an option to label the dataframes,after the concatenation,with a key so
that you may know which data came from from which dataframes
. you can achieve the same by passing additional argument keys specifying the label names of the
dataframes in a list.
In [34]: #combine dataframes
result=pd.concat([df1,df2,df3],keys=['x','y','z'])
result
Out[34]: A B C D
x 0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
y 4 A4 B4 C4 D4
5 A5 B5 C5 D5
6 A6 B6 C6 D6
7 A7 B7 C7 D7
z 8 A8 B8 C8 D8
9 A9 B9 C9 D9
. mentioning the keys also makes it easy to retrive data corresponding to a particular dataframe.
. you can retrieve the data of dataframe df2 which had the label y by using the loc method.
Out[35]: A B C D
4 A4 B4 C4 D4
5 A5 B5 C5 D5
6 A6 B6 C6 D6
7 A7 B7 C7 D7
. when gluing togethermultiple dataframes, you have a choice of how to habdle the other axes(other
than the one being concatenated).this can be done in the following three ways:
. take the union of them all, join= "outer". this is the default option as it result in zero information loss
In [36]: df4=pd.DataFrame({'B':['B2','B3','B6','B7'],
'D':['D2','D3','D6','D7'],
'F':['F2','F3','F6','F7']},
index=[2,3,6,7])
result=pd.concat([df1,df4],axis=1,sort=False)
result
Out[36]: A B C D B D F
2 A2 B2 C2 D2 B2 D2 F2
3 A3 B3 C3 D3 B3 D3 F3
In [38]: result=pd.concat([df1,df4],axis=1,join="inner")
result
Out[38]: A B C D B D F
2 A2 B2 C2 D2 B2 D2 F2
3 A3 B3 C3 D3 B3 D3 F3
lastly, suppose we just wanted to reuse the excat index from the original dataframe:
b.merge() for combining dataframs using sql like joins
. another ubiquitous operation related to dataframes is the merging operation. two datafram might hold
diffrent kinds of information about the same entity and linked by some common feature/column.
df_b=pd.DataFrame({
"subject_id":["4","5","6","7","8"],
"first_name":["billy","brian","bran","bryce","betty"],
"last_name":["bonder","black","balwner","brice","btisan"]})
df_c=pd.DataFrame({
"subject_id": ['1','2','3','4','5','6','7','8','9','10','11'],
"test_id" :[51,15,15,61,16,14,15,1,61,16,61]})
In [40]: #let's start with the basic join, we want to combine df_a with df_c based on the subject _id c
pd.merge(df_a,df_c,on="subject_id")
Out[40]: subject_id first_name last_name test_id
0 1 alex anderson 51
1 2 amy ackerman 15
2 3 allen ali 15
3 4 alice aoni 61
4 5 ayoung atiches 16
In [41]: pd.merge(df_a,df_b,on="subject_id",how="outer")
Out[41]: subject_id first_name_x last_name_x first_name_y last_name_y
merge with innner join . the inner join produces only th set of records that match in both table A and
table B
In [ ]:
In [42]: pd.merge(df_a,df_b,on="subject_id",how="inner")
Out[42]: subject_id first_name_x last_name_x first_name_y last_name_y
In [43]: pd.merge(df_a,df_b,on="subject_id",how="right")
In [44]: pd.merge(df_a,df_b,on="subject_id",how="left")
1. after learning both of the function in detail, chances are that you might be confused which to use
when.
2. one major differences is that merge() is used to combine dataframes on the basis of values of
common columns while cocat() is used to append dataframes
one below the other(or sideways,depending on whether the axis option is set 0 or 1).
3. exact usage depends upon the kind of data you have and analysis you want to perform.
3. Apply function
. apply fucntion can be used to perform pre-processing/data-manipulation on your data both row wise
and column wise.
. it is a faster method then simply using a for loop over your dataframe.
. almost every time i need to itterate over a dataframe or it's rows and columns, i will think of using the
apply.
Soft
1 DRC01 5.920 Regular 0.019278 48.2692 O
Drinks
Baking
5 FDP36 10.395 Regular 0.000000 51.4008 O
Goods
Frozen
8517 FDF53 20.750 reg 0.083607 178.8318 O
Foods
Snack
8518 FDF22 6.865 Low Fat 0.056783 214.5218 O
Foods
Health
8520 NCJ29 10.600 Low Fat 0.035186 and 85.1224 O
Hygiene
Snack
8521 FDN46 7.210 Regular 0.145221 103.1332 O
Foods
Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type Item_MRP Outlet_Ide
Soft
8522 DRG01 14.800 Low Fat 0.044878 75.4670 O
Drinks
. you can also apply to implement a condition individually on every row .column of your dataframe.
.suppose you want to clip item_MRP to 200 and not cosider any value grater than that.
def clip_price(price):
if price>200:
price=200
retun price
In [48]: #before clipping
data_Bm["Item_MRP"] [:5]
Out[48]: 0 249.8092
1 48.2692
2 141.6180
4 53.8614
5 51.4008
Name: Item_MRP, dtype: float64
def clip_price(price):
if price>200:
price=200
return price
#after clipping
data_Bm["Item_MRP"].apply(lambda x: clip_price(x))[:5]
Out[51]: 0 200.0000
1 48.2692
2 141.6180
4 53.8614
5 51.4008
Name: Item_MRP, dtype: float64
In [52]: data_Bm["Outlet_Location_Type"][:5]
Out[52]: 0 Tier 1
1 Tier 3
2 Tier 1
4 Tier 3
5 Tier 3
Name: Outlet_Location_Type, dtype: object
data_Bm["Outlet_Location_Type"]=data_Bm["Outlet_Location_Type"].apply(label_encode)
In [54]: data_Bm["Outlet_Location_Type"][:5]
Out[54]: 0 0
1 2
2 0
4 2
5 2
Name: Outlet_Location_Type, dtype: int64
In [55]: #read_the dataset
data_Bm.dropna(how="any")
#reset index after dropping
data_Bm=data_Bm.reset_index(drop=True)
#view the top results
data_Bm.head()
Soft
1 DRC01 5.920 Regular 0.019278 48.2692 OUT0
Drinks
Baking
4 FDP36 10.395 Regular 0.000000 51.4008 OUT0
Goods
1.Aggregating data
there aremultiple fucntions that can be used to perform useful aggregation on data in pandas
. groupby
. crosstab
.pivottable
. in the given dataset, i want to find out what is the mean price for each item type? you can use
groupby() to achive this.
price_by_item=data_Bm.groupby("Item_Type")
price_by_item.first()
Out[57]: Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_MRP Outlet_Identifier
Item_Type
Baking
FDP36 10.395 Regular 0.000000 51.4008 OUT018
Goods
Frozen
FDR28 13.850 Regular 0.025896 165.0210 OUT046
Foods
Fruits and
FDY07 11.800 Low Fat 0.000000 45.5402 OUT049
Vegetables
Hard
DRJ59 11.650 low fat 0.019356 39.1164 OUT013
Drinks
Health
and NCB42 11.800 Low Fat 0.008596 115.3492 OUT018
Hygiene
Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_MRP Outlet_Identifier
Item_Type
Snack
FDO10 13.650 Regular 0.012741 57.6588 OUT013
Foods
Soft
DRC01 5.920 Regular 0.019278 48.2692 OUT018
Drinks
Starchy
FDB11 16.000 Low Fat 0.060837 226.8404 OUT035
Foods
.Now that you have grouped by the item_type, the next step would be to calculte the mean of item_MRP.
Out[58]: Item_Type
Baking Goods 125.795653
Breads 141.300639
Breakfast 134.090683
Canned 138.551179
Dairy 149.481471
Frozen Foods 140.095830
Fruits and Vegetables 145.418257
Hard Drinks 140.102908
Health and Hygiene 131.437324
Household 149.884244
Meat 140.279344
Others 137.640870
Seafood 146.595782
Snack Foods 147.569955
Soft Drinks 130.910182
Starchy Foods 151.256747
Name: Item_MRP, dtype: float64
you can use groupby with multiple columns of the dataset too.
in this case, if you want to group first based on the item_Type and then item_mrp you can simply pass a
list of column names.
Item_Type Item_Fat_Content
Baking Regular
FDP36 10.395 0.000000 51.4008 OUT018
Goods
Snack Regular
FDO10 13.650 0.012741 57.6588 OUT013
Foods
Soft Regular
DRC01 5.920 0.019278 48.2692 OUT018
Drinks
B. how are outlet sizes distributed based on the city type ? : crosstab
. this function is used to get an initial "feel" (view) of the data, here, we can validate some basic
hyposthesis.
. for example, in this case,"outlet_location_type" is expected to affect the "outlet_size" significantly. this
can be tested using cross-tabulation as shown below
Outlet_Size
if you notice in the above crosstab there are interesting insights like 50 % of medium size outlets are
present only in either Tier 1 or Tier 2 cities
.Another counter intuitive things to notice is that outlet size is only present in Tier3 city though general
assumption would be towards tier1 cities having
larger outlet sizes
. the fun thing about pandas pivottable is you can get another point of viewon your data with only one
line of code.
. most of the pivote_table parameter use default values, so the only mandatory parameter you must add
are data and index .
. index is the column, grouper, array(or list of the previous) you would likegropu of your data by.
. value(optional) is the column you would like to aggregate. if you do not specify this then the function
will aggregate all numeric columns
Outlet_Establishment_Year
1987 2298.995256
1997 2277.844267
1999 2348.354635
2004 2438.841866
2009 1995.498739
. in the above example, the mean sales for each year is show. . you can also pass multiple columns to
pivote table, in the next example we try to see mean sales not just by the year but also taking into
account the outlt size and type of the city.
Out[65]: Item_Outlet_Sales
you can also perform multiple aggregations like mean,median, min,max,etc. in a pivote table by using
aggfunc paramter
In [ ]: