Pandas - NOTES
Pandas - NOTES
Lists, the data structure which we have learnt is a one dimensional data structure.
For instance, I want to save 2 values which are Name and City, if I use lists, either I need to use 2 separate
lists otherwise I should use list inside a list.
However, when we have too many attributes imagine so many lists inside a list, things will become complex.
Dataframes are 2 dimensional data structures consisting of rows and columns which gives them a tabluar
like structure
Attributes are nothing but the characteristics. For instance, if I say Student, some of the attributes would be
Name, Age, Stage, Gender etc. All these are Attributes and will be represented as Columns in a Dataframe.
Similarly the actual data which we add in the dataframe are Rows.
Because of the pandas library, Dataframes are more popularly known as Pandas Dataframes.
Dataframes - pandas
In the below line of code, we do import pandas library and give an alias which is "pd". This is the common alias
given as we would prefer a short name instead of pandas everytime.
It is not mandatory to use only "pd". However, it is a general convention which is followed by pandas users
worldwide to give a sense of uniformity. Hence it is always recommended to use pd.
In [1]:
import pandas as pd
df1 = pd.DataFrame()
To create a Dataframe, we use the library name, in this case we use the alias i.e. "pd" followed by a "period"
symbol and then the function "DataFrame" is called.
If it is not followed you will get an error and your dataframe will not be created. Refer the below cells.
If it is not followed you will get an error and your dataframe will not be created. Refer the below cells.
In the above cell we are creating an Empty Dataframe which is having no columns and no rows, similar to an
empty List.
In [3]:
df1 = pd.dataframe()
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-3-72de98764683> in <module>
----> 1 df1 = pd.dataframe()
In [4]:
df1 = pd.Dataframe()
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-4-3d9df181c01f> in <module>
----> 1 df1 = pd.Dataframe()
In [5]:
df1 = pd.dataFrame()
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-5-0f1764ecfef5> in <module>
----> 1 df1 = pd.dataFrame()
In [6]:
df1
Out[6]:
You can see when you check the dataframe variable "df1" it shows nothing but only a hyphen.
Similarly when you print a dataframe variable, it tells that it is an empty datframe and shows empty square
brackets for columns and rows. Index here refers to rows.
In [7]:
print(df1)
Empty DataFrame
Columns: []
Index: []
Name of a Dataframe['Column_Name'] = [ ]
The above line will add an Empty column to the dataframe. However there are variations as well.
You can use a variable also in the square brackets. However, that variable should contain the column name
you want for that column.
In [8]:
Age = 'Age'
In [9]:
df1['ID'] = []
df1["Name"] = []
df1[Age] = []
In [10]:
df1
Out[10]:
ID Name Age
When you check the Dataframe you will see the above result as still it does not have any rows or any data.
In [11]:
print(df1)
Empty DataFrame
Columns: [ID, Name, Age]
Index: []
When you print the dataframe, you will see the above result. Still it shows as an Empty Dataframe as it does not
contain any data. However, it shows the list of columns.
In [12]:
df1['ID'] = [1,2,3,4]
df1['Name'] = ['Mukund', 'Rishabh', 'Virat', 'Victor']
df1['Age'] = [35,30,36,37]
df1
Out[12]:
ID Name Age
0 1 Mukund 35
1 2 Rishabh 30
2 3 Virat 36
3 4 Victor 37
In [13]:
print(df1)
ID Name Age
0 1 Mukund 35
1 2 Rishabh 30
2 3 Virat 36
3 4 Victor 37
Atfer adding the values, when we use print to check the values of the Dataframe, the output is shown in the
above cell.
However now we will separately create lists and then add those lists to a dataframe.
In [14]:
list1 = [1,2,3,4]
list2 = ["MI", "RCB", "CSK", "SRH"]
In [15]:
df2 = pd.DataFrame(list2,index = list1, columns = ['Teams'])
In [16]:
df2
Out[16]:
Teams
1 MI
2 RCB
3 CSK
4 SRH
Few Pointers
We have used 2 lists. One contains the actual values or data i.e, list2.
Observe that in the previous example, the indices started from 0. However in a dataframe we can change it.
In this example I have taken a list which has the indices I want and I assign that value to the argument
"index" in the function "DataFrame".
The first argument is the name of the list which has the data. In our case it is list2.
The second argument is the index which is assigned the value list1 which contains the numbers from 1 to 4.
The third and important argument is the list of Columns. You can clearly see that it is a list as it is placed in
Square Brackets. The name of the column will be "Teams".
As mentioned in the above cells, we can specify our own indexes. So, now let us take another list with index
values from 100 to 400 and see whether it works.
In [17]:
list3 = [100,200,300,400]
Now, we will asisgn the list variable "list3" to the index argument and see what is the output.
In [18]:
df2 = pd.DataFrame(list2,index = list3, columns = ['Teams'])
In [19]:
df2
Out[19]:
Teams
100 MI
200 RCB
300 CSK
400 SRH
It can be clearly seen that the above dataframe has indices starting from 100 and ending at 400.
What if the list has more indices' values than the actual data itself ? What would you think will happen ??
Let us Examine
In [20]:
list5 = [1,2,3,4,5]
I have created a new list which has only 3 values and the data is the same i.e., list2.
In [21]:
df3 = pd.DataFrame(list2,index = list5, columns = ['Teams'])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals\managers.py i
n create_block_manager_from_blocks(blocks, axes)
1652
-> 1653 mgr = BlockManager(blocks, axes)
1654 mgr._consolidate_inplace()
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals\managers.py i
n __init__(self, blocks, axes, do_integrity_check)
113 if do_integrity_check:
--> 114 self._verify_integrity()
115
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals\managers.py i
n _verify_integrity(self)
310 if block._verify_integrity and block.shape[1:] != mgr_shape[1:]:
--> 311 construction_error(tot_items, block.shape[1:], self.axes)
312 if len(self.items) != tot_items:
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals\managers.py i
n construction_error(tot_items, block_shape, axes, e)
1690 raise ValueError("Shape of passed values is {0}, indices imply {1}".format(
-> 1691 passed, implied))
1692
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py in __init__(se
lf, data, index, columns, dtype, copy)
449 else:
450 mgr = init_ndarray(data, index, columns, dtype=dtype,
--> 451 copy=copy)
452 else:
453 mgr = init_dict({}, index, columns, dtype=dtype)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals\construction.
py in init_ndarray(values, index, columns, dtype, copy)
165 values = maybe_infer_to_datetimelike(values)
166
--> 167 return create_block_manager_from_blocks([values], [columns, index])
168
169
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals\managers.py i
n create_block_manager_from_blocks(blocks, axes)
1658 blocks = [getattr(b, 'values', b) for b in blocks]
1659 tot_items = sum(b.shape[0] for b in blocks)
-> 1660 construction_error(tot_items, blocks[0].shape[1:], axes, e)
1661
1662
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals\managers.py i
n construction_error(tot_items, block_shape, axes, e)
1689 raise ValueError("Empty data passed with indices specified.")
1690 raise ValueError("Shape of passed values is {0}, indices imply {1}".format(
-> 1691 passed, implied))
1692
1693
Now, what if we do the opposite ? That is, data will have more values and index list will have less values.
Let us Check !!
In [22]:
list6 = [1,2,3]
list7 = ["Harry", "Ben", "James", "Mary", "Sherlock"]
Now we have 2 lists. The list6 has 3 values which we will use to assign for index argument. Similarly we have
list7 which contains actual values and its count is more than the index list.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals\managers.py i
n create_block_manager_from_blocks(blocks, axes)
1652
-> 1653 mgr = BlockManager(blocks, axes)
1654 mgr._consolidate_inplace()
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals\managers.py i
n __init__(self, blocks, axes, do_integrity_check)
113 if do_integrity_check:
--> 114 self._verify_integrity()
115
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals\managers.py i
n _verify_integrity(self)
310 if block._verify_integrity and block.shape[1:] != mgr_shape[1:]:
--> 311 construction_error(tot_items, block.shape[1:], self.axes)
312 if len(self.items) != tot_items:
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals\managers.py i
n construction_error(tot_items, block_shape, axes, e)
1690 raise ValueError("Shape of passed values is {0}, indices imply {1}".format(
-> 1691 passed, implied))
1692
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py in __init__(se
lf, data, index, columns, dtype, copy)
449 else:
450 mgr = init_ndarray(data, index, columns, dtype=dtype,
--> 451 copy=copy)
452 else:
453 mgr = init_dict({}, index, columns, dtype=dtype)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals\construction.
py in init_ndarray(values, index, columns, dtype, copy)
165 values = maybe_infer_to_datetimelike(values)
166
--> 167 return create_block_manager_from_blocks([values], [columns, index])
168
169
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals\managers.py i
n create_block_manager_from_blocks(blocks, axes)
1658 blocks = [getattr(b, 'values', b) for b in blocks]
1659 tot_items = sum(b.shape[0] for b in blocks)
-> 1660 construction_error(tot_items, blocks[0].shape[1:], axes, e)
1661
1662
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals\managers.py i
n construction_error(tot_items, block_shape, axes, e)
1689 raise ValueError("Empty data passed with indices specified.")
1690 raise ValueError("Shape of passed values is {0}, indices imply {1}".format(
-> 1691 passed, implied))
1692
1693
So, we can conclude that if we specify a list for the index argument, it should have the same values as the list of
values and vice versa
Till now, the indices we used were all in order, be it 1,2,3 or 100,200 etc. What if we give the indices in some
random order ?
Let's Check !
In [24]:
list8 = [10,1,9,11,16]
list9 = ["Mukund", "Ajay", "Vijay", "Sudeep", "Krish"]
So, list8 contains list indices and list9 contains some names.
In [25]:
df5 = pd.DataFrame(list9,index = list8, columns = ['Teams'])
In [26]:
df5
Out[26]:
Teams
10 Mukund
1 Ajay
9 Vijay
11 Sudeep
16 Krish
So, it has created in the exact same order in which we provided the indices.
What if we want to Sort using Indices ? Can we do it ? If Yes how ? Tell in the next class
!Can we do it in Descending ??
Till now we manually added the data in the dataframe. However, in real life data science
and machine learning applications, data is already present in different forms and we need
not add manually like this. So, how to do that ?
We will see one method, i.e., loading the data from an Excel Sheet.
You should note that you have to give the complete path of the file where it is located. Once you give the
complete path, the read_excel would be able to load the data in the dataframe.
complete path, the read_excel would be able to load the data in the dataframe.
In [27]:
student_data = pd.read_excel("D:\\ GENERAL\\ Datasets\\ Student_Data.xlsx")
In [28]:
student_data
Out[28]:
Roll_No Fname Lname Gender School City Stage 2nd Lang Activity
1. head
This function is to get only a few rows of the dataframe. As it can be seen in the above output, that it is difficult
to visualize all the data in one go, it is always good to to first see only the first few rows.
In [29]:
student_data.head()
Out[29]:
Roll_No Fname Lname Gender School City Stage 2nd Lang Activity
Of course, we can change that. If I want to see first 7 records, then I need to pass the value 6 in parentheses.
Below is the code.
In [30]:
student_data.head(7)
Out[30]:
Roll_No Fname Lname Gender School City Stage 2nd Lang Activity
2. tail
If you need to see last few records of a dataframe, then we use tail function. Below is the code for the same.
In [31]:
student_data.tail()
Out[31]:
Roll_No Fname Lname Gender School City Stage 2nd Lang Activity
This will show the last 5 rows and again if you need fewer or more rows, pass that value in the parentheses.
In [32]:
student_data.tail(3)
Out[32]:
Roll_No Fname Lname Gender School City Stage 2nd Lang Activity
Roll_No Fname Lname Gender School City Stage 2nd Lang Activity
21 22 Vijay Raj M Ferguson Pune 8 Hindi Theatre
3. info
The info function give some important information such as how many rows, how many columns, the datatype of
each attribute, how many datatypes and memory usage.
In [33]:
student_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24 entries, 0 to 23
Data columns (total 9 columns):
Roll_No 24 non-null int64
Fname 24 non-null object
Lname 24 non-null object
Gender 24 non-null object
School 24 non-null object
City 24 non-null object
Stage 24 non-null int64
2nd Lang 24 non-null object
Activity 24 non-null object
dtypes: int64(2), object(7)
memory usage: 1.8+ KB
4. shape
shape gives the number of information on how many rows and columns are there in a dataframe in the format
(rows,columns)
In [34]:
student_data.shape
Out[34]:
(24, 9)
5. dtypes
In [35]:
student_data.dtypes
Out[35]:
Roll_No int64
Fname object
Lname object
Gender object
School object
City object
Stage int64
2nd Lang object
2nd Lang object
Activity object
dtype: object
6. describe
describe function gives information which is called as "Summary Statistics" of numerical columns in a
Dataframe by default.
In [36]:
student_data.describe()
Out[36]:
Roll_No Stage
In the above output, the count is the total number of records, mean means average, min and max are minimum
and maximum values respectively.
Suppose, I want to include non-numeric data also in the describe output then we have to pass an argument
called as "include" and assign the value 'all'.
In [37]:
student_data.describe(include='all')
Out[37]:
Roll_No Fname Lname Gender School City Stage 2nd Lang Activity
mean 12.500000 NaN NaN NaN NaN NaN 8.458333 NaN NaN
std 7.071068 NaN NaN NaN NaN NaN 1.503016 NaN NaN
min 1.000000 NaN NaN NaN NaN NaN 5.000000 NaN NaN
25% 6.750000 NaN NaN NaN NaN NaN 7.750000 NaN NaN
50% 12.500000 NaN NaN NaN NaN NaN 8.000000 NaN NaN
75% 18.250000 NaN NaN NaN NaN NaN 9.250000 NaN NaN
max 24.000000 NaN NaN NaN NaN NaN 11.000000 NaN NaN
Now, in this output you can see that non-numeric columns are also included.
7. min and max
As the name suggests min and max functions are used to get the min and max values of a column in a
dataframe. You can apply this function to both numeric and non-numeric attributes.
In [38]:
student_data["Stage"].min()
Out[38]:
5
In [39]:
student_data["Stage"].max()
Out[39]:
11
In [40]:
student_data["City"].max()
Out[40]:
'Pune'
In [41]:
student_data["City"].min()
Out[41]:
'Bangalore'
8. mean
In [42]:
student_data["Stage"].mean()
Out[42]:
8.458333333333334
9. size
size gives the size of the dataframe which is the product of rows and columns.
In [43]:
student_data.size
Out[43]:
216
10. value_counts
This function gives the unique counts for different values for a particular column. It is a very useful function as it
makes you understand how many types of values are present in that column and each value is occurring is how
many number of times.
In [44]:
student_data.Gender.value_counts()
Out[44]:
M 18
F 6
Name: Gender, dtype: int64