0% found this document useful (0 votes)
35 views

Pandas - NOTES

Uploaded by

baabaasheep50
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views

Pandas - NOTES

Uploaded by

baabaasheep50
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

DATAFRAMES

Lists, the data structure which we have learnt is a one dimensional data structure.

For instance, I want to save 2 values which are Name and City, if I use lists, either I need to use 2 separate
lists otherwise I should use list inside a list.

It looks simple when only 2 attributes or characteristics are involved.

However, when we have too many attributes imagine so many lists inside a list, things will become complex.

To avoid this, we have a special Data structure called Dataframes.

Dataframes are 2 dimensional data structures consisting of rows and columns which gives them a tabluar
like structure

Attributes are nothing but the characteristics. For instance, if I say Student, some of the attributes would be
Name, Age, Stage, Gender etc. All these are Attributes and will be represented as Columns in a Dataframe.

Similarly the actual data which we add in the dataframe are Rows.

Now let us see how to use Dataframes in Python !!


To create Dataframes in Python, we make use of a Library called as pandas.

Because of the pandas library, Dataframes are more popularly known as Pandas Dataframes.

Dataframes - pandas
In the below line of code, we do import pandas library and give an alias which is "pd". This is the common alias
given as we would prefer a short name instead of pandas everytime.

It is not mandatory to use only "pd". However, it is a general convention which is followed by pandas users
worldwide to give a sense of uniformity. Hence it is always recommended to use pd.

In [1]:

import pandas as pd

Create an empty dataframe


In [2]:

df1 = pd.DataFrame()

Some pointers for the above line

To create a Dataframe, we use the library name, in this case we use the alias i.e. "pd" followed by a "period"
symbol and then the function "DataFrame" is called.

Note the capitalization of "D" and "F" which is mandatory.

If it is not followed you will get an error and your dataframe will not be created. Refer the below cells.
If it is not followed you will get an error and your dataframe will not be created. Refer the below cells.

In the above cell we are creating an Empty Dataframe which is having no columns and no rows, similar to an
empty List.

In [3]:
df1 = pd.dataframe()

---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-3-72de98764683> in <module>
----> 1 df1 = pd.dataframe()

AttributeError: module 'pandas' has no attribute 'dataframe'

In [4]:
df1 = pd.Dataframe()

---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-4-3d9df181c01f> in <module>
----> 1 df1 = pd.Dataframe()

AttributeError: module 'pandas' has no attribute 'Dataframe'

In [5]:
df1 = pd.dataFrame()

---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-5-0f1764ecfef5> in <module>
----> 1 df1 = pd.dataFrame()

AttributeError: module 'pandas' has no attribute 'dataFrame'

In [6]:
df1
Out[6]:

You can see when you check the dataframe variable "df1" it shows nothing but only a hyphen.

Similarly when you print a dataframe variable, it tells that it is an empty datframe and shows empty square
brackets for columns and rows. Index here refers to rows.

In [7]:

print(df1)

Empty DataFrame
Columns: []
Index: []

Adding Columns to the Dataframe


Now, let us add few columns to a Dataframe. As demonstrated in the below cell, we need to follow this Syntax to
add a Column.

Name of a Dataframe['Column_Name'] = [ ]
The above line will add an Empty column to the dataframe. However there are variations as well.

You can use double quotes instead of single quotes.

You can use a variable also in the square brackets. However, that variable should contain the column name
you want for that column.

Below cells have the code for the same.

In [8]:
Age = 'Age'

In [9]:
df1['ID'] = []
df1["Name"] = []
df1[Age] = []

In [10]:
df1
Out[10]:

ID Name Age

When you check the Dataframe you will see the above result as still it does not have any rows or any data.

In [11]:
print(df1)

Empty DataFrame
Columns: [ID, Name, Age]
Index: []

When you print the dataframe, you will see the above result. Still it shows as an Empty Dataframe as it does not
contain any data. However, it shows the list of columns.

Adding Values to the Dataframe


Now, we will add the values to our columns. Or, we can say that we will populate our dataframe.

This is how it is done.

Dataframe name['Column_Name] = [List of values]


As mentioned in the above syntax, we now add the values which is depicted in the below cell.

In [12]:
df1['ID'] = [1,2,3,4]
df1['Name'] = ['Mukund', 'Rishabh', 'Virat', 'Victor']
df1['Age'] = [35,30,36,37]

df1
Out[12]:
ID Name Age

0 1 Mukund 35

1 2 Rishabh 30

2 3 Virat 36

3 4 Victor 37

In [13]:
print(df1)

ID Name Age
0 1 Mukund 35
1 2 Rishabh 30
2 3 Virat 36
3 4 Victor 37

Atfer adding the values, when we use print to check the values of the Dataframe, the output is shown in the
above cell.

dataframe using lists


In the above example, when we showed how to add values to a dataframe, we directly assigned the values to the
column of the dataframe. The values assigned were in the form of list only.

However now we will separately create lists and then add those lists to a dataframe.

In [14]:
list1 = [1,2,3,4]
list2 = ["MI", "RCB", "CSK", "SRH"]

In [15]:
df2 = pd.DataFrame(list2,index = list1, columns = ['Teams'])

In [16]:

df2
Out[16]:

Teams

1 MI

2 RCB

3 CSK

4 SRH

Few Pointers

We have used 2 lists. One contains the actual values or data i.e, list2.

The other list which is list1 is for index numbers.

Observe that in the previous example, the indices started from 0. However in a dataframe we can change it.

In this example I have taken a list which has the indices I want and I assign that value to the argument
"index" in the function "DataFrame".
The first argument is the name of the list which has the data. In our case it is list2.

The second argument is the index which is assigned the value list1 which contains the numbers from 1 to 4.

The third and important argument is the list of Columns. You can clearly see that it is a list as it is placed in
Square Brackets. The name of the column will be "Teams".

As mentioned in the above cells, we can specify our own indexes. So, now let us take another list with index
values from 100 to 400 and see whether it works.

In [17]:
list3 = [100,200,300,400]

Now, we will asisgn the list variable "list3" to the index argument and see what is the output.

In [18]:
df2 = pd.DataFrame(list2,index = list3, columns = ['Teams'])

In [19]:
df2
Out[19]:

Teams

100 MI

200 RCB

300 CSK

400 SRH

It can be clearly seen that the above dataframe has indices starting from 100 and ending at 400.

What if the list has more indices' values than the actual data itself ? What would you think will happen ??

Let us Examine

In [20]:
list5 = [1,2,3,4,5]

I have created a new list which has only 3 values and the data is the same i.e., list2.

In [21]:
df3 = pd.DataFrame(list2,index = list5, columns = ['Teams'])

---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals\managers.py i
n create_block_manager_from_blocks(blocks, axes)
1652
-> 1653 mgr = BlockManager(blocks, axes)
1654 mgr._consolidate_inplace()

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals\managers.py i
n __init__(self, blocks, axes, do_integrity_check)
113 if do_integrity_check:
--> 114 self._verify_integrity()
115
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals\managers.py i
n _verify_integrity(self)
310 if block._verify_integrity and block.shape[1:] != mgr_shape[1:]:
--> 311 construction_error(tot_items, block.shape[1:], self.axes)
312 if len(self.items) != tot_items:

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals\managers.py i
n construction_error(tot_items, block_shape, axes, e)
1690 raise ValueError("Shape of passed values is {0}, indices imply {1}".format(
-> 1691 passed, implied))
1692

ValueError: Shape of passed values is (4, 1), indices imply (5, 1)

During handling of the above exception, another exception occurred:

ValueError Traceback (most recent call last)


<ipython-input-21-70958f47609f> in <module>
----> 1 df3 = pd.DataFrame(list2,index = list5, columns = ['Teams' ])

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py in __init__(se
lf, data, index, columns, dtype, copy)
449 else:
450 mgr = init_ndarray(data, index, columns, dtype=dtype,
--> 451 copy=copy)
452 else:
453 mgr = init_dict({}, index, columns, dtype=dtype)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals\construction.
py in init_ndarray(values, index, columns, dtype, copy)
165 values = maybe_infer_to_datetimelike(values)
166
--> 167 return create_block_manager_from_blocks([values], [columns, index])
168
169

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals\managers.py i
n create_block_manager_from_blocks(blocks, axes)
1658 blocks = [getattr(b, 'values', b) for b in blocks]
1659 tot_items = sum(b.shape[0] for b in blocks)
-> 1660 construction_error(tot_items, blocks[0].shape[1:], axes, e)
1661
1662

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals\managers.py i
n construction_error(tot_items, block_shape, axes, e)
1689 raise ValueError("Empty data passed with indices specified.")
1690 raise ValueError("Shape of passed values is {0}, indices imply {1}".format(
-> 1691 passed, implied))
1692
1693

ValueError: Shape of passed values is (4, 1), indices imply (5, 1)

Now, what if we do the opposite ? That is, data will have more values and index list will have less values.

Let us Check !!

In [22]:
list6 = [1,2,3]
list7 = ["Harry", "Ben", "James", "Mary", "Sherlock"]

Now we have 2 lists. The list6 has 3 values which we will use to assign for index argument. Similarly we have
list7 which contains actual values and its count is more than the index list.

Let us create the Dataframe and see what transpires.


In [23]:
df4 = pd.DataFrame(list7,index = list6, columns = ['Names'])

---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals\managers.py i
n create_block_manager_from_blocks(blocks, axes)
1652
-> 1653 mgr = BlockManager(blocks, axes)
1654 mgr._consolidate_inplace()

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals\managers.py i
n __init__(self, blocks, axes, do_integrity_check)
113 if do_integrity_check:
--> 114 self._verify_integrity()
115

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals\managers.py i
n _verify_integrity(self)
310 if block._verify_integrity and block.shape[1:] != mgr_shape[1:]:
--> 311 construction_error(tot_items, block.shape[1:], self.axes)
312 if len(self.items) != tot_items:

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals\managers.py i
n construction_error(tot_items, block_shape, axes, e)
1690 raise ValueError("Shape of passed values is {0}, indices imply {1}".format(
-> 1691 passed, implied))
1692

ValueError: Shape of passed values is (5, 1), indices imply (3, 1)

During handling of the above exception, another exception occurred:

ValueError Traceback (most recent call last)


<ipython-input-23-83b64a1f7925> in <module>
----> 1 df4 = pd.DataFrame(list7,index = list6, columns = ['Names' ])

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py in __init__(se
lf, data, index, columns, dtype, copy)
449 else:
450 mgr = init_ndarray(data, index, columns, dtype=dtype,
--> 451 copy=copy)
452 else:
453 mgr = init_dict({}, index, columns, dtype=dtype)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals\construction.
py in init_ndarray(values, index, columns, dtype, copy)
165 values = maybe_infer_to_datetimelike(values)
166
--> 167 return create_block_manager_from_blocks([values], [columns, index])
168
169

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals\managers.py i
n create_block_manager_from_blocks(blocks, axes)
1658 blocks = [getattr(b, 'values', b) for b in blocks]
1659 tot_items = sum(b.shape[0] for b in blocks)
-> 1660 construction_error(tot_items, blocks[0].shape[1:], axes, e)
1661
1662

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals\managers.py i
n construction_error(tot_items, block_shape, axes, e)
1689 raise ValueError("Empty data passed with indices specified.")
1690 raise ValueError("Shape of passed values is {0}, indices imply {1}".format(
-> 1691 passed, implied))
1692
1693

ValueError: Shape of passed values is (5, 1), indices imply (3, 1)


This also ended in an Error.

So, we can conclude that if we specify a list for the index argument, it should have the same values as the list of
values and vice versa

Till now, the indices we used were all in order, be it 1,2,3 or 100,200 etc. What if we give the indices in some
random order ?

Let's Check !

So, here is the list with randomly ordered numbers.

In [24]:
list8 = [10,1,9,11,16]
list9 = ["Mukund", "Ajay", "Vijay", "Sudeep", "Krish"]

So, list8 contains list indices and list9 contains some names.

In [25]:
df5 = pd.DataFrame(list9,index = list8, columns = ['Teams'])

So, the dataframe got created. Let us check it once.

In [26]:
df5
Out[26]:

Teams

10 Mukund

1 Ajay

9 Vijay

11 Sudeep

16 Krish

So, it has created in the exact same order in which we provided the indices.

What if we want to Sort using Indices ? Can we do it ? If Yes how ? Tell in the next class
!Can we do it in Descending ??

Till now we manually added the data in the dataframe. However, in real life data science
and machine learning applications, data is already present in different forms and we need
not add manually like this. So, how to do that ?

We will see one method, i.e., loading the data from an Excel Sheet.

To get the data from excel in Pandas


We use read_excel function of pandas to get the data from an excel to pandas

You should note that you have to give the complete path of the file where it is located. Once you give the
complete path, the read_excel would be able to load the data in the dataframe.
complete path, the read_excel would be able to load the data in the dataframe.

In [27]:
student_data = pd.read_excel("D:\\ GENERAL\\ Datasets\\ Student_Data.xlsx")

In [28]:
student_data
Out[28]:

Roll_No Fname Lname Gender School City Stage 2nd Lang Activity

0 1 Vijay Chauhan M Pleasant Valley Dehradun 7 Spanish Cricket

1 2 Ajay Rao M Pleasant Valley Dehradun 7 French Badminton

2 3 Shivani Kumar F St. Xavier Mussorie 8 Hindi Archery

3 4 Gautam Pai M St.Joseph Chennai 9 Sanskrit Soccer

4 5 Raj S M Pleasant Valley Dehradun 5 Spanish Cricket

5 6 Ravi J M St. Xavier Mussorie 8 Hindi Archery

6 7 Mukund K M Pavan Dharwad 10 Kannada Chess

7 8 Fatima K F Pleasant Valley Dehradun 7 Spanish Basketball

8 9 Kavya A F Pavan Dharwad 8 Hindi Carrom

9 10 Samarth Kaushik M Pleasant Valley Dehradun 10 Spanish Basketball

10 11 Om Mehra M OakPine Delhi 11 French Basketball

11 12 Siddharth Mehra M Pleasant Valley Dehradun 9 Spanish Basketball

12 13 Ranveer R M OakPine Delhi 11 French Archery

13 14 Richa Kapoor F Pleasant Valley Dehradun 7 Spanish Badminton

14 15 Manish Pandey M Oriental Bangalore 10 French Cricket

15 16 Arun Singh M New Era Mohali 8 French Cricket

16 17 Ajay M M Blue Valley Hyderabad 8 French Badminton

17 18 Nidhi M F Lions School Chennai 8 Hindi Basketball

18 19 Praveen Reddy M OakPine Delhi 9 Sanskrit Archery

19 20 Manoj Tripathi M Ferguson Pune 8 Hindi Soccer

20 21 Irfan Khan M Nile Valley Mumbai 9 Hindi Cricket

21 22 Vijay Raj M Ferguson Pune 8 Hindi Theatre

22 23 Ronit Roy M Pleasant Valley Dehradun 7 Spanish Theatre

23 24 Ila Kapoor F OakPine Delhi 11 French Archery

Important and Useful pandas functions

1. head

This function is to get only a few rows of the dataframe. As it can be seen in the above output, that it is difficult
to visualize all the data in one go, it is always good to to first see only the first few rows.

In [29]:
student_data.head()
Out[29]:
Roll_No Fname Lname Gender School City Stage 2nd Lang Activity

0 1 Vijay Chauhan M Pleasant Valley Dehradun 7 Spanish Cricket

1 2 Ajay Rao M Pleasant Valley Dehradun 7 French Badminton

2 3 Shivani Kumar F St. Xavier Mussorie 8 Hindi Archery

3 4 Gautam Pai M St.Joseph Chennai 9 Sanskrit Soccer

4 5 Raj S M Pleasant Valley Dehradun 5 Spanish Cricket

By default, this operation lists first five records.

Of course, we can change that. If I want to see first 7 records, then I need to pass the value 6 in parentheses.
Below is the code.

In [30]:
student_data.head(7)
Out[30]:

Roll_No Fname Lname Gender School City Stage 2nd Lang Activity

0 1 Vijay Chauhan M Pleasant Valley Dehradun 7 Spanish Cricket

1 2 Ajay Rao M Pleasant Valley Dehradun 7 French Badminton

2 3 Shivani Kumar F St. Xavier Mussorie 8 Hindi Archery

3 4 Gautam Pai M St.Joseph Chennai 9 Sanskrit Soccer

4 5 Raj S M Pleasant Valley Dehradun 5 Spanish Cricket

5 6 Ravi J M St. Xavier Mussorie 8 Hindi Archery

6 7 Mukund K M Pavan Dharwad 10 Kannada Chess

2. tail

If you need to see last few records of a dataframe, then we use tail function. Below is the code for the same.

In [31]:
student_data.tail()
Out[31]:

Roll_No Fname Lname Gender School City Stage 2nd Lang Activity

19 20 Manoj Tripathi M Ferguson Pune 8 Hindi Soccer

20 21 Irfan Khan M Nile Valley Mumbai 9 Hindi Cricket

21 22 Vijay Raj M Ferguson Pune 8 Hindi Theatre

22 23 Ronit Roy M Pleasant Valley Dehradun 7 Spanish Theatre

23 24 Ila Kapoor F OakPine Delhi 11 French Archery

This will show the last 5 rows and again if you need fewer or more rows, pass that value in the parentheses.

In [32]:

student_data.tail(3)
Out[32]:
Roll_No Fname Lname Gender School City Stage 2nd Lang Activity
Roll_No Fname Lname Gender School City Stage 2nd Lang Activity
21 22 Vijay Raj M Ferguson Pune 8 Hindi Theatre

22 23 Ronit Roy M Pleasant Valley Dehradun 7 Spanish Theatre

23 24 Ila Kapoor F OakPine Delhi 11 French Archery

3. info

The info function give some important information such as how many rows, how many columns, the datatype of
each attribute, how many datatypes and memory usage.

In [33]:

student_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24 entries, 0 to 23
Data columns (total 9 columns):
Roll_No 24 non-null int64
Fname 24 non-null object
Lname 24 non-null object
Gender 24 non-null object
School 24 non-null object
City 24 non-null object
Stage 24 non-null int64
2nd Lang 24 non-null object
Activity 24 non-null object
dtypes: int64(2), object(7)
memory usage: 1.8+ KB

4. shape

shape gives the number of information on how many rows and columns are there in a dataframe in the format
(rows,columns)

In [34]:
student_data.shape

Out[34]:
(24, 9)

5. dtypes

dtypes will display each column and its corresponding datatype.

In [35]:
student_data.dtypes

Out[35]:
Roll_No int64
Fname object
Lname object
Gender object
School object
City object
Stage int64
2nd Lang object
2nd Lang object
Activity object
dtype: object

6. describe

describe function gives information which is called as "Summary Statistics" of numerical columns in a
Dataframe by default.

In [36]:
student_data.describe()

Out[36]:

Roll_No Stage

count 24.000000 24.000000

mean 12.500000 8.458333

std 7.071068 1.503016

min 1.000000 5.000000

25% 6.750000 7.750000

50% 12.500000 8.000000

75% 18.250000 9.250000

max 24.000000 11.000000

In the above output, the count is the total number of records, mean means average, min and max are minimum
and maximum values respectively.

Suppose, I want to include non-numeric data also in the describe output then we have to pass an argument
called as "include" and assign the value 'all'.

In [37]:
student_data.describe(include='all')

Out[37]:

Roll_No Fname Lname Gender School City Stage 2nd Lang Activity

count 24.000000 24 24 24 24 24 24.000000 24 24

unique NaN 22 21 2 11 10 NaN 5 8

top NaN Vijay K M Pleasant Valley Dehradun NaN French Archery

freq NaN 2 2 18 8 8 NaN 7 5

mean 12.500000 NaN NaN NaN NaN NaN 8.458333 NaN NaN

std 7.071068 NaN NaN NaN NaN NaN 1.503016 NaN NaN

min 1.000000 NaN NaN NaN NaN NaN 5.000000 NaN NaN

25% 6.750000 NaN NaN NaN NaN NaN 7.750000 NaN NaN

50% 12.500000 NaN NaN NaN NaN NaN 8.000000 NaN NaN

75% 18.250000 NaN NaN NaN NaN NaN 9.250000 NaN NaN

max 24.000000 NaN NaN NaN NaN NaN 11.000000 NaN NaN

Now, in this output you can see that non-numeric columns are also included.
7. min and max

As the name suggests min and max functions are used to get the min and max values of a column in a
dataframe. You can apply this function to both numeric and non-numeric attributes.

In [38]:
student_data["Stage"].min()

Out[38]:
5

In [39]:
student_data["Stage"].max()

Out[39]:
11

In [40]:

student_data["City"].max()
Out[40]:

'Pune'

In [41]:

student_data["City"].min()
Out[41]:

'Bangalore'

8. mean

mean function is to calculate the average of the column.

In [42]:

student_data["Stage"].mean()
Out[42]:

8.458333333333334

9. size

size gives the size of the dataframe which is the product of rows and columns.

In [43]:

student_data.size
Out[43]:

216

10. value_counts
This function gives the unique counts for different values for a particular column. It is a very useful function as it
makes you understand how many types of values are present in that column and each value is occurring is how
many number of times.

In [44]:
student_data.Gender.value_counts()

Out[44]:
M 18
F 6
Name: Gender, dtype: int64

You might also like