0% found this document useful (0 votes)
11 views

CO3_3_Indexing and Sorting, Loading Data From CSV

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

CO3_3_Indexing and Sorting, Loading Data From CSV

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 29

Department of CSE

COURSE NAME: DATA ANALYTICS AND VISUALIZATION


COURSE CODE: 22CS2227R
Topic: Loading data in CSV, Indexing & Sorting

Session - 11

1
AIM OF THE SESSION

To familiarize students with the concepts of loading csv files, indexing and sorting of data and lists.

INSTRUCTIONAL OBJECTIVES

This Session is designed to: understand importance of Indexing – its real time

applications in sorting the lists with respect to the data frames.

LEARNING OUTCOMES

At the end of this session, you should be able to: Understand the

Sort a pandas DataFrame by the values of one or more columns.

Sort a DataFrame by its index using .sort_index()

Organize missing data while sorting values.

Sort a DataFrame in place using inplace set to True.


Sort a Pandas Dataframe.
Session Content Loading data from csv files

Use the Ascending Parameter to change the Sort order

Sort a Dataframe by its Index using .sort_index()


Organize the Missing data while sorting values.
Sort a DataFrame in place using inplace set to True.

‘Lists’ in Python

Inferential statistics

3
Preparing the Dataset
Fuel economy data compiled by the US Environmental Protection Agency (EPA) on
vehicles made between 1984 and 2021.

The EPA fuel economy dataset is great because it has many different types of information
that you can sort on, from textual to numeric data types.

The dataset contains eighty-three columns in total.

For analysis purposes, you’ll be looking at MPG (miles per gallon) data on vehicles by
make, model, year, and other vehicle attributes.

4
Python Code

By calling .read_csv() with the dataset


URL, you’re able to load the data into a
DataFrame.

5
Getting Familiar With .sort_values()
We use .sort_values() to sort values in a DataFrame along either axis (columns or rows).

The figure above shows the results of using .sort_values() to sort the DataFrame’s
rows based on the values in the highway08 column.
6
Getting Familiar With .sort_index()
You use .sort_index() to sort a DataFrame by its row index or column labels.
The difference from using .sort_values() is that you’re sorting the DataFrame
based on its row index or column names, not by the values in these rows or
columns.

7
Sorting Your DataFrame on a Single Column
To sort the DataFrame based on the values in a single column, you’ll
use .sort_values(). By default, this will return a new DataFrame sorted in
ascending order.
It does not modify the original DataFrame.

8
Sorting by a Column in Ascending Order
• To use .sort_values(), you pass a single argument to the method containing
the name of the column you want to sort by. In this example, you sort the
DataFrame by the city08 column, which represents city MPG for fuel-only
cars. 
This sorts your
DataFrame using the
column values from city08,
showing the vehicles with
the lowest MPG first.
By
default, .sort_values() sorts
your data in ascending
order.

9
Changing the Sort Order
Another parameter of .sort_values() is ascending.
By default .sort_values() has ascending set to True.
If you want the DataFrame sorted in descending order, then you can pass False to
this parameter.

10
Choosing a Sorting Algorithm
Pandas allows you to choose different sorting algorithms to use with
both .sort_values() and .sort_index().
The available algorithms are quicksort, mergesort, and heapsort.

Using kind, you set the sorting


algorithm to mergesort.

11
Sorting Your DataFrame on Multiple Columns
In data analysis, it’s common to want to sort your data based on the values of
multiple columns.
Imagine you have a dataset with people’s first and last names. It would make
sense to sort by last name and then first name, so that people with the same last
name are arranged alphabetically according to their first names.

12
Sorting Your DataFrame on Multiple Columns
In addition to the MPG in city conditions,
you may also want to look at MPG for
highway conditions. To sort by two keys,
you can pass a list of column names to by:

13
Sorting by Multiple Columns in Ascending Order
To sort the DataFrame on multiple columns, you must provide a list of
column names. For example, to sort by make and model, you should
create the following list and then pass it to .sort_values():

14
Sorting by Multiple Columns in Descending Order
 Sorting in descending order based on the make and model columns.
To sort in descending order, set ascending to False:

15
Sorting Your DataFrame on Its Index
Before sorting on the index, it’s a good idea to know what an index represents.
A DataFrame has an .index property, which by default is a numerical
representation of its rows’ locations.
You can think of the index as the row numbers. It helps in quick row lookup and
identification.

16
Sorting Your DataFrame on Its Index

17
Sorting by Index in Descending Order

Now we will sort your DataFrame by its index in descending order.


Remember from sorting your DataFrame with .sort_values() that you can reverse
the sort order by setting ascending to False.
This parameter also works with .sort_index(), so you can sort your DataFrame in
reverse order like this:

18
Sorting by Index in Descending Order

Now your DataFrame is sorted by its index in descending order.


One difference between using .sort_index() and .sort_values() is
that .sort_index() has no by parameter since it sorts a DataFrame on the row
index by default.
19
Merge/Join Datasets
 Joining and merging DataFrames is the core process to start with data analysis
and machine learning tasks.
 It is one of the toolkits which every Data Analyst or Data Scientist should
master because in almost all the cases data comes from multiple source and
files.
 You may need to bring all the data in one place by some sort of join logic and
then start your analysis.
 Pandas provides various facilities for easily combining different datasets.

20
Understanding the different types of merge
 We can merge two data frames in pandas python by using the merge()
function.
 The different arguments to merge() allow you to perform natural join, left
join, right join, and full outer join in pandas.
 Before you perform joint operations let’s first load the two csv files and
convert them into data frames df1 and df2.

21
Natural join
 Natural join keeps only rows that match from the data frames(df1 and df2),
specify the argument how=’inner’
Syntax: pd.merge(df1, df2, on=column', how='inner')
 Return only the rows in which the left table have matching keys in the right
table

22
Full outer join
 Full outer join keeps all rows from both data frames, specify how=‘outer’.
Syntax: pd.merge(df1, df2, on=column', how=’outer’)
 Returns all rows from both tables, join records from the left which have
matching keys in the right table.

23
Left outer join
 Left outer join includes all the rows of your data frame df1 and only those
from df2 that match, specify how =‘Left.
Syntax: pd.merge(df1, df2, on=column', how=left)
 Return all rows from the left table, and any rows with matching keys from the
right table.

24
Right outer join
 Return all rows from the df2 table, and any rows with matching keys from the
df1 table, specify how =‘Right’.
Syntax: pd.merge(df1, df2, on=column', how=right)
 Return all rows from the right table, and any rows with matching keys from
the left table.

25
SELF-ASSESSMENT QUESTIONS

1. What is the syntax to create a Pandas Series from a Python list ___________________

2. What is a correct syntax to return the first value of a Pandas Series_________________

3. What is a correct syntax to add the lables "x", "y", and "z" to a Pandas Series __________________

4. What is a correct syntax to create a Pandas DataFrame_______________

5. What is a correct syntax to return the first row in a Pandas DataFrame____________________


TERMINAL QUESTIONS

1) Critique your views on pandas – groupby function with examples.

2) Deduce the steps to write csv files to pandas.

3) Defend your steps on delimited text file that uses a comma to separate

the values.

4) Compare and Contrast between different types of joins.


REFERENCES FOR FURTHER LEARNING OF THE SESSION

Reference Books:
1)Biological data exploration with Python, pandas and seaborn by Martin Jones. June,
2020. (https://pythonforbiologists.com/biological-data-exploration-book) ISBN-13: 979-
8612757238.
2)Hands-on Machine Learning with Scikit-Learn & TensorFlow by Aurélien Géon. March
2017. Publisher: O'Reilly Media, Inc. ISBN: 9781491962299.
3)Python Crash Course: A Hands-On, Project-Based Introduction to Programming (2nd
Edition).
Sites and Web links:
1. https://mu.ac.in/wp-content/uploads/2022/10/Big-Data-Analytics-and-Visualization.pdf
THANK YOU

Team – DAV

You might also like