0% found this document useful (0 votes)
52 views6 pages

Avoid This Costly Mistake When Indexing A DataFrame

The document discusses how selecting columns first when indexing a pandas dataframe is faster than slicing rows first. It explains that dataframes are stored in column-major format, so accessing data by column is faster because the elements are stored contiguously in memory. Slicing rows requires gathering non-contiguous elements, converting them to a series, and is over 15 times slower. The document recommends selecting columns instead of slicing rows for better performance when indexing dataframes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views6 pages

Avoid This Costly Mistake When Indexing A DataFrame

The document discusses how selecting columns first when indexing a pandas dataframe is faster than slicing rows first. It explains that dataframes are stored in column-major format, so accessing data by column is faster because the elements are stored contiguously in memory. Slicing rows requires gathering non-contiguous elements, converting them to a series, and is over 15 times slower. The document recommends selecting columns instead of slicing rows for better performance when indexing dataframes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

5/25/23, 10:08 AM Yahoo Mail - Avoid This Costly Mistake When Indexing A DataFrame

Avoid This Costly Mistake When Indexing A DataFrame

From: Daily Dose of Data Science (avichawla@substack.com)

To: chidichekwas@yahoo.com

Date: Saturday, April 22, 2023 at 03:45 AM CDT

Open in app or online

Avoid This Costly Mistake When


Indexing A DataFrame
Row-then-column is not the same as Column-then-row.
AVI CHAWLA
APR 22

SHARE

about:blank 1/6
5/25/23, 10:08 AM Yahoo Mail - Avoid This Costly Mistake When Indexing A DataFrame

When indexing a dataframe, choosing whether to select a column first or slice


a row first is pretty important from a run-time perspective.

As shown above, selecting the column first is over 15 times faster than
slicing the row first. Why?

As I have talked before, Pandas DataFrame is a column-major data structure.


Thus, consecutive elements in a column are stored next to each other in
memory.

about:blank 2/6
5/25/23, 10:08 AM Yahoo Mail - Avoid This Costly Mistake When Indexing A DataFrame

Column-major structure of a Pandas DataFrame

As processors are efficient with contiguous blocks of memory, accessing a


column is much faster than accessing a row (read more about this in one of
my previous posts here).

But when you slice a row first, each row is retrieved by accessing non-
contiguous blocks of memory, thereby making it slow.

Also, once all the elements of a row are gathered, Pandas converts them to a
Series, which is another overhead.

Slicing a row creates a Pandas Series

We can verify this conversion below:

about:blank 3/6
5/25/23, 10:08 AM Yahoo Mail - Avoid This Costly Mistake When Indexing A DataFrame

Slicing a row returns a Pandas Series

Instead, when you select a column first, elements are retrieved by accessing
contiguous blocks of memory, which is way faster. Also, a column is
inherently a Pandas Series. Thus, there is no conversion overhead involved
like above.

about:blank 4/6
5/25/23, 10:08 AM Yahoo Mail - Avoid This Costly Mistake When Indexing A DataFrame

Selecting a column returns a Pandas Series

Overall, by accessing the column first, we avoid accessing non-contiguous


memory access, which does happen when we access the row first.

This makes selecting the column first faster than slicing a row first in indexing
operations.

If you are confused about what selecting, indexing, slicing, and filtering mean,
here’s what you should read next:

Daily Dose of Data Science

Are You Sure You Are Using The Correct Pandas


Terminologies?
Many Pandas users use the dataframe subsetting terminologies
incorrectly. So let's spend a minute to get it straight. 𝐒𝐔𝐁𝐒𝐄𝐓𝐓𝐈𝐍𝐆
means extracting value(s) from a dataframe. This can be done in four
ways: 1) We call it 𝐒𝐄𝐋𝐄𝐂𝐓𝐈𝐍𝐆 when we extract one or more of its
𝐂𝐎𝐋𝐔𝐌𝐍𝐒 based on index location or name. The output contains some
co…

Read more

a month ago · 17 likes · 2 comments · Avi Chawla

👉 Read what others are saying about this post on LinkedIn.

👉 If you liked this post, don’t forget to leave a like ❤️. It helps more
people discover this newsletter on Substack and tells me that you
appreciate reading these daily insights. The button is located towards
the bottom of this email.

👉 If you love reading this newsletter, feel free to share it with friends!

about:blank 5/6
5/25/23, 10:08 AM Yahoo Mail - Avoid This Costly Mistake When Indexing A DataFrame

Share Daily Dose of Data Science

Find the code for my tips here: GitHub.

I like to explore, experiment and write about data science concepts and tools.
You can read my articles on Medium. Also, you can connect with me on
LinkedIn and Twitter.

LIKE COMMENT RESTACK

Read Daily Dose of Data Science in the app


Listen to posts, join subscriber chats, and never miss an update from Avi Chawla.

© 2023 Avi Chawla


India
Unsubscribe

about:blank 6/6

You might also like