Add documentation section on Scaling #28315

TomAugspurger · 2019-09-06T13:40:31Z

From the user survey, the most critical feature request was "Improve scaling to larger datasets".

While we continue to do work within pandas to improve scaling (fewer copies, native string dtype, etc.), we can document a few strategies that are available that may help with scaling.

Using efficient dtypes / ensure you don't have object dtypes. Possibly use Categorical for strings, if they have low cardinality. Possibly use lower-precision dtypes.
Avoid unnecessary work. When loading data, use columns= (csv or parquet) to select only the columns you need. Probably some other examples.
Use out of core methods, like pd.read_csv(..., chunksize=), to avoid having to rewrite.
Use other libraries. I would of course recommend Dask. But I'm not opposed to a section highlighting Vaex, and possibly Spark (though installing it in our doc environment may be difficult).

Do people have thoughts on this? Any objections to highlighting outside projects like Dask?
Are there other strategies we should mention?

The text was updated successfully, but these errors were encountered:

WillAyd · 2019-09-06T21:48:22Z

I'm on board with this

jorisvandenbossche · 2019-09-07T13:31:54Z

This could fit in a general "performance" section of the user guide together with the https://pandas.pydata.org/pandas-docs/stable/user_guide/enhancingperf.html we already have?

Closes pandas-dev#28315

* DOC: Add scaling to large datasets section Closes #28315

* DOC: Add scaling to large datasets section Closes pandas-dev#28315

* DOC: Add scaling to large datasets section Closes pandas-dev/pandas#28315

TomAugspurger added the Docs label Sep 6, 2019

TomAugspurger changed the title ~~Add documentation section on "Scaling"~~ Add documentation section on Scaling Sep 6, 2019

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 23, 2019

DOC: Add scaling to large datasets section

4123445

Closes pandas-dev#28315

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 23, 2019

DOC: Add scaling to large datasets section

7e7d786

Closes pandas-dev#28315

TomAugspurger mentioned this issue Sep 23, 2019

DOC: Add scaling to large datasets section #28577

Merged

TomAugspurger closed this as completed in #28577 Oct 1, 2019

TomAugspurger added a commit that referenced this issue Oct 1, 2019

DOC: Add scaling to large datasets section (#28577)

c13c13b

* DOC: Add scaling to large datasets section Closes #28315

josibake pushed a commit to josibake/pandas that referenced this issue Oct 1, 2019

DOC: Add scaling to large datasets section (pandas-dev#28577)

4661d77

* DOC: Add scaling to large datasets section Closes pandas-dev#28315

proost pushed a commit to proost/pandas that referenced this issue Dec 19, 2019

DOC: Add scaling to large datasets section (pandas-dev#28577)

da82364

* DOC: Add scaling to large datasets section Closes pandas-dev#28315

proost pushed a commit to proost/pandas that referenced this issue Dec 19, 2019

DOC: Add scaling to large datasets section (pandas-dev#28577)

82344e4

* DOC: Add scaling to large datasets section Closes pandas-dev#28315

bongolegend pushed a commit to bongolegend/pandas that referenced this issue Jan 1, 2020

DOC: Add scaling to large datasets section (pandas-dev#28577)

9f57b78

* DOC: Add scaling to large datasets section Closes pandas-dev#28315

jason-ytsao pushed a commit to jason-ytsao/exp_vtb that referenced this issue Aug 10, 2022

DOC: Add scaling to large datasets section (#28577)

a7193f9

* DOC: Add scaling to large datasets section Closes pandas-dev/pandas#28315

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add documentation section on Scaling #28315

Add documentation section on Scaling #28315

TomAugspurger commented Sep 6, 2019

WillAyd commented Sep 6, 2019

jorisvandenbossche commented Sep 7, 2019

Add documentation section on Scaling #28315

Add documentation section on Scaling #28315

Comments

TomAugspurger commented Sep 6, 2019

WillAyd commented Sep 6, 2019

jorisvandenbossche commented Sep 7, 2019