Skip to content

Add documentation section on Scaling #28315

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TomAugspurger opened this issue Sep 6, 2019 · 2 comments · Fixed by #28577
Closed

Add documentation section on Scaling #28315

TomAugspurger opened this issue Sep 6, 2019 · 2 comments · Fixed by #28577
Labels

Comments

@TomAugspurger
Copy link
Contributor

From the user survey, the most critical feature request was "Improve scaling to larger datasets".

While we continue to do work within pandas to improve scaling (fewer copies, native string dtype, etc.), we can document a few strategies that are available that may help with scaling.

  1. Using efficient dtypes / ensure you don't have object dtypes. Possibly use Categorical for strings, if they have low cardinality. Possibly use lower-precision dtypes.
  2. Avoid unnecessary work. When loading data, use columns= (csv or parquet) to select only the columns you need. Probably some other examples.
  3. Use out of core methods, like pd.read_csv(..., chunksize=), to avoid having to rewrite.
  4. Use other libraries. I would of course recommend Dask. But I'm not opposed to a section highlighting Vaex, and possibly Spark (though installing it in our doc environment may be difficult).

Do people have thoughts on this? Any objections to highlighting outside projects like Dask?
Are there other strategies we should mention?

@TomAugspurger TomAugspurger changed the title Add documentation section on "Scaling" Add documentation section on Scaling Sep 6, 2019
@WillAyd
Copy link
Member

WillAyd commented Sep 6, 2019

I'm on board with this

@jorisvandenbossche
Copy link
Member

This could fit in a general "performance" section of the user guide together with the https://pandas.pydata.org/pandas-docs/stable/user_guide/enhancingperf.html we already have?

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 23, 2019
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 23, 2019
TomAugspurger added a commit that referenced this issue Oct 1, 2019
* DOC: Add scaling to large datasets section

Closes #28315
josibake pushed a commit to josibake/pandas that referenced this issue Oct 1, 2019
* DOC: Add scaling to large datasets section

Closes pandas-dev#28315
proost pushed a commit to proost/pandas that referenced this issue Dec 19, 2019
* DOC: Add scaling to large datasets section

Closes pandas-dev#28315
proost pushed a commit to proost/pandas that referenced this issue Dec 19, 2019
* DOC: Add scaling to large datasets section

Closes pandas-dev#28315
bongolegend pushed a commit to bongolegend/pandas that referenced this issue Jan 1, 2020
* DOC: Add scaling to large datasets section

Closes pandas-dev#28315
jason-ytsao pushed a commit to jason-ytsao/exp_vtb that referenced this issue Aug 10, 2022
* DOC: Add scaling to large datasets section

Closes pandas-dev/pandas#28315
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants