Skip to content

get-started: more plots, dvclive, better dataset #102

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

daavoo
Copy link
Contributor

@daavoo daavoo commented Jan 31, 2022

EDITs by @shcheklein:

Corresponding repo to play with: https://studio.iterative.ai/user/shcheklein/views/example-get-started-r7898c5e32

(a few lines in the PR will be uncommented / reverted before actual merge, after testing)

It changes example get started:

  • To use a clean dataset (thus 2.5x smaller)
  • Less number of features (3000 -> 100/200). Since dataset is better it's now enough - should be way better memory footprint
  • More plots - confusion matrix + forest analysis (most useful feature)
  • CML report includes confusion matrix
  • Two branches - more data, hyperparams tuning

TODO:

@daavoo daavoo self-assigned this Jan 31, 2022
@shcheklein
Copy link
Member

Looks promising. May be we could keep the old code for education as a comment? wdyt?

@iesahin
Copy link
Contributor

iesahin commented Feb 23, 2022

Hey @daavoo Is this still a draft?

@iesahin
Copy link
Contributor

iesahin commented Mar 8, 2022

ping! @daavoo :)

@iesahin
Copy link
Contributor

iesahin commented Apr 5, 2022

ding! :)

@daavoo
Copy link
Contributor Author

daavoo commented Apr 5, 2022

ding! :)

Hei @iesahin , what do you think about:

May be we could keep the old code for education as a comment? wdyt?

Should I add this P.R. changes as a new branch?

fd,
indent=4,
)
live.log_plot("precision_recall", labels, predictions)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably it doesn't factor in this https://github.com/iterative/example-repos-dev/pull/102/files#diff-c9e81175bafc05fa2504a715d85f4fb99493314d93b2ba5b81f79fcc77cd09a7L41-L45 , right? that would make Studio slow ... and unfortunately sklearn doesn't support dropping intermediate values yet scikit-learn/scikit-learn#21825

@shcheklein shcheklein force-pushed the dvclive-sklearn branch 3 times, most recently from 91976b8 to ef39113 Compare May 9, 2022 02:29
@shcheklein shcheklein changed the title get-started: Use dvclive get-started: more plots, dvclive, better dataset May 9, 2022
@@ -0,0 +1,42 @@
import io
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script is not used in the project, this to generated datasets from the SO raw 1TB dump

@@ -17,7 +17,9 @@ popd

# Requires AWS CLI and write access to `s3://dvc-public/code/get-started/`.
mv $PACKAGE_DIR/$PACKAGE .
aws s3 cp --acl public-read $PACKAGE s3://dvc-public/code/get-started/$PACKAGE
#aws s3 cp --acl public-read $PACKAGE s3://dvc-public/code/get-started/$PACKAGE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will be reverted after iterative/dataset-registry#28

@@ -60,7 +60,7 @@ git tag -a "1-dvc-init" -m "DVC initialized."

mkdir data
dvc get https://github.com/iterative/dataset-registry \
get-started/data.xml -o data/data.xml
get-started/data.xml -o data/data.xml --rev 95d720c467496ea6c15dd2c5d5ad48bbb631d8b1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will be reverted after iterative/dataset-registry#28

@@ -79,15 +79,16 @@ dvc push

rm data/data.xml data/data.xml.dvc
dvc import https://github.com/iterative/dataset-registry \
get-started/data.xml -o data/data.xml \
get-started/data.xml -o data/data.xml --rev 95d720c467496ea6c15dd2c5d5ad48bbb631d8b1 \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will be reverted after iterative/dataset-registry#28

@shcheklein
Copy link
Member

I've forced push multiple times in this repo, to make branch name and description less confusing closing this in favor of #114

@shcheklein shcheklein closed this May 9, 2022
@shcheklein shcheklein deleted the dvclive-sklearn branch May 12, 2022 01:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants