Skip to content

Get started: better dataset, dvclive, more plots #114

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
May 11, 2022

Conversation

shcheklein
Copy link
Member

@shcheklein shcheklein commented May 9, 2022

Based on initial work done by @daavoo in this PR - #102

Depends on iterative/dataset-registry#28. A few things will be improved here when that PR lands.

Corresponding repo to play with: https://studio.iterative.ai/team/Iterative/views/example-get-started-zde16i6c4g

(a few lines in the PR will be uncommented / reverted before actual merge, after testing)

It changes example get started:

  • To use a clean dataset (thus 2.5x smaller)
  • Less number of features (3000 -> 100/200). Since dataset is better it's now enough - should be way better memory footprint (prepared matrix is 4x smaller)
  • More plots - confusion matrix + forest analysis (most useful feature)
  • CML report includes confusion matrix
  • Two branches - more data, hyperparams tuning

@shcheklein shcheklein changed the title get started: better dataset, dvclive, more plots Get started: better dataset, dvclive, more plots May 9, 2022
@shcheklein
Copy link
Member Author

Some things will be uncommented and reverted after data registry update lands.

shcheklein and others added 4 commits May 9, 2022 15:01
Co-authored-by: David de la Iglesia Castro <daviddelaiglesiacastro@gmail.com>
Co-authored-by: David de la Iglesia Castro <daviddelaiglesiacastro@gmail.com>
@shcheklein shcheklein requested a review from daavoo May 10, 2022 05:10
live.log_plot("confusion_matrix", labels.squeeze(), predictions_by_class.argmax(-1))

# ... and finally, we can dump an image, it's also supported:
fig, axes = plt.subplots(dpi=100)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shcheklein Reduced size (from 800 to 100) here as it was very disproportionated in dvc plots show.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daavoo yep, I think we should also handle gracefully this on dvc plots show - images could be large after all, we should be normalizing their sizes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and for VS Code it's better to have them bigger (you can click and zoom)

@@ -17,7 +17,9 @@ popd

# Requires AWS CLI and write access to `s3://dvc-public/code/get-started/`.
mv $PACKAGE_DIR/$PACKAGE .
aws s3 cp --acl public-read $PACKAGE s3://dvc-public/code/get-started/$PACKAGE
#aws s3 cp --acl public-read $PACKAGE s3://dvc-public/code/get-started/$PACKAGE

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intentionally left like this?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get it for now, just making sure we don't forget to update it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, I'll remove it today before the merge

--desc "Imported raw data (tracks source updates)"
git add data/data.xml.dvc
tick
git commit -m "Import raw data (overwrite)"
git tag -a "4-import-data" -m "Data file overwritten with an import."
dvc push

wget https://code.dvc.org/get-started/code.zip
cp $HERE/code.zip .
#wget https://code.dvc.org/get-started/code.zip

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, just making sure we don't forget to uncomment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@shcheklein shcheklein merged commit d427984 into master May 11, 2022
@shcheklein shcheklein deleted the get-started-better-dataset branch May 11, 2022 22:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants