Skip to content

chore(dataflow): use n1 machine types for GPUs #6473

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

davidcavazos
Copy link
Contributor

Description

Fixes #6436

Explicitly use N1 machine types for GPU samples, it looks like the Dataflow service changed the default machine type to E2 shared-core machine types, which are not compatible with GPUs.

https://cloud.google.com/dataflow/docs/concepts/gpu-support#machine_types_specifications

Checklist

@davidcavazos davidcavazos requested a review from a team as a code owner July 26, 2021 18:40
@google-cla google-cla bot added the cla: yes This human has signed the Contributor License Agreement. label Jul 26, 2021
@product-auto-label product-auto-label bot added api: dataflow Issues related to the Dataflow API. samples Issues that are directly related to samples. labels Jul 26, 2021
@davidcavazos davidcavazos added the blunderbuss: assign Instruct blunderbuss to assign someone label Jul 26, 2021
@blunderbuss-gcf blunderbuss-gcf bot removed the blunderbuss: assign Instruct blunderbuss to assign someone label Jul 26, 2021
@davidcavazos
Copy link
Contributor Author

One test was terminated due to a ~30 minutes timeout, but the actual Dataflow job succeeded. Other tests passed with around 27 minutes, so we might be hitting intermittent timeouts due to the large amount of time taken to build such large container images.

@davidcavazos davidcavazos added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Jul 26, 2021
@kokoro-team kokoro-team removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Jul 26, 2021
@dandhlee dandhlee added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Jul 26, 2021
@kokoro-team kokoro-team removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Jul 26, 2021
@davidcavazos
Copy link
Contributor Author

FYI: @leahecole this should fix some of the failing tests

Copy link
Collaborator

@dandhlee dandhlee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

@parthea parthea changed the title dataflow: use n1 machine types for GPUs chore(dataflow): use n1 machine types for GPUs Jul 30, 2021
@leahecole leahecole added the automerge Merge the pull request once unit tests and other checks pass. label Jul 30, 2021
@gcf-merge-on-green gcf-merge-on-green bot merged commit 1ed926d into GoogleCloudPlatform:master Jul 30, 2021
@gcf-merge-on-green gcf-merge-on-green bot removed the automerge Merge the pull request once unit tests and other checks pass. label Jul 30, 2021
@davidcavazos davidcavazos deleted the dataflow-gpu branch August 2, 2021 17:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: dataflow Issues related to the Dataflow API. cla: yes This human has signed the Contributor License Agreement. samples Issues that are directly related to samples.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

dataflow.gpu-examples.tensorflow-minimal.e2e_test: test_tensorflow_minimal failed
5 participants