Skip to content

Data cleaning region tags #4785

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Oct 1, 2020
Merged

Data cleaning region tags #4785

merged 11 commits into from
Oct 1, 2020

Conversation

leahecole
Copy link
Collaborator

@leahecole leahecole commented Oct 1, 2020

Description

Adds region tags for the data cleaning section, removes the gender column from the dataframe. Link to doc in staging available if you ping me :)

Note: It's a good idea to open an issue first for discussion.

Checklist

@leahecole leahecole requested a review from bradmiro October 1, 2020 17:55
@leahecole leahecole requested a review from a team as a code owner October 1, 2020 17:55
@google-cla google-cla bot added the cla: yes This human has signed the Contributor License Agreement. label Oct 1, 2020
@product-auto-label product-auto-label bot added the samples Issues that are directly related to samples. label Oct 1, 2020
@leahecole
Copy link
Collaborator Author

fixing lint now ugh 🙈

Comment on lines 154 to 155
temp_df = spark.read.format("bigquery").option("table", TABLE).load()
df = temp_df.drop("gender")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be more explicitly explained in the code why we are removing gender? Either way, I would move it outside of this try-block. It's also ok to do df = df.<operation> and not use the temp_df var name as this ends up happening in multiple other places in the script.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Np for moving it outside of try block and making that df = df.operation change. Reasoning for removal of the column is in the Putting it together section of the doc I sent you - do you think it also needs to be mentioned here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's fair, maybe it should have its own region tag then?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added region tag, but have to wait a few min before re-staging

@@ -109,7 +106,7 @@ def setup_and_teardown_cluster():
project_id=PROJECT_ID,
region=CLUSTER_REGION,
cluster_name=DATAPROC_CLUSTER,
timeout=300
timeout=300,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't hurt but you shouldn't need to do this. Were there timeout errors deleting the cluster?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leftover from Tushar - can remove if need be

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing is fine unless it somehow breaks

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will remove next, just testing locally now

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

local tests passed, just pushed removal

@leahecole
Copy link
Collaborator Author

ugh, ofc these were passing locally. I'll figure out why they don't pass here!

@leahecole leahecole added the automerge Merge the pull request once unit tests and other checks pass. label Oct 1, 2020
@leahecole leahecole merged commit 89e67bc into master Oct 1, 2020
@leahecole leahecole deleted the data_cleaning_region_tags branch October 1, 2020 20:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
automerge Merge the pull request once unit tests and other checks pass. cla: yes This human has signed the Contributor License Agreement. samples Issues that are directly related to samples.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants