-
Notifications
You must be signed in to change notification settings - Fork 6.5k
Data cleaning region tags #4785
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
fixing lint now ugh 🙈 |
temp_df = spark.read.format("bigquery").option("table", TABLE).load() | ||
df = temp_df.drop("gender") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be more explicitly explained in the code why we are removing gender? Either way, I would move it outside of this try-block. It's also ok to do df = df.<operation>
and not use the temp_df
var name as this ends up happening in multiple other places in the script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Np for moving it outside of try block and making that df = df.operation change. Reasoning for removal of the column is in the Putting it together section of the doc I sent you - do you think it also needs to be mentioned here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's fair, maybe it should have its own region tag then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added region tag, but have to wait a few min before re-staging
@@ -109,7 +106,7 @@ def setup_and_teardown_cluster(): | |||
project_id=PROJECT_ID, | |||
region=CLUSTER_REGION, | |||
cluster_name=DATAPROC_CLUSTER, | |||
timeout=300 | |||
timeout=300, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't hurt but you shouldn't need to do this. Were there timeout errors deleting the cluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leftover from Tushar - can remove if need be
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing is fine unless it somehow breaks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will remove next, just testing locally now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
local tests passed, just pushed removal
ugh, ofc these were passing locally. I'll figure out why they don't pass here! |
Description
Adds region tags for the data cleaning section, removes the gender column from the dataframe. Link to doc in staging available if you ping me :)
Note: It's a good idea to open an issue first for discussion.
Checklist
nox -s py-3.6
(see Test Environment Setup)nox -s lint
(see Test Environment Setup)