-
Notifications
You must be signed in to change notification settings - Fork 48
refactor: Simplify join node definition #966
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ca59e4e
to
31164aa
Compare
def relational_join( | ||
self, | ||
other: ArrayValue, | ||
join_def: join_def.JoinDefinition, | ||
) -> ArrayValue: | ||
conditions: typing.Tuple[typing.Tuple[str, str], ...] = (), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wikipedia calls these predicates, or more specifically "join predicates". That said, I do see Google SQL calls these join conditions.
Note: we will eventually want to support more than just equality, such as geospatial join predicates (https://carto.com/blog/guide-to-spatial-joins-and-predicates-with-sql), so Tuple
doesn't seem like the right type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wikipedia uses the term "condition" plenty as well - seems to be an accepted term. As for spatial predicates - can we leave those for later? Not sure how yet how I would want to represent those. I'm sure we will have one or two more refactors by then as we move towards offset-based indexing.
) | ||
return ArrayValue(join_node) | ||
l_size = len(self.node.schema) | ||
l_mapping = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious what the purpose of these mappings is? Could you give more explanation in a docstring, please?
A guess: is it so we don't actually have to explicitly rename the columns in the SQL compilation step? If so, would it be better to switch to some offset-based logic now instead of mapping strings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Callers used to provide the input_id->output_id mapping themselves through the join_def. I'm slowly taking power away from callers to provide the internal ids, so instead of accepting mappings from caller, this method now provides them to callers. I do want to eventually move to entirely offset-based column addressing, but its a multi-step process.
bigframes/core/blocks.py
Outdated
passthrough_columns=[*self.index_columns, offset_col], | ||
) | ||
index_aggregations = [ | ||
(ex.UnaryAggregation(agg_ops.AnyValueOp(), ex.free_var(col_id)), col_id) | ||
for col_id in [*self.index_columns] | ||
for col_id in passthrough_cols[:-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why every column except the last one? Could you have a comment here explaining, please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comment. These correspond to the passthrough_columns argument in unpivot.
@@ -1604,7 +1596,7 @@ def promote_offsets(self, label: Label = None) -> typing.Tuple[Block, str]: | |||
Block( | |||
expr, | |||
index_columns=self.index_columns, | |||
column_labels=self.column_labels.insert(0, label), | |||
column_labels=self.column_labels.insert(len(self.column_labels), label), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it going to be a problem that the label moves from the start to the end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
caused a few issues - but all resolved now. I want nodes to add variables to the end as it preserves the offsets of the existing variables - this will make some planned bfet transformations simpler.
14f94a1
to
1eb2e09
Compare
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Fixes #<issue_number_goes_here> 🦕