refactor: Simplify join node definition #966

TrevorBergeron · 2024-09-05T23:33:38Z

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
Ensure the tests and linter pass
Code coverage does not decrease (if any source code was changed)
Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕

tswast · 2024-09-10T20:31:20Z

bigframes/core/__init__.py

    def relational_join(
        self,
        other: ArrayValue,
-        join_def: join_def.JoinDefinition,
-    ) -> ArrayValue:
+        conditions: typing.Tuple[typing.Tuple[str, str], ...] = (),


Wikipedia calls these predicates, or more specifically "join predicates". That said, I do see Google SQL calls these join conditions.

Note: we will eventually want to support more than just equality, such as geospatial join predicates (https://carto.com/blog/guide-to-spatial-joins-and-predicates-with-sql), so Tuple doesn't seem like the right type.

Wikipedia uses the term "condition" plenty as well - seems to be an accepted term. As for spatial predicates - can we leave those for later? Not sure how yet how I would want to represent those. I'm sure we will have one or two more refactors by then as we move towards offset-based indexing.

tswast · 2024-09-10T20:35:53Z

bigframes/core/__init__.py

        )
-        return ArrayValue(join_node)
+        l_size = len(self.node.schema)
+        l_mapping = {


I'm curious what the purpose of these mappings is? Could you give more explanation in a docstring, please?

A guess: is it so we don't actually have to explicitly rename the columns in the SQL compilation step? If so, would it be better to switch to some offset-based logic now instead of mapping strings?

Callers used to provide the input_id->output_id mapping themselves through the join_def. I'm slowly taking power away from callers to provide the internal ids, so instead of accepting mappings from caller, this method now provides them to callers. I do want to eventually move to entirely offset-based column addressing, but its a multi-step process.

tswast · 2024-09-10T20:38:08Z

bigframes/core/blocks.py

                passthrough_columns=[*self.index_columns, offset_col],
            )
            index_aggregations = [
                (ex.UnaryAggregation(agg_ops.AnyValueOp(), ex.free_var(col_id)), col_id)
-                for col_id in [*self.index_columns]
+                for col_id in passthrough_cols[:-1]


Why every column except the last one? Could you have a comment here explaining, please?

Added comment. These correspond to the passthrough_columns argument in unpivot.

bigframes/core/blocks.py

tswast · 2024-09-10T20:47:16Z

bigframes/core/blocks.py

@@ -1604,7 +1596,7 @@ def promote_offsets(self, label: Label = None) -> typing.Tuple[Block, str]:
            Block(
                expr,
                index_columns=self.index_columns,
-                column_labels=self.column_labels.insert(0, label),
+                column_labels=self.column_labels.insert(len(self.column_labels), label),


Is it going to be a problem that the label moves from the start to the end?

caused a few issues - but all resolved now. I want nodes to add variables to the end as it preserves the offsets of the existing variables - this will make some planned bfet transformations simpler.

TrevorBergeron requested review from a team as code owners September 5, 2024 23:33

TrevorBergeron requested a review from tswast September 5, 2024 23:33

blunderbuss-gcf bot assigned arwas11 Sep 5, 2024

product-auto-label bot added size: l Pull request size is large. api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. labels Sep 5, 2024

refactor: Simplify join node definition

31164aa

TrevorBergeron force-pushed the simplify_join branch from ca59e4e to 31164aa Compare September 6, 2024 19:07

TrevorBergeron added 4 commits September 6, 2024 23:12

fix inconsistency between PromoteOffsets node def and compiler

430d0ac

Merge remote-tracking branch 'github/main' into simplify_join

ded41b7

fix issue with block promote_offsets labels

3ccd815

fix one more promote offsets issue

2d4e7cf

arwas11 removed their assignment Sep 9, 2024

tswast reviewed Sep 10, 2024

View reviewed changes

TrevorBergeron added 2 commits September 10, 2024 23:20

Merge remote-tracking branch 'github/main' into simplify_join

94508d5

explain unpivot passthrough cols

1eb2e09

TrevorBergeron force-pushed the simplify_join branch from 14f94a1 to 1eb2e09 Compare September 10, 2024 23:52

TrevorBergeron requested a review from tswast September 10, 2024 23:59

Merge branch 'main' into simplify_join

9f459dc

tswast approved these changes Sep 11, 2024

View reviewed changes

TrevorBergeron merged commit 3a4a9de into main Sep 11, 2024
22 of 23 checks passed

TrevorBergeron deleted the simplify_join branch September 11, 2024 20:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: Simplify join node definition #966

refactor: Simplify join node definition #966

TrevorBergeron commented Sep 5, 2024

tswast Sep 10, 2024

TrevorBergeron Sep 10, 2024

tswast Sep 10, 2024

TrevorBergeron Sep 10, 2024

tswast Sep 10, 2024

TrevorBergeron Sep 10, 2024

tswast Sep 10, 2024

TrevorBergeron Sep 10, 2024

refactor: Simplify join node definition #966

refactor: Simplify join node definition #966

Conversation

TrevorBergeron commented Sep 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment