Skip to content

Feat: Update partitioning by DATE, DATETIME, TIMESTAMP, _PARTITIONDATE #1113

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Nov 22, 2024

Conversation

chalmerlowe
Copy link
Collaborator

@chalmerlowe chalmerlowe commented Sep 11, 2024

This adds some additional functionality to properly handle partitioning of columns with the following datatypes.

  • DATE
  • TIMESTAMP
  • DATETIME
  • _PARTITIONDATE

Where appropriate, ensures the following functions can be used with/or without the following TimePartitioningTypes (HOUR, DAY, MONTH, YEAR).

  • DATE_TRUNC()
  • TIMESTAMP_TRUNC()
  • DATETIME_TRUNC()
  • DATE()

This is a nearly complete fix for #1072. A revised table from #1072 is included here:
NOTES:

  • This PR does not handle _PARTITIONTIME
  • If a field (column) is not provided to partition on, the system defaults to the pseudocolumn "_PARTITIONDATE" to partition by. (If you attempt to provide a partitioning period anyway BQ does not accept HOUR and even if you include DAY, MONTH, or YEAR, the system will ignore those in deference to the _PARTITIONDATE).
Column Data Type HOUR DAY MONTH YEAR Special case: None
DATE N/A N/A via DATE_TRUNC via DATE_TRUNC N/A
DATETIME via DATETIME_TRUNC via DATETIME_TRUNC via DATETIME_TRUNC via DATETIME_TRUNC via DATETIME_TRUNC (defaults to DAY)
TIMESTAMP via TIMESTAMP_TRUNC via TIMESTAMP_TRUNC via TIMESTAMP_TRUNC via TIMESTAMP_TRUNC N/A
_PARTITIONDATE N/A N/A N/A N/A N/A
_PARTITIONTIME Not currently implemented TODO: USE TIMESTAMP_TRUNC Not currently implemented TODO: USE TIMESTAMP_TRUNC Not currently implemented TODO: USE TIMESTAMP_TRUNC Not currently implemented TODO: USE TIMESTAMP_TRUNC Need to lookup

Copy link

conventional-commit-lint-gcf bot commented Sep 11, 2024

🤖 I detect that the PR title and the commit message differ and there's only one commit. To use the PR title for the commit history, you can use Github's automerge feature with squashing, or use automerge label. Good luck human!

-- conventional-commit-lint bot
https://conventionalcommits.org/

@product-auto-label product-auto-label bot added size: m Pull request size is medium. api: bigquery Issues related to the googleapis/python-bigquery-sqlalchemy API. labels Sep 11, 2024
@product-auto-label product-auto-label bot added size: l Pull request size is large. and removed size: m Pull request size is medium. labels Nov 11, 2024
@chalmerlowe chalmerlowe added kokoro:run Add this label to force Kokoro to re-run the tests. owlbot:run Add this label to trigger the Owlbot post processor. labels Nov 13, 2024
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Nov 13, 2024
@@ -132,83 +188,89 @@ def test_table_require_partition_filter_dialect_option(faux_conn):
bigquery_require_partition_filter=True,
)

assert " ".join(faux_conn.test_data["execute"][-1][0].strip().split()) == (
result = " ".join(faux_conn.test_data["execute"][-1][0].strip().split())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT:

It seems this test is covered in test_table_time_partitioning_date_timestamp_and_datetime_dialect_option?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deduplicated.

# is DAY. This case overwrites that to avoid making a breaking change in
# python-bigquery.
# https://github.com/googleapis/python-bigquery/blob/a4d9534a900f13ae7355904cda05097d781f27e3/google/cloud/bigquery/table.py#L2916
if trunc_fn == "DATE_TRUNC" and partitioning_period == "DAY":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this would be added when you complete the draft, I think we need to set the default value of partitioning_period somewhere

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Linchin

I am unclear on why we might need to create a default value of partitioning_period. It is set based on the value of time_partitioning.type_ on line 873.

time_partitioning.type_ is set in the python-bigquery library in the TimePartitioning class.

When that class is instantiated, the default for type_ in the __init__ is None but it is then processed and immediately overwritten by DAY as shown at the link above.

So by the time .type_ gets to us, it will either be DAY or will be set to something else that we use to set the variable named partitioning_period.

NOTE: I rename the variable to partitioning_period simply to make this code more readable. type_ seemed less intuitive than partitioning_period.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I missed the part in __init__() that sets the default value.

renovate-bot and others added 4 commits November 21, 2024 16:02
* chore(deps): update all dependencies

* 🦉 Updates from OwlBot post-processor

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

* Update protobuf to 5.28.3

* pin google-crc32c for python 3.7/3.8

* Pin mako for python 3.8

* Pin markupsafe for Python 3.8

* Pin pyparsing for python 3.8

* Pin pyparsing for Python 3.8

---------

Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>
Co-authored-by: Anthonios Partheniou <partheniou@google.com>
* chore(deps): update all dependencies

* 🦉 Updates from OwlBot post-processor

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

---------

Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>
@yoshi-kokoro yoshi-kokoro removed the kokoro:run Add this label to force Kokoro to re-run the tests. label Nov 21, 2024
@chalmerlowe chalmerlowe marked this pull request as ready for review November 21, 2024 20:19
@chalmerlowe chalmerlowe requested review from a team as code owners November 21, 2024 20:19
@chalmerlowe
Copy link
Collaborator Author

@Linchin Please take a look. I have taken this out of draft. As far as I recall, this should be finished and ready for final review/approval.

@Linchin Linchin added the owlbot:run Add this label to trigger the Owlbot post processor. label Nov 21, 2024
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Nov 21, 2024
assert " ".join(faux_conn.test_data["execute"][-1][0].strip().split()) == (
"CREATE TABLE `some_table` ( `id` INT64, `createdAt` TIMESTAMP )"
" PARTITION BY TIMESTAMP_TRUNC(createdAt, DAY)"
result = " ".join(faux_conn.test_data["execute"][-1][0].strip().split())
Copy link
Contributor

@Linchin Linchin Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the second half of this test does the following, as the docstring suggests:

Confirms that if the column datatype is DATETIME but no TimePartitioning.type_
    has been supplied, the system will default to DAY.

Maybe I'm missing something, but I think it's a duplicate of the test case at line 136.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference between this test and the parametrized test above is minor.

Essentially, in the parameters on line 136 we explicitly provide a value of None as a type_ argument to the TimePartitioning class.

We know the default is None in python-bigquery but we set the value deliberately.

In line 193, we do not set the value of type_ explicitly, we allow the default value (which should be None) come through and thus confirm that everything works as expected whether we provide a value to type_ or not.

assert " ".join(faux_conn.test_data["execute"][-1][0].strip().split()) == (
"CREATE TABLE `some_table_2` ( `id` INT64, `createdAt` DATE )"
" PARTITION BY createdAt"
result = " ".join(faux_conn.test_data["execute"][-1][0].strip().split())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this test, we do NOT provide a field argument to the TimePartitioning class whereas in the tests above, we do.

This is needed to confirm that in the absence of a field argument the SQL statement will not use a function, etc but will instead use _PARTITIONDATE

Both the tests above end up including a truncation function but this on does not (and should not).

@Linchin Linchin requested review from Linchin and suzmue November 21, 2024 23:02
Copy link
Contributor

@Linchin Linchin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

Two small suggestions, could we:

  1. rename the tests to reflect what they are intended to test
  2. the table in the first comment seems a little outdated, could we update it?

@chalmerlowe chalmerlowe merged commit 413cd24 into main Nov 22, 2024
16 checks passed
@chalmerlowe chalmerlowe deleted the add-partitioning-support branch November 22, 2024 12:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-sqlalchemy API. size: l Pull request size is large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants