diff --git a/CHANGELOG.md b/CHANGELOG.md index 23b3e8033d..c36d024ce7 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -4,6 +4,35 @@ [1]: https://pypi.org/project/bigframes/#history +## [2.15.0](https://github.com/googleapis/python-bigquery-dataframes/compare/v2.14.0...v2.15.0) (2025-08-11) + + +### Features + +* Add `st_buffer`, `st_centroid`, and `st_convexhull` and their corresponding GeoSeries methods ([#1963](https://github.com/googleapis/python-bigquery-dataframes/issues/1963)) ([c4c7fa5](https://github.com/googleapis/python-bigquery-dataframes/commit/c4c7fa578e135e7f0e31ad3063db379514957acc)) +* Add first, last support to GroupBy ([#1969](https://github.com/googleapis/python-bigquery-dataframes/issues/1969)) ([41dda88](https://github.com/googleapis/python-bigquery-dataframes/commit/41dda889860c0ed8ca2eab81b34a9d71372c69f7)) +* Add value_counts to GroupBy classes ([#1974](https://github.com/googleapis/python-bigquery-dataframes/issues/1974)) ([82175a4](https://github.com/googleapis/python-bigquery-dataframes/commit/82175a4d0fa41d8aee11efdf8778a21bb70b1c0f)) +* Allow callable as a conditional or replacement input in DataFrame.where ([#1971](https://github.com/googleapis/python-bigquery-dataframes/issues/1971)) ([a8d57d2](https://github.com/googleapis/python-bigquery-dataframes/commit/a8d57d2f7075158eff69ec65a14c232756ab72a6)) +* Can cast locally in hybrid engine ([#1944](https://github.com/googleapis/python-bigquery-dataframes/issues/1944)) ([d9bc4a5](https://github.com/googleapis/python-bigquery-dataframes/commit/d9bc4a5940e9930d5e3c3bfffdadd2f91f96b53b)) +* Df.join lsuffix and rsuffix support ([#1857](https://github.com/googleapis/python-bigquery-dataframes/issues/1857)) ([26515c3](https://github.com/googleapis/python-bigquery-dataframes/commit/26515c34c4f0a5e4602d2f59bf229d41e0fc9196)) + + +### Bug Fixes + +* Add warnings for duplicated or conflicting type hints in bigfram… ([#1956](https://github.com/googleapis/python-bigquery-dataframes/issues/1956)) ([d38e42c](https://github.com/googleapis/python-bigquery-dataframes/commit/d38e42ce689e65f57223e9a8b14c4262cba08966)) +* Make `remote_function` more robust when there are `create_function` retries ([#1973](https://github.com/googleapis/python-bigquery-dataframes/issues/1973)) ([cd954ac](https://github.com/googleapis/python-bigquery-dataframes/commit/cd954ac07ad5e5820a20b941d3c6cab7cfcc1f29)) +* Make ExecutionMetrics stats tracking more robust to missing stats ([#1977](https://github.com/googleapis/python-bigquery-dataframes/issues/1977)) ([feb3ff4](https://github.com/googleapis/python-bigquery-dataframes/commit/feb3ff4b543eb8acbf6adf335b67a266a1cf4297)) + + +### Performance Improvements + +* Remove an unnecessary extra `dry_run` query from `read_gbq_table` ([#1972](https://github.com/googleapis/python-bigquery-dataframes/issues/1972)) ([d17b711](https://github.com/googleapis/python-bigquery-dataframes/commit/d17b711750d281ef3efd42c160f3784cd60021ae)) + + +### Documentation + +* Divide BQ DataFrames quickstart code cell ([#1975](https://github.com/googleapis/python-bigquery-dataframes/issues/1975)) ([fedb8f2](https://github.com/googleapis/python-bigquery-dataframes/commit/fedb8f23120aa315c7e9dd6f1bf1255ccf1ebc48)) + ## [2.14.0](https://github.com/googleapis/python-bigquery-dataframes/compare/v2.13.0...v2.14.0) (2025-08-05) diff --git a/GEMINI.md b/GEMINI.md new file mode 100644 index 0000000000..d26a51ebfc --- /dev/null +++ b/GEMINI.md @@ -0,0 +1,147 @@ +# Contribution guidelines, tailored for LLM agents + +## Testing + +We use `nox` to instrument our tests. + +- To test your changes, run unit tests with `nox`: + + ```bash + nox -r -s unit + ``` + +- To run a single unit test: + + ```bash + nox -r -s unit-3.13 -- -k + ``` + +- To run system tests, you can execute:: + + # Run all system tests + $ nox -r -s system + + # Run a single system test + $ nox -r -s system-3.13 -- -k + +- The codebase must have better coverage than it had previously after each + change. You can test coverage via `nox -s unit system cover` (takes a long + time). + +## Code Style + +- We use the automatic code formatter `black`. You can run it using + the nox session `format`. This will eliminate many lint errors. Run via: + + ```bash + nox -r -s format + ``` + +- PEP8 compliance is required, with exceptions defined in the linter configuration. + If you have ``nox`` installed, you can test that you have not introduced + any non-compliant code via: + + ``` + nox -r -s lint + ``` + +- When writing tests, use the idiomatic "pytest" style. + +## Documentation + +If a method or property is implementing the same interface as a third-party +package such as pandas or scikit-learn, place the relevant docstring in the +corresponding `third_party/bigframes_vendored/package_name` directory, not in +the `bigframes` directory. Implementations may be placed in the `bigframes` +directory, though. + +### Testing code samples + +Code samples are very important for accurate documentation. We use the "doctest" +framework to ensure the samples are functioning as expected. After adding a code +sample, please ensure it is correct by running doctest. To run the samples +doctests for just a single method, refer to the following example: + +```bash +pytest --doctest-modules bigframes/pandas/__init__.py::bigframes.pandas.cut +``` + +## Tips for implementing common BigFrames features + +### Adding a scalar operator + +For an example, see commit +[c5b7fdae74a22e581f7705bc0cf5390e928f4425](https://github.com/googleapis/python-bigquery-dataframes/commit/c5b7fdae74a22e581f7705bc0cf5390e928f4425). + +To add a new scalar operator, follow these steps: + +1. **Define the operation dataclass:** + - In `bigframes/operations/`, find the relevant file (e.g., `geo_ops.py` for geography functions) or create a new one. + - Create a new dataclass inheriting from `base_ops.UnaryOp` for unary + operators, `base_ops.BinaryOp` for binary operators, `base_ops.TernaryOp` + for ternary operators, or `base_ops.NaryOp for operators with many + arguments. Note that these operators are counting the number column-like + arguments. A function that takes only a single column but several literal + values would still be a `UnaryOp`. + - Define the `name` of the operation and any parameters it requires. + - Implement the `output_type` method to specify the data type of the result. + +2. **Export the new operation:** + - In `bigframes/operations/__init__.py`, import your new operation dataclass and add it to the `__all__` list. + +3. **Implement the user-facing function (pandas-like):** + + - Identify the canonical function from pandas / geopandas / awkward array / + other popular Python package that this operator implements. + - Find the corresponding class in BigFrames. For example, the implementation + for most geopandas.GeoSeries methods is in + `bigframes/geopandas/geoseries.py`. Pandas Series methods are implemented + in `bigframes/series.py` or one of the accessors, such as `StringMethods` + in `bigframes/operations/strings.py`. + - Create the user-facing function that will be called by users (e.g., `length`). + - If the SQL method differs from pandas or geopandas in a way that can't be + made the same, raise a `NotImplementedError` with an appropriate message and + link to the feedback form. + - Add the docstring to the corresponding file in + `third_party/bigframes_vendored`, modeled after pandas / geopandas. + +4. **Implement the user-facing function (SQL-like):** + + - In `bigframes/bigquery/_operations/`, find the relevant file (e.g., `geo.py`) or create a new one. + - Create the user-facing function that will be called by users (e.g., `st_length`). + - This function should take a `Series` for any column-like inputs, plus any other parameters. + - Inside the function, call `series._apply_unary_op`, + `series._apply_binary_op`, or similar passing the operation dataclass you + created. + - Add a comprehensive docstring with examples. + - In `bigframes/bigquery/__init__.py`, import your new user-facing function and add it to the `__all__` list. + +5. **Implement the compilation logic:** + - In `bigframes/core/compile/scalar_op_compiler.py`: + - If the BigQuery function has a direct equivalent in Ibis, you can often reuse an existing Ibis method. + - If not, define a new Ibis UDF using `@ibis_udf.scalar.builtin` to map to the specific BigQuery function signature. + - Create a new compiler implementation function (e.g., `geo_length_op_impl`). + - Register this function to your operation dataclass using `@scalar_op_compiler.register_unary_op` or `@scalar_op_compiler.register_binary_op`. + - This implementation will translate the BigQuery DataFrames operation into the appropriate Ibis expression. + +6. **Add Tests:** + - Add system tests in the `tests/system/` directory to verify the end-to-end + functionality of the new operator. Test various inputs, including edge cases + and `NULL` values. + + Where possible, run the same test code against pandas or GeoPandas and + compare that the outputs are the same (except for dtypes if BigFrames + differs from pandas). + - If you are overriding a pandas or GeoPandas property, add a unit test to + ensure the correct behavior (e.g., raising `NotImplementedError` if the + functionality is not supported). + + +## Constraints + +- Only add git commits. Do not change git history. +- Follow the spec file for development. + - Check off items in the "Acceptance + criteria" and "Detailed steps" sections with `[x]`. + - Please do this as they are completed. + - Refer back to the spec after each step. diff --git a/bigframes/_importing.py b/bigframes/_importing.py index 095a1d9c51..e88bd77fe8 100644 --- a/bigframes/_importing.py +++ b/bigframes/_importing.py @@ -14,6 +14,7 @@ import importlib from types import ModuleType +import numpy from packaging import version # Keep this in sync with setup.py @@ -22,9 +23,13 @@ def import_polars() -> ModuleType: polars_module = importlib.import_module("polars") - imported_version = version.Version(polars_module.build_info()["version"]) - if imported_version < POLARS_MIN_VERSION: + # Check for necessary methods instead of the version number because we + # can't trust the polars version until + # https://github.com/pola-rs/polars/issues/23940 is fixed. + try: + polars_module.lit(numpy.int64(100), dtype=polars_module.Int64()) + except TypeError: raise ImportError( - f"Imported polars version: {imported_version} is below the minimum version: {POLARS_MIN_VERSION}" + f"Imported polars version is likely below the minimum version: {POLARS_MIN_VERSION}" ) return polars_module diff --git a/bigframes/bigquery/__init__.py b/bigframes/bigquery/__init__.py index 7ca7fb693b..dbaea57005 100644 --- a/bigframes/bigquery/__init__.py +++ b/bigframes/bigquery/__init__.py @@ -29,6 +29,9 @@ ) from bigframes.bigquery._operations.geo import ( st_area, + st_buffer, + st_centroid, + st_convexhull, st_difference, st_distance, st_intersection, @@ -54,11 +57,18 @@ # approximate aggregate ops "approx_top_count", # array ops - "array_length", "array_agg", + "array_length", "array_to_string", + # datetime ops + "unix_micros", + "unix_millis", + "unix_seconds", # geo ops "st_area", + "st_buffer", + "st_centroid", + "st_convexhull", "st_difference", "st_distance", "st_intersection", @@ -81,8 +91,4 @@ "sql_scalar", # struct ops "struct", - # datetime ops - "unix_micros", - "unix_millis", - "unix_seconds", ] diff --git a/bigframes/bigquery/_operations/geo.py b/bigframes/bigquery/_operations/geo.py index bdc85eed9f..9a92a8960d 100644 --- a/bigframes/bigquery/_operations/geo.py +++ b/bigframes/bigquery/_operations/geo.py @@ -103,6 +103,187 @@ def st_area( return series +def st_buffer( + series: Union[bigframes.series.Series, bigframes.geopandas.GeoSeries], + buffer_radius: float, + num_seg_quarter_circle: float = 8.0, + use_spheroid: bool = False, +) -> bigframes.series.Series: + """ + Computes a `GEOGRAPHY` that represents all points whose distance from the + input `GEOGRAPHY` is less than or equal to `distance` meters. + + .. note:: + BigQuery's Geography functions, like `st_buffer`, interpret the geometry + data type as a point set on the Earth's surface. A point set is a set + of points, lines, and polygons on the WGS84 reference spheroid, with + geodesic edges. See: https://cloud.google.com/bigquery/docs/geospatial-data + + **Examples:** + + >>> import bigframes.geopandas + >>> import bigframes.pandas as bpd + >>> import bigframes.bigquery as bbq + >>> from shapely.geometry import Point + >>> bpd.options.display.progress_bar = None + + >>> series = bigframes.geopandas.GeoSeries( + ... [ + ... Point(0, 0), + ... Point(1, 1), + ... ] + ... ) + >>> series + 0 POINT (0 0) + 1 POINT (1 1) + dtype: geometry + + >>> buffer = bbq.st_buffer(series, 100) + >>> bbq.st_area(buffer) > 0 + 0 True + 1 True + dtype: boolean + + Args: + series (bigframes.pandas.Series | bigframes.geopandas.GeoSeries): + A series containing geography objects. + buffer_radius (float): + The distance in meters. + num_seg_quarter_circle (float, optional): + Specifies the number of segments that are used to approximate a + quarter circle. The default value is 8.0. + use_spheroid (bool, optional): + Determines how this function measures distance. If use_spheroid is + FALSE, the function measures distance on the surface of a perfect + sphere. The use_spheroid parameter currently only supports the + value FALSE. The default value of use_spheroid is FALSE. + + Returns: + bigframes.pandas.Series: + A series of geography objects representing the buffered geometries. + """ + op = ops.GeoStBufferOp( + buffer_radius=buffer_radius, + num_seg_quarter_circle=num_seg_quarter_circle, + use_spheroid=use_spheroid, + ) + series = series._apply_unary_op(op) + series.name = None + return series + + +def st_centroid( + series: Union[bigframes.series.Series, bigframes.geopandas.GeoSeries], +) -> bigframes.series.Series: + """ + Computes the geometric centroid of a `GEOGRAPHY` type. + + For `POINT` and `MULTIPOINT` types, this is the arithmetic mean of the + input coordinates. For `LINESTRING` and `POLYGON` types, this is the + center of mass. For `GEOMETRYCOLLECTION` types, this is the center of + mass of the collection's elements. + + .. note:: + BigQuery's Geography functions, like `st_centroid`, interpret the geometry + data type as a point set on the Earth's surface. A point set is a set + of points, lines, and polygons on the WGS84 reference spheroid, with + geodesic edges. See: https://cloud.google.com/bigquery/docs/geospatial-data + + **Examples:** + + >>> import bigframes.geopandas + >>> import bigframes.pandas as bpd + >>> import bigframes.bigquery as bbq + >>> from shapely.geometry import Polygon, LineString, Point + >>> bpd.options.display.progress_bar = None + + >>> series = bigframes.geopandas.GeoSeries( + ... [ + ... Polygon([(0.0, 0.0), (0.1, 0.1), (0.0, 0.1)]), + ... LineString([(0, 0), (1, 1), (0, 1)]), + ... Point(0, 1), + ... ] + ... ) + >>> series + 0 POLYGON ((0 0, 0.1 0.1, 0 0.1, 0 0)) + 1 LINESTRING (0 0, 1 1, 0 1) + 2 POINT (0 1) + dtype: geometry + + >>> bbq.st_centroid(series) + 0 POINT (0.03333 0.06667) + 1 POINT (0.49998 0.70712) + 2 POINT (0 1) + dtype: geometry + + Args: + series (bigframes.pandas.Series | bigframes.geopandas.GeoSeries): + A series containing geography objects. + + Returns: + bigframes.pandas.Series: + A series of geography objects representing the centroids. + """ + series = series._apply_unary_op(ops.geo_st_centroid_op) + series.name = None + return series + + +def st_convexhull( + series: Union[bigframes.series.Series, bigframes.geopandas.GeoSeries], +) -> bigframes.series.Series: + """ + Computes the convex hull of a `GEOGRAPHY` type. + + The convex hull is the smallest convex set that contains all of the + points in the input `GEOGRAPHY`. + + .. note:: + BigQuery's Geography functions, like `st_convexhull`, interpret the geometry + data type as a point set on the Earth's surface. A point set is a set + of points, lines, and polygons on the WGS84 reference spheroid, with + geodesic edges. See: https://cloud.google.com/bigquery/docs/geospatial-data + + **Examples:** + + >>> import bigframes.geopandas + >>> import bigframes.pandas as bpd + >>> import bigframes.bigquery as bbq + >>> from shapely.geometry import Polygon, LineString, Point + >>> bpd.options.display.progress_bar = None + + >>> series = bigframes.geopandas.GeoSeries( + ... [ + ... Polygon([(0.0, 0.0), (0.1, 0.1), (0.0, 0.1)]), + ... LineString([(0, 0), (1, 1), (0, 1)]), + ... Point(0, 1), + ... ] + ... ) + >>> series + 0 POLYGON ((0 0, 0.1 0.1, 0 0.1, 0 0)) + 1 LINESTRING (0 0, 1 1, 0 1) + 2 POINT (0 1) + dtype: geometry + + >>> bbq.st_convexhull(series) + 0 POLYGON ((0 0, 0.1 0.1, 0 0.1, 0 0)) + 1 POLYGON ((0 0, 1 1, 0 1, 0 0)) + 2 POINT (0 1) + dtype: geometry + + Args: + series (bigframes.pandas.Series | bigframes.geopandas.GeoSeries): + A series containing geography objects. + + Returns: + bigframes.pandas.Series: + A series of geography objects representing the convex hulls. + """ + series = series._apply_unary_op(ops.geo_st_convexhull_op) + series.name = None + return series + + def st_difference( series: Union[bigframes.series.Series, bigframes.geopandas.GeoSeries], other: Union[ diff --git a/bigframes/core/block_transforms.py b/bigframes/core/block_transforms.py index cb7c1923cf..465728b0ef 100644 --- a/bigframes/core/block_transforms.py +++ b/bigframes/core/block_transforms.py @@ -355,24 +355,28 @@ def value_counts( normalize: bool = False, sort: bool = True, ascending: bool = False, - dropna: bool = True, + drop_na: bool = True, + grouping_keys: typing.Sequence[str] = (), ): - block, dummy = block.create_constant(1) + if grouping_keys and drop_na: + # only need this if grouping_keys is involved, otherwise the drop_na in the aggregation will handle it for us + block = dropna(block, columns, how="any") block, agg_ids = block.aggregate( - by_column_ids=columns, - aggregations=[ex.UnaryAggregation(agg_ops.count_op, ex.deref(dummy))], - dropna=dropna, + by_column_ids=(*grouping_keys, *columns), + aggregations=[ex.NullaryAggregation(agg_ops.size_op)], + dropna=drop_na and not grouping_keys, ) count_id = agg_ids[0] if normalize: - unbound_window = windows.unbound() + unbound_window = windows.unbound(grouping_keys=tuple(grouping_keys)) block, total_count_id = block.apply_window_op( count_id, agg_ops.sum_op, unbound_window ) block, count_id = block.apply_binary_op(count_id, total_count_id, ops.div_op) if sort: - block = block.order_by( + order_parts = [ordering.ascending_over(id) for id in grouping_keys] + order_parts.extend( [ ordering.OrderingExpression( ex.deref(count_id), @@ -382,6 +386,7 @@ def value_counts( ) ] ) + block = block.order_by(order_parts) return block.select_column(count_id).with_column_labels( ["proportion" if normalize else "count"] ) diff --git a/bigframes/core/compile/__init__.py b/bigframes/core/compile/__init__.py index e2487306ab..68c36df288 100644 --- a/bigframes/core/compile/__init__.py +++ b/bigframes/core/compile/__init__.py @@ -14,8 +14,8 @@ from __future__ import annotations from bigframes.core.compile.api import test_only_ibis_inferred_schema -from bigframes.core.compile.compiler import compile_sql from bigframes.core.compile.configs import CompileRequest, CompileResult +from bigframes.core.compile.ibis_compiler.ibis_compiler import compile_sql __all__ = [ "test_only_ibis_inferred_schema", diff --git a/bigframes/core/compile/api.py b/bigframes/core/compile/api.py index ddd8622327..3a4695c50d 100644 --- a/bigframes/core/compile/api.py +++ b/bigframes/core/compile/api.py @@ -16,7 +16,7 @@ from typing import TYPE_CHECKING from bigframes.core import rewrite -from bigframes.core.compile import compiler +from bigframes.core.compile.ibis_compiler import ibis_compiler if TYPE_CHECKING: import bigframes.core.nodes @@ -26,9 +26,9 @@ def test_only_ibis_inferred_schema(node: bigframes.core.nodes.BigFrameNode): """Use only for testing paths to ensure ibis inferred schema does not diverge from bigframes inferred schema.""" import bigframes.core.schema - node = compiler._replace_unsupported_ops(node) + node = ibis_compiler._replace_unsupported_ops(node) node = rewrite.bake_order(node) - ir = compiler.compile_node(node) + ir = ibis_compiler.compile_node(node) items = tuple( bigframes.core.schema.SchemaItem(name, ir.get_column_type(ibis_id)) for name, ibis_id in zip(node.schema.names, ir.column_ids) diff --git a/bigframes/core/compile/compiled.py b/bigframes/core/compile/compiled.py index 314b54fc6d..f7de5c051a 100644 --- a/bigframes/core/compile/compiled.py +++ b/bigframes/core/compile/compiled.py @@ -30,11 +30,10 @@ import pyarrow as pa from bigframes.core import utils -import bigframes.core.compile.aggregate_compiler as agg_compiler import bigframes.core.compile.googlesql +import bigframes.core.compile.ibis_compiler.aggregate_compiler as agg_compiler +import bigframes.core.compile.ibis_compiler.scalar_op_compiler as op_compilers import bigframes.core.compile.ibis_types -import bigframes.core.compile.scalar_op_compiler as op_compilers -import bigframes.core.compile.scalar_op_compiler as scalar_op_compiler import bigframes.core.expression as ex from bigframes.core.ordering import OrderingExpression import bigframes.core.sql @@ -460,7 +459,7 @@ def project_window_op( for column in inputs: clauses.append((column.isnull(), ibis_types.null())) if window_spec.min_periods and len(inputs) > 0: - if expression.op.skips_nulls: + if not expression.op.nulls_count_for_min_values: # Most operations do not count NULL values towards min_periods per_col_does_count = (column.notnull() for column in inputs) # All inputs must be non-null for observation to count @@ -679,13 +678,15 @@ def _join_condition( def _as_groupable(value: ibis_types.Value): + from bigframes.core.compile.ibis_compiler import scalar_op_registry + # Some types need to be converted to another type to enable groupby if value.type().is_float64(): return value.cast(ibis_dtypes.str) elif value.type().is_geospatial(): return typing.cast(ibis_types.GeoSpatialColumn, value).as_binary() elif value.type().is_json(): - return scalar_op_compiler.to_json_string(value) + return scalar_op_registry.to_json_string(value) else: return value diff --git a/tests/system/small/pandas/io/__init__.py b/bigframes/core/compile/ibis_compiler/__init__.py similarity index 62% rename from tests/system/small/pandas/io/__init__.py rename to bigframes/core/compile/ibis_compiler/__init__.py index 0a2669d7a2..aef0ed9267 100644 --- a/tests/system/small/pandas/io/__init__.py +++ b/bigframes/core/compile/ibis_compiler/__init__.py @@ -11,3 +11,14 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. + +"""Compiler for BigFrames expression to Ibis expression. + +Make sure to import all ibis_compiler implementations here so that they get +registered. +""" + +from __future__ import annotations + +import bigframes.core.compile.ibis_compiler.operations.generic_ops # noqa: F401 +import bigframes.core.compile.ibis_compiler.scalar_op_registry # noqa: F401 diff --git a/bigframes/core/compile/aggregate_compiler.py b/bigframes/core/compile/ibis_compiler/aggregate_compiler.py similarity index 99% rename from bigframes/core/compile/aggregate_compiler.py rename to bigframes/core/compile/ibis_compiler/aggregate_compiler.py index 0d31798f25..4e0bf477fc 100644 --- a/bigframes/core/compile/aggregate_compiler.py +++ b/bigframes/core/compile/ibis_compiler/aggregate_compiler.py @@ -27,8 +27,8 @@ import pandas as pd from bigframes.core.compile import constants as compiler_constants +import bigframes.core.compile.ibis_compiler.scalar_op_compiler as scalar_compilers import bigframes.core.compile.ibis_types as compile_ibis_types -import bigframes.core.compile.scalar_op_compiler as scalar_compilers import bigframes.core.expression as ex import bigframes.core.window_spec as window_spec import bigframes.operations.aggregations as agg_ops diff --git a/bigframes/core/compile/compiler.py b/bigframes/core/compile/ibis_compiler/ibis_compiler.py similarity index 98% rename from bigframes/core/compile/compiler.py rename to bigframes/core/compile/ibis_compiler/ibis_compiler.py index 0efbd47ae4..ff0441ea22 100644 --- a/bigframes/core/compile/compiler.py +++ b/bigframes/core/compile/ibis_compiler/ibis_compiler.py @@ -29,7 +29,6 @@ import bigframes.core.compile.concat as concat_impl import bigframes.core.compile.configs as configs import bigframes.core.compile.explode -import bigframes.core.compile.scalar_op_compiler as compile_scalar import bigframes.core.nodes as nodes import bigframes.core.ordering as bf_ordering import bigframes.core.rewrite as rewrites @@ -178,6 +177,8 @@ def compile_readlocal(node: nodes.ReadLocalNode, *args): @_compile_node.register def compile_readtable(node: nodes.ReadTableNode, *args): + from bigframes.core.compile.ibis_compiler import scalar_op_registry + ibis_table = _table_to_ibis( node.source, scan_cols=[col.source_id for col in node.scan_list.items] ) @@ -188,7 +189,7 @@ def compile_readtable(node: nodes.ReadTableNode, *args): scan_item.dtype == dtypes.JSON_DTYPE and ibis_table[scan_item.source_id].type() == ibis_dtypes.string ): - json_column = compile_scalar.parse_json( + json_column = scalar_op_registry.parse_json( ibis_table[scan_item.source_id] ).name(scan_item.source_id) ibis_table = ibis_table.mutate(json_column) diff --git a/tests/system/small/pandas/io/api/__init__.py b/bigframes/core/compile/ibis_compiler/operations/__init__.py similarity index 67% rename from tests/system/small/pandas/io/api/__init__.py rename to bigframes/core/compile/ibis_compiler/operations/__init__.py index 0a2669d7a2..9d9f3849ab 100644 --- a/tests/system/small/pandas/io/api/__init__.py +++ b/bigframes/core/compile/ibis_compiler/operations/__init__.py @@ -11,3 +11,11 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. + +"""Operation implementations for the Ibis-based compiler. + +This directory structure should reflect the same layout as the +`bigframes/operations` directory where the operations are defined. + +Prefer a few ops per file to keep file sizes manageable for text editors and LLMs. +""" diff --git a/bigframes/core/compile/ibis_compiler/operations/generic_ops.py b/bigframes/core/compile/ibis_compiler/operations/generic_ops.py new file mode 100644 index 0000000000..78f6a0c4de --- /dev/null +++ b/bigframes/core/compile/ibis_compiler/operations/generic_ops.py @@ -0,0 +1,38 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +""" +BigFrames -> Ibis compilation for the operations in bigframes.operations.generic_ops. + +Please keep implementations in sequential order by op name. +""" + +from __future__ import annotations + +from bigframes_vendored.ibis.expr import types as ibis_types + +from bigframes.core.compile.ibis_compiler import scalar_op_compiler +from bigframes.operations import generic_ops + +register_unary_op = scalar_op_compiler.scalar_op_compiler.register_unary_op + + +@register_unary_op(generic_ops.notnull_op) +def notnull_op_impl(x: ibis_types.Value): + return x.notnull() + + +@register_unary_op(generic_ops.isnull_op) +def isnull_op_impl(x: ibis_types.Value): + return x.isnull() diff --git a/bigframes/core/compile/ibis_compiler/scalar_op_compiler.py b/bigframes/core/compile/ibis_compiler/scalar_op_compiler.py new file mode 100644 index 0000000000..d5f3e15d34 --- /dev/null +++ b/bigframes/core/compile/ibis_compiler/scalar_op_compiler.py @@ -0,0 +1,207 @@ +# Copyright 2023 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""To avoid circular imports, this module should _not_ depend on any ops.""" + +from __future__ import annotations + +import functools +import typing +from typing import TYPE_CHECKING + +import bigframes_vendored.ibis.expr.types as ibis_types + +import bigframes.core.compile.ibis_types +import bigframes.core.expression as ex + +if TYPE_CHECKING: + import bigframes.operations as ops + + +class ScalarOpCompiler: + # Mapping of operation name to implemenations + _registry: dict[ + str, + typing.Callable[ + [typing.Sequence[ibis_types.Value], ops.RowOp], ibis_types.Value + ], + ] = {} + + @functools.singledispatchmethod + def compile_expression( + self, + expression: ex.Expression, + bindings: typing.Dict[str, ibis_types.Value], + ) -> ibis_types.Value: + raise NotImplementedError(f"Unrecognized expression: {expression}") + + @compile_expression.register + def _( + self, + expression: ex.ScalarConstantExpression, + bindings: typing.Dict[str, ibis_types.Value], + ) -> ibis_types.Value: + return bigframes.core.compile.ibis_types.literal_to_ibis_scalar( + expression.value, expression.dtype + ) + + @compile_expression.register + def _( + self, + expression: ex.DerefOp, + bindings: typing.Dict[str, ibis_types.Value], + ) -> ibis_types.Value: + if expression.id.sql not in bindings: + raise ValueError(f"Could not resolve unbound variable {expression.id}") + else: + return bindings[expression.id.sql] + + @compile_expression.register + def _( + self, + expression: ex.OpExpression, + bindings: typing.Dict[str, ibis_types.Value], + ) -> ibis_types.Value: + inputs = [ + self.compile_expression(sub_expr, bindings) + for sub_expr in expression.inputs + ] + return self.compile_row_op(expression.op, inputs) + + def compile_row_op( + self, op: ops.RowOp, inputs: typing.Sequence[ibis_types.Value] + ) -> ibis_types.Value: + impl = self._registry[op.name] + return impl(inputs, op) + + def register_unary_op( + self, + op_ref: typing.Union[ops.UnaryOp, type[ops.UnaryOp]], + pass_op: bool = False, + ): + """ + Decorator to register a unary op implementation. + + Args: + op_ref (UnaryOp or UnaryOp type): + Class or instance of operator that is implemented by the decorated function. + pass_op (bool): + Set to true if implementation takes the operator object as the last argument. + This is needed for parameterized ops where parameters are part of op object. + """ + key = typing.cast(str, op_ref.name) + + def decorator(impl: typing.Callable[..., ibis_types.Value]): + def normalized_impl(args: typing.Sequence[ibis_types.Value], op: ops.RowOp): + if pass_op: + return impl(args[0], op) + else: + return impl(args[0]) + + self._register(key, normalized_impl) + return impl + + return decorator + + def register_binary_op( + self, + op_ref: typing.Union[ops.BinaryOp, type[ops.BinaryOp]], + pass_op: bool = False, + ): + """ + Decorator to register a binary op implementation. + + Args: + op_ref (BinaryOp or BinaryOp type): + Class or instance of operator that is implemented by the decorated function. + pass_op (bool): + Set to true if implementation takes the operator object as the last argument. + This is needed for parameterized ops where parameters are part of op object. + """ + key = typing.cast(str, op_ref.name) + + def decorator(impl: typing.Callable[..., ibis_types.Value]): + def normalized_impl(args: typing.Sequence[ibis_types.Value], op: ops.RowOp): + if pass_op: + return impl(args[0], args[1], op) + else: + return impl(args[0], args[1]) + + self._register(key, normalized_impl) + return impl + + return decorator + + def register_ternary_op( + self, op_ref: typing.Union[ops.TernaryOp, type[ops.TernaryOp]] + ): + """ + Decorator to register a ternary op implementation. + + Args: + op_ref (TernaryOp or TernaryOp type): + Class or instance of operator that is implemented by the decorated function. + """ + key = typing.cast(str, op_ref.name) + + def decorator(impl: typing.Callable[..., ibis_types.Value]): + def normalized_impl(args: typing.Sequence[ibis_types.Value], op: ops.RowOp): + return impl(args[0], args[1], args[2]) + + self._register(key, normalized_impl) + return impl + + return decorator + + def register_nary_op( + self, op_ref: typing.Union[ops.NaryOp, type[ops.NaryOp]], pass_op: bool = False + ): + """ + Decorator to register a nary op implementation. + + Args: + op_ref (NaryOp or NaryOp type): + Class or instance of operator that is implemented by the decorated function. + pass_op (bool): + Set to true if implementation takes the operator object as the last argument. + This is needed for parameterized ops where parameters are part of op object. + """ + key = typing.cast(str, op_ref.name) + + def decorator(impl: typing.Callable[..., ibis_types.Value]): + def normalized_impl(args: typing.Sequence[ibis_types.Value], op: ops.RowOp): + if pass_op: + return impl(*args, op=op) + else: + return impl(*args) + + self._register(key, normalized_impl) + return impl + + return decorator + + def _register( + self, + op_name: str, + impl: typing.Callable[ + [typing.Sequence[ibis_types.Value], ops.RowOp], ibis_types.Value + ], + ): + if op_name in self._registry: + raise ValueError(f"Operation name {op_name} already registered") + self._registry[op_name] = impl + + +# Singleton compiler +scalar_op_compiler = ScalarOpCompiler() diff --git a/bigframes/core/compile/scalar_op_compiler.py b/bigframes/core/compile/ibis_compiler/scalar_op_registry.py similarity index 92% rename from bigframes/core/compile/scalar_op_compiler.py rename to bigframes/core/compile/ibis_compiler/scalar_op_registry.py index 95517ead35..bc077c1ce3 100644 --- a/bigframes/core/compile/scalar_op_compiler.py +++ b/bigframes/core/compile/ibis_compiler/scalar_op_registry.py @@ -27,9 +27,10 @@ from bigframes.core.compile.constants import UNIT_TO_US_CONVERSION_FACTORS import bigframes.core.compile.default_ordering +from bigframes.core.compile.ibis_compiler.scalar_op_compiler import ( + scalar_op_compiler, # TODO(tswast): avoid import of variables +) import bigframes.core.compile.ibis_types -import bigframes.core.expression as ex -import bigframes.dtypes import bigframes.operations as ops _ZERO = typing.cast(ibis_types.NumericValue, ibis_types.literal(0)) @@ -51,195 +52,7 @@ _OBJ_REF_IBIS_DTYPE = ibis_dtypes.Struct.from_tuples(_OBJ_REF_STRUCT_SCHEMA) # type: ignore -class ScalarOpCompiler: - # Mapping of operation name to implemenations - _registry: dict[ - str, - typing.Callable[ - [typing.Sequence[ibis_types.Value], ops.RowOp], ibis_types.Value - ], - ] = {} - - @functools.singledispatchmethod - def compile_expression( - self, - expression: ex.Expression, - bindings: typing.Dict[str, ibis_types.Value], - ) -> ibis_types.Value: - raise NotImplementedError(f"Unrecognized expression: {expression}") - - @compile_expression.register - def _( - self, - expression: ex.ScalarConstantExpression, - bindings: typing.Dict[str, ibis_types.Value], - ) -> ibis_types.Value: - return bigframes.core.compile.ibis_types.literal_to_ibis_scalar( - expression.value, expression.dtype - ) - - @compile_expression.register - def _( - self, - expression: ex.DerefOp, - bindings: typing.Dict[str, ibis_types.Value], - ) -> ibis_types.Value: - if expression.id.sql not in bindings: - raise ValueError(f"Could not resolve unbound variable {expression.id}") - else: - return bindings[expression.id.sql] - - @compile_expression.register - def _( - self, - expression: ex.OpExpression, - bindings: typing.Dict[str, ibis_types.Value], - ) -> ibis_types.Value: - inputs = [ - self.compile_expression(sub_expr, bindings) - for sub_expr in expression.inputs - ] - return self.compile_row_op(expression.op, inputs) - - def compile_row_op( - self, op: ops.RowOp, inputs: typing.Sequence[ibis_types.Value] - ) -> ibis_types.Value: - impl = self._registry[op.name] - return impl(inputs, op) - - def register_unary_op( - self, - op_ref: typing.Union[ops.UnaryOp, type[ops.UnaryOp]], - pass_op: bool = False, - ): - """ - Decorator to register a unary op implementation. - - Args: - op_ref (UnaryOp or UnaryOp type): - Class or instance of operator that is implemented by the decorated function. - pass_op (bool): - Set to true if implementation takes the operator object as the last argument. - This is needed for parameterized ops where parameters are part of op object. - """ - key = typing.cast(str, op_ref.name) - - def decorator(impl: typing.Callable[..., ibis_types.Value]): - def normalized_impl(args: typing.Sequence[ibis_types.Value], op: ops.RowOp): - if pass_op: - return impl(args[0], op) - else: - return impl(args[0]) - - self._register(key, normalized_impl) - return impl - - return decorator - - def register_binary_op( - self, - op_ref: typing.Union[ops.BinaryOp, type[ops.BinaryOp]], - pass_op: bool = False, - ): - """ - Decorator to register a binary op implementation. - - Args: - op_ref (BinaryOp or BinaryOp type): - Class or instance of operator that is implemented by the decorated function. - pass_op (bool): - Set to true if implementation takes the operator object as the last argument. - This is needed for parameterized ops where parameters are part of op object. - """ - key = typing.cast(str, op_ref.name) - - def decorator(impl: typing.Callable[..., ibis_types.Value]): - def normalized_impl(args: typing.Sequence[ibis_types.Value], op: ops.RowOp): - if pass_op: - return impl(args[0], args[1], op) - else: - return impl(args[0], args[1]) - - self._register(key, normalized_impl) - return impl - - return decorator - - def register_ternary_op( - self, op_ref: typing.Union[ops.TernaryOp, type[ops.TernaryOp]] - ): - """ - Decorator to register a ternary op implementation. - - Args: - op_ref (TernaryOp or TernaryOp type): - Class or instance of operator that is implemented by the decorated function. - """ - key = typing.cast(str, op_ref.name) - - def decorator(impl: typing.Callable[..., ibis_types.Value]): - def normalized_impl(args: typing.Sequence[ibis_types.Value], op: ops.RowOp): - return impl(args[0], args[1], args[2]) - - self._register(key, normalized_impl) - return impl - - return decorator - - def register_nary_op( - self, op_ref: typing.Union[ops.NaryOp, type[ops.NaryOp]], pass_op: bool = False - ): - """ - Decorator to register a nary op implementation. - - Args: - op_ref (NaryOp or NaryOp type): - Class or instance of operator that is implemented by the decorated function. - pass_op (bool): - Set to true if implementation takes the operator object as the last argument. - This is needed for parameterized ops where parameters are part of op object. - """ - key = typing.cast(str, op_ref.name) - - def decorator(impl: typing.Callable[..., ibis_types.Value]): - def normalized_impl(args: typing.Sequence[ibis_types.Value], op: ops.RowOp): - if pass_op: - return impl(*args, op=op) - else: - return impl(*args) - - self._register(key, normalized_impl) - return impl - - return decorator - - def _register( - self, - op_name: str, - impl: typing.Callable[ - [typing.Sequence[ibis_types.Value], ops.RowOp], ibis_types.Value - ], - ): - if op_name in self._registry: - raise ValueError(f"Operation name {op_name} already registered") - self._registry[op_name] = impl - - -# Singleton compiler -scalar_op_compiler = ScalarOpCompiler() - - ### Unary Ops -@scalar_op_compiler.register_unary_op(ops.isnull_op) -def isnull_op_impl(x: ibis_types.Value): - return x.isnull() - - -@scalar_op_compiler.register_unary_op(ops.notnull_op) -def notnull_op_impl(x: ibis_types.Value): - return x.notnull() - - @scalar_op_compiler.register_unary_op(ops.hash_op) def hash_op_impl(x: ibis_types.Value): return typing.cast(ibis_types.IntegerValue, x).hash() @@ -1038,6 +851,26 @@ def geo_st_boundary_op_impl(x: ibis_types.Value): return st_boundary(x) +@scalar_op_compiler.register_unary_op(ops.GeoStBufferOp, pass_op=True) +def geo_st_buffer_op_impl(x: ibis_types.Value, op: ops.GeoStBufferOp): + return st_buffer( + x, + op.buffer_radius, + op.num_seg_quarter_circle, + op.use_spheroid, + ) + + +@scalar_op_compiler.register_unary_op(ops.geo_st_centroid_op, pass_op=False) +def geo_st_centroid_op_impl(x: ibis_types.Value): + return typing.cast(ibis_types.GeoSpatialValue, x).centroid() + + +@scalar_op_compiler.register_unary_op(ops.geo_st_convexhull_op, pass_op=False) +def geo_st_convexhull_op_impl(x: ibis_types.Value): + return st_convexhull(x) + + @scalar_op_compiler.register_binary_op(ops.geo_st_difference_op, pass_op=False) def geo_st_difference_op_impl(x: ibis_types.Value, y: ibis_types.Value): return typing.cast(ibis_types.GeoSpatialValue, x).difference( @@ -2116,6 +1949,12 @@ def _ibis_num(number: float): return typing.cast(ibis_types.NumericValue, ibis_types.literal(number)) +@ibis_udf.scalar.builtin +def st_convexhull(x: ibis_dtypes.geography) -> ibis_dtypes.geography: # type: ignore + """ST_CONVEXHULL""" + ... + + @ibis_udf.scalar.builtin def st_geogfromtext(a: str) -> ibis_dtypes.geography: # type: ignore """Convert string to geography.""" @@ -2136,6 +1975,16 @@ def st_boundary(a: ibis_dtypes.geography) -> ibis_dtypes.geography: # type: ign """Find the boundary of a geography.""" +@ibis_udf.scalar.builtin +def st_buffer( + geography: ibis_dtypes.geography, # type: ignore + buffer_radius: ibis_dtypes.Float64, + num_seg_quarter_circle: ibis_dtypes.Float64, + use_spheroid: ibis_dtypes.Boolean, +) -> ibis_dtypes.geography: # type: ignore + ... + + @ibis_udf.scalar.builtin def st_distance(a: ibis_dtypes.geography, b: ibis_dtypes.geography, use_spheroid: bool) -> ibis_dtypes.float: # type: ignore """Convert string to geography.""" diff --git a/bigframes/core/compile/polars/__init__.py b/bigframes/core/compile/polars/__init__.py index 8c37e046ab..7ae6fcc755 100644 --- a/bigframes/core/compile/polars/__init__.py +++ b/bigframes/core/compile/polars/__init__.py @@ -11,16 +11,30 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. + +"""Compiler for BigFrames expression to Polars LazyFrame expression. + +Make sure to import all polars implementations here so that they get registered. +""" from __future__ import annotations import warnings +# The ops imports appear first so that the implementations can be registered. +# polars shouldn't be needed at import time, as register is a no-op if polars +# isn't installed. +import bigframes.core.compile.polars.operations.generic_ops # noqa: F401 + try: - import polars # noqa + import bigframes._importing + + # Use import_polars() instead of importing directly so that we check the + # version numbers. + bigframes._importing.import_polars() from bigframes.core.compile.polars.compiler import PolarsCompiler __all__ = ["PolarsCompiler"] -except Exception: - msg = "Polars compiler not available as polars is not installed." +except Exception as exc: + msg = f"Polars compiler not available as there was an exception importing polars. Details: {str(exc)}" warnings.warn(msg) diff --git a/bigframes/core/compile/polars/compiler.py b/bigframes/core/compile/polars/compiler.py index e1531ee9e5..dd30aec16a 100644 --- a/bigframes/core/compile/polars/compiler.py +++ b/bigframes/core/compile/polars/compiler.py @@ -17,7 +17,7 @@ import functools import itertools import operator -from typing import cast, Literal, Optional, Sequence, Tuple, TYPE_CHECKING +from typing import cast, Literal, Optional, Sequence, Tuple, Type, TYPE_CHECKING import pandas as pd @@ -33,7 +33,9 @@ import bigframes.operations.aggregations as agg_ops import bigframes.operations.bool_ops as bool_ops import bigframes.operations.comparison_ops as comp_ops +import bigframes.operations.datetime_ops as dt_ops import bigframes.operations.generic_ops as gen_ops +import bigframes.operations.json_ops as json_ops import bigframes.operations.numeric_ops as num_ops import bigframes.operations.string_ops as string_ops @@ -42,10 +44,35 @@ import polars as pl else: try: - import polars as pl + import bigframes._importing + + # Use import_polars() instead of importing directly so that we check + # the version numbers. + pl = bigframes._importing.import_polars() except Exception: polars_installed = False + +def register_op(op: Type): + """Register a compilation from BigFrames to Ibis. + + This decorator can be used, even if Polars is not installed. + + Args: + op: The type of the operator the wrapped function compiles. + """ + + def decorator(func): + if polars_installed: + # Ignore the type because compile_op is a generic Callable, so + # register isn't available according to mypy. + return PolarsExpressionCompiler.compile_op.register(op)(func) # type: ignore + else: + return func + + return decorator + + if polars_installed: _DTYPE_MAPPING = { # Direct mappings @@ -238,14 +265,6 @@ def _(self, op: ops.ScalarOp, input: pl.Expr) -> pl.Expr: else: return input.is_in(op.values) or input.is_null() - @compile_op.register(gen_ops.IsNullOp) - def _(self, op: ops.ScalarOp, input: pl.Expr) -> pl.Expr: - return input.is_null() - - @compile_op.register(gen_ops.NotNullOp) - def _(self, op: ops.ScalarOp, input: pl.Expr) -> pl.Expr: - return input.is_not_null() - @compile_op.register(gen_ops.FillNaOp) @compile_op.register(gen_ops.CoalesceOp) def _(self, op: ops.ScalarOp, l_input: pl.Expr, r_input: pl.Expr) -> pl.Expr: @@ -280,6 +299,30 @@ def _(self, op: ops.ScalarOp, l_input: pl.Expr, r_input: pl.Expr) -> pl.Expr: assert isinstance(op, string_ops.StrConcatOp) return pl.concat_str(l_input, r_input) + @compile_op.register(dt_ops.StrftimeOp) + def _(self, op: ops.ScalarOp, input: pl.Expr) -> pl.Expr: + assert isinstance(op, dt_ops.StrftimeOp) + return input.dt.strftime(op.date_format) + + @compile_op.register(dt_ops.ParseDatetimeOp) + def _(self, op: ops.ScalarOp, input: pl.Expr) -> pl.Expr: + assert isinstance(op, dt_ops.ParseDatetimeOp) + return input.str.to_datetime( + time_unit="us", time_zone=None, ambiguous="earliest" + ) + + @compile_op.register(dt_ops.ParseTimestampOp) + def _(self, op: ops.ScalarOp, input: pl.Expr) -> pl.Expr: + assert isinstance(op, dt_ops.ParseTimestampOp) + return input.str.to_datetime( + time_unit="us", time_zone="UTC", ambiguous="earliest" + ) + + @compile_op.register(json_ops.JSONDecode) + def _(self, op: ops.ScalarOp, input: pl.Expr) -> pl.Expr: + assert isinstance(op, json_ops.JSONDecode) + return input.str.json_decode(_DTYPE_MAPPING[op.to_type]) + @dataclasses.dataclass(frozen=True) class PolarsAggregateCompiler: scalar_compiler = PolarsExpressionCompiler() diff --git a/bigframes/core/compile/polars/lowering.py b/bigframes/core/compile/polars/lowering.py index ee0933b450..013651ff17 100644 --- a/bigframes/core/compile/polars/lowering.py +++ b/bigframes/core/compile/polars/lowering.py @@ -17,7 +17,7 @@ from bigframes import dtypes from bigframes.core import bigframe_node, expression from bigframes.core.rewrite import op_lowering -from bigframes.operations import comparison_ops, numeric_ops +from bigframes.operations import comparison_ops, datetime_ops, json_ops, numeric_ops import bigframes.operations as ops # TODO: Would be more precise to actually have separate op set for polars ops (where they diverge from the original ops) @@ -278,6 +278,16 @@ def lower(self, expr: expression.OpExpression) -> expression.Expression: return wo_bools +class LowerAsTypeRule(op_lowering.OpLoweringRule): + @property + def op(self) -> type[ops.ScalarOp]: + return ops.AsTypeOp + + def lower(self, expr: expression.OpExpression) -> expression.Expression: + assert isinstance(expr.op, ops.AsTypeOp) + return _lower_cast(expr.op, expr.inputs[0]) + + def _coerce_comparables( expr1: expression.Expression, expr2: expression.Expression, @@ -299,12 +309,57 @@ def _coerce_comparables( return expr1, expr2 -# TODO: Need to handle bool->string cast to get capitalization correct def _lower_cast(cast_op: ops.AsTypeOp, arg: expression.Expression): + if arg.output_type == cast_op.to_type: + return arg + + if arg.output_type == dtypes.JSON_DTYPE: + return json_ops.JSONDecode(cast_op.to_type).as_expr(arg) + if ( + arg.output_type == dtypes.STRING_DTYPE + and cast_op.to_type == dtypes.DATETIME_DTYPE + ): + return datetime_ops.ParseDatetimeOp().as_expr(arg) + if ( + arg.output_type == dtypes.STRING_DTYPE + and cast_op.to_type == dtypes.TIMESTAMP_DTYPE + ): + return datetime_ops.ParseTimestampOp().as_expr(arg) + # date -> string casting + if ( + arg.output_type == dtypes.DATETIME_DTYPE + and cast_op.to_type == dtypes.STRING_DTYPE + ): + return datetime_ops.StrftimeOp("%Y-%m-%d %H:%M:%S").as_expr(arg) + if arg.output_type == dtypes.TIME_DTYPE and cast_op.to_type == dtypes.STRING_DTYPE: + return datetime_ops.StrftimeOp("%H:%M:%S.%6f").as_expr(arg) + if ( + arg.output_type == dtypes.TIMESTAMP_DTYPE + and cast_op.to_type == dtypes.STRING_DTYPE + ): + return datetime_ops.StrftimeOp("%Y-%m-%d %H:%M:%S%.6f%:::z").as_expr(arg) + if arg.output_type == dtypes.BOOL_DTYPE and cast_op.to_type == dtypes.STRING_DTYPE: + # bool -> decimal needs two-step cast + new_arg = ops.AsTypeOp(to_type=dtypes.INT_DTYPE).as_expr(arg) + is_true_cond = ops.eq_op.as_expr(arg, expression.const(True)) + is_false_cond = ops.eq_op.as_expr(arg, expression.const(False)) + return ops.CaseWhenOp().as_expr( + is_true_cond, + expression.const("True"), + is_false_cond, + expression.const("False"), + ) if arg.output_type == dtypes.BOOL_DTYPE and dtypes.is_numeric(cast_op.to_type): # bool -> decimal needs two-step cast new_arg = ops.AsTypeOp(to_type=dtypes.INT_DTYPE).as_expr(arg) return cast_op.as_expr(new_arg) + if arg.output_type == dtypes.TIME_DTYPE and dtypes.is_numeric(cast_op.to_type): + # polars cast gives nanoseconds, so convert to microseconds + return numeric_ops.floordiv_op.as_expr( + cast_op.as_expr(arg), expression.const(1000) + ) + if dtypes.is_numeric(arg.output_type) and cast_op.to_type == dtypes.TIME_DTYPE: + return cast_op.as_expr(ops.mul_op.as_expr(expression.const(1000), arg)) return cast_op.as_expr(arg) @@ -329,6 +384,7 @@ def _lower_cast(cast_op: ops.AsTypeOp, arg: expression.Expression): LowerDivRule(), LowerFloorDivRule(), LowerModRule(), + LowerAsTypeRule(), ) diff --git a/tests/system/small/pandas/core/methods/__init__.py b/bigframes/core/compile/polars/operations/__init__.py similarity index 66% rename from tests/system/small/pandas/core/methods/__init__.py rename to bigframes/core/compile/polars/operations/__init__.py index 0a2669d7a2..26444dcb67 100644 --- a/tests/system/small/pandas/core/methods/__init__.py +++ b/bigframes/core/compile/polars/operations/__init__.py @@ -11,3 +11,11 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. + +"""Operation implementations for the Polars LazyFrame compiler. + +This directory structure should reflect the same layout as the +`bigframes/operations` directory where the operations are defined. + +Prefer small groups of ops per file to keep file sizes manageable for text editors and LLMs. +""" diff --git a/bigframes/core/compile/polars/operations/generic_ops.py b/bigframes/core/compile/polars/operations/generic_ops.py new file mode 100644 index 0000000000..de0e987aa2 --- /dev/null +++ b/bigframes/core/compile/polars/operations/generic_ops.py @@ -0,0 +1,47 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +""" +BigFrames -> Polars compilation for the operations in bigframes.operations.generic_ops. + +Please keep implementations in sequential order by op name. +""" + +from __future__ import annotations + +from typing import TYPE_CHECKING + +import bigframes.core.compile.polars.compiler as polars_compiler +from bigframes.operations import generic_ops + +if TYPE_CHECKING: + import polars as pl + + +@polars_compiler.register_op(generic_ops.NotNullOp) +def notnull_op_impl( + compiler: polars_compiler.PolarsExpressionCompiler, + op: generic_ops.NotNullOp, # type: ignore + input: pl.Expr, +) -> pl.Expr: + return input.is_not_null() + + +@polars_compiler.register_op(generic_ops.IsNullOp) +def isnull_op_impl( + compiler: polars_compiler.PolarsExpressionCompiler, + op: generic_ops.IsNullOp, # type: ignore + input: pl.Expr, +) -> pl.Expr: + return input.is_null() diff --git a/bigframes/core/compile/sqlglot/expressions/binary_compiler.py b/bigframes/core/compile/sqlglot/expressions/binary_compiler.py index a6eb7182e9..c46019d909 100644 --- a/bigframes/core/compile/sqlglot/expressions/binary_compiler.py +++ b/bigframes/core/compile/sqlglot/expressions/binary_compiler.py @@ -14,6 +14,7 @@ from __future__ import annotations +import bigframes_vendored.constants as constants import sqlglot.expressions as sge from bigframes import dtypes @@ -35,8 +36,83 @@ def _(op, left: TypedExpr, right: TypedExpr) -> sge.Expression: # String addition return sge.Concat(expressions=[left.expr, right.expr]) - # Numerical addition - return sge.Add(this=left.expr, expression=right.expr) + if dtypes.is_numeric(left.dtype) and dtypes.is_numeric(right.dtype): + left_expr = left.expr + if left.dtype == dtypes.BOOL_DTYPE: + left_expr = sge.Cast(this=left_expr, to="INT64") + right_expr = right.expr + if right.dtype == dtypes.BOOL_DTYPE: + right_expr = sge.Cast(this=right_expr, to="INT64") + return sge.Add(this=left_expr, expression=right_expr) + + if ( + dtypes.is_time_or_date_like(left.dtype) + and right.dtype == dtypes.TIMEDELTA_DTYPE + ): + left_expr = left.expr + if left.dtype == dtypes.DATE_DTYPE: + left_expr = sge.Cast(this=left_expr, to="DATETIME") + return sge.TimestampAdd( + this=left_expr, expression=right.expr, unit=sge.Var(this="MICROSECOND") + ) + if ( + dtypes.is_time_or_date_like(right.dtype) + and left.dtype == dtypes.TIMEDELTA_DTYPE + ): + right_expr = right.expr + if right.dtype == dtypes.DATE_DTYPE: + right_expr = sge.Cast(this=right_expr, to="DATETIME") + return sge.TimestampAdd( + this=right_expr, expression=left.expr, unit=sge.Var(this="MICROSECOND") + ) + if left.dtype == dtypes.TIMEDELTA_DTYPE and right.dtype == dtypes.TIMEDELTA_DTYPE: + return sge.Add(this=left.expr, expression=right.expr) + + raise TypeError( + f"Cannot add type {left.dtype} and {right.dtype}. {constants.FEEDBACK_LINK}" + ) + + +@BINARY_OP_REGISTRATION.register(ops.sub_op) +def _(op, left: TypedExpr, right: TypedExpr) -> sge.Expression: + if dtypes.is_numeric(left.dtype) and dtypes.is_numeric(right.dtype): + left_expr = left.expr + if left.dtype == dtypes.BOOL_DTYPE: + left_expr = sge.Cast(this=left_expr, to="INT64") + right_expr = right.expr + if right.dtype == dtypes.BOOL_DTYPE: + right_expr = sge.Cast(this=right_expr, to="INT64") + return sge.Sub(this=left_expr, expression=right_expr) + + if ( + dtypes.is_time_or_date_like(left.dtype) + and right.dtype == dtypes.TIMEDELTA_DTYPE + ): + left_expr = left.expr + if left.dtype == dtypes.DATE_DTYPE: + left_expr = sge.Cast(this=left_expr, to="DATETIME") + return sge.TimestampSub( + this=left_expr, expression=right.expr, unit=sge.Var(this="MICROSECOND") + ) + if dtypes.is_time_or_date_like(left.dtype) and dtypes.is_time_or_date_like( + right.dtype + ): + left_expr = left.expr + if left.dtype == dtypes.DATE_DTYPE: + left_expr = sge.Cast(this=left_expr, to="DATETIME") + right_expr = right.expr + if right.dtype == dtypes.DATE_DTYPE: + right_expr = sge.Cast(this=right_expr, to="DATETIME") + return sge.TimestampDiff( + this=left_expr, expression=right_expr, unit=sge.Var(this="MICROSECOND") + ) + + if left.dtype == dtypes.TIMEDELTA_DTYPE and right.dtype == dtypes.TIMEDELTA_DTYPE: + return sge.Sub(this=left.expr, expression=right.expr) + + raise TypeError( + f"Cannot subtract type {left.dtype} and {right.dtype}. {constants.FEEDBACK_LINK}" + ) @BINARY_OP_REGISTRATION.register(ops.ge_op) diff --git a/bigframes/core/groupby/dataframe_group_by.py b/bigframes/core/groupby/dataframe_group_by.py index a2c4cf2867..e4e4b313f9 100644 --- a/bigframes/core/groupby/dataframe_group_by.py +++ b/bigframes/core/groupby/dataframe_group_by.py @@ -16,7 +16,7 @@ import datetime import typing -from typing import Literal, Sequence, Tuple, Union +from typing import Literal, Optional, Sequence, Tuple, Union import bigframes_vendored.constants as constants import bigframes_vendored.pandas.core.groupby as vendored_pandas_groupby @@ -263,6 +263,48 @@ def kurt( kurtosis = kurt + @validations.requires_ordering() + def first(self, numeric_only: bool = False, min_count: int = -1) -> df.DataFrame: + window_spec = window_specs.unbound( + grouping_keys=tuple(self._by_col_ids), + min_periods=min_count if min_count >= 0 else 0, + ) + target_cols, index = self._aggregated_columns(numeric_only) + block, firsts_ids = self._block.multi_apply_window_op( + target_cols, + agg_ops.FirstNonNullOp(), + window_spec=window_spec, + ) + block, _ = block.aggregate( + self._by_col_ids, + tuple( + aggs.agg(firsts_id, agg_ops.AnyValueOp()) for firsts_id in firsts_ids + ), + dropna=self._dropna, + column_labels=index, + ) + return df.DataFrame(block) + + @validations.requires_ordering() + def last(self, numeric_only: bool = False, min_count: int = -1) -> df.DataFrame: + window_spec = window_specs.unbound( + grouping_keys=tuple(self._by_col_ids), + min_periods=min_count if min_count >= 0 else 0, + ) + target_cols, index = self._aggregated_columns(numeric_only) + block, lasts_ids = self._block.multi_apply_window_op( + target_cols, + agg_ops.LastNonNullOp(), + window_spec=window_spec, + ) + block, _ = block.aggregate( + self._by_col_ids, + tuple(aggs.agg(lasts_id, agg_ops.AnyValueOp()) for lasts_id in lasts_ids), + dropna=self._dropna, + column_labels=index, + ) + return df.DataFrame(block) + def all(self) -> df.DataFrame: return self._aggregate_all(agg_ops.all_op) @@ -330,6 +372,39 @@ def diff(self, periods=1) -> series.Series: ) return self._apply_window_op(agg_ops.DiffOp(periods), window=window) + def value_counts( + self, + subset: Optional[Sequence[blocks.Label]] = None, + normalize: bool = False, + sort: bool = True, + ascending: bool = False, + dropna: bool = True, + ) -> Union[df.DataFrame, series.Series]: + if subset is None: + columns = self._selected_cols + else: + columns = [ + column + for column in self._block.value_columns + if self._block.col_id_to_label[column] in subset + ] + block = self._block + if self._dropna: # this drops null grouping columns + block = block_ops.dropna(block, self._by_col_ids) + block = block_ops.value_counts( + block, + columns, + normalize=normalize, + sort=sort, + ascending=ascending, + drop_na=dropna, # this drops null value columns + grouping_keys=self._by_col_ids, + ) + if self._as_index: + return series.Series(block) + else: + return series.Series(block).to_frame().reset_index(drop=False) + @validations.requires_ordering() def rolling( self, diff --git a/bigframes/core/groupby/series_group_by.py b/bigframes/core/groupby/series_group_by.py index a29bb45a32..7a8bdcb6cf 100644 --- a/bigframes/core/groupby/series_group_by.py +++ b/bigframes/core/groupby/series_group_by.py @@ -36,6 +36,7 @@ import bigframes.core.window as windows import bigframes.core.window_spec as window_specs import bigframes.dataframe as df +import bigframes.dtypes import bigframes.operations.aggregations as agg_ops import bigframes.series as series @@ -162,6 +163,54 @@ def kurt(self, *args, **kwargs) -> series.Series: kurtosis = kurt + @validations.requires_ordering() + def first(self, numeric_only: bool = False, min_count: int = -1) -> series.Series: + if numeric_only and not bigframes.dtypes.is_numeric( + self._block.expr.get_column_type(self._value_column) + ): + raise TypeError( + f"Cannot use 'numeric_only' with non-numeric column {self._value_name}." + ) + window_spec = window_specs.unbound( + grouping_keys=tuple(self._by_col_ids), + min_periods=min_count if min_count >= 0 else 0, + ) + block, firsts_id = self._block.apply_window_op( + self._value_column, + agg_ops.FirstNonNullOp(), + window_spec=window_spec, + ) + block, _ = block.aggregate( + self._by_col_ids, + (aggs.agg(firsts_id, agg_ops.AnyValueOp()),), + dropna=self._dropna, + ) + return series.Series(block.with_column_labels([self._value_name])) + + @validations.requires_ordering() + def last(self, numeric_only: bool = False, min_count: int = -1) -> series.Series: + if numeric_only and not bigframes.dtypes.is_numeric( + self._block.expr.get_column_type(self._value_column) + ): + raise TypeError( + f"Cannot use 'numeric_only' with non-numeric column {self._value_name}." + ) + window_spec = window_specs.unbound( + grouping_keys=tuple(self._by_col_ids), + min_periods=min_count if min_count >= 0 else 0, + ) + block, firsts_id = self._block.apply_window_op( + self._value_column, + agg_ops.LastNonNullOp(), + window_spec=window_spec, + ) + block, _ = block.aggregate( + self._by_col_ids, + (aggs.agg(firsts_id, agg_ops.AnyValueOp()),), + dropna=self._dropna, + ) + return series.Series(block.with_column_labels([self._value_name])) + def prod(self, *args) -> series.Series: return self._aggregate(agg_ops.product_op) @@ -195,6 +244,30 @@ def agg(self, func=None) -> typing.Union[df.DataFrame, series.Series]: aggregate = agg + def value_counts( + self, + normalize: bool = False, + sort: bool = True, + ascending: bool = False, + dropna: bool = True, + ) -> Union[df.DataFrame, series.Series]: + columns = [self._value_column] + block = self._block + if self._dropna: # this drops null grouping columns + block = block_ops.dropna(block, self._by_col_ids) + block = block_ops.value_counts( + block, + columns, + normalize=normalize, + sort=sort, + ascending=ascending, + drop_na=dropna, # this drops null value columns + grouping_keys=self._by_col_ids, + ) + # TODO: once as_index=Fales supported, return DataFrame instead by resetting index + # with .to_frame().reset_index(drop=False) + return series.Series(block) + @validations.requires_ordering() def cumsum(self, *args, **kwargs) -> series.Series: return self._apply_window_op( @@ -314,7 +387,7 @@ def _apply_window_op( discard_name=False, window: typing.Optional[window_specs.WindowSpec] = None, never_skip_nulls: bool = False, - ): + ) -> series.Series: """Apply window op to groupby. Defaults to grouped cumulative window.""" window_spec = window or window_specs.cumulative_rows( grouping_keys=tuple(self._by_col_ids) diff --git a/bigframes/core/indexes/base.py b/bigframes/core/indexes/base.py index 9ad201c73d..e022b3f151 100644 --- a/bigframes/core/indexes/base.py +++ b/bigframes/core/indexes/base.py @@ -489,7 +489,7 @@ def value_counts( self._block.index_columns, normalize=normalize, ascending=ascending, - dropna=dropna, + drop_na=dropna, ) import bigframes.series as series diff --git a/bigframes/dataframe.py b/bigframes/dataframe.py index 7de4bdbc91..4559d7cbb9 100644 --- a/bigframes/dataframe.py +++ b/bigframes/dataframe.py @@ -2475,7 +2475,7 @@ def value_counts( normalize=normalize, sort=sort, ascending=ascending, - dropna=dropna, + drop_na=dropna, ) return bigframes.series.Series(block) @@ -2763,6 +2763,12 @@ def where(self, cond, other=None): "The dataframe.where() method does not support multi-column." ) + # Execute it with the DataFrame when cond or/and other is callable. + if callable(cond): + cond = cond(self) + if callable(other): + other = other(self) + aligned_block, (_, _) = self._block.join(cond._block, how="left") # No left join is needed when 'other' is None or constant. if isinstance(other, bigframes.dataframe.DataFrame): @@ -3514,16 +3520,22 @@ def join( *, on: Optional[str] = None, how: str = "left", + lsuffix: str = "", + rsuffix: str = "", ) -> DataFrame: if isinstance(other, bigframes.series.Series): other = other.to_frame() left, right = self, other - if not left.columns.intersection(right.columns).empty: - raise NotImplementedError( - f"Deduping column names is not implemented. {constants.FEEDBACK_LINK}" - ) + col_intersection = left.columns.intersection(right.columns) + + if not col_intersection.empty: + if lsuffix == rsuffix == "": + raise ValueError( + f"columns overlap but no suffix specified: {col_intersection}" + ) + if how == "cross": if on is not None: raise ValueError("'on' is not supported for cross join.") @@ -3531,7 +3543,7 @@ def join( right._block, left_join_ids=[], right_join_ids=[], - suffixes=("", ""), + suffixes=(lsuffix, rsuffix), how="cross", sort=True, ) @@ -3539,45 +3551,107 @@ def join( # Join left columns with right index if on is not None: + if left._has_index and (on in left.index.names): + if on in left.columns: + raise ValueError( + f"'{on}' is both an index level and a column label, which is ambiguous." + ) + else: + raise NotImplementedError( + f"Joining on index level '{on}' is not yet supported. {constants.FEEDBACK_LINK}" + ) + if (left.columns == on).sum() > 1: + raise ValueError(f"The column label '{on}' is not unique.") + if other._block.index.nlevels != 1: raise ValueError( "Join on columns must match the index level of the other DataFrame. Join on column with multi-index haven't been supported." ) - # Switch left index with on column - left_columns = left.columns - left_idx_original_names = left.index.names if left._has_index else () - left_idx_names_in_cols = [ - f"bigframes_left_idx_name_{i}" - for i in range(len(left_idx_original_names)) - ] - if left._has_index: - left.index.names = left_idx_names_in_cols - left = left.reset_index(drop=False) - left = left.set_index(on) - - # Join on index and switch back - combined_df = left._perform_join_by_index(right, how=how) - combined_df.index.name = on - combined_df = combined_df.reset_index(drop=False) - combined_df = combined_df.set_index(left_idx_names_in_cols) - - # To be consistent with Pandas - if combined_df._has_index: - combined_df.index.names = ( - left_idx_original_names - if how in ("inner", "left") - else ([None] * len(combined_df.index.names)) - ) - # Reorder columns - combined_df = combined_df[list(left_columns) + list(right.columns)] - return combined_df + return self._join_on_key( + other, + on=on, + how=how, + lsuffix=lsuffix, + rsuffix=rsuffix, + should_duplicate_on_key=(on in col_intersection), + ) # Join left index with right index if left._block.index.nlevels != right._block.index.nlevels: raise ValueError("Index to join on must have the same number of levels.") - return left._perform_join_by_index(right, how=how) + return left._perform_join_by_index(right, how=how)._add_join_suffix( + left.columns, right.columns, lsuffix=lsuffix, rsuffix=rsuffix + ) + + def _join_on_key( + self, + other: DataFrame, + on: str, + how: str, + lsuffix: str, + rsuffix: str, + should_duplicate_on_key: bool, + ) -> DataFrame: + left, right = self.copy(), other + # Replace all columns names with unique names for reordering. + left_col_original_names = left.columns + on_col_name = "bigframes_left_col_on" + dup_on_col_name = "bigframes_left_col_on_dup" + left_col_temp_names = [ + f"bigframes_left_col_name_{i}" if col_name != on else on_col_name + for i, col_name in enumerate(left_col_original_names) + ] + left.columns = pandas.Index(left_col_temp_names) + # if on column is also in right df, we need to duplicate the column + # and set it to be the first column + if should_duplicate_on_key: + left[dup_on_col_name] = left[on_col_name] + on_col_name = dup_on_col_name + left_col_temp_names = [on_col_name] + left_col_temp_names + left = left[left_col_temp_names] + + # Switch left index with on column + left_idx_original_names = left.index.names if left._has_index else () + left_idx_names_in_cols = [ + f"bigframes_left_idx_name_{i}" for i in range(len(left_idx_original_names)) + ] + if left._has_index: + left.index.names = left_idx_names_in_cols + left = left.reset_index(drop=False) + left = left.set_index(on_col_name) + + right_col_original_names = right.columns + right_col_temp_names = [ + f"bigframes_right_col_name_{i}" + for i in range(len(right_col_original_names)) + ] + right.columns = pandas.Index(right_col_temp_names) + + # Join on index and switch back + combined_df = left._perform_join_by_index(right, how=how) + combined_df.index.name = on_col_name + combined_df = combined_df.reset_index(drop=False) + combined_df = combined_df.set_index(left_idx_names_in_cols) + + # To be consistent with Pandas + if combined_df._has_index: + combined_df.index.names = ( + left_idx_original_names + if how in ("inner", "left") + else ([None] * len(combined_df.index.names)) + ) + + # Reorder columns + combined_df = combined_df[left_col_temp_names + right_col_temp_names] + return combined_df._add_join_suffix( + left_col_original_names, + right_col_original_names, + lsuffix=lsuffix, + rsuffix=rsuffix, + extra_col=on if on_col_name == dup_on_col_name else None, + ) def _perform_join_by_index( self, @@ -3591,6 +3665,59 @@ def _perform_join_by_index( ) return DataFrame(block) + def _add_join_suffix( + self, + left_columns, + right_columns, + lsuffix: str = "", + rsuffix: str = "", + extra_col: typing.Optional[str] = None, + ): + """Applies suffixes to overlapping column names to mimic a pandas join. + + This method identifies columns that are common to both a "left" and "right" + set of columns and renames them using the provided suffixes. Columns that + are not in the intersection are kept with their original names. + + Args: + left_columns (pandas.Index): + The column labels from the left DataFrame. + right_columns (pandas.Index): + The column labels from the right DataFrame. + lsuffix (str): + The suffix to apply to overlapping column names from the left side. + rsuffix (str): + The suffix to apply to overlapping column names from the right side. + extra_col (typing.Optional[str]): + An optional column name to prepend to the final list of columns. + This argument is used specifically to match the behavior of a + pandas join. When a join key (i.e., the 'on' column) exists + in both the left and right DataFrames, pandas creates two versions + of that column: one copy keeps its original name and is placed as + the first column, while the other instances receive the normal + suffix. Passing the join key's name here replicates that behavior. + + Returns: + DataFrame: + A new DataFrame with the columns renamed to resolve overlaps. + """ + combined_df = self.copy() + col_intersection = left_columns.intersection(right_columns) + final_col_names = [] if extra_col is None else [extra_col] + for col_name in left_columns: + if col_name in col_intersection: + final_col_names.append(f"{col_name}{lsuffix}") + else: + final_col_names.append(col_name) + + for col_name in right_columns: + if col_name in col_intersection: + final_col_names.append(f"{col_name}{rsuffix}") + else: + final_col_names.append(col_name) + combined_df.columns = pandas.Index(final_col_names) + return combined_df + @validations.requires_ordering() def rolling( self, diff --git a/bigframes/dtypes.py b/bigframes/dtypes.py index a58619dc21..ef1b9e7871 100644 --- a/bigframes/dtypes.py +++ b/bigframes/dtypes.py @@ -289,6 +289,10 @@ def is_time_like(type_: ExpressionType) -> bool: return type_ in (DATETIME_DTYPE, TIMESTAMP_DTYPE, TIME_DTYPE) +def is_time_or_date_like(type_: ExpressionType) -> bool: + return type_ in (DATE_DTYPE, DATETIME_DTYPE, TIME_DTYPE, TIMESTAMP_DTYPE) + + def is_geo_like(type_: ExpressionType) -> bool: return type_ in (GEO_DTYPE,) diff --git a/bigframes/exceptions.py b/bigframes/exceptions.py index 39a847de84..53c8deb082 100644 --- a/bigframes/exceptions.py +++ b/bigframes/exceptions.py @@ -79,6 +79,10 @@ class TimeTravelDisabledWarning(Warning): """A query was reattempted without time travel.""" +class TimeTravelCacheWarning(Warning): + """Reads from the same table twice in the same session pull time travel from cache.""" + + class AmbiguousWindowWarning(Warning): """A query may produce nondeterministic results as the window may be ambiguously ordered.""" @@ -103,6 +107,10 @@ class FunctionAxisOnePreviewWarning(PreviewWarning): """Remote Function and Managed UDF with axis=1 preview.""" +class FunctionConflictTypeHintWarning(UserWarning): + """Conflicting type hints in a BigFrames function.""" + + class FunctionPackageVersionWarning(PreviewWarning): """ Managed UDF package versions for Numpy, Pandas, and Pyarrow may not diff --git a/bigframes/functions/_function_client.py b/bigframes/functions/_function_client.py index ae19dc1480..a8c9f9c301 100644 --- a/bigframes/functions/_function_client.py +++ b/bigframes/functions/_function_client.py @@ -245,7 +245,7 @@ def provision_bq_managed_function( # Augment user package requirements with any internal package # requirements. - packages = _utils._get_updated_package_requirements( + packages = _utils.get_updated_package_requirements( packages, is_row_processor, capture_references, ignore_package_version=True ) if packages: @@ -258,7 +258,7 @@ def provision_bq_managed_function( bq_function_name = name if not bq_function_name: # Compute a unique hash representing the user code. - function_hash = _utils._get_hash(func, packages) + function_hash = _utils.get_hash(func, packages) bq_function_name = _utils.get_bigframes_function_name( function_hash, session_id, @@ -366,8 +366,8 @@ def generate_cloud_function_code( def create_cloud_function( self, def_, - cf_name, *, + random_name, input_types: Tuple[str], output_type: str, package_requirements=None, @@ -428,9 +428,9 @@ def create_cloud_function( create_function_request.parent = ( self.get_cloud_function_fully_qualified_parent() ) - create_function_request.function_id = cf_name + create_function_request.function_id = random_name function = functions_v2.Function() - function.name = self.get_cloud_function_fully_qualified_name(cf_name) + function.name = self.get_cloud_function_fully_qualified_name(random_name) function.build_config = functions_v2.BuildConfig() function.build_config.runtime = python_version function.build_config.entry_point = entry_point @@ -497,24 +497,25 @@ def create_cloud_function( # Cleanup os.remove(archive_path) except google.api_core.exceptions.AlreadyExists: - # If a cloud function with the same name already exists, let's - # update it - update_function_request = functions_v2.UpdateFunctionRequest() - update_function_request.function = function - operation = self._cloud_functions_client.update_function( - request=update_function_request - ) - operation.result() + # b/437124912: The most likely scenario is that + # `create_function` had a retry due to a network issue. The + # retried request then fails because the first call actually + # succeeded, but we didn't get the successful response back. + # + # Since the function name was randomly chosen to avoid + # conflicts, we know the AlreadyExist can only happen because + # we created it. This error is safe to ignore. + pass # Fetch the endpoint of the just created function - endpoint = self.get_cloud_function_endpoint(cf_name) + endpoint = self.get_cloud_function_endpoint(random_name) if not endpoint: raise bf_formatting.create_exception_with_feedback_link( ValueError, "Couldn't fetch the http endpoint." ) logger.info( - f"Successfully created cloud function {cf_name} with uri ({endpoint})" + f"Successfully created cloud function {random_name} with uri ({endpoint})" ) return endpoint @@ -538,12 +539,12 @@ def provision_bq_remote_function( """Provision a BigQuery remote function.""" # Augment user package requirements with any internal package # requirements - package_requirements = _utils._get_updated_package_requirements( + package_requirements = _utils.get_updated_package_requirements( package_requirements, is_row_processor ) # Compute a unique hash representing the user code - function_hash = _utils._get_hash(def_, package_requirements) + function_hash = _utils.get_hash(def_, package_requirements) # If reuse of any existing function with the same name (indicated by the # same hash of its source code) is not intended, then attach a unique @@ -571,7 +572,7 @@ def provision_bq_remote_function( if not cf_endpoint: cf_endpoint = self.create_cloud_function( def_, - cloud_function_name, + random_name=cloud_function_name, input_types=input_types, output_type=output_type, package_requirements=package_requirements, diff --git a/bigframes/functions/_function_session.py b/bigframes/functions/_function_session.py index 371784332c..29e175d02f 100644 --- a/bigframes/functions/_function_session.py +++ b/bigframes/functions/_function_session.py @@ -536,6 +536,11 @@ def wrapper(func): if input_types is not None: if not isinstance(input_types, collections.abc.Sequence): input_types = [input_types] + if _utils.has_conflict_input_type(py_sig, input_types): + msg = bfe.format_message( + "Conflicting input types detected, using the one from the decorator." + ) + warnings.warn(msg, category=bfe.FunctionConflictTypeHintWarning) py_sig = py_sig.replace( parameters=[ par.replace(annotation=itype) @@ -543,6 +548,11 @@ def wrapper(func): ] ) if output_type: + if _utils.has_conflict_output_type(py_sig, output_type): + msg = bfe.format_message( + "Conflicting return type detected, using the one from the decorator." + ) + warnings.warn(msg, category=bfe.FunctionConflictTypeHintWarning) py_sig = py_sig.replace(return_annotation=output_type) # Try to get input types via type annotations. @@ -587,7 +597,7 @@ def wrapper(func): bqrf_metadata = _utils.get_bigframes_metadata( python_output_type=py_sig.return_annotation ) - post_process_routine = _utils._build_unnest_post_routine( + post_process_routine = _utils.build_unnest_post_routine( py_sig.return_annotation ) py_sig = py_sig.replace(return_annotation=str) @@ -838,6 +848,11 @@ def wrapper(func): if input_types is not None: if not isinstance(input_types, collections.abc.Sequence): input_types = [input_types] + if _utils.has_conflict_input_type(py_sig, input_types): + msg = bfe.format_message( + "Conflicting input types detected, using the one from the decorator." + ) + warnings.warn(msg, category=bfe.FunctionConflictTypeHintWarning) py_sig = py_sig.replace( parameters=[ par.replace(annotation=itype) @@ -845,6 +860,11 @@ def wrapper(func): ] ) if output_type: + if _utils.has_conflict_output_type(py_sig, output_type): + msg = bfe.format_message( + "Conflicting return type detected, using the one from the decorator." + ) + warnings.warn(msg, category=bfe.FunctionConflictTypeHintWarning) py_sig = py_sig.replace(return_annotation=output_type) # The function will actually be receiving a pandas Series, but allow diff --git a/bigframes/functions/_utils.py b/bigframes/functions/_utils.py index 0b7222db86..37ced54d49 100644 --- a/bigframes/functions/_utils.py +++ b/bigframes/functions/_utils.py @@ -14,10 +14,11 @@ import hashlib +import inspect import json import sys import typing -from typing import cast, Optional, Set +from typing import Any, cast, Optional, Sequence, Set import warnings import cloudpickle @@ -62,7 +63,7 @@ def get_remote_function_locations(bq_location): return bq_location, cloud_function_region -def _get_updated_package_requirements( +def get_updated_package_requirements( package_requirements=None, is_row_processor=False, capture_references=True, @@ -104,7 +105,7 @@ def _get_updated_package_requirements( return requirements -def _clean_up_by_session_id( +def clean_up_by_session_id( bqclient: bigquery.Client, gcfclient: functions_v2.FunctionServiceClient, dataset: bigquery.DatasetReference, @@ -168,7 +169,7 @@ def _clean_up_by_session_id( pass -def _get_hash(def_, package_requirements=None): +def get_hash(def_, package_requirements=None): "Get hash (32 digits alphanumeric) of a function." # There is a known cell-id sensitivity of the cloudpickle serialization in # notebooks https://github.com/cloudpipe/cloudpickle/issues/538. Because of @@ -278,7 +279,7 @@ def get_python_version(is_compat: bool = False) -> str: return f"python{major}{minor}" if is_compat else f"python-{major}.{minor}" -def _build_unnest_post_routine(py_list_type: type[list]): +def build_unnest_post_routine(py_list_type: type[list]): sdk_type = function_typing.sdk_array_output_type_from_python_type(py_list_type) assert sdk_type.array_element_type is not None inner_sdk_type = sdk_type.array_element_type @@ -290,3 +291,36 @@ def post_process(input): return bbq.json_extract_string_array(input, value_dtype=result_dtype) return post_process + + +def has_conflict_input_type( + signature: inspect.Signature, + input_types: Sequence[Any], +) -> bool: + """Checks if the parameters have any conflict with the input_types.""" + params = list(signature.parameters.values()) + + if len(params) != len(input_types): + return True + + # Check for conflicts type hints. + for i, param in enumerate(params): + if param.annotation is not inspect.Parameter.empty: + if param.annotation != input_types[i]: + return True + + # No conflicts were found after checking all parameters. + return False + + +def has_conflict_output_type( + signature: inspect.Signature, + output_type: Any, +) -> bool: + """Checks if the return type annotation conflicts with the output_type.""" + return_annotation = signature.return_annotation + + if return_annotation is inspect.Parameter.empty: + return False + + return return_annotation != output_type diff --git a/bigframes/functions/function.py b/bigframes/functions/function.py index b695bcd250..a62da57075 100644 --- a/bigframes/functions/function.py +++ b/bigframes/functions/function.py @@ -90,7 +90,7 @@ def _try_import_routine( return BigqueryCallableRoutine( udf_def, session, - post_routine=_utils._build_unnest_post_routine(override_type), + post_routine=_utils.build_unnest_post_routine(override_type), ) return BigqueryCallableRoutine(udf_def, session, is_managed=not is_remote) @@ -107,7 +107,7 @@ def _try_import_row_routine( return BigqueryCallableRowRoutine( udf_def, session, - post_routine=_utils._build_unnest_post_routine(override_type), + post_routine=_utils.build_unnest_post_routine(override_type), ) return BigqueryCallableRowRoutine(udf_def, session, is_managed=not is_remote) diff --git a/bigframes/geopandas/geoseries.py b/bigframes/geopandas/geoseries.py index 2999625cda..f3558e4b34 100644 --- a/bigframes/geopandas/geoseries.py +++ b/bigframes/geopandas/geoseries.py @@ -13,13 +13,15 @@ # limitations under the License. from __future__ import annotations +from typing import Optional + import bigframes_vendored.constants as constants import bigframes_vendored.geopandas.geoseries as vendored_geoseries import geopandas.array # type: ignore -import bigframes.geopandas import bigframes.operations as ops import bigframes.series +import bigframes.session class GeoSeries(vendored_geoseries.GeoSeries, bigframes.series.Series): @@ -73,8 +75,14 @@ def is_closed(self) -> bigframes.series.Series: ) @classmethod - def from_wkt(cls, data, index=None) -> GeoSeries: - series = bigframes.series.Series(data, index=index) + def from_wkt( + cls, + data, + index=None, + *, + session: Optional[bigframes.session.Session] = None, + ) -> GeoSeries: + series = bigframes.series.Series(data, index=index, session=session) return cls(series._apply_unary_op(ops.geo_st_geogfromtext_op)) @@ -92,6 +100,19 @@ def to_wkt(self: GeoSeries) -> bigframes.series.Series: series.name = None return series + def buffer(self: GeoSeries, distance: float) -> bigframes.series.Series: # type: ignore + raise NotImplementedError( + f"GeoSeries.buffer is not supported. Use bigframes.bigquery.st_buffer(series, distance), instead. {constants.FEEDBACK_LINK}" + ) + + @property + def centroid(self: GeoSeries) -> bigframes.series.Series: # type: ignore + return self._apply_unary_op(ops.geo_st_centroid_op) + + @property + def convex_hull(self: GeoSeries) -> bigframes.series.Series: # type: ignore + return self._apply_unary_op(ops.geo_st_convexhull_op) + def difference(self: GeoSeries, other: GeoSeries) -> bigframes.series.Series: # type: ignore return self._apply_binary_op(other, ops.geo_st_difference_op) diff --git a/bigframes/operations/__init__.py b/bigframes/operations/__init__.py index 86098d47cf..e10a972790 100644 --- a/bigframes/operations/__init__.py +++ b/bigframes/operations/__init__.py @@ -94,6 +94,8 @@ geo_area_op, geo_st_astext_op, geo_st_boundary_op, + geo_st_centroid_op, + geo_st_convexhull_op, geo_st_difference_op, geo_st_geogfromtext_op, geo_st_geogpoint_op, @@ -101,6 +103,7 @@ geo_st_isclosed_op, geo_x_op, geo_y_op, + GeoStBufferOp, GeoStDistanceOp, GeoStLengthOp, ) @@ -386,12 +389,15 @@ # Geo ops "geo_area_op", "geo_st_boundary_op", + "geo_st_centroid_op", + "geo_st_convexhull_op", "geo_st_difference_op", "geo_st_astext_op", "geo_st_geogfromtext_op", "geo_st_geogpoint_op", "geo_st_intersection_op", "geo_st_isclosed_op", + "GeoStBufferOp", "GeoStLengthOp", "geo_x_op", "geo_y_op", diff --git a/bigframes/operations/aggregations.py b/bigframes/operations/aggregations.py index 1c321c0bf8..984f7d3798 100644 --- a/bigframes/operations/aggregations.py +++ b/bigframes/operations/aggregations.py @@ -33,6 +33,11 @@ def skips_nulls(self): """Whether the window op skips null rows.""" return True + @property + def nulls_count_for_min_values(self) -> bool: + """Whether null values count for min_values.""" + return not self.skips_nulls + @property def implicitly_inherits_order(self): """ @@ -480,6 +485,10 @@ class FirstNonNullOp(UnaryWindowOp): def skips_nulls(self): return False + @property + def nulls_count_for_min_values(self) -> bool: + return False + @dataclasses.dataclass(frozen=True) class LastOp(UnaryWindowOp): @@ -492,6 +501,10 @@ class LastNonNullOp(UnaryWindowOp): def skips_nulls(self): return False + @property + def nulls_count_for_min_values(self) -> bool: + return False + @dataclasses.dataclass(frozen=True) class ShiftOp(UnaryWindowOp): diff --git a/bigframes/operations/datetime_ops.py b/bigframes/operations/datetime_ops.py index 6f44952488..9988e8ed7b 100644 --- a/bigframes/operations/datetime_ops.py +++ b/bigframes/operations/datetime_ops.py @@ -39,6 +39,28 @@ time_op = TimeOp() +@dataclasses.dataclass(frozen=True) +class ParseDatetimeOp(base_ops.UnaryOp): + # TODO: Support strict format + name: typing.ClassVar[str] = "parse_datetime" + + def output_type(self, *input_types: dtypes.ExpressionType) -> dtypes.ExpressionType: + if input_types[0] != dtypes.STRING_DTYPE: + raise TypeError("expected string input") + return pd.ArrowDtype(pa.timestamp("us", tz=None)) + + +@dataclasses.dataclass(frozen=True) +class ParseTimestampOp(base_ops.UnaryOp): + # TODO: Support strict format + name: typing.ClassVar[str] = "parse_timestamp" + + def output_type(self, *input_types: dtypes.ExpressionType) -> dtypes.ExpressionType: + if input_types[0] != dtypes.STRING_DTYPE: + raise TypeError("expected string input") + return pd.ArrowDtype(pa.timestamp("us", tz="UTC")) + + @dataclasses.dataclass(frozen=True) class ToDatetimeOp(base_ops.UnaryOp): name: typing.ClassVar[str] = "to_datetime" diff --git a/bigframes/operations/generic_ops.py b/bigframes/operations/generic_ops.py index 3c3f9653b4..152de543db 100644 --- a/bigframes/operations/generic_ops.py +++ b/bigframes/operations/generic_ops.py @@ -53,6 +53,280 @@ ) hash_op = HashOp() +# source, dest type +_VALID_CASTS = set( + ( + # INT casts + ( + dtypes.BOOL_DTYPE, + dtypes.INT_DTYPE, + ), + ( + dtypes.FLOAT_DTYPE, + dtypes.INT_DTYPE, + ), + ( + dtypes.NUMERIC_DTYPE, + dtypes.INT_DTYPE, + ), + ( + dtypes.BIGNUMERIC_DTYPE, + dtypes.INT_DTYPE, + ), + ( + dtypes.TIME_DTYPE, + dtypes.INT_DTYPE, + ), + ( + dtypes.DATETIME_DTYPE, + dtypes.INT_DTYPE, + ), + ( + dtypes.TIMESTAMP_DTYPE, + dtypes.INT_DTYPE, + ), + ( + dtypes.TIMEDELTA_DTYPE, + dtypes.INT_DTYPE, + ), + ( + dtypes.STRING_DTYPE, + dtypes.INT_DTYPE, + ), + ( + dtypes.JSON_DTYPE, + dtypes.INT_DTYPE, + ), + # Float casts + ( + dtypes.BOOL_DTYPE, + dtypes.FLOAT_DTYPE, + ), + ( + dtypes.NUMERIC_DTYPE, + dtypes.FLOAT_DTYPE, + ), + ( + dtypes.BIGNUMERIC_DTYPE, + dtypes.FLOAT_DTYPE, + ), + ( + dtypes.INT_DTYPE, + dtypes.FLOAT_DTYPE, + ), + ( + dtypes.STRING_DTYPE, + dtypes.FLOAT_DTYPE, + ), + ( + dtypes.JSON_DTYPE, + dtypes.FLOAT_DTYPE, + ), + # Bool casts + ( + dtypes.INT_DTYPE, + dtypes.BOOL_DTYPE, + ), + ( + dtypes.FLOAT_DTYPE, + dtypes.BOOL_DTYPE, + ), + ( + dtypes.JSON_DTYPE, + dtypes.BOOL_DTYPE, + ), + # String casts + ( + dtypes.BYTES_DTYPE, + dtypes.STRING_DTYPE, + ), + ( + dtypes.BOOL_DTYPE, + dtypes.STRING_DTYPE, + ), + ( + dtypes.FLOAT_DTYPE, + dtypes.STRING_DTYPE, + ), + ( + dtypes.TIME_DTYPE, + dtypes.STRING_DTYPE, + ), + ( + dtypes.INT_DTYPE, + dtypes.STRING_DTYPE, + ), + ( + dtypes.DATETIME_DTYPE, + dtypes.STRING_DTYPE, + ), + ( + dtypes.TIMESTAMP_DTYPE, + dtypes.STRING_DTYPE, + ), + ( + dtypes.DATE_DTYPE, + dtypes.STRING_DTYPE, + ), + ( + dtypes.JSON_DTYPE, + dtypes.STRING_DTYPE, + ), + # bytes casts + ( + dtypes.STRING_DTYPE, + dtypes.BYTES_DTYPE, + ), + # decimal casts + ( + dtypes.STRING_DTYPE, + dtypes.NUMERIC_DTYPE, + ), + ( + dtypes.INT_DTYPE, + dtypes.NUMERIC_DTYPE, + ), + ( + dtypes.FLOAT_DTYPE, + dtypes.NUMERIC_DTYPE, + ), + ( + dtypes.BIGNUMERIC_DTYPE, + dtypes.NUMERIC_DTYPE, + ), + # big decimal casts + ( + dtypes.STRING_DTYPE, + dtypes.BIGNUMERIC_DTYPE, + ), + ( + dtypes.INT_DTYPE, + dtypes.BIGNUMERIC_DTYPE, + ), + ( + dtypes.FLOAT_DTYPE, + dtypes.BIGNUMERIC_DTYPE, + ), + ( + dtypes.NUMERIC_DTYPE, + dtypes.BIGNUMERIC_DTYPE, + ), + # time casts + ( + dtypes.INT_DTYPE, + dtypes.TIME_DTYPE, + ), + ( + dtypes.DATETIME_DTYPE, + dtypes.TIME_DTYPE, + ), + ( + dtypes.TIMESTAMP_DTYPE, + dtypes.TIME_DTYPE, + ), + # date casts + ( + dtypes.STRING_DTYPE, + dtypes.DATE_DTYPE, + ), + ( + dtypes.DATETIME_DTYPE, + dtypes.DATE_DTYPE, + ), + ( + dtypes.TIMESTAMP_DTYPE, + dtypes.DATE_DTYPE, + ), + # datetime casts + ( + dtypes.DATE_DTYPE, + dtypes.DATETIME_DTYPE, + ), + ( + dtypes.STRING_DTYPE, + dtypes.DATETIME_DTYPE, + ), + ( + dtypes.TIMESTAMP_DTYPE, + dtypes.DATETIME_DTYPE, + ), + ( + dtypes.INT_DTYPE, + dtypes.DATETIME_DTYPE, + ), + # timestamp casts + ( + dtypes.DATE_DTYPE, + dtypes.TIMESTAMP_DTYPE, + ), + ( + dtypes.STRING_DTYPE, + dtypes.TIMESTAMP_DTYPE, + ), + ( + dtypes.DATETIME_DTYPE, + dtypes.TIMESTAMP_DTYPE, + ), + ( + dtypes.INT_DTYPE, + dtypes.TIMESTAMP_DTYPE, + ), + # timedelta casts + ( + dtypes.INT_DTYPE, + dtypes.TIMEDELTA_DTYPE, + ), + # json casts + ( + dtypes.BOOL_DTYPE, + dtypes.JSON_DTYPE, + ), + ( + dtypes.FLOAT_DTYPE, + dtypes.JSON_DTYPE, + ), + ( + dtypes.STRING_DTYPE, + dtypes.JSON_DTYPE, + ), + ( + dtypes.INT_DTYPE, + dtypes.JSON_DTYPE, + ), + ) +) + + +def _valid_scalar_cast(src: dtypes.Dtype, dst: dtypes.Dtype): + if src == dst: + return True + elif (src, dst) in _VALID_CASTS: + return True + return False + + +def _valid_cast(src: dtypes.Dtype, dst: dtypes.Dtype): + if src == dst: + return True + # TODO: Might need to be more strict within list/array context + if dtypes.is_array_like(src) and dtypes.is_array_like(dst): + src_inner = dtypes.get_array_inner_type(src) + dst_inner = dtypes.get_array_inner_type(dst) + return _valid_cast(src_inner, dst_inner) + if dtypes.is_struct_like(src) and dtypes.is_struct_like(dst): + src_fields = dtypes.get_struct_fields(src) + dst_fields = dtypes.get_struct_fields(dst) + if len(src_fields) != len(dst_fields): + return False + for (_, src_dtype), (_, dst_dtype) in zip( + src_fields.items(), dst_fields.items() + ): + if not _valid_cast(src_dtype, dst_dtype): + return False + return True + + return _valid_scalar_cast(src, dst) + @dataclasses.dataclass(frozen=True) class AsTypeOp(base_ops.UnaryOp): @@ -62,6 +336,9 @@ class AsTypeOp(base_ops.UnaryOp): safe: bool = False def output_type(self, *input_types): + if not _valid_cast(input_types[0], self.to_type): + raise TypeError(f"Cannot cast {input_types[0]} to {self.to_type}") + return self.to_type diff --git a/bigframes/operations/geo_ops.py b/bigframes/operations/geo_ops.py index 0268c63249..3b7754a47a 100644 --- a/bigframes/operations/geo_ops.py +++ b/bigframes/operations/geo_ops.py @@ -42,6 +42,22 @@ ) geo_st_boundary_op = GeoStBoundaryOp() +GeoStCentroidOp = base_ops.create_unary_op( + name="geo_st_centroid", + type_signature=op_typing.FixedOutputType( + dtypes.is_geo_like, dtypes.GEO_DTYPE, description="geo-like" + ), +) +geo_st_centroid_op = GeoStCentroidOp() + +GeoStConvexhullOp = base_ops.create_unary_op( + name="geo_st_convexhull", + type_signature=op_typing.FixedOutputType( + dtypes.is_geo_like, dtypes.GEO_DTYPE, description="geo-like" + ), +) +geo_st_convexhull_op = GeoStConvexhullOp() + GeoStDifferenceOp = base_ops.create_binary_op( name="geo_st_difference", type_signature=op_typing.BinaryGeo() ) @@ -90,6 +106,17 @@ geo_st_intersection_op = GeoStIntersectionOp() +@dataclasses.dataclass(frozen=True) +class GeoStBufferOp(base_ops.UnaryOp): + name = "st_buffer" + buffer_radius: float + num_seg_quarter_circle: float + use_spheroid: bool + + def output_type(self, *input_types: dtypes.ExpressionType) -> dtypes.ExpressionType: + return dtypes.GEO_DTYPE + + @dataclasses.dataclass(frozen=True) class GeoStDistanceOp(base_ops.BinaryOp): name = "st_distance" diff --git a/bigframes/operations/json_ops.py b/bigframes/operations/json_ops.py index 81f00c39ce..b1f4f2f689 100644 --- a/bigframes/operations/json_ops.py +++ b/bigframes/operations/json_ops.py @@ -183,3 +183,18 @@ def output_type(self, *input_types): + f" Received type: {input_type}" ) return input_type + + +@dataclasses.dataclass(frozen=True) +class JSONDecode(base_ops.UnaryOp): + name: typing.ClassVar[str] = "json_decode" + to_type: dtypes.Dtype + + def output_type(self, *input_types): + input_type = input_types[0] + if not dtypes.is_json_like(input_type): + raise TypeError( + "Input type must be a valid JSON object or JSON-formatted string type." + + f" Received type: {input_type}" + ) + return self.to_type diff --git a/bigframes/pandas/__init__.py b/bigframes/pandas/__init__.py index 76e0f8719b..6ffed5b53f 100644 --- a/bigframes/pandas/__init__.py +++ b/bigframes/pandas/__init__.py @@ -293,7 +293,7 @@ def clean_up_by_session_id( session.bqclient, dataset, session_id ) - bff_utils._clean_up_by_session_id( + bff_utils.clean_up_by_session_id( session.bqclient, session.cloudfunctionsclient, dataset, session_id ) diff --git a/bigframes/pandas/io/api.py b/bigframes/pandas/io/api.py index a88cc7a011..cf4b4eb19c 100644 --- a/bigframes/pandas/io/api.py +++ b/bigframes/pandas/io/api.py @@ -33,6 +33,7 @@ Tuple, Union, ) +import warnings import bigframes_vendored.constants as constants import bigframes_vendored.pandas.io.gbq as vendored_pandas_gbq @@ -348,7 +349,11 @@ def _read_gbq_colab( ) _set_default_session_location_if_possible_deferred_query(create_query) if not config.options.bigquery._session_started: - config.options.bigquery.enable_polars_execution = True + with warnings.catch_warnings(): + # Don't warning about Polars in SQL cell. + # Related to b/437090788. + warnings.simplefilter("ignore", bigframes.exceptions.PreviewWarning) + config.options.bigquery.enable_polars_execution = True return global_session.with_default_session( bigframes.session.Session._read_gbq_colab, diff --git a/bigframes/series.py b/bigframes/series.py index 3a1af0bb1d..bfc26adc38 100644 --- a/bigframes/series.py +++ b/bigframes/series.py @@ -1631,7 +1631,7 @@ def value_counts( [self._value_column], normalize=normalize, ascending=ascending, - dropna=dropna, + drop_na=dropna, ) return Series(block) diff --git a/bigframes/session/_io/bigquery/read_gbq_table.py b/bigframes/session/_io/bigquery/read_gbq_table.py index 6322040428..30a25762eb 100644 --- a/bigframes/session/_io/bigquery/read_gbq_table.py +++ b/bigframes/session/_io/bigquery/read_gbq_table.py @@ -54,26 +54,43 @@ def get_table_metadata( cached_table = cache.get(table_ref) if use_cache and cached_table is not None: - snapshot_timestamp, _ = cached_table - - # Cache hit could be unexpected. See internal issue 329545805. - # Raise a warning with more information about how to avoid the - # problems with the cache. - msg = bfe.format_message( - f"Reading cached table from {snapshot_timestamp} to avoid " - "incompatibilies with previous reads of this table. To read " - "the latest version, set `use_cache=False` or close the " - "current session with Session.close() or " - "bigframes.pandas.close_session()." - ) - # There are many layers before we get to (possibly) the user's code: - # pandas.read_gbq_table - # -> with_default_session - # -> Session.read_gbq_table - # -> _read_gbq_table - # -> _get_snapshot_sql_and_primary_key - # -> get_snapshot_datetime_and_table_metadata - warnings.warn(msg, stacklevel=7) + snapshot_timestamp, table = cached_table + + if is_time_travel_eligible( + bqclient=bqclient, + table=table, + columns=None, + snapshot_time=snapshot_timestamp, + filter_str=None, + # Don't warn, because that will already have been taken care of. + should_warn=False, + should_dry_run=False, + ): + # This warning should only happen if the cached snapshot_time will + # have any effect on bigframes (b/437090788). For example, with + # cached query results, such as after re-running a query, time + # travel won't be applied and thus this check is irrelevent. + # + # In other cases, such as an explicit read_gbq_table(), Cache hit + # could be unexpected. See internal issue 329545805. Raise a + # warning with more information about how to avoid the problems + # with the cache. + msg = bfe.format_message( + f"Reading cached table from {snapshot_timestamp} to avoid " + "incompatibilies with previous reads of this table. To read " + "the latest version, set `use_cache=False` or close the " + "current session with Session.close() or " + "bigframes.pandas.close_session()." + ) + # There are many layers before we get to (possibly) the user's code: + # pandas.read_gbq_table + # -> with_default_session + # -> Session.read_gbq_table + # -> _read_gbq_table + # -> _get_snapshot_sql_and_primary_key + # -> get_snapshot_datetime_and_table_metadata + warnings.warn(msg, category=bfe.TimeTravelCacheWarning, stacklevel=7) + return cached_table table = bqclient.get_table(table_ref) @@ -88,40 +105,74 @@ def get_table_metadata( return cached_table -def validate_table( +def is_time_travel_eligible( bqclient: bigquery.Client, table: bigquery.table.Table, columns: Optional[Sequence[str]], snapshot_time: datetime.datetime, filter_str: Optional[str] = None, -) -> bool: - """Validates that the table can be read, returns True iff snapshot is supported.""" + *, + should_warn: bool, + should_dry_run: bool, +): + """Check if a table is eligible to use time-travel. + + + Args: + table: BigQuery table to check. + should_warn: + If true, raises a warning when time travel is disabled and the + underlying table is likely mutable. + + Return: + bool: + True if there is a chance that time travel may be supported on this + table. If ``should_dry_run`` is True, then this is validated with a + ``dry_run`` query. + """ + + # user code + # -> pandas.read_gbq_table + # -> with_default_session + # -> session.read_gbq_table + # -> session._read_gbq_table + # -> loader.read_gbq_table + # -> is_time_travel_eligible + stacklevel = 7 - time_travel_not_found = False # Anonymous dataset, does not support snapshot ever if table.dataset_id.startswith("_"): - pass + return False # Only true tables support time travel - elif table.table_id.endswith("*"): - msg = bfe.format_message( - "Wildcard tables do not support FOR SYSTEM_TIME AS OF queries. " - "Attempting query without time travel. Be aware that " - "modifications to the underlying data may result in errors or " - "unexpected behavior." - ) - warnings.warn(msg, category=bfe.TimeTravelDisabledWarning) - elif table.table_type != "TABLE": - if table.table_type == "MATERIALIZED_VIEW": + if table.table_id.endswith("*"): + if should_warn: msg = bfe.format_message( - "Materialized views do not support FOR SYSTEM_TIME AS OF queries. " - "Attempting query without time travel. Be aware that as materialized views " - "are updated periodically, modifications to the underlying data in the view may " - "result in errors or unexpected behavior." + "Wildcard tables do not support FOR SYSTEM_TIME AS OF queries. " + "Attempting query without time travel. Be aware that " + "modifications to the underlying data may result in errors or " + "unexpected behavior." ) - warnings.warn(msg, category=bfe.TimeTravelDisabledWarning) - else: - # table might support time travel, lets do a dry-run query with time travel + warnings.warn( + msg, category=bfe.TimeTravelDisabledWarning, stacklevel=stacklevel + ) + return False + elif table.table_type != "TABLE": + if table.table_type == "MATERIALIZED_VIEW": + if should_warn: + msg = bfe.format_message( + "Materialized views do not support FOR SYSTEM_TIME AS OF queries. " + "Attempting query without time travel. Be aware that as materialized views " + "are updated periodically, modifications to the underlying data in the view may " + "result in errors or unexpected behavior." + ) + warnings.warn( + msg, category=bfe.TimeTravelDisabledWarning, stacklevel=stacklevel + ) + return False + + # table might support time travel, lets do a dry-run query with time travel + if should_dry_run: snapshot_sql = bigframes.session._io.bigquery.to_query( query_or_table=f"{table.reference.project}.{table.reference.dataset_id}.{table.reference.table_id}", columns=columns or (), @@ -129,36 +180,39 @@ def validate_table( time_travel_timestamp=snapshot_time, ) try: - # If this succeeds, we don't need to query without time travel, that would surely succeed - bqclient.query_and_wait( - snapshot_sql, job_config=bigquery.QueryJobConfig(dry_run=True) + # If this succeeds, we know that time travel will for sure work. + bigframes.session._io.bigquery.start_query_with_client( + bq_client=bqclient, + sql=snapshot_sql, + job_config=bigquery.QueryJobConfig(dry_run=True), + location=None, + project=None, + timeout=None, + metrics=None, + query_with_job=False, ) return True + except google.api_core.exceptions.NotFound: - # note that a notfound caused by a simple typo will be - # caught above when the metadata is fetched, not here - time_travel_not_found = True - - # At this point, time travel is known to fail, but can we query without time travel? - snapshot_sql = bigframes.session._io.bigquery.to_query( - query_or_table=f"{table.reference.project}.{table.reference.dataset_id}.{table.reference.table_id}", - columns=columns or (), - sql_predicate=filter_str, - time_travel_timestamp=None, - ) - # Any errors here should just be raised to user - bqclient.query_and_wait( - snapshot_sql, job_config=bigquery.QueryJobConfig(dry_run=True) - ) - if time_travel_not_found: - msg = bfe.format_message( - "NotFound error when reading table with time travel." - " Attempting query without time travel. Warning: Without" - " time travel, modifications to the underlying table may" - " result in errors or unexpected behavior." - ) - warnings.warn(msg, category=bfe.TimeTravelDisabledWarning) - return False + # If system time isn't supported, it returns NotFound error? + # Note that a notfound caused by a simple typo will be + # caught above when the metadata is fetched, not here. + if should_warn: + msg = bfe.format_message( + "NotFound error when reading table with time travel." + " Attempting query without time travel. Warning: Without" + " time travel, modifications to the underlying table may" + " result in errors or unexpected behavior." + ) + warnings.warn( + msg, category=bfe.TimeTravelDisabledWarning, stacklevel=stacklevel + ) + + # If we make it to here, we know for sure that time travel won't work. + return False + else: + # We haven't validated it, but there's a chance that time travel could work. + return True def infer_unique_columns( diff --git a/bigframes/session/loader.py b/bigframes/session/loader.py index c264abd860..6500701324 100644 --- a/bigframes/session/loader.py +++ b/bigframes/session/loader.py @@ -744,18 +744,15 @@ def read_gbq_table( else (*columns, *[col for col in index_cols if col not in columns]) ) - try: - enable_snapshot = enable_snapshot and bf_read_gbq_table.validate_table( - self._bqclient, - table, - all_columns, - time_travel_timestamp, - filter_str, - ) - except google.api_core.exceptions.Forbidden as ex: - if "Drive credentials" in ex.message: - ex.message += "\nCheck https://cloud.google.com/bigquery/docs/query-drive-data#Google_Drive_permissions." - raise + enable_snapshot = enable_snapshot and bf_read_gbq_table.is_time_travel_eligible( + self._bqclient, + table, + all_columns, + time_travel_timestamp, + filter_str, + should_warn=True, + should_dry_run=True, + ) # ---------------------------- # Create ordering and validate diff --git a/bigframes/session/metrics.py b/bigframes/session/metrics.py index 36e48ee9ec..8ec8d525cc 100644 --- a/bigframes/session/metrics.py +++ b/bigframes/session/metrics.py @@ -42,9 +42,9 @@ def count_job_stats( assert row_iterator is not None # TODO(tswast): Pass None after making benchmark publishing robust to missing data. - bytes_processed = getattr(row_iterator, "total_bytes_processed", 0) - query_char_count = len(getattr(row_iterator, "query", "")) - slot_millis = getattr(row_iterator, "slot_millis", 0) + bytes_processed = getattr(row_iterator, "total_bytes_processed", 0) or 0 + query_char_count = len(getattr(row_iterator, "query", "") or "") + slot_millis = getattr(row_iterator, "slot_millis", 0) or 0 exec_seconds = 0.0 self.execution_count += 1 @@ -63,10 +63,10 @@ def count_job_stats( elif (stats := get_performance_stats(query_job)) is not None: query_char_count, bytes_processed, slot_millis, exec_seconds = stats self.execution_count += 1 - self.query_char_count += query_char_count - self.bytes_processed += bytes_processed - self.slot_millis += slot_millis - self.execution_secs += exec_seconds + self.query_char_count += query_char_count or 0 + self.bytes_processed += bytes_processed or 0 + self.slot_millis += slot_millis or 0 + self.execution_secs += exec_seconds or 0 write_stats_to_disk( query_char_count=query_char_count, bytes_processed=bytes_processed, diff --git a/bigframes/session/polars_executor.py b/bigframes/session/polars_executor.py index 2c04a0016b..ccc577deae 100644 --- a/bigframes/session/polars_executor.py +++ b/bigframes/session/polars_executor.py @@ -21,7 +21,7 @@ from bigframes.core import array_value, bigframe_node, expression, local_data, nodes import bigframes.operations from bigframes.operations import aggregations as agg_ops -from bigframes.operations import comparison_ops, numeric_ops +from bigframes.operations import comparison_ops, generic_ops, numeric_ops from bigframes.session import executor, semi_executor if TYPE_CHECKING: @@ -57,6 +57,7 @@ numeric_ops.DivOp, numeric_ops.FloorDivOp, numeric_ops.ModOp, + generic_ops.AsTypeOp, ) _COMPATIBLE_AGG_OPS = ( agg_ops.SizeOp, diff --git a/bigframes/testing/utils.py b/bigframes/testing/utils.py index c3a8008465..5da24c5b9b 100644 --- a/bigframes/testing/utils.py +++ b/bigframes/testing/utils.py @@ -440,11 +440,11 @@ def get_function_name(func, package_requirements=None, is_row_processor=False): """Get a bigframes function name for testing given a udf.""" # Augment user package requirements with any internal package # requirements. - package_requirements = bff_utils._get_updated_package_requirements( + package_requirements = bff_utils.get_updated_package_requirements( package_requirements, is_row_processor ) # Compute a unique hash representing the user code. - function_hash = bff_utils._get_hash(func, package_requirements) + function_hash = bff_utils.get_hash(func, package_requirements) return f"bigframes_{function_hash}" diff --git a/bigframes/version.py b/bigframes/version.py index e85f0b73c8..7aff17a40d 100644 --- a/bigframes/version.py +++ b/bigframes/version.py @@ -12,8 +12,8 @@ # See the License for the specific language governing permissions and # limitations under the License. -__version__ = "2.14.0" +__version__ = "2.15.0" # {x-release-please-start-date} -__release_date__ = "2025-08-05" +__release_date__ = "2025-08-11" # {x-release-please-end} diff --git a/noxfile.py b/noxfile.py index 2d0edfc1b0..7adf499a08 100644 --- a/noxfile.py +++ b/noxfile.py @@ -78,15 +78,20 @@ ] UNIT_TEST_LOCAL_DEPENDENCIES: List[str] = [] UNIT_TEST_DEPENDENCIES: List[str] = [] -UNIT_TEST_EXTRAS: List[str] = ["tests", "anywidget"] +UNIT_TEST_EXTRAS: List[str] = ["tests"] UNIT_TEST_EXTRAS_BY_PYTHON: Dict[str, List[str]] = { - "3.12": ["tests", "polars", "scikit-learn", "anywidget"], + "3.10": ["tests", "scikit-learn", "anywidget"], + "3.11": ["tests", "polars", "scikit-learn", "anywidget"], + # Make sure we leave some versions without "extras" so we know those + # dependencies are actually optional. + "3.13": ["tests", "polars", "scikit-learn", "anywidget"], } +# 3.11 is used by colab. # 3.10 is needed for Windows tests as it is the only version installed in the # bigframes-windows container image. For more information, search # bigframes/windows-docker, internally. -SYSTEM_TEST_PYTHON_VERSIONS = ["3.9", "3.10", "3.12", "3.13"] +SYSTEM_TEST_PYTHON_VERSIONS = ["3.9", "3.10", "3.11", "3.13"] SYSTEM_TEST_STANDARD_DEPENDENCIES = [ "jinja2", "mock", @@ -105,12 +110,13 @@ ] SYSTEM_TEST_LOCAL_DEPENDENCIES: List[str] = [] SYSTEM_TEST_DEPENDENCIES: List[str] = [] -SYSTEM_TEST_EXTRAS: List[str] = [] +SYSTEM_TEST_EXTRAS: List[str] = ["tests"] SYSTEM_TEST_EXTRAS_BY_PYTHON: Dict[str, List[str]] = { - "3.9": ["tests", "anywidget"], - "3.10": ["tests", "polars"], - "3.12": ["tests", "scikit-learn", "polars", "anywidget"], - "3.13": ["tests", "polars"], + # Make sure we leave some versions without "extras" so we know those + # dependencies are actually optional. + "3.10": ["tests", "scikit-learn", "anywidget"], + "3.11": ["tests", "scikit-learn", "polars", "anywidget"], + "3.13": ["tests", "polars", "anywidget"], } LOGGING_NAME_ENV_VAR = "BIGFRAMES_PERFORMANCE_LOG_NAME" @@ -120,8 +126,8 @@ # Sessions are executed in the order so putting the smaller sessions # ahead to fail fast at presubmit running. nox.options.sessions = [ - "system-3.9", - "system-3.12", + "system-3.9", # No extras. + "system-3.11", "cover", # TODO(b/401609005): remove "cleanup", diff --git a/samples/snippets/quickstart.py b/samples/snippets/quickstart.py index bc05cd2512..08662c1ea7 100644 --- a/samples/snippets/quickstart.py +++ b/samples/snippets/quickstart.py @@ -16,7 +16,7 @@ def run_quickstart(project_id: str) -> None: your_gcp_project_id = project_id - # [START bigquery_bigframes_quickstart] + # [START bigquery_bigframes_quickstart_create_dataframe] import bigframes.pandas as bpd # Set BigQuery DataFrames options @@ -37,12 +37,16 @@ def run_quickstart(project_id: str) -> None: # Efficiently preview the results using the .peek() method. df.peek() + # [END bigquery_bigframes_quickstart_create_dataframe] + # [START bigquery_bigframes_quickstart_calculate_print] # Use the DataFrame just as you would a pandas DataFrame, but calculations # happen in the BigQuery query engine instead of the local system. average_body_mass = df["body_mass_g"].mean() print(f"average_body_mass: {average_body_mass}") + # [END bigquery_bigframes_quickstart_calculate_print] + # [START bigquery_bigframes_quickstart_eval_metrics] # Create the Linear Regression model from bigframes.ml.linear_model import LinearRegression @@ -70,7 +74,7 @@ def run_quickstart(project_id: str) -> None: model = LinearRegression(fit_intercept=False) model.fit(X, y) model.score(X, y) - # [END bigquery_bigframes_quickstart] + # [END bigquery_bigframes_quickstart_eval_metrics] # close session and reset option so not to affect other tests bpd.close_session() diff --git a/setup.py b/setup.py index bc42cc4281..2aef514749 100644 --- a/setup.py +++ b/setup.py @@ -76,8 +76,8 @@ "google-cloud-bigtable >=2.24.0", "google-cloud-pubsub >=2.21.4", ], - # used for local engine, which is only needed for unit tests at present. - "polars": ["polars >= 1.7.0"], + # used for local engine + "polars": ["polars >= 1.21.0"], "scikit-learn": ["scikit-learn>=1.2.2"], # Packages required for basic development flow. "dev": [ diff --git a/specs/2025-08-04-geoseries-scalars.md b/specs/2025-08-04-geoseries-scalars.md new file mode 100644 index 0000000000..38dc77c4cf --- /dev/null +++ b/specs/2025-08-04-geoseries-scalars.md @@ -0,0 +1,307 @@ +# Implementing GeoSeries scalar operators + +This project is to implement all GeoSeries scalar properties and methods in the +`bigframes.geopandas.GeoSeries` class. Likewise, all BigQuery GEOGRAPHY +functions should be exposed in the `bigframes.bigquery` module. + +## Background + +*Explain the context and why this change is necessary.* +*Include links to relevant issues or documentation.* + +* https://geopandas.org/en/stable/docs/reference/geoseries.html +* https://cloud.google.com/bigquery/docs/reference/standard-sql/geography_functions + +## Acceptance Criteria + +*Define the specific, measurable outcomes that indicate the task is complete.* +*Use a checklist format for clarity.* + +### GeoSeries methods and properties + +- [x] Constructor +- [x] GeoSeries.area +- [x] GeoSeries.boundary +- [ ] GeoSeries.bounds +- [ ] GeoSeries.total_bounds +- [x] GeoSeries.length +- [ ] GeoSeries.geom_type +- [ ] GeoSeries.offset_curve +- [x] GeoSeries.distance +- [ ] GeoSeries.hausdorff_distance +- [ ] GeoSeries.frechet_distance +- [ ] GeoSeries.representative_point +- [ ] GeoSeries.exterior +- [ ] GeoSeries.interiors +- [ ] GeoSeries.minimum_bounding_radius +- [ ] GeoSeries.minimum_clearance +- [x] GeoSeries.x +- [x] GeoSeries.y +- [ ] GeoSeries.z +- [ ] GeoSeries.m +- [ ] GeoSeries.get_coordinates +- [ ] GeoSeries.count_coordinates +- [ ] GeoSeries.count_geometries +- [ ] GeoSeries.count_interior_rings +- [ ] GeoSeries.set_precision +- [ ] GeoSeries.get_precision +- [ ] GeoSeries.get_geometry +- [x] GeoSeries.is_closed +- [ ] GeoSeries.is_empty +- [ ] GeoSeries.is_ring +- [ ] GeoSeries.is_simple +- [ ] GeoSeries.is_valid +- [ ] GeoSeries.is_valid_reason +- [ ] GeoSeries.is_valid_coverage +- [ ] GeoSeries.invalid_coverage_edges +- [ ] GeoSeries.has_m +- [ ] GeoSeries.has_z +- [ ] GeoSeries.is_ccw +- [ ] GeoSeries.contains +- [ ] GeoSeries.contains_properly +- [ ] GeoSeries.crosses +- [ ] GeoSeries.disjoint +- [ ] GeoSeries.dwithin +- [ ] GeoSeries.geom_equals +- [ ] GeoSeries.geom_equals_exact +- [ ] GeoSeries.geom_equals_identical +- [ ] GeoSeries.intersects +- [ ] GeoSeries.overlaps +- [ ] GeoSeries.touches +- [ ] GeoSeries.within +- [ ] GeoSeries.covers +- [ ] GeoSeries.covered_by +- [ ] GeoSeries.relate +- [ ] GeoSeries.relate_pattern +- [ ] GeoSeries.clip_by_rect +- [x] GeoSeries.difference +- [x] GeoSeries.intersection +- [ ] GeoSeries.symmetric_difference +- [ ] GeoSeries.union +- [x] GeoSeries.boundary +- [x] GeoSeries.buffer +- [x] GeoSeries.centroid +- [ ] GeoSeries.concave_hull +- [x] GeoSeries.convex_hull +- [ ] GeoSeries.envelope +- [ ] GeoSeries.extract_unique_points +- [ ] GeoSeries.force_2d +- [ ] GeoSeries.force_3d +- [ ] GeoSeries.make_valid +- [ ] GeoSeries.minimum_bounding_circle +- [ ] GeoSeries.maximum_inscribed_circle +- [ ] GeoSeries.minimum_clearance +- [ ] GeoSeries.minimum_clearance_line +- [ ] GeoSeries.minimum_rotated_rectangle +- [ ] GeoSeries.normalize +- [ ] GeoSeries.orient_polygons +- [ ] GeoSeries.remove_repeated_points +- [ ] GeoSeries.reverse +- [ ] GeoSeries.sample_points +- [ ] GeoSeries.segmentize +- [ ] GeoSeries.shortest_line +- [ ] GeoSeries.simplify +- [ ] GeoSeries.simplify_coverage +- [ ] GeoSeries.snap +- [ ] GeoSeries.transform +- [ ] GeoSeries.affine_transform +- [ ] GeoSeries.rotate +- [ ] GeoSeries.scale +- [ ] GeoSeries.skew +- [ ] GeoSeries.translate +- [ ] GeoSeries.interpolate +- [ ] GeoSeries.line_merge +- [ ] GeoSeries.project +- [ ] GeoSeries.shared_paths +- [ ] GeoSeries.build_area +- [ ] GeoSeries.constrained_delaunay_triangles +- [ ] GeoSeries.delaunay_triangles +- [ ] GeoSeries.explode +- [ ] GeoSeries.intersection_all +- [ ] GeoSeries.polygonize +- [ ] GeoSeries.union_all +- [ ] GeoSeries.voronoi_polygons +- [ ] GeoSeries.from_arrow +- [ ] GeoSeries.from_file +- [ ] GeoSeries.from_wkb +- [x] GeoSeries.from_wkt +- [x] GeoSeries.from_xy +- [ ] GeoSeries.to_arrow +- [ ] GeoSeries.to_file +- [ ] GeoSeries.to_json +- [ ] GeoSeries.to_wkb +- [x] GeoSeries.to_wkt +- [ ] GeoSeries.crs +- [ ] GeoSeries.set_crs +- [ ] GeoSeries.to_crs +- [ ] GeoSeries.estimate_utm_crs +- [ ] GeoSeries.fillna +- [ ] GeoSeries.isna +- [ ] GeoSeries.notna +- [ ] GeoSeries.clip +- [ ] GeoSeries.plot +- [ ] GeoSeries.explore +- [ ] GeoSeries.sindex +- [ ] GeoSeries.has_sindex +- [ ] GeoSeries.cx +- [ ] GeoSeries.__geo_interface__ + +### `bigframes.pandas` methods + +Constructors: Functions that build new geography values from coordinates or +existing geographies. + +- [x] ST_GEOGPOINT +- [ ] ST_MAKELINE +- [ ] ST_MAKEPOLYGON +- [ ] ST_MAKEPOLYGONORIENTED + +Parsers ST_GEOGFROM: Functions that create geographies from an external format +such as WKT and GeoJSON. + +- [ ] ST_GEOGFROMGEOJSON +- [x] ST_GEOGFROMTEXT +- [ ] ST_GEOGFROMWKB +- [ ] ST_GEOGPOINTFROMGEOHASH + +Formatters: Functions that export geographies to an external format such as WKT. + +- [ ] ST_ASBINARY +- [ ] ST_ASGEOJSON +- [x] ST_ASTEXT +- [ ] ST_GEOHASH + +Transformations: Functions that generate a new geography based on input. + +- [x] ST_BOUNDARY +- [x] ST_BUFFER +- [ ] ST_BUFFERWITHTOLERANCE +- [x] ST_CENTROID +- [ ] ST_CENTROID_AGG (Aggregate) +- [ ] ST_CLOSESTPOINT +- [x] ST_CONVEXHULL +- [x] ST_DIFFERENCE +- [ ] ST_EXTERIORRING +- [ ] ST_INTERIORRINGS +- [x] ST_INTERSECTION +- [ ] ST_LINEINTERPOLATEPOINT +- [ ] ST_LINESUBSTRING +- [ ] ST_SIMPLIFY +- [ ] ST_SNAPTOGRID +- [ ] ST_UNION +- [ ] ST_UNION_AGG (Aggregate) + +Accessors: Functions that provide access to properties of a geography without +side-effects. + +- [ ] ST_DIMENSION +- [ ] ST_DUMP +- [ ] ST_ENDPOINT +- [ ] ST_GEOMETRYTYPE +- [x] ST_ISCLOSED +- [ ] ST_ISCOLLECTION +- [ ] ST_ISEMPTY +- [ ] ST_ISRING +- [ ] ST_NPOINTS +- [ ] ST_NUMGEOMETRIES +- [ ] ST_NUMPOINTS +- [ ] ST_POINTN +- [ ] ST_STARTPOINT +- [x] ST_X +- [x] ST_Y + +Predicates: Functions that return TRUE or FALSE for some spatial relationship +between two geographies or some property of a geography. These functions are +commonly used in filter clauses. + +- [ ] ST_CONTAINS +- [ ] ST_COVEREDBY +- [ ] ST_COVERS +- [ ] ST_DISJOINT +- [ ] ST_DWITHIN +- [ ] ST_EQUALS +- [ ] ST_HAUSDORFFDWITHIN +- [ ] ST_INTERSECTS +- [ ] ST_INTERSECTSBOX +- [ ] ST_TOUCHES +- [ ] ST_WITHIN + +Measures: Functions that compute measurements of one or more geographies. + +- [ ] ST_ANGLE +- [x] ST_AREA +- [ ] ST_AZIMUTH +- [ ] ST_BOUNDINGBOX +- [x] ST_DISTANCE +- [ ] ST_EXTENT (Aggregate) +- [ ] ST_HAUSDORFFDISTANCE +- [ ] ST_LINELOCATEPOINT +- [x] ST_LENGTH +- [ ] ST_MAXDISTANCE +- [ ] ST_PERIMETER + +Clustering: Functions that perform clustering on geographies. + +- [ ] ST_CLUSTERDBSCAN + +S2 functions: Functions for working with S2 cell coverings of GEOGRAPHY. + +- [ ] S2_CELLIDFROMPOINT +- [ ] S2_COVERINGCELLIDS + +Raster functions: Functions for analyzing geospatial rasters using geographies. + +- [ ] ST_REGIONSTATS + +## Detailed Steps + +*Break down the implementation into small, actionable steps.* +*This section will guide the development process.* + +### Implementing a new scalar geography operation + +- [ ] **Define the operation dataclass:** + - [ ] In `bigframes/operations/geo_ops.py`, create a new dataclass inheriting from `base_ops.UnaryOp` or `base_ops.BinaryOp`. + - [ ] Define the `name` of the operation and any parameters it requires. + - [ ] Implement the `output_type` method to specify the data type of the result. +- [ ] **Export the new operation:** + - [ ] In `bigframes/operations/__init__.py`, import your new operation dataclass and add it to the `__all__` list. +- [ ] **Implement the compilation logic:** + - [ ] In `bigframes/core/compile/scalar_op_compiler.py`: + - [ ] If the BigQuery function has a direct equivalent in Ibis, you can often reuse an existing Ibis method. + - [ ] If not, define a new Ibis UDF using `@ibis_udf.scalar.builtin` to map to the specific BigQuery function signature. + - [ ] Create a new compiler implementation function (e.g., `geo_length_op_impl`). + - [ ] Register this function to your operation dataclass using `@scalar_op_compiler.register_unary_op` or `@scalar_op_compiler.register_binary_op`. +- [ ] **Implement the user-facing function or property:** + - [ ] For a `bigframes.bigquery` function: + - [ ] In `bigframes/bigquery/_operations/geo.py`, create the user-facing function (e.g., `st_length`). + - [ ] The function should take a `Series` and any other parameters. + - [ ] Inside the function, call `series._apply_unary_op` or `series._apply_binary_op`, passing the operation dataclass you created. + - [ ] Add a comprehensive docstring with examples. + - [ ] In `bigframes/bigquery/__init__.py`, import your new user-facing function and add it to the `__all__` list. + - [ ] For a `GeoSeries` property or method: + - [ ] In `bigframes/geopandas/geoseries.py`, create the property or method. + - [ ] If the operation is not possible to be supported, such as if the + geopandas method returns values in units corresponding to the + coordinate system rather than meters that BigQuery uses, raise a + `NotImplementedError` with a helpful message. + - [ ] Otherwise, call `series._apply_unary_op` or `series._apply_binary_op`, passing the operation dataclass. + - [ ] Add a comprehensive docstring with examples. +- [ ] **Add Tests:** + - [ ] Add system tests in `tests/system/small/bigquery/test_geo.py` or `tests/system/small/geopandas/test_geoseries.py` to verify the end-to-end functionality. Test various inputs, including edge cases and `NULL` values. + - [ ] If you are overriding a pandas or GeoPandas property and raising `NotImplementedError`, add a unit test to ensure the correct error is raised. + +## Verification + +*Specify the commands to run to verify the changes.* + +- [ ] The `nox -r -s format lint lint_setup_py` linter should pass. +- [ ] The `nox -r -s mypy` static type checker should pass. +- [ ] The `nox -r -s docs docfx` docs should successfully build and include relevant docs in the output. +- [ ] All new and existing unit tests `pytest tests/unit` should pass. +- [ ] Identify all related system tests in the `tests/system` directories. +- [ ] All related system tests `pytest tests/system/small/path_to_relevant_test.py::test_name` should pass. + +## Constraints + +Follow the guidelines listed in GEMINI.md at the root of the repository. diff --git a/specs/TEMPLATE.md b/specs/TEMPLATE.md new file mode 100644 index 0000000000..0d93035dcc --- /dev/null +++ b/specs/TEMPLATE.md @@ -0,0 +1,47 @@ +# Title of the Specification + +*Provide a brief overview of the feature or bug.* + +## Background + +*Explain the context and why this change is necessary.* +*Include links to relevant issues or documentation.* + +## Acceptance Criteria + +*Define the specific, measurable outcomes that indicate the task is complete.* +*Use a checklist format for clarity.* + +- [ ] Criterion 1 +- [ ] Criterion 2 +- [ ] Criterion 3 + +## Detailed Steps + +*Break down the implementation into small, actionable steps.* +*This section will guide the development process.* + +### 1. Step One + +- [ ] Action 1.1 +- [ ] Action 1.2 + +### 2. Step Two + +- [ ] Action 2.1 +- [ ] Action 2.2 + +## Verification + +*Specify the commands to run to verify the changes.* + +- [ ] The `nox -r -s format lint lint_setup_py` linter should pass. +- [ ] The `nox -r -s mypy` static type checker should pass. +- [ ] The `nox -r -s docs docfx` docs should successfully build and include relevant docs in the output. +- [ ] All new and existing unit tests `pytest tests/unit` should pass. +- [ ] Identify all related system tests in the `tests/system` directories. +- [ ] All related system tests `pytest tests/system/small/path_to_relevant_test.py::test_name` should pass. + +## Constraints + +Follow the guidelines listed in GEMINI.md at the root of the repository. diff --git a/testing/constraints-3.10.txt b/testing/constraints-3.10.txt index 12ad443aab..1695a4806b 100644 --- a/testing/constraints-3.10.txt +++ b/testing/constraints-3.10.txt @@ -1,4 +1,5 @@ -# Keep in sync with colab/containers/requirements.core.in image +# When we drop Python 3.9, +# please keep these in sync with the minimum versions in setup.py google-auth==2.27.0 ipykernel==5.5.6 ipython==7.34.0 @@ -15,4 +16,4 @@ matplotlib==3.7.1 psutil==5.9.5 seaborn==0.13.1 traitlets==5.7.1 -polars==1.7.0 +polars==1.21.0 diff --git a/testing/constraints-3.11.txt b/testing/constraints-3.11.txt index e69de29bb2..8fd20d453b 100644 --- a/testing/constraints-3.11.txt +++ b/testing/constraints-3.11.txt @@ -0,0 +1,621 @@ +# Keep in sync with %pip freeze in colab. +# Note: These are just constraints, so it's ok to have extra packages we +# aren't installing, except in the version that gets used for prerelease +# tests. +absl-py==1.4.0 +accelerate==1.9.0 +aiofiles==24.1.0 +aiohappyeyeballs==2.6.1 +aiohttp==3.12.15 +aiosignal==1.4.0 +alabaster==1.0.0 +albucore==0.0.24 +albumentations==2.0.8 +ale-py==0.11.2 +altair==5.5.0 +annotated-types==0.7.0 +antlr4-python3-runtime==4.9.3 +anyio==4.10.0 +anywidget==0.9.18 +argon2-cffi==25.1.0 +argon2-cffi-bindings==25.1.0 +array_record==0.7.2 +arviz==0.22.0 +astropy==7.1.0 +astropy-iers-data==0.2025.8.4.0.42.59 +astunparse==1.6.3 +atpublic==5.1 +attrs==25.3.0 +audioread==3.0.1 +autograd==1.8.0 +babel==2.17.0 +backcall==0.2.0 +backports.tarfile==1.2.0 +beautifulsoup4==4.13.4 +betterproto==2.0.0b6 +bigquery-magics==0.10.2 +bleach==6.2.0 +blinker==1.9.0 +blis==1.3.0 +blobfile==3.0.0 +blosc2==3.6.1 +bokeh==3.7.3 +Bottleneck==1.4.2 +bqplot==0.12.45 +branca==0.8.1 +Brotli==1.1.0 +build==1.3.0 +CacheControl==0.14.3 +cachetools==5.5.2 +catalogue==2.0.10 +certifi==2025.8.3 +cffi==1.17.1 +chardet==5.2.0 +charset-normalizer==3.4.2 +chex==0.1.90 +clarabel==0.11.1 +click==8.2.1 +cloudpathlib==0.21.1 +cloudpickle==3.1.1 +cmake==3.31.6 +cmdstanpy==1.2.5 +colorcet==3.1.0 +colorlover==0.3.0 +colour==0.1.5 +community==1.0.0b1 +confection==0.1.5 +cons==0.4.7 +contourpy==1.3.3 +cramjam==2.11.0 +cryptography==43.0.3 +cuda-python==12.6.2.post1 +cudf-polars-cu12==25.6.0 +cufflinks==0.17.3 +cuml-cu12==25.6.0 +cupy-cuda12x==13.3.0 +curl_cffi==0.12.0 +cuvs-cu12==25.6.1 +cvxopt==1.3.2 +cvxpy==1.6.7 +cycler==0.12.1 +cyipopt==1.5.0 +cymem==2.0.11 +Cython==3.0.12 +dask==2025.5.0 +dask-cuda==25.6.0 +dask-cudf-cu12==25.6.0 +dataproc-spark-connect==0.8.3 +datasets==4.0.0 +db-dtypes==1.4.3 +dbus-python==1.2.18 +debugpy==1.8.15 +decorator==4.4.2 +defusedxml==0.7.1 +diffusers==0.34.0 +dill==0.3.8 +distributed==2025.5.0 +distributed-ucxx-cu12==0.44.0 +distro==1.9.0 +dlib==19.24.6 +dm-tree==0.1.9 +docstring_parser==0.17.0 +docutils==0.21.2 +dopamine_rl==4.1.2 +duckdb==1.3.2 +earthengine-api==1.5.24 +easydict==1.13 +editdistance==0.8.1 +eerepr==0.1.2 +einops==0.8.1 +entrypoints==0.4 +et_xmlfile==2.0.0 +etils==1.13.0 +etuples==0.3.10 +Farama-Notifications==0.0.4 +fastai==2.7.19 +fastapi==0.116.1 +fastcore==1.7.29 +fastdownload==0.0.7 +fastjsonschema==2.21.1 +fastprogress==1.0.3 +fastrlock==0.8.3 +ffmpy==0.6.1 +filelock==3.18.0 +firebase-admin==6.9.0 +Flask==3.1.1 +flatbuffers==25.2.10 +flax==0.10.6 +folium==0.20.0 +fonttools==4.59.0 +frozendict==2.4.6 +frozenlist==1.7.0 +fsspec==2025.3.0 +future==1.0.0 +gast==0.6.0 +gcsfs==2025.3.0 +GDAL==3.8.4 +gdown==5.2.0 +geemap==0.35.3 +geocoder==1.38.1 +geographiclib==2.0 +geopandas==1.1.1 +geopy==2.4.1 +gin-config==0.5.0 +gitdb==4.0.12 +GitPython==3.1.45 +glob2==0.7 +google==2.0.3 +google-ai-generativelanguage==0.6.15 +google-api-core==2.25.1 +google-api-python-client==2.177.0 +google-auth==2.38.0 +google-auth-httplib2==0.2.0 +google-auth-oauthlib==1.2.2 +google-cloud-aiplatform==1.106.0 +google-cloud-bigquery==3.35.1 +google-cloud-bigquery-connection==1.18.3 +google-cloud-bigquery-storage==2.32.0 +google-cloud-core==2.4.3 +google-cloud-dataproc==5.21.0 +google-cloud-datastore==2.21.0 +google-cloud-firestore==2.21.0 +google-cloud-functions==1.20.4 +google-cloud-language==2.17.2 +google-cloud-resource-manager==1.14.2 +google-cloud-spanner==3.56.0 +google-cloud-storage==2.19.0 +google-cloud-translate==3.21.1 +google-crc32c==1.7.1 +google-genai==1.28.0 +google-generativeai==0.8.5 +google-pasta==0.2.0 +google-resumable-media==2.7.2 +googleapis-common-protos==1.70.0 +googledrivedownloader==1.1.0 +gradio==5.39.0 +gradio_client==1.11.0 +graphviz==0.21 +greenlet==3.2.3 +groovy==0.1.2 +grpc-google-iam-v1==0.14.2 +grpc-interceptor==0.15.4 +grpcio==1.74.0 +grpcio-status==1.71.2 +grpclib==0.4.8 +gspread==6.2.1 +gspread-dataframe==4.0.0 +gym==0.25.2 +gym-notices==0.1.0 +gymnasium==1.2.0 +h11==0.16.0 +h2==4.2.0 +h5netcdf==1.6.3 +h5py==3.14.0 +hdbscan==0.8.40 +hf-xet==1.1.5 +hf_transfer==0.1.9 +highspy==1.11.0 +holidays==0.78 +holoviews==1.21.0 +hpack==4.1.0 +html5lib==1.1 +httpcore==1.0.9 +httpimport==1.4.1 +httplib2==0.22.0 +httpx==0.28.1 +huggingface-hub==0.34.3 +humanize==4.12.3 +hyperframe==6.1.0 +hyperopt==0.2.7 +ibis-framework==9.5.0 +idna==3.10 +imageio==2.37.0 +imageio-ffmpeg==0.6.0 +imagesize==1.4.1 +imbalanced-learn==0.13.0 +immutabledict==4.2.1 +importlib_metadata==8.7.0 +importlib_resources==6.5.2 +imutils==0.5.4 +inflect==7.5.0 +iniconfig==2.1.0 +intel-cmplr-lib-ur==2025.2.0 +intel-openmp==2025.2.0 +ipyevents==2.0.2 +ipyfilechooser==0.6.0 +ipykernel==6.17.1 +ipyleaflet==0.20.0 +ipyparallel==8.8.0 +ipython==7.34.0 +ipython-genutils==0.2.0 +ipython-sql==0.5.0 +ipytree==0.2.2 +ipywidgets==7.7.1 +itsdangerous==2.2.0 +jaraco.classes==3.4.0 +jaraco.context==6.0.1 +jaraco.functools==4.2.1 +jax==0.5.3 +jax-cuda12-pjrt==0.5.3 +jax-cuda12-plugin==0.5.3 +jaxlib==0.5.3 +jeepney==0.9.0 +jieba==0.42.1 +Jinja2==3.1.6 +jiter==0.10.0 +joblib==1.5.1 +jsonpatch==1.33 +jsonpickle==4.1.1 +jsonpointer==3.0.0 +jsonschema==4.25.0 +jsonschema-specifications==2025.4.1 +jupyter-client==6.1.12 +jupyter-console==6.1.0 +jupyter-leaflet==0.20.0 +jupyter-server==1.16.0 +jupyter_core==5.8.1 +jupyterlab_pygments==0.3.0 +jupyterlab_widgets==3.0.15 +jupytext==1.17.2 +kaggle==1.7.4.5 +kagglehub==0.3.12 +keras==3.10.0 +keras-hub==0.21.1 +keras-nlp==0.21.1 +keyring==25.6.0 +keyrings.google-artifactregistry-auth==1.1.2 +kiwisolver==1.4.8 +langchain==0.3.27 +langchain-core==0.3.72 +langchain-text-splitters==0.3.9 +langcodes==3.5.0 +langsmith==0.4.10 +language_data==1.3.0 +launchpadlib==1.10.16 +lazr.restfulclient==0.14.4 +lazr.uri==1.0.6 +lazy_loader==0.4 +libclang==18.1.1 +libcugraph-cu12==25.6.0 +libcuml-cu12==25.6.0 +libcuvs-cu12==25.6.1 +libkvikio-cu12==25.6.0 +libpysal==4.13.0 +libraft-cu12==25.6.0 +librmm-cu12==25.6.0 +librosa==0.11.0 +libucx-cu12==1.18.1 +libucxx-cu12==0.44.0 +linkify-it-py==2.0.3 +llvmlite==0.43.0 +locket==1.0.0 +logical-unification==0.4.6 +lxml==5.4.0 +Mako==1.1.3 +marisa-trie==1.2.1 +Markdown==3.8.2 +markdown-it-py==3.0.0 +MarkupSafe==3.0.2 +matplotlib==3.10.0 +matplotlib-inline==0.1.7 +matplotlib-venn==1.1.2 +mdit-py-plugins==0.4.2 +mdurl==0.1.2 +miniKanren==1.0.5 +missingno==0.5.2 +mistune==3.1.3 +mizani==0.13.5 +mkl==2025.2.0 +ml_dtypes==0.5.3 +mlxtend==0.23.4 +more-itertools==10.7.0 +moviepy==1.0.3 +mpmath==1.3.0 +msgpack==1.1.1 +multidict==6.6.3 +multipledispatch==1.0.0 +multiprocess==0.70.16 +multitasking==0.0.12 +murmurhash==1.0.13 +music21==9.3.0 +namex==0.1.0 +narwhals==2.0.1 +natsort==8.4.0 +nbclassic==1.3.1 +nbclient==0.10.2 +nbconvert==7.16.6 +nbformat==5.10.4 +ndindex==1.10.0 +nest-asyncio==1.6.0 +networkx==3.5 +nibabel==5.3.2 +nltk==3.9.1 +notebook==6.5.7 +notebook_shim==0.2.4 +numba==0.60.0 +numba-cuda==0.11.0 +numexpr==2.11.0 +numpy==2.0.2 +nvidia-cublas-cu12==12.5.3.2 +nvidia-cuda-cupti-cu12==12.5.82 +nvidia-cuda-nvcc-cu12==12.5.82 +nvidia-cuda-nvrtc-cu12==12.5.82 +nvidia-cuda-runtime-cu12==12.5.82 +nvidia-cudnn-cu12==9.3.0.75 +nvidia-cufft-cu12==11.2.3.61 +nvidia-curand-cu12==10.3.6.82 +nvidia-cusolver-cu12==11.6.3.83 +nvidia-cusparse-cu12==12.5.1.3 +nvidia-cusparselt-cu12==0.6.2 +nvidia-ml-py==12.575.51 +nvidia-nccl-cu12==2.23.4 +nvidia-nvjitlink-cu12==12.5.82 +nvidia-nvtx-cu12==12.4.127 +nvtx==0.2.13 +oauth2client==4.1.3 +oauthlib==3.3.1 +omegaconf==2.3.0 +openai==1.98.0 +opencv-contrib-python==4.12.0.88 +opencv-python==4.12.0.88 +opencv-python-headless==4.12.0.88 +openpyxl==3.1.5 +opt_einsum==3.4.0 +optax==0.2.5 +optree==0.17.0 +orbax-checkpoint==0.11.20 +orjson==3.11.1 +osqp==1.0.4 +packaging==25.0 +pandas==2.2.2 +pandas-datareader==0.10.0 +pandas-gbq==0.29.2 +pandas-stubs==2.2.2.240909 +pandocfilters==1.5.1 +panel==1.7.5 +param==2.2.1 +parso==0.8.4 +parsy==2.1 +partd==1.4.2 +patsy==1.0.1 +peewee==3.18.2 +peft==0.17.0 +pexpect==4.9.0 +pickleshare==0.7.5 +pillow==11.3.0 +platformdirs==4.3.8 +plotly==5.24.1 +plotnine==0.14.5 +pluggy==1.6.0 +ply==3.11 +polars==1.25.2 +pooch==1.8.2 +portpicker==1.5.2 +preshed==3.0.10 +prettytable==3.16.0 +proglog==0.1.12 +progressbar2==4.5.0 +prometheus_client==0.22.1 +promise==2.3 +prompt_toolkit==3.0.51 +propcache==0.3.2 +prophet==1.1.7 +proto-plus==1.26.1 +protobuf==5.29.5 +psutil==5.9.5 +psycopg2==2.9.10 +psygnal==0.14.0 +ptyprocess==0.7.0 +py-cpuinfo==9.0.0 +py4j==0.10.9.7 +pyarrow==18.1.0 +pyasn1==0.6.1 +pyasn1_modules==0.4.2 +pycairo==1.28.0 +pycocotools==2.0.10 +pycparser==2.22 +pycryptodomex==3.23.0 +pydantic==2.11.7 +pydantic_core==2.33.2 +pydata-google-auth==1.9.1 +pydot==3.0.4 +pydotplus==2.0.2 +PyDrive2==1.21.3 +pydub==0.25.1 +pyerfa==2.0.1.5 +pygame==2.6.1 +pygit2==1.18.1 +Pygments==2.19.2 +PyGObject==3.42.0 +PyJWT==2.10.1 +pylibcugraph-cu12==25.6.0 +pylibraft-cu12==25.6.0 +pymc==5.25.1 +pynndescent==0.5.13 +pynvjitlink-cu12==0.7.0 +pynvml==12.0.0 +pyogrio==0.11.1 +pyomo==6.9.2 +PyOpenGL==3.1.9 +pyOpenSSL==24.2.1 +pyparsing==3.2.3 +pyperclip==1.9.0 +pyproj==3.7.1 +pyproject_hooks==1.2.0 +pyshp==2.3.1 +PySocks==1.7.1 +pyspark==3.5.1 +pytensor==2.31.7 +python-apt==0.0.0 +python-box==7.3.2 +python-dateutil==2.9.0.post0 +python-louvain==0.16 +python-multipart==0.0.20 +python-slugify==8.0.4 +python-snappy==0.7.3 +python-utils==3.9.1 +pytz==2025.2 +pyviz_comms==3.0.6 +PyWavelets==1.9.0 +PyYAML==6.0.2 +pyzmq==26.2.1 +raft-dask-cu12==25.6.0 +rapids-dask-dependency==25.6.0 +rapids-logger==0.1.1 +ratelim==0.1.6 +referencing==0.36.2 +regex==2024.11.6 +requests==2.32.3 +requests-oauthlib==2.0.0 +requests-toolbelt==1.0.0 +requirements-parser==0.9.0 +rich==13.9.4 +rmm-cu12==25.6.0 +roman-numerals-py==3.1.0 +rpds-py==0.26.0 +rpy2==3.5.17 +rsa==4.9.1 +ruff==0.12.7 +safehttpx==0.1.6 +safetensors==0.5.3 +scikit-image==0.25.2 +scikit-learn==1.6.1 +scipy==1.16.1 +scooby==0.10.1 +scs==3.2.7.post2 +seaborn==0.13.2 +SecretStorage==3.3.3 +semantic-version==2.10.0 +Send2Trash==1.8.3 +sentence-transformers==4.1.0 +sentencepiece==0.2.0 +sentry-sdk==2.34.1 +shap==0.48.0 +shapely==2.1.1 +shellingham==1.5.4 +simple-parsing==0.1.7 +simplejson==3.20.1 +simsimd==6.5.0 +six==1.17.0 +sklearn-compat==0.1.3 +sklearn-pandas==2.2.0 +slicer==0.0.8 +smart_open==7.3.0.post1 +smmap==5.0.2 +sniffio==1.3.1 +snowballstemmer==3.0.1 +sortedcontainers==2.4.0 +soundfile==0.13.1 +soupsieve==2.7 +soxr==0.5.0.post1 +spacy==3.8.7 +spacy-legacy==3.0.12 +spacy-loggers==1.0.5 +spanner-graph-notebook==1.1.7 +Sphinx==8.2.3 +sphinxcontrib-applehelp==2.0.0 +sphinxcontrib-devhelp==2.0.0 +sphinxcontrib-htmlhelp==2.1.0 +sphinxcontrib-jsmath==1.0.1 +sphinxcontrib-qthelp==2.0.0 +sphinxcontrib-serializinghtml==2.0.0 +SQLAlchemy==2.0.42 +sqlglot==25.20.2 +sqlparse==0.5.3 +srsly==2.5.1 +stanio==0.5.1 +starlette==0.47.2 +statsmodels==0.14.5 +stringzilla==3.12.5 +stumpy==1.13.0 +sympy==1.13.1 +tables==3.10.2 +tabulate==0.9.0 +tbb==2022.2.0 +tblib==3.1.0 +tcmlib==1.4.0 +tenacity==8.5.0 +tensorboard==2.19.0 +tensorboard-data-server==0.7.2 +tensorflow==2.19.0 +tensorflow-datasets==4.9.9 +tensorflow-hub==0.16.1 +tensorflow-io-gcs-filesystem==0.37.1 +tensorflow-metadata==1.17.2 +tensorflow-probability==0.25.0 +tensorflow-text==2.19.0 +tensorflow_decision_forests==1.12.0 +tensorstore==0.1.76 +termcolor==3.1.0 +terminado==0.18.1 +text-unidecode==1.3 +textblob==0.19.0 +tf-slim==1.1.0 +tf_keras==2.19.0 +thinc==8.3.6 +threadpoolctl==3.6.0 +tifffile==2025.6.11 +tiktoken==0.9.0 +timm==1.0.19 +tinycss2==1.4.0 +tokenizers==0.21.4 +toml==0.10.2 +tomlkit==0.13.3 +toolz==0.12.1 +torchao==0.10.0 +torchdata==0.11.0 +torchsummary==1.5.1 +torchtune==0.6.1 +tornado==6.4.2 +tqdm==4.67.1 +traitlets==5.7.1 +traittypes==0.2.1 +transformers==4.54.1 +treelite==4.4.1 +treescope==0.1.9 +triton==3.2.0 +tsfresh==0.21.0 +tweepy==4.16.0 +typeguard==4.4.4 +typer==0.16.0 +types-pytz==2025.2.0.20250516 +types-setuptools==80.9.0.20250801 +typing-inspection==0.4.1 +typing_extensions==4.14.1 +tzdata==2025.2 +tzlocal==5.3.1 +uc-micro-py==1.0.3 +ucx-py-cu12==0.44.0 +ucxx-cu12==0.44.0 +umap-learn==0.5.9.post2 +umf==0.11.0 +uritemplate==4.2.0 +urllib3==2.5.0 +uvicorn==0.35.0 +vega-datasets==0.9.0 +wadllib==1.3.6 +wandb==0.21.0 +wasabi==1.1.3 +wcwidth==0.2.13 +weasel==0.4.1 +webcolors==24.11.1 +webencodings==0.5.1 +websocket-client==1.8.0 +websockets==15.0.1 +Werkzeug==3.1.3 +widgetsnbextension==3.6.10 +wordcloud==1.9.4 +wrapt==1.17.2 +wurlitzer==3.1.1 +xarray==2025.7.1 +xarray-einstats==0.9.1 +xgboost==3.0.3 +xlrd==2.0.2 +xxhash==3.5.0 +xyzservices==2025.4.0 +yarl==1.20.1 +ydf==0.13.0 +yellowbrick==1.5 +yfinance==0.2.65 +zict==3.0.0 +zipp==3.23.0 diff --git a/tests/system/large/functions/test_managed_function.py b/tests/system/large/functions/test_managed_function.py index 5aa27e1775..5349529f1d 100644 --- a/tests/system/large/functions/test_managed_function.py +++ b/tests/system/large/functions/test_managed_function.py @@ -12,6 +12,8 @@ # See the License for the specific language governing permissions and # limitations under the License. +import warnings + import google.api_core.exceptions import pandas import pyarrow @@ -31,12 +33,22 @@ def test_managed_function_array_output(session, scalars_dfs, dataset_id): try: - @session.udf( - dataset=dataset_id, - name=prefixer.create_prefix(), + with warnings.catch_warnings(record=True) as record: + + @session.udf( + dataset=dataset_id, + name=prefixer.create_prefix(), + ) + def featurize(x: int) -> list[float]: + return [float(i) for i in [x, x + 1, x + 2]] + + # No following conflict warning when there is no redundant type hints. + input_type_warning = "Conflicting input types detected" + return_type_warning = "Conflicting return type detected" + assert not any(input_type_warning in str(warning.message) for warning in record) + assert not any( + return_type_warning in str(warning.message) for warning in record ) - def featurize(x: int) -> list[float]: - return [float(i) for i in [x, x + 1, x + 2]] scalars_df, scalars_pandas_df = scalars_dfs @@ -222,7 +234,10 @@ def add(x: int, y: int) -> int: def test_managed_function_series_combine_array_output(session, dataset_id, scalars_dfs): try: - def add_list(x: int, y: int) -> list[int]: + # The type hints in this function's signature has conflicts. The + # `input_types` and `output_type` arguments from udf decorator take + # precedence and will be used instead. + def add_list(x, y: bool) -> list[bool]: return [x, y] scalars_df, scalars_pandas_df = scalars_dfs @@ -234,9 +249,18 @@ def add_list(x: int, y: int) -> list[int]: # Make sure there are NA values in the test column. assert any([pandas.isna(val) for val in bf_df[int_col_name_with_nulls]]) - add_list_managed_func = session.udf( - dataset=dataset_id, name=prefixer.create_prefix() - )(add_list) + with warnings.catch_warnings(record=True) as record: + add_list_managed_func = session.udf( + input_types=[int, int], + output_type=list[int], + dataset=dataset_id, + name=prefixer.create_prefix(), + )(add_list) + + input_type_warning = "Conflicting input types detected" + assert any(input_type_warning in str(warning.message) for warning in record) + return_type_warning = "Conflicting return type detected" + assert any(return_type_warning in str(warning.message) for warning in record) # After filtering out nulls the managed function application should work # similar to pandas. diff --git a/tests/system/large/functions/test_remote_function.py b/tests/system/large/functions/test_remote_function.py index f3e97aeb85..a93435d11a 100644 --- a/tests/system/large/functions/test_remote_function.py +++ b/tests/system/large/functions/test_remote_function.py @@ -527,8 +527,8 @@ def add_one(x): add_one_uniq, add_one_uniq_dir = make_uniq_udf(add_one) # Expected cloud function name for the unique udf - package_requirements = bff_utils._get_updated_package_requirements() - add_one_uniq_hash = bff_utils._get_hash(add_one_uniq, package_requirements) + package_requirements = bff_utils.get_updated_package_requirements() + add_one_uniq_hash = bff_utils.get_hash(add_one_uniq, package_requirements) add_one_uniq_cf_name = bff_utils.get_cloud_function_name( add_one_uniq_hash, session.session_id ) @@ -843,22 +843,31 @@ def test_remote_function_with_external_package_dependencies( ): try: - def pd_np_foo(x): + # The return type hint in this function's signature has conflict. The + # `output_type` argument from remote_function decorator takes precedence + # and will be used instead. + def pd_np_foo(x) -> None: import numpy as mynp import pandas as mypd return mypd.Series([x, mynp.sqrt(mynp.abs(x))]).sum() - # Create the remote function with the name provided explicitly - pd_np_foo_remote = session.remote_function( - input_types=[int], - output_type=float, - dataset=dataset_id, - bigquery_connection=bq_cf_connection, - reuse=False, - packages=["numpy", "pandas >= 2.0.0"], - cloud_function_service_account="default", - )(pd_np_foo) + with warnings.catch_warnings(record=True) as record: + # Create the remote function with the name provided explicitly + pd_np_foo_remote = session.remote_function( + input_types=[int], + output_type=float, + dataset=dataset_id, + bigquery_connection=bq_cf_connection, + reuse=False, + packages=["numpy", "pandas >= 2.0.0"], + cloud_function_service_account="default", + )(pd_np_foo) + + input_type_warning = "Conflicting input types detected" + assert not any(input_type_warning in str(warning.message) for warning in record) + return_type_warning = "Conflicting return type detected" + assert any(return_type_warning in str(warning.message) for warning in record) # The behavior of the created remote function should be as expected scalars_df, scalars_pandas_df = scalars_dfs @@ -1999,10 +2008,25 @@ def test_remote_function_unnamed_removed_w_session_cleanup(): # create a clean session session = bigframes.connect() - # create an unnamed remote function in the session - @session.remote_function(reuse=False, cloud_function_service_account="default") - def foo(x: int) -> int: - return x + 1 + with warnings.catch_warnings(record=True) as record: + # create an unnamed remote function in the session. + # The type hints in this function's signature are redundant. The + # `input_types` and `output_type` arguments from remote_function + # decorator take precedence and will be used instead. + @session.remote_function( + input_types=[int], + output_type=int, + reuse=False, + cloud_function_service_account="default", + ) + def foo(x: int) -> int: + return x + 1 + + # No following warning with only redundant type hints (no conflict). + input_type_warning = "Conflicting input types detected" + assert not any(input_type_warning in str(warning.message) for warning in record) + return_type_warning = "Conflicting return type detected" + assert not any(return_type_warning in str(warning.message) for warning in record) # ensure that remote function artifacts are created assert foo.bigframes_remote_function is not None diff --git a/tests/system/small/bigquery/test_geo.py b/tests/system/small/bigquery/test_geo.py index f888fd0364..c89ca59aca 100644 --- a/tests/system/small/bigquery/test_geo.py +++ b/tests/system/small/bigquery/test_geo.py @@ -12,6 +12,8 @@ # See the License for the specific language governing permissions and # limitations under the License. +from __future__ import annotations + import geopandas # type: ignore import pandas as pd import pandas.testing @@ -29,9 +31,10 @@ from bigframes.bigquery import st_length import bigframes.bigquery as bbq import bigframes.geopandas +import bigframes.session -def test_geo_st_area(): +def test_geo_st_area(session: bigframes.session.Session): data = [ Polygon([(0.000, 0.0), (0.001, 0.001), (0.000, 0.001)]), Polygon([(0.0010, 0.004), (0.009, 0.005), (0.0010, 0.005)]), @@ -41,7 +44,7 @@ def test_geo_st_area(): ] geopd_s = geopandas.GeoSeries(data=data, crs="EPSG:4326") - geobf_s = bigframes.geopandas.GeoSeries(data=data) + geobf_s = bigframes.geopandas.GeoSeries(data=data, session=session) # For `geopd_s`, the data was further projected with `geopandas.GeoSeries.to_crs` # to `to_crs(26393)` to get the area in square meter. See: https://geopandas.org/en/stable/docs/user_guide/projections.html @@ -123,7 +126,7 @@ def test_st_length_various_geometries(session): ) # type: ignore -def test_geo_st_difference_with_geometry_objects(): +def test_geo_st_difference_with_geometry_objects(session: bigframes.session.Session): data1 = [ Polygon([(0, 0), (10, 0), (10, 10), (0, 0)]), Polygon([(0, 0), (1, 1), (0, 1), (0, 0)]), @@ -136,8 +139,8 @@ def test_geo_st_difference_with_geometry_objects(): LineString([(2, 0), (0, 2)]), ] - geobf_s1 = bigframes.geopandas.GeoSeries(data=data1) - geobf_s2 = bigframes.geopandas.GeoSeries(data=data2) + geobf_s1 = bigframes.geopandas.GeoSeries(data=data1, session=session) + geobf_s2 = bigframes.geopandas.GeoSeries(data=data2, session=session) geobf_s_result = bbq.st_difference(geobf_s1, geobf_s2).to_pandas() expected = pd.Series( @@ -158,7 +161,9 @@ def test_geo_st_difference_with_geometry_objects(): ) -def test_geo_st_difference_with_single_geometry_object(): +def test_geo_st_difference_with_single_geometry_object( + session: bigframes.session.Session, +): pytest.importorskip( "shapely", minversion="2.0.0", @@ -171,7 +176,7 @@ def test_geo_st_difference_with_single_geometry_object(): Point(0, 1), ] - geobf_s1 = bigframes.geopandas.GeoSeries(data=data1) + geobf_s1 = bigframes.geopandas.GeoSeries(data=data1, session=session) geobf_s_result = bbq.st_difference( geobf_s1, Polygon([(0, 0), (10, 0), (10, 5), (0, 5), (0, 0)]), @@ -195,14 +200,16 @@ def test_geo_st_difference_with_single_geometry_object(): ) -def test_geo_st_difference_with_similar_geometry_objects(): +def test_geo_st_difference_with_similar_geometry_objects( + session: bigframes.session.Session, +): data1 = [ Polygon([(0, 0), (10, 0), (10, 10), (0, 0)]), Polygon([(0, 0), (1, 1), (0, 1)]), Point(0, 1), ] - geobf_s1 = bigframes.geopandas.GeoSeries(data=data1) + geobf_s1 = bigframes.geopandas.GeoSeries(data=data1, session=session) geobf_s_result = bbq.st_difference(geobf_s1, geobf_s1).to_pandas() expected = pd.Series( @@ -219,7 +226,7 @@ def test_geo_st_difference_with_similar_geometry_objects(): ) -def test_geo_st_distance_with_geometry_objects(): +def test_geo_st_distance_with_geometry_objects(session: bigframes.session.Session): data1 = [ # 0.00001 is approximately 1 meter. Polygon([(0, 0), (0.00001, 0), (0.00001, 0.00001), (0, 0.00001), (0, 0)]), @@ -252,8 +259,8 @@ def test_geo_st_distance_with_geometry_objects(): ), # No matching row in data1, so this will be NULL after the call to distance. ] - geobf_s1 = bigframes.geopandas.GeoSeries(data=data1) - geobf_s2 = bigframes.geopandas.GeoSeries(data=data2) + geobf_s1 = bigframes.geopandas.GeoSeries(data=data1, session=session) + geobf_s2 = bigframes.geopandas.GeoSeries(data=data2, session=session) geobf_s_result = bbq.st_distance(geobf_s1, geobf_s2).to_pandas() expected = pd.Series( @@ -275,7 +282,9 @@ def test_geo_st_distance_with_geometry_objects(): ) -def test_geo_st_distance_with_single_geometry_object(): +def test_geo_st_distance_with_single_geometry_object( + session: bigframes.session.Session, +): pytest.importorskip( "shapely", minversion="2.0.0", @@ -297,7 +306,7 @@ def test_geo_st_distance_with_single_geometry_object(): Point(0, 0.00002), ] - geobf_s1 = bigframes.geopandas.GeoSeries(data=data1) + geobf_s1 = bigframes.geopandas.GeoSeries(data=data1, session=session) geobf_s_result = bbq.st_distance( geobf_s1, Point(0, 0), @@ -320,7 +329,7 @@ def test_geo_st_distance_with_single_geometry_object(): ) -def test_geo_st_intersection_with_geometry_objects(): +def test_geo_st_intersection_with_geometry_objects(session: bigframes.session.Session): data1 = [ Polygon([(0, 0), (10, 0), (10, 10), (0, 0)]), Polygon([(0, 0), (1, 1), (0, 1), (0, 0)]), @@ -333,8 +342,8 @@ def test_geo_st_intersection_with_geometry_objects(): LineString([(2, 0), (0, 2)]), ] - geobf_s1 = bigframes.geopandas.GeoSeries(data=data1) - geobf_s2 = bigframes.geopandas.GeoSeries(data=data2) + geobf_s1 = bigframes.geopandas.GeoSeries(data=data1, session=session) + geobf_s2 = bigframes.geopandas.GeoSeries(data=data2, session=session) geobf_s_result = bbq.st_intersection(geobf_s1, geobf_s2).to_pandas() expected = pd.Series( @@ -355,7 +364,9 @@ def test_geo_st_intersection_with_geometry_objects(): ) -def test_geo_st_intersection_with_single_geometry_object(): +def test_geo_st_intersection_with_single_geometry_object( + session: bigframes.session.Session, +): pytest.importorskip( "shapely", minversion="2.0.0", @@ -368,7 +379,7 @@ def test_geo_st_intersection_with_single_geometry_object(): Point(0, 1), ] - geobf_s1 = bigframes.geopandas.GeoSeries(data=data1) + geobf_s1 = bigframes.geopandas.GeoSeries(data=data1, session=session) geobf_s_result = bbq.st_intersection( geobf_s1, Polygon([(0, 0), (10, 0), (10, 5), (0, 5), (0, 0)]), @@ -392,14 +403,16 @@ def test_geo_st_intersection_with_single_geometry_object(): ) -def test_geo_st_intersection_with_similar_geometry_objects(): +def test_geo_st_intersection_with_similar_geometry_objects( + session: bigframes.session.Session, +): data1 = [ Polygon([(0, 0), (10, 0), (10, 10), (0, 0)]), Polygon([(0, 0), (1, 1), (0, 1)]), Point(0, 1), ] - geobf_s1 = bigframes.geopandas.GeoSeries(data=data1) + geobf_s1 = bigframes.geopandas.GeoSeries(data=data1, session=session) geobf_s_result = bbq.st_intersection(geobf_s1, geobf_s1).to_pandas() expected = pd.Series( @@ -420,7 +433,7 @@ def test_geo_st_intersection_with_similar_geometry_objects(): ) -def test_geo_st_isclosed(): +def test_geo_st_isclosed(session: bigframes.session.Session): bf_gs = bigframes.geopandas.GeoSeries( [ Point(0, 0), # Point @@ -428,12 +441,15 @@ def test_geo_st_isclosed(): LineString([(0, 0), (1, 1), (0, 1), (0, 0)]), # Closed LineString Polygon([(0, 0), (1, 1), (0, 1)]), # Open polygon GeometryCollection(), # Empty GeometryCollection - bigframes.geopandas.GeoSeries.from_wkt(["GEOMETRYCOLLECTION EMPTY"]).iloc[ + bigframes.geopandas.GeoSeries.from_wkt( + ["GEOMETRYCOLLECTION EMPTY"], session=session + ).iloc[ 0 ], # Also empty None, # Should be filtered out by dropna ], index=[0, 1, 2, 3, 4, 5, 6], + session=session, ) bf_result = bbq.st_isclosed(bf_gs).to_pandas() @@ -455,3 +471,12 @@ def test_geo_st_isclosed(): # We default to Int64 (nullable) dtype, but pandas defaults to int64 index. check_index_type=False, ) + + +def test_st_buffer(session): + geoseries = bigframes.geopandas.GeoSeries( + [Point(0, 0), LineString([(1, 1), (2, 2)])], session=session + ) + result = bbq.st_buffer(geoseries, 1000).to_pandas() + assert result.iloc[0].geom_type == "Polygon" + assert result.iloc[1].geom_type == "Polygon" diff --git a/tests/system/small/engines/test_generic_ops.py b/tests/system/small/engines/test_generic_ops.py new file mode 100644 index 0000000000..af114991eb --- /dev/null +++ b/tests/system/small/engines/test_generic_ops.py @@ -0,0 +1,268 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import re + +import pytest + +from bigframes.core import array_value, expression +import bigframes.dtypes +import bigframes.operations as ops +from bigframes.session import polars_executor +from bigframes.testing.engine_utils import assert_equivalence_execution + +pytest.importorskip("polars") + +# Polars used as reference as its fast and local. Generally though, prefer gbq engine where they disagree. +REFERENCE_ENGINE = polars_executor.PolarsExecutor() + + +def apply_op( + array: array_value.ArrayValue, op: ops.AsTypeOp, excluded_cols=[] +) -> array_value.ArrayValue: + exprs = [] + labels = [] + for arg in array.column_ids: + if arg in excluded_cols: + continue + try: + _ = op.output_type(array.get_column_type(arg)) + expr = op.as_expr(arg) + exprs.append(expr) + type_string = re.sub(r"[^a-zA-Z\d]", "_", str(op.to_type)) + labels.append(f"{arg}_as_{type_string}") + except TypeError: + continue + assert len(exprs) > 0 + new_arr, ids = array.compute_values(exprs) + new_arr = new_arr.rename_columns( + {new_col: label for new_col, label in zip(ids, labels)} + ) + return new_arr + + +@pytest.mark.parametrize("engine", ["polars", "bq"], indirect=True) +def test_engines_astype_int(scalars_array_value: array_value.ArrayValue, engine): + arr = apply_op( + scalars_array_value, + ops.AsTypeOp(to_type=bigframes.dtypes.INT_DTYPE), + excluded_cols=["string_col"], + ) + assert_equivalence_execution(arr.node, REFERENCE_ENGINE, engine) + + +@pytest.mark.parametrize("engine", ["polars", "bq"], indirect=True) +def test_engines_astype_string_int(scalars_array_value: array_value.ArrayValue, engine): + vals = ["1", "100", "-3"] + arr, _ = scalars_array_value.compute_values( + [ + ops.AsTypeOp(to_type=bigframes.dtypes.INT_DTYPE).as_expr( + expression.const(val) + ) + for val in vals + ] + ) + assert_equivalence_execution(arr.node, REFERENCE_ENGINE, engine) + + +@pytest.mark.parametrize("engine", ["polars", "bq"], indirect=True) +def test_engines_astype_float(scalars_array_value: array_value.ArrayValue, engine): + arr = apply_op( + scalars_array_value, + ops.AsTypeOp(to_type=bigframes.dtypes.FLOAT_DTYPE), + excluded_cols=["string_col"], + ) + assert_equivalence_execution(arr.node, REFERENCE_ENGINE, engine) + + +@pytest.mark.parametrize("engine", ["polars", "bq"], indirect=True) +def test_engines_astype_string_float( + scalars_array_value: array_value.ArrayValue, engine +): + vals = ["1", "1.1", ".1", "1e3", "1.34235e4", "3.33333e-4"] + arr, _ = scalars_array_value.compute_values( + [ + ops.AsTypeOp(to_type=bigframes.dtypes.FLOAT_DTYPE).as_expr( + expression.const(val) + ) + for val in vals + ] + ) + assert_equivalence_execution(arr.node, REFERENCE_ENGINE, engine) + + +@pytest.mark.parametrize("engine", ["polars", "bq"], indirect=True) +def test_engines_astype_bool(scalars_array_value: array_value.ArrayValue, engine): + arr = apply_op( + scalars_array_value, ops.AsTypeOp(to_type=bigframes.dtypes.BOOL_DTYPE) + ) + assert_equivalence_execution(arr.node, REFERENCE_ENGINE, engine) + + +@pytest.mark.parametrize("engine", ["polars", "bq"], indirect=True) +def test_engines_astype_string(scalars_array_value: array_value.ArrayValue, engine): + # floats work slightly different with trailing zeroes rn + arr = apply_op( + scalars_array_value, + ops.AsTypeOp(to_type=bigframes.dtypes.STRING_DTYPE), + excluded_cols=["float64_col"], + ) + assert_equivalence_execution(arr.node, REFERENCE_ENGINE, engine) + + +@pytest.mark.parametrize("engine", ["polars", "bq"], indirect=True) +def test_engines_astype_numeric(scalars_array_value: array_value.ArrayValue, engine): + arr = apply_op( + scalars_array_value, + ops.AsTypeOp(to_type=bigframes.dtypes.NUMERIC_DTYPE), + excluded_cols=["string_col"], + ) + assert_equivalence_execution(arr.node, REFERENCE_ENGINE, engine) + + +@pytest.mark.parametrize("engine", ["polars", "bq"], indirect=True) +def test_engines_astype_string_numeric( + scalars_array_value: array_value.ArrayValue, engine +): + vals = ["1", "1.1", ".1", "23428975070235903.209", "-23428975070235903.209"] + arr, _ = scalars_array_value.compute_values( + [ + ops.AsTypeOp(to_type=bigframes.dtypes.NUMERIC_DTYPE).as_expr( + expression.const(val) + ) + for val in vals + ] + ) + assert_equivalence_execution(arr.node, REFERENCE_ENGINE, engine) + + +@pytest.mark.parametrize("engine", ["polars", "bq"], indirect=True) +def test_engines_astype_date(scalars_array_value: array_value.ArrayValue, engine): + arr = apply_op( + scalars_array_value, + ops.AsTypeOp(to_type=bigframes.dtypes.DATE_DTYPE), + excluded_cols=["string_col"], + ) + assert_equivalence_execution(arr.node, REFERENCE_ENGINE, engine) + + +@pytest.mark.parametrize("engine", ["polars", "bq"], indirect=True) +def test_engines_astype_string_date( + scalars_array_value: array_value.ArrayValue, engine +): + vals = ["2014-08-15", "2215-08-15", "2016-02-29"] + arr, _ = scalars_array_value.compute_values( + [ + ops.AsTypeOp(to_type=bigframes.dtypes.DATE_DTYPE).as_expr( + expression.const(val) + ) + for val in vals + ] + ) + assert_equivalence_execution(arr.node, REFERENCE_ENGINE, engine) + + +@pytest.mark.parametrize("engine", ["polars", "bq"], indirect=True) +def test_engines_astype_datetime(scalars_array_value: array_value.ArrayValue, engine): + arr = apply_op( + scalars_array_value, + ops.AsTypeOp(to_type=bigframes.dtypes.DATETIME_DTYPE), + excluded_cols=["string_col"], + ) + assert_equivalence_execution(arr.node, REFERENCE_ENGINE, engine) + + +@pytest.mark.parametrize("engine", ["polars", "bq"], indirect=True) +def test_engines_astype_string_datetime( + scalars_array_value: array_value.ArrayValue, engine +): + vals = ["2014-08-15 08:15:12", "2015-08-15 08:15:12.654754", "2016-02-29 00:00:00"] + arr, _ = scalars_array_value.compute_values( + [ + ops.AsTypeOp(to_type=bigframes.dtypes.DATETIME_DTYPE).as_expr( + expression.const(val) + ) + for val in vals + ] + ) + assert_equivalence_execution(arr.node, REFERENCE_ENGINE, engine) + + +@pytest.mark.parametrize("engine", ["polars", "bq"], indirect=True) +def test_engines_astype_timestamp(scalars_array_value: array_value.ArrayValue, engine): + arr = apply_op( + scalars_array_value, + ops.AsTypeOp(to_type=bigframes.dtypes.TIMESTAMP_DTYPE), + excluded_cols=["string_col"], + ) + assert_equivalence_execution(arr.node, REFERENCE_ENGINE, engine) + + +@pytest.mark.parametrize("engine", ["polars", "bq"], indirect=True) +def test_engines_astype_string_timestamp( + scalars_array_value: array_value.ArrayValue, engine +): + vals = [ + "2014-08-15 08:15:12+00:00", + "2015-08-15 08:15:12.654754+05:00", + "2016-02-29 00:00:00+08:00", + ] + arr, _ = scalars_array_value.compute_values( + [ + ops.AsTypeOp(to_type=bigframes.dtypes.TIMESTAMP_DTYPE).as_expr( + expression.const(val) + ) + for val in vals + ] + ) + assert_equivalence_execution(arr.node, REFERENCE_ENGINE, engine) + + +@pytest.mark.parametrize("engine", ["polars", "bq"], indirect=True) +def test_engines_astype_time(scalars_array_value: array_value.ArrayValue, engine): + arr = apply_op( + scalars_array_value, + ops.AsTypeOp(to_type=bigframes.dtypes.TIME_DTYPE), + excluded_cols=["string_col", "int64_col", "int64_too"], + ) + assert_equivalence_execution(arr.node, REFERENCE_ENGINE, engine) + + +@pytest.mark.parametrize("engine", ["polars", "bq"], indirect=True) +def test_engines_astype_from_json(scalars_array_value: array_value.ArrayValue, engine): + exprs = [ + ops.AsTypeOp(to_type=bigframes.dtypes.INT_DTYPE).as_expr( + expression.const("5", bigframes.dtypes.JSON_DTYPE) + ), + ops.AsTypeOp(to_type=bigframes.dtypes.FLOAT_DTYPE).as_expr( + expression.const("5", bigframes.dtypes.JSON_DTYPE) + ), + ops.AsTypeOp(to_type=bigframes.dtypes.BOOL_DTYPE).as_expr( + expression.const("true", bigframes.dtypes.JSON_DTYPE) + ), + ops.AsTypeOp(to_type=bigframes.dtypes.STRING_DTYPE).as_expr( + expression.const('"hello world"', bigframes.dtypes.JSON_DTYPE) + ), + ] + arr, _ = scalars_array_value.compute_values(exprs) + assert_equivalence_execution(arr.node, REFERENCE_ENGINE, engine) + + +@pytest.mark.parametrize("engine", ["polars", "bq"], indirect=True) +def test_engines_astype_timedelta(scalars_array_value: array_value.ArrayValue, engine): + arr = apply_op( + scalars_array_value, + ops.AsTypeOp(to_type=bigframes.dtypes.TIMEDELTA_DTYPE), + ) + assert_equivalence_execution(arr.node, REFERENCE_ENGINE, engine) diff --git a/tests/system/small/engines/test_numeric_ops.py b/tests/system/small/engines/test_numeric_ops.py index b53da977f5..7e5b85857b 100644 --- a/tests/system/small/engines/test_numeric_ops.py +++ b/tests/system/small/engines/test_numeric_ops.py @@ -53,7 +53,7 @@ def apply_op_pairwise( return new_arr -@pytest.mark.parametrize("engine", ["polars", "bq"], indirect=True) +@pytest.mark.parametrize("engine", ["polars", "bq", "bq-sqlglot"], indirect=True) def test_engines_project_add( scalars_array_value: array_value.ArrayValue, engine, @@ -62,7 +62,7 @@ def test_engines_project_add( assert_equivalence_execution(arr.node, REFERENCE_ENGINE, engine) -@pytest.mark.parametrize("engine", ["polars", "bq"], indirect=True) +@pytest.mark.parametrize("engine", ["polars", "bq", "bq-sqlglot"], indirect=True) def test_engines_project_sub( scalars_array_value: array_value.ArrayValue, engine, diff --git a/tests/system/small/geopandas/test_geoseries.py b/tests/system/small/geopandas/test_geoseries.py index 51344edcbd..a2f0759161 100644 --- a/tests/system/small/geopandas/test_geoseries.py +++ b/tests/system/small/geopandas/test_geoseries.py @@ -12,6 +12,8 @@ # See the License for the specific language governing permissions and # limitations under the License. +from __future__ import annotations + import re import bigframes_vendored.constants as constants @@ -31,6 +33,7 @@ import bigframes.geopandas import bigframes.pandas import bigframes.series +import bigframes.session from bigframes.testing.utils import assert_series_equal @@ -75,7 +78,7 @@ def test_geo_y(urban_areas_dfs): ) -def test_geo_area_not_supported(): +def test_geo_area_not_supported(session: bigframes.session.Session): s = bigframes.pandas.Series( [ Polygon([(0, 0), (1, 1), (0, 1)]), @@ -85,6 +88,7 @@ def test_geo_area_not_supported(): Point(0, 1), ], dtype=GeometryDtype(), + session=session, ) bf_series: bigframes.geopandas.GeoSeries = s.geo with pytest.raises( @@ -107,7 +111,7 @@ def test_geoseries_length_property_not_implemented(session): _ = gs.length -def test_geo_distance_not_supported(): +def test_geo_distance_not_supported(session: bigframes.session.Session): s1 = bigframes.pandas.Series( [ Polygon([(0, 0), (1, 1), (0, 1)]), @@ -117,6 +121,7 @@ def test_geo_distance_not_supported(): Point(0, 1), ], dtype=GeometryDtype(), + session=session, ) s2 = bigframes.geopandas.GeoSeries( [ @@ -125,7 +130,8 @@ def test_geo_distance_not_supported(): Polygon([(0, 0), (2, 2), (2, 0)]), LineString([(0, 0), (1, 1), (0, 1)]), Point(0, 1), - ] + ], + session=session, ) with pytest.raises( NotImplementedError, @@ -134,11 +140,11 @@ def test_geo_distance_not_supported(): s1.geo.distance(s2) -def test_geo_from_xy(): +def test_geo_from_xy(session: bigframes.session.Session): x = [2.5, 5, -3.0] y = [0.5, 1, 1.5] bf_result = ( - bigframes.geopandas.GeoSeries.from_xy(x, y) + bigframes.geopandas.GeoSeries.from_xy(x, y, session=session) .astype(geopandas.array.GeometryDtype()) .to_pandas() ) @@ -154,7 +160,7 @@ def test_geo_from_xy(): ) -def test_geo_from_wkt(): +def test_geo_from_wkt(session: bigframes.session.Session): wkts = [ "Point(0 1)", "Point(2 4)", @@ -162,7 +168,9 @@ def test_geo_from_wkt(): "Point(6 8)", ] - bf_result = bigframes.geopandas.GeoSeries.from_wkt(wkts).to_pandas() + bf_result = bigframes.geopandas.GeoSeries.from_wkt( + wkts, session=session + ).to_pandas() pd_result = geopandas.GeoSeries.from_wkt(wkts) @@ -174,14 +182,15 @@ def test_geo_from_wkt(): ) -def test_geo_to_wkt(): +def test_geo_to_wkt(session: bigframes.session.Session): bf_geo = bigframes.geopandas.GeoSeries( [ Point(0, 1), Point(2, 4), Point(5, 3), Point(6, 8), - ] + ], + session=session, ) pd_geo = geopandas.GeoSeries( @@ -209,8 +218,8 @@ def test_geo_to_wkt(): ) -def test_geo_boundary(): - bf_s = bigframes.pandas.Series( +def test_geo_boundary(session: bigframes.session.Session): + bf_s = bigframes.series.Series( [ Polygon([(0, 0), (1, 1), (0, 1)]), Polygon([(10, 0), (10, 5), (0, 0)]), @@ -218,6 +227,7 @@ def test_geo_boundary(): LineString([(0, 0), (1, 1), (0, 1)]), Point(0, 1), ], + session=session, ) pd_s = geopandas.GeoSeries( @@ -229,6 +239,7 @@ def test_geo_boundary(): Point(0, 1), ], index=pd.Index([0, 1, 2, 3, 4], dtype="Int64"), + crs="WGS84", ) bf_result = bf_s.geo.boundary.to_pandas() @@ -246,7 +257,7 @@ def test_geo_boundary(): # For example, when the difference between two polygons is empty, # GeoPandas returns 'POLYGON EMPTY' while GeoSeries returns 'GeometryCollection([])'. # This is why we are hard-coding the expected results. -def test_geo_difference_with_geometry_objects(): +def test_geo_difference_with_geometry_objects(session: bigframes.session.Session): data1 = [ Polygon([(0, 0), (10, 0), (10, 10), (0, 0)]), Polygon([(0, 0), (1, 1), (0, 1), (0, 0)]), @@ -259,8 +270,8 @@ def test_geo_difference_with_geometry_objects(): LineString([(2, 0), (0, 2)]), ] - bf_s1 = bigframes.geopandas.GeoSeries(data=data1) - bf_s2 = bigframes.geopandas.GeoSeries(data=data2) + bf_s1 = bigframes.geopandas.GeoSeries(data=data1, session=session) + bf_s2 = bigframes.geopandas.GeoSeries(data=data2, session=session) bf_result = bf_s1.difference(bf_s2).to_pandas() @@ -271,6 +282,7 @@ def test_geo_difference_with_geometry_objects(): Point(0, 1), ], index=[0, 1, 2], + session=session, ).to_pandas() assert bf_result.dtype == "geometry" @@ -279,20 +291,21 @@ def test_geo_difference_with_geometry_objects(): assert expected.iloc[2].equals(bf_result.iloc[2]) -def test_geo_difference_with_single_geometry_object(): +def test_geo_difference_with_single_geometry_object(session: bigframes.session.Session): data1 = [ Polygon([(0, 0), (10, 0), (10, 10), (0, 0)]), Polygon([(4, 2), (6, 2), (8, 6), (4, 2)]), Point(0, 1), ] - bf_s1 = bigframes.geopandas.GeoSeries(data=data1) + bf_s1 = bigframes.geopandas.GeoSeries(data=data1, session=session) bf_result = bf_s1.difference( bigframes.geopandas.GeoSeries( [ Polygon([(0, 0), (10, 0), (10, 10), (0, 0)]), Polygon([(1, 0), (0, 5), (0, 0), (1, 0)]), - ] + ], + session=session, ), ).to_pandas() @@ -303,6 +316,7 @@ def test_geo_difference_with_single_geometry_object(): None, ], index=[0, 1, 2], + session=session, ).to_pandas() assert bf_result.dtype == "geometry" @@ -311,19 +325,22 @@ def test_geo_difference_with_single_geometry_object(): assert expected.iloc[2] == bf_result.iloc[2] -def test_geo_difference_with_similar_geometry_objects(): +def test_geo_difference_with_similar_geometry_objects( + session: bigframes.session.Session, +): data1 = [ Polygon([(0, 0), (10, 0), (10, 10), (0, 0)]), Polygon([(0, 0), (1, 1), (0, 1)]), Point(0, 1), ] - bf_s1 = bigframes.geopandas.GeoSeries(data=data1) + bf_s1 = bigframes.geopandas.GeoSeries(data=data1, session=session) bf_result = bf_s1.difference(bf_s1).to_pandas() expected = bigframes.geopandas.GeoSeries( [GeometryCollection([]), GeometryCollection([]), GeometryCollection([])], index=[0, 1, 2], + session=session, ).to_pandas() assert bf_result.dtype == "geometry" @@ -332,9 +349,10 @@ def test_geo_difference_with_similar_geometry_objects(): assert expected.iloc[2].equals(bf_result.iloc[2]) -def test_geo_drop_duplicates(): +def test_geo_drop_duplicates(session: bigframes.session.Session): bf_series = bigframes.geopandas.GeoSeries( - [Point(1, 1), Point(2, 2), Point(3, 3), Point(2, 2)] + [Point(1, 1), Point(2, 2), Point(3, 3), Point(2, 2)], + session=session, ) pd_series = geopandas.GeoSeries( @@ -353,7 +371,7 @@ def test_geo_drop_duplicates(): # For example, when the intersection between two polygons is empty, # GeoPandas returns 'POLYGON EMPTY' while GeoSeries returns 'GeometryCollection([])'. # This is why we are hard-coding the expected results. -def test_geo_intersection_with_geometry_objects(): +def test_geo_intersection_with_geometry_objects(session: bigframes.session.Session): data1 = [ Polygon([(0, 0), (10, 0), (10, 10), (0, 0)]), Polygon([(0, 0), (1, 1), (0, 1), (0, 0)]), @@ -366,8 +384,8 @@ def test_geo_intersection_with_geometry_objects(): LineString([(2, 0), (0, 2)]), ] - bf_s1 = bigframes.geopandas.GeoSeries(data=data1) - bf_s2 = bigframes.geopandas.GeoSeries(data=data2) + bf_s1 = bigframes.geopandas.GeoSeries(data=data1, session=session) + bf_s2 = bigframes.geopandas.GeoSeries(data=data2, session=session) bf_result = bf_s1.intersection(bf_s2).to_pandas() @@ -377,6 +395,7 @@ def test_geo_intersection_with_geometry_objects(): Polygon([(0, 0), (1, 1), (0, 1), (0, 0)]), GeometryCollection([]), ], + session=session, ).to_pandas() assert bf_result.dtype == "geometry" @@ -385,20 +404,23 @@ def test_geo_intersection_with_geometry_objects(): assert expected.iloc[2].equals(bf_result.iloc[2]) -def test_geo_intersection_with_single_geometry_object(): +def test_geo_intersection_with_single_geometry_object( + session: bigframes.session.Session, +): data1 = [ Polygon([(0, 0), (10, 0), (10, 10), (0, 0)]), Polygon([(4, 2), (6, 2), (8, 6), (4, 2)]), Point(0, 1), ] - bf_s1 = bigframes.geopandas.GeoSeries(data=data1) + bf_s1 = bigframes.geopandas.GeoSeries(data=data1, session=session) bf_result = bf_s1.intersection( bigframes.geopandas.GeoSeries( [ Polygon([(0, 0), (10, 0), (10, 10), (0, 0)]), Polygon([(1, 0), (0, 5), (0, 0), (1, 0)]), - ] + ], + session=session, ), ).to_pandas() @@ -409,6 +431,7 @@ def test_geo_intersection_with_single_geometry_object(): None, ], index=[0, 1, 2], + session=session, ).to_pandas() assert bf_result.dtype == "geometry" @@ -417,14 +440,16 @@ def test_geo_intersection_with_single_geometry_object(): assert expected.iloc[2] == bf_result.iloc[2] -def test_geo_intersection_with_similar_geometry_objects(): +def test_geo_intersection_with_similar_geometry_objects( + session: bigframes.session.Session, +): data1 = [ Polygon([(0, 0), (10, 0), (10, 10), (0, 0)]), Polygon([(0, 0), (1, 1), (0, 1)]), Point(0, 1), ] - bf_s1 = bigframes.geopandas.GeoSeries(data=data1) + bf_s1 = bigframes.geopandas.GeoSeries(data=data1, session=session) bf_result = bf_s1.intersection(bf_s1).to_pandas() expected = bigframes.geopandas.GeoSeries( @@ -434,9 +459,119 @@ def test_geo_intersection_with_similar_geometry_objects(): Point(0, 1), ], index=[0, 1, 2], + session=session, ).to_pandas() assert bf_result.dtype == "geometry" assert expected.iloc[0].equals(bf_result.iloc[0]) assert expected.iloc[1].equals(bf_result.iloc[1]) assert expected.iloc[2].equals(bf_result.iloc[2]) + + +def test_geo_is_closed_not_supported(session: bigframes.session.Session): + s = bigframes.series.Series( + [ + Polygon([(0, 0), (1, 1), (0, 1)]), + Polygon([(10, 0), (10, 5), (0, 0)]), + Polygon([(0, 0), (2, 2), (2, 0)]), + LineString([(0, 0), (1, 1), (0, 1)]), + Point(0, 1), + ], + dtype=GeometryDtype(), + session=session, + ) + bf_series: bigframes.geopandas.GeoSeries = s.geo + with pytest.raises( + NotImplementedError, + match=re.escape( + f"GeoSeries.is_closed is not supported. Use bigframes.bigquery.st_isclosed(series), instead. {constants.FEEDBACK_LINK}" + ), + ): + bf_series.is_closed + + +def test_geo_buffer_raises_notimplemented(session: bigframes.session.Session): + """GeoPandas takes distance in units of the coordinate system, but BigQuery + uses meters. + """ + s = bigframes.geopandas.GeoSeries( + [ + Point(0, 0), + ], + session=session, + ) + with pytest.raises( + NotImplementedError, match=re.escape("bigframes.bigquery.st_buffer") + ): + s.buffer(1000) + + +def test_geo_centroid(session: bigframes.session.Session): + bf_s = bigframes.series.Series( + [ + Polygon([(0, 0), (0.1, 0.1), (0, 0.1)]), + LineString([(10, 10), (10.0001, 10.0001), (10, 10.0001)]), + Point(-10, -10), + ], + session=session, + ) + + pd_s = geopandas.GeoSeries( + [ + Polygon([(0, 0), (0.1, 0.1), (0, 0.1)]), + LineString([(10, 10), (10.0001, 10.0001), (10, 10.0001)]), + Point(-10, -10), + ], + index=pd.Index([0, 1, 2], dtype="Int64"), + crs="WGS84", + ) + + bf_result = bf_s.geo.centroid.to_pandas() + # Avoid warning that centroid is incorrect for geographic CRS. + # https://gis.stackexchange.com/a/401815/275289 + pd_result = pd_s.to_crs("+proj=cea").centroid.to_crs("WGS84") + + geopandas.testing.assert_geoseries_equal( + bf_result, + pd_result, + check_series_type=False, + check_index_type=False, + # BigQuery geography calculations are on a sphere, so results will be + # slightly different. + check_less_precise=True, + ) + + +def test_geo_convex_hull(session: bigframes.session.Session): + bf_s = bigframes.series.Series( + [ + Polygon([(0, 0), (1, 1), (0, 1)]), + Polygon([(10, 0), (10, 5), (0, 0)]), + Polygon([(0, 0), (2, 2), (2, 0)]), + LineString([(0, 0), (1, 1), (0, 1)]), + Point(0, 1), + ], + session=session, + ) + + pd_s = geopandas.GeoSeries( + [ + Polygon([(0, 0), (1, 1), (0, 1)]), + Polygon([(10, 0), (10, 5), (0, 0)]), + Polygon([(0, 0), (2, 2), (2, 0)]), + LineString([(0, 0), (1, 1), (0, 1)]), + Point(0, 1), + ], + index=pd.Index([0, 1, 2, 3, 4], dtype="Int64"), + crs="WGS84", + ) + + bf_result = bf_s.geo.convex_hull.to_pandas() + pd_result = pd_s.convex_hull + + geopandas.testing.assert_geoseries_equal( + bf_result, + pd_result, + check_series_type=False, + check_index_type=False, + ) diff --git a/tests/system/small/pandas/core/methods/test_describe.py b/tests/system/small/pandas/test_describe.py similarity index 100% rename from tests/system/small/pandas/core/methods/test_describe.py rename to tests/system/small/pandas/test_describe.py diff --git a/tests/system/small/pandas/io/api/test_read_gbq_colab.py b/tests/system/small/pandas/test_read_gbq_colab.py similarity index 100% rename from tests/system/small/pandas/io/api/test_read_gbq_colab.py rename to tests/system/small/pandas/test_read_gbq_colab.py diff --git a/tests/system/small/test_dataframe.py b/tests/system/small/test_dataframe.py index bc773d05b2..50989ae150 100644 --- a/tests/system/small/test_dataframe.py +++ b/tests/system/small/test_dataframe.py @@ -514,6 +514,50 @@ def test_where_dataframe_cond_dataframe_other( pandas.testing.assert_frame_equal(bf_result, pd_result) +def test_where_callable_cond_constant_other(scalars_df_index, scalars_pandas_df_index): + # Condition is callable, other is a constant. + columns = ["int64_col", "float64_col"] + dataframe_bf = scalars_df_index[columns] + dataframe_pd = scalars_pandas_df_index[columns] + + other = 10 + + bf_result = dataframe_bf.where(lambda x: x > 0, other).to_pandas() + pd_result = dataframe_pd.where(lambda x: x > 0, other) + pandas.testing.assert_frame_equal(bf_result, pd_result) + + +def test_where_dataframe_cond_callable_other(scalars_df_index, scalars_pandas_df_index): + # Condition is a dataframe, other is callable. + columns = ["int64_col", "float64_col"] + dataframe_bf = scalars_df_index[columns] + dataframe_pd = scalars_pandas_df_index[columns] + + cond_bf = dataframe_bf > 0 + cond_pd = dataframe_pd > 0 + + def func(x): + return x * 2 + + bf_result = dataframe_bf.where(cond_bf, func).to_pandas() + pd_result = dataframe_pd.where(cond_pd, func) + pandas.testing.assert_frame_equal(bf_result, pd_result) + + +def test_where_callable_cond_callable_other(scalars_df_index, scalars_pandas_df_index): + # Condition is callable, other is callable too. + columns = ["int64_col", "float64_col"] + dataframe_bf = scalars_df_index[columns] + dataframe_pd = scalars_pandas_df_index[columns] + + def func(x): + return x["int64_col"] > 0 + + bf_result = dataframe_bf.where(func, lambda x: x * 2).to_pandas() + pd_result = dataframe_pd.where(func, lambda x: x * 2) + pandas.testing.assert_frame_equal(bf_result, pd_result) + + def test_drop_column(scalars_dfs): scalars_df, scalars_pandas_df = scalars_dfs col_name = "int64_col" @@ -2937,12 +2981,102 @@ def test_join_different_table( assert_pandas_df_equal(bf_result, pd_result, ignore_order=True) -def test_join_duplicate_columns_raises_not_implemented(scalars_dfs): - scalars_df, _ = scalars_dfs - df_a = scalars_df[["string_col", "float64_col"]] - df_b = scalars_df[["float64_col"]] - with pytest.raises(NotImplementedError): - df_a.join(df_b, how="outer").to_pandas() +@all_joins +def test_join_different_table_with_duplicate_column_name( + scalars_df_index, scalars_pandas_df_index, how +): + bf_df_a = scalars_df_index[["string_col", "int64_col", "int64_too"]].rename( + columns={"int64_too": "int64_col"} + ) + bf_df_b = scalars_df_index.dropna()[ + ["string_col", "int64_col", "int64_too"] + ].rename(columns={"int64_too": "int64_col"}) + bf_result = bf_df_a.join(bf_df_b, how=how, lsuffix="_l", rsuffix="_r").to_pandas() + pd_df_a = scalars_pandas_df_index[["string_col", "int64_col", "int64_too"]].rename( + columns={"int64_too": "int64_col"} + ) + pd_df_b = scalars_pandas_df_index.dropna()[ + ["string_col", "int64_col", "int64_too"] + ].rename(columns={"int64_too": "int64_col"}) + pd_result = pd_df_a.join(pd_df_b, how=how, lsuffix="_l", rsuffix="_r") + + # Ensure no inplace changes + pd.testing.assert_index_equal(bf_df_a.columns, pd_df_a.columns) + pd.testing.assert_index_equal(bf_df_b.index.to_pandas(), pd_df_b.index) + pd.testing.assert_frame_equal(bf_result, pd_result, check_index_type=False) + + +@all_joins +def test_join_param_on_with_duplicate_column_name_not_on_col( + scalars_df_index, scalars_pandas_df_index, how +): + # This test is for duplicate column names, but the 'on' column is not duplicated. + if how == "cross": + return + bf_df_a = scalars_df_index[ + ["string_col", "datetime_col", "timestamp_col", "int64_too"] + ].rename(columns={"timestamp_col": "datetime_col"}) + bf_df_b = scalars_df_index.dropna()[ + ["string_col", "datetime_col", "timestamp_col"] + ].rename(columns={"timestamp_col": "datetime_col"}) + bf_result = bf_df_a.join( + bf_df_b, on="int64_too", how=how, lsuffix="_l", rsuffix="_r" + ).to_pandas() + pd_df_a = scalars_pandas_df_index[ + ["string_col", "datetime_col", "timestamp_col", "int64_too"] + ].rename(columns={"timestamp_col": "datetime_col"}) + pd_df_b = scalars_pandas_df_index.dropna()[ + ["string_col", "datetime_col", "timestamp_col"] + ].rename(columns={"timestamp_col": "datetime_col"}) + pd_result = pd_df_a.join( + pd_df_b, on="int64_too", how=how, lsuffix="_l", rsuffix="_r" + ) + pd.testing.assert_frame_equal( + bf_result.sort_index(), + pd_result.sort_index(), + check_like=True, + check_index_type=False, + check_names=False, + ) + pd.testing.assert_index_equal(bf_result.columns, pd_result.columns) + + +@pytest.mark.skipif( + pandas.__version__.startswith("1."), reason="bad left join in pandas 1.x" +) +@all_joins +def test_join_param_on_with_duplicate_column_name_on_col( + scalars_df_index, scalars_pandas_df_index, how +): + # This test is for duplicate column names, and the 'on' column is duplicated. + if how == "cross": + return + bf_df_a = scalars_df_index[ + ["string_col", "datetime_col", "timestamp_col", "int64_too"] + ].rename(columns={"timestamp_col": "datetime_col"}) + bf_df_b = scalars_df_index.dropna()[ + ["string_col", "datetime_col", "timestamp_col", "int64_too"] + ].rename(columns={"timestamp_col": "datetime_col"}) + bf_result = bf_df_a.join( + bf_df_b, on="int64_too", how=how, lsuffix="_l", rsuffix="_r" + ).to_pandas() + pd_df_a = scalars_pandas_df_index[ + ["string_col", "datetime_col", "timestamp_col", "int64_too"] + ].rename(columns={"timestamp_col": "datetime_col"}) + pd_df_b = scalars_pandas_df_index.dropna()[ + ["string_col", "datetime_col", "timestamp_col", "int64_too"] + ].rename(columns={"timestamp_col": "datetime_col"}) + pd_result = pd_df_a.join( + pd_df_b, on="int64_too", how=how, lsuffix="_l", rsuffix="_r" + ) + pd.testing.assert_frame_equal( + bf_result.sort_index(), + pd_result.sort_index(), + check_like=True, + check_index_type=False, + check_names=False, + ) + pd.testing.assert_index_equal(bf_result.columns, pd_result.columns) @all_joins diff --git a/tests/system/small/test_groupby.py b/tests/system/small/test_groupby.py index 0af173adc8..5c89363e9b 100644 --- a/tests/system/small/test_groupby.py +++ b/tests/system/small/test_groupby.py @@ -582,6 +582,101 @@ def test_dataframe_groupby_nonnumeric_with_mean(): ) +@pytest.mark.parametrize( + ("subset", "normalize", "ascending", "dropna", "as_index"), + [ + (None, True, True, True, True), + (["int64_too", "int64_col"], False, False, False, False), + ], +) +def test_dataframe_groupby_value_counts( + scalars_df_index, + scalars_pandas_df_index, + subset, + normalize, + ascending, + dropna, + as_index, +): + if pd.__version__.startswith("1."): + pytest.skip("pandas 1.x produces different column labels.") + col_names = ["float64_col", "int64_col", "bool_col", "int64_too"] + bf_result = ( + scalars_df_index[col_names] + .groupby("bool_col", as_index=as_index) + .value_counts( + subset=subset, normalize=normalize, ascending=ascending, dropna=dropna + ) + .to_pandas() + ) + pd_result = ( + scalars_pandas_df_index[col_names] + .groupby("bool_col", as_index=as_index) + .value_counts( + subset=subset, normalize=normalize, ascending=ascending, dropna=dropna + ) + ) + + if as_index: + pd.testing.assert_series_equal(pd_result, bf_result, check_dtype=False) + else: + pd_result.index = pd_result.index.astype("Int64") + pd.testing.assert_frame_equal(pd_result, bf_result, check_dtype=False) + + +@pytest.mark.parametrize( + ("numeric_only", "min_count"), + [ + (False, 4), + (True, 0), + ], +) +def test_dataframe_groupby_first( + scalars_df_index, scalars_pandas_df_index, numeric_only, min_count +): + # min_count seems to not work properly on older pandas + pytest.importorskip("pandas", minversion="2.0.0") + # bytes, dates not handling min_count properly in pandas + bf_result = ( + scalars_df_index.drop(columns=["bytes_col", "date_col"]) + .groupby(scalars_df_index.int64_col % 2) + .first(numeric_only=numeric_only, min_count=min_count) + ).to_pandas() + pd_result = ( + scalars_pandas_df_index.drop(columns=["bytes_col", "date_col"]) + .groupby(scalars_pandas_df_index.int64_col % 2) + .first(numeric_only=numeric_only, min_count=min_count) + ) + pd.testing.assert_frame_equal( + pd_result, + bf_result, + ) + + +@pytest.mark.parametrize( + ("numeric_only", "min_count"), + [ + (True, 2), + (False, -1), + ], +) +def test_dataframe_groupby_last( + scalars_df_index, scalars_pandas_df_index, numeric_only, min_count +): + bf_result = ( + scalars_df_index.groupby(scalars_df_index.int64_col % 2).last( + numeric_only=numeric_only, min_count=min_count + ) + ).to_pandas() + pd_result = scalars_pandas_df_index.groupby( + scalars_pandas_df_index.int64_col % 2 + ).last(numeric_only=numeric_only, min_count=min_count) + pd.testing.assert_frame_equal( + pd_result, + bf_result, + ) + + # ============== # Series.groupby # ============== @@ -768,3 +863,83 @@ def test_series_groupby_quantile(scalars_df_index, scalars_pandas_df_index, q): pd.testing.assert_series_equal( pd_result, bf_result, check_dtype=False, check_index_type=False ) + + +@pytest.mark.parametrize( + ("normalize", "ascending", "dropna"), + [ + ( + True, + True, + True, + ), + ( + False, + False, + False, + ), + ], +) +def test_series_groupby_value_counts( + scalars_df_index, + scalars_pandas_df_index, + normalize, + ascending, + dropna, +): + if pd.__version__.startswith("1."): + pytest.skip("pandas 1.x produces different column labels.") + bf_result = ( + scalars_df_index.groupby("bool_col")["string_col"] + .value_counts(normalize=normalize, ascending=ascending, dropna=dropna) + .to_pandas() + ) + pd_result = scalars_pandas_df_index.groupby("bool_col")["string_col"].value_counts( + normalize=normalize, ascending=ascending, dropna=dropna + ) + pd.testing.assert_series_equal(pd_result, bf_result, check_dtype=False) + + +@pytest.mark.parametrize( + ("numeric_only", "min_count"), + [ + (True, 2), + (False, -1), + ], +) +def test_series_groupby_first( + scalars_df_index, scalars_pandas_df_index, numeric_only, min_count +): + bf_result = ( + scalars_df_index.groupby("string_col")["int64_col"].first( + numeric_only=numeric_only, min_count=min_count + ) + ).to_pandas() + pd_result = scalars_pandas_df_index.groupby("string_col")["int64_col"].first( + numeric_only=numeric_only, min_count=min_count + ) + pd.testing.assert_series_equal( + pd_result, + bf_result, + ) + + +@pytest.mark.parametrize( + ("numeric_only", "min_count"), + [ + (False, 4), + (True, 0), + ], +) +def test_series_groupby_last( + scalars_df_index, scalars_pandas_df_index, numeric_only, min_count +): + bf_result = ( + scalars_df_index.groupby("string_col")["int64_col"].last( + numeric_only=numeric_only, min_count=min_count + ) + ).to_pandas() + pd_result = scalars_pandas_df_index.groupby("string_col")["int64_col"].last( + numeric_only=numeric_only, min_count=min_count + ) + pd.testing.assert_series_equal(pd_result, bf_result) diff --git a/tests/system/small/test_polars_execution.py b/tests/system/small/test_polars_execution.py index 1568a76ec9..916780b1ce 100644 --- a/tests/system/small/test_polars_execution.py +++ b/tests/system/small/test_polars_execution.py @@ -16,7 +16,7 @@ import bigframes from bigframes.testing.utils import assert_pandas_df_equal -polars = pytest.importorskip("polars", reason="polars is required for this test") +polars = pytest.importorskip("polars") @pytest.fixture(scope="module") diff --git a/tests/system/small/test_series.py b/tests/system/small/test_series.py index 3f64234293..e94250e98f 100644 --- a/tests/system/small/test_series.py +++ b/tests/system/small/test_series.py @@ -3685,8 +3685,12 @@ def test_astype_numeric_to_int(scalars_df_index, scalars_pandas_df_index): column = "numeric_col" to_type = "Int64" bf_result = scalars_df_index[column].astype(to_type).to_pandas() - # Round to the nearest whole number to avoid TypeError - pd_result = scalars_pandas_df_index[column].round(0).astype(to_type) + # Truncate to int to avoid TypeError + pd_result = ( + scalars_pandas_df_index[column] + .apply(lambda x: None if pd.isna(x) else math.trunc(x)) + .astype(to_type) + ) pd.testing.assert_series_equal(bf_result, pd_result) diff --git a/tests/unit/core/compile/sqlglot/expressions/snapshots/test_binary_compiler/test_add_numeric/out.sql b/tests/unit/core/compile/sqlglot/expressions/snapshots/test_binary_compiler/test_add_numeric/out.sql index e8dc2edb80..44335805e4 100644 --- a/tests/unit/core/compile/sqlglot/expressions/snapshots/test_binary_compiler/test_add_numeric/out.sql +++ b/tests/unit/core/compile/sqlglot/expressions/snapshots/test_binary_compiler/test_add_numeric/out.sql @@ -1,13 +1,54 @@ WITH `bfcte_0` AS ( SELECT - `int64_col` AS `bfcol_0` + `bool_col` AS `bfcol_0`, + `int64_col` AS `bfcol_1`, + `rowindex` AS `bfcol_2` FROM `bigframes-dev`.`sqlglot_test`.`scalar_types` ), `bfcte_1` AS ( SELECT *, - `bfcol_0` + `bfcol_0` AS `bfcol_1` + `bfcol_2` AS `bfcol_6`, + `bfcol_1` AS `bfcol_7`, + `bfcol_0` AS `bfcol_8`, + `bfcol_1` + `bfcol_1` AS `bfcol_9` FROM `bfcte_0` +), `bfcte_2` AS ( + SELECT + *, + `bfcol_6` AS `bfcol_14`, + `bfcol_7` AS `bfcol_15`, + `bfcol_8` AS `bfcol_16`, + `bfcol_9` AS `bfcol_17`, + `bfcol_7` + 1 AS `bfcol_18` + FROM `bfcte_1` +), `bfcte_3` AS ( + SELECT + *, + `bfcol_14` AS `bfcol_24`, + `bfcol_15` AS `bfcol_25`, + `bfcol_16` AS `bfcol_26`, + `bfcol_17` AS `bfcol_27`, + `bfcol_18` AS `bfcol_28`, + `bfcol_15` + CAST(`bfcol_16` AS INT64) AS `bfcol_29` + FROM `bfcte_2` +), `bfcte_4` AS ( + SELECT + *, + `bfcol_24` AS `bfcol_36`, + `bfcol_25` AS `bfcol_37`, + `bfcol_26` AS `bfcol_38`, + `bfcol_27` AS `bfcol_39`, + `bfcol_28` AS `bfcol_40`, + `bfcol_29` AS `bfcol_41`, + CAST(`bfcol_26` AS INT64) + `bfcol_25` AS `bfcol_42` + FROM `bfcte_3` ) SELECT - `bfcol_1` AS `int64_col` -FROM `bfcte_1` \ No newline at end of file + `bfcol_36` AS `rowindex`, + `bfcol_37` AS `int64_col`, + `bfcol_38` AS `bool_col`, + `bfcol_39` AS `int_add_int`, + `bfcol_40` AS `int_add_1`, + `bfcol_41` AS `int_add_bool`, + `bfcol_42` AS `bool_add_int` +FROM `bfcte_4` \ No newline at end of file diff --git a/tests/unit/core/compile/sqlglot/expressions/snapshots/test_binary_compiler/test_add_numeric_w_scalar/out.sql b/tests/unit/core/compile/sqlglot/expressions/snapshots/test_binary_compiler/test_add_numeric_w_scalar/out.sql deleted file mode 100644 index 7c4cc2c770..0000000000 --- a/tests/unit/core/compile/sqlglot/expressions/snapshots/test_binary_compiler/test_add_numeric_w_scalar/out.sql +++ /dev/null @@ -1,13 +0,0 @@ -WITH `bfcte_0` AS ( - SELECT - `int64_col` AS `bfcol_0` - FROM `bigframes-dev`.`sqlglot_test`.`scalar_types` -), `bfcte_1` AS ( - SELECT - *, - `bfcol_0` + 1 AS `bfcol_1` - FROM `bfcte_0` -) -SELECT - `bfcol_1` AS `int64_col` -FROM `bfcte_1` \ No newline at end of file diff --git a/tests/unit/core/compile/sqlglot/expressions/snapshots/test_binary_compiler/test_add_timedelta/out.sql b/tests/unit/core/compile/sqlglot/expressions/snapshots/test_binary_compiler/test_add_timedelta/out.sql new file mode 100644 index 0000000000..a47531999b --- /dev/null +++ b/tests/unit/core/compile/sqlglot/expressions/snapshots/test_binary_compiler/test_add_timedelta/out.sql @@ -0,0 +1,60 @@ +WITH `bfcte_0` AS ( + SELECT + `date_col` AS `bfcol_0`, + `rowindex` AS `bfcol_1`, + `timestamp_col` AS `bfcol_2` + FROM `bigframes-dev`.`sqlglot_test`.`scalar_types` +), `bfcte_1` AS ( + SELECT + *, + `bfcol_1` AS `bfcol_6`, + `bfcol_2` AS `bfcol_7`, + `bfcol_0` AS `bfcol_8`, + TIMESTAMP_ADD(CAST(`bfcol_0` AS DATETIME), INTERVAL 86400000000 MICROSECOND) AS `bfcol_9` + FROM `bfcte_0` +), `bfcte_2` AS ( + SELECT + *, + `bfcol_6` AS `bfcol_14`, + `bfcol_7` AS `bfcol_15`, + `bfcol_8` AS `bfcol_16`, + `bfcol_9` AS `bfcol_17`, + TIMESTAMP_ADD(`bfcol_7`, INTERVAL 86400000000 MICROSECOND) AS `bfcol_18` + FROM `bfcte_1` +), `bfcte_3` AS ( + SELECT + *, + `bfcol_14` AS `bfcol_24`, + `bfcol_15` AS `bfcol_25`, + `bfcol_16` AS `bfcol_26`, + `bfcol_17` AS `bfcol_27`, + `bfcol_18` AS `bfcol_28`, + TIMESTAMP_ADD(CAST(`bfcol_16` AS DATETIME), INTERVAL 86400000000 MICROSECOND) AS `bfcol_29` + FROM `bfcte_2` +), `bfcte_4` AS ( + SELECT + *, + `bfcol_24` AS `bfcol_36`, + `bfcol_25` AS `bfcol_37`, + `bfcol_26` AS `bfcol_38`, + `bfcol_27` AS `bfcol_39`, + `bfcol_28` AS `bfcol_40`, + `bfcol_29` AS `bfcol_41`, + TIMESTAMP_ADD(`bfcol_25`, INTERVAL 86400000000 MICROSECOND) AS `bfcol_42` + FROM `bfcte_3` +), `bfcte_5` AS ( + SELECT + *, + 172800000000 AS `bfcol_50` + FROM `bfcte_4` +) +SELECT + `bfcol_36` AS `rowindex`, + `bfcol_37` AS `timestamp_col`, + `bfcol_38` AS `date_col`, + `bfcol_39` AS `date_add_timedelta`, + `bfcol_40` AS `timestamp_add_timedelta`, + `bfcol_41` AS `timedelta_add_date`, + `bfcol_42` AS `timedelta_add_timestamp`, + `bfcol_50` AS `timedelta_add_timedelta` +FROM `bfcte_5` \ No newline at end of file diff --git a/tests/unit/core/compile/sqlglot/expressions/snapshots/test_binary_compiler/test_sub_numeric/out.sql b/tests/unit/core/compile/sqlglot/expressions/snapshots/test_binary_compiler/test_sub_numeric/out.sql new file mode 100644 index 0000000000..a43fa2df67 --- /dev/null +++ b/tests/unit/core/compile/sqlglot/expressions/snapshots/test_binary_compiler/test_sub_numeric/out.sql @@ -0,0 +1,54 @@ +WITH `bfcte_0` AS ( + SELECT + `bool_col` AS `bfcol_0`, + `int64_col` AS `bfcol_1`, + `rowindex` AS `bfcol_2` + FROM `bigframes-dev`.`sqlglot_test`.`scalar_types` +), `bfcte_1` AS ( + SELECT + *, + `bfcol_2` AS `bfcol_6`, + `bfcol_1` AS `bfcol_7`, + `bfcol_0` AS `bfcol_8`, + `bfcol_1` - `bfcol_1` AS `bfcol_9` + FROM `bfcte_0` +), `bfcte_2` AS ( + SELECT + *, + `bfcol_6` AS `bfcol_14`, + `bfcol_7` AS `bfcol_15`, + `bfcol_8` AS `bfcol_16`, + `bfcol_9` AS `bfcol_17`, + `bfcol_7` - 1 AS `bfcol_18` + FROM `bfcte_1` +), `bfcte_3` AS ( + SELECT + *, + `bfcol_14` AS `bfcol_24`, + `bfcol_15` AS `bfcol_25`, + `bfcol_16` AS `bfcol_26`, + `bfcol_17` AS `bfcol_27`, + `bfcol_18` AS `bfcol_28`, + `bfcol_15` - CAST(`bfcol_16` AS INT64) AS `bfcol_29` + FROM `bfcte_2` +), `bfcte_4` AS ( + SELECT + *, + `bfcol_24` AS `bfcol_36`, + `bfcol_25` AS `bfcol_37`, + `bfcol_26` AS `bfcol_38`, + `bfcol_27` AS `bfcol_39`, + `bfcol_28` AS `bfcol_40`, + `bfcol_29` AS `bfcol_41`, + CAST(`bfcol_26` AS INT64) - `bfcol_25` AS `bfcol_42` + FROM `bfcte_3` +) +SELECT + `bfcol_36` AS `rowindex`, + `bfcol_37` AS `int64_col`, + `bfcol_38` AS `bool_col`, + `bfcol_39` AS `int_add_int`, + `bfcol_40` AS `int_add_1`, + `bfcol_41` AS `int_add_bool`, + `bfcol_42` AS `bool_add_int` +FROM `bfcte_4` \ No newline at end of file diff --git a/tests/unit/core/compile/sqlglot/expressions/snapshots/test_binary_compiler/test_sub_timedelta/out.sql b/tests/unit/core/compile/sqlglot/expressions/snapshots/test_binary_compiler/test_sub_timedelta/out.sql new file mode 100644 index 0000000000..41e45d3333 --- /dev/null +++ b/tests/unit/core/compile/sqlglot/expressions/snapshots/test_binary_compiler/test_sub_timedelta/out.sql @@ -0,0 +1,60 @@ +WITH `bfcte_0` AS ( + SELECT + `date_col` AS `bfcol_0`, + `rowindex` AS `bfcol_1`, + `timestamp_col` AS `bfcol_2` + FROM `bigframes-dev`.`sqlglot_test`.`scalar_types` +), `bfcte_1` AS ( + SELECT + *, + `bfcol_1` AS `bfcol_6`, + `bfcol_2` AS `bfcol_7`, + `bfcol_0` AS `bfcol_8`, + TIMESTAMP_SUB(CAST(`bfcol_0` AS DATETIME), INTERVAL 86400000000 MICROSECOND) AS `bfcol_9` + FROM `bfcte_0` +), `bfcte_2` AS ( + SELECT + *, + `bfcol_6` AS `bfcol_14`, + `bfcol_7` AS `bfcol_15`, + `bfcol_8` AS `bfcol_16`, + `bfcol_9` AS `bfcol_17`, + TIMESTAMP_SUB(`bfcol_7`, INTERVAL 86400000000 MICROSECOND) AS `bfcol_18` + FROM `bfcte_1` +), `bfcte_3` AS ( + SELECT + *, + `bfcol_14` AS `bfcol_24`, + `bfcol_15` AS `bfcol_25`, + `bfcol_16` AS `bfcol_26`, + `bfcol_17` AS `bfcol_27`, + `bfcol_18` AS `bfcol_28`, + TIMESTAMP_DIFF(CAST(`bfcol_16` AS DATETIME), CAST(`bfcol_16` AS DATETIME), MICROSECOND) AS `bfcol_29` + FROM `bfcte_2` +), `bfcte_4` AS ( + SELECT + *, + `bfcol_24` AS `bfcol_36`, + `bfcol_25` AS `bfcol_37`, + `bfcol_26` AS `bfcol_38`, + `bfcol_27` AS `bfcol_39`, + `bfcol_28` AS `bfcol_40`, + `bfcol_29` AS `bfcol_41`, + TIMESTAMP_DIFF(`bfcol_25`, `bfcol_25`, MICROSECOND) AS `bfcol_42` + FROM `bfcte_3` +), `bfcte_5` AS ( + SELECT + *, + 0 AS `bfcol_50` + FROM `bfcte_4` +) +SELECT + `bfcol_36` AS `rowindex`, + `bfcol_37` AS `timestamp_col`, + `bfcol_38` AS `date_col`, + `bfcol_39` AS `date_sub_timedelta`, + `bfcol_40` AS `timestamp_sub_timedelta`, + `bfcol_41` AS `timestamp_sub_date`, + `bfcol_42` AS `date_sub_timestamp`, + `bfcol_50` AS `timedelta_sub_timedelta` +FROM `bfcte_5` \ No newline at end of file diff --git a/tests/unit/core/compile/sqlglot/expressions/test_binary_compiler.py b/tests/unit/core/compile/sqlglot/expressions/test_binary_compiler.py index a78a41fdbf..05d9c26945 100644 --- a/tests/unit/core/compile/sqlglot/expressions/test_binary_compiler.py +++ b/tests/unit/core/compile/sqlglot/expressions/test_binary_compiler.py @@ -14,6 +14,7 @@ import typing +import pandas as pd import pytest from bigframes import operations as ops @@ -42,17 +43,15 @@ def _apply_binary_op( def test_add_numeric(scalar_types_df: bpd.DataFrame, snapshot): - bf_df = scalar_types_df[["int64_col"]] - sql = _apply_binary_op(bf_df, ops.add_op, "int64_col", "int64_col") - - snapshot.assert_match(sql, "out.sql") + bf_df = scalar_types_df[["int64_col", "bool_col"]] + bf_df["int_add_int"] = bf_df["int64_col"] + bf_df["int64_col"] + bf_df["int_add_1"] = bf_df["int64_col"] + 1 -def test_add_numeric_w_scalar(scalar_types_df: bpd.DataFrame, snapshot): - bf_df = scalar_types_df[["int64_col"]] - sql = _apply_binary_op(bf_df, ops.add_op, "int64_col", ex.const(1)) + bf_df["int_add_bool"] = bf_df["int64_col"] + bf_df["bool_col"] + bf_df["bool_add_int"] = bf_df["bool_col"] + bf_df["int64_col"] - snapshot.assert_match(sql, "out.sql") + snapshot.assert_match(bf_df.sql, "out.sql") def test_add_string(scalar_types_df: bpd.DataFrame, snapshot): @@ -62,6 +61,27 @@ def test_add_string(scalar_types_df: bpd.DataFrame, snapshot): snapshot.assert_match(sql, "out.sql") +def test_add_timedelta(scalar_types_df: bpd.DataFrame, snapshot): + bf_df = scalar_types_df[["timestamp_col", "date_col"]] + timedelta = pd.Timedelta(1, unit="d") + + bf_df["date_add_timedelta"] = bf_df["date_col"] + timedelta + bf_df["timestamp_add_timedelta"] = bf_df["timestamp_col"] + timedelta + bf_df["timedelta_add_date"] = timedelta + bf_df["date_col"] + bf_df["timedelta_add_timestamp"] = timedelta + bf_df["timestamp_col"] + bf_df["timedelta_add_timedelta"] = timedelta + timedelta + + snapshot.assert_match(bf_df.sql, "out.sql") + + +def test_add_unsupported_raises(scalar_types_df: bpd.DataFrame): + with pytest.raises(TypeError): + _apply_binary_op(scalar_types_df, ops.add_op, "timestamp_col", "date_col") + + with pytest.raises(TypeError): + _apply_binary_op(scalar_types_df, ops.add_op, "int64_col", "string_col") + + def test_json_set(json_types_df: bpd.DataFrame, snapshot): bf_df = json_types_df[["json_col"]] sql = _apply_binary_op( @@ -69,3 +89,36 @@ def test_json_set(json_types_df: bpd.DataFrame, snapshot): ) snapshot.assert_match(sql, "out.sql") + + +def test_sub_numeric(scalar_types_df: bpd.DataFrame, snapshot): + bf_df = scalar_types_df[["int64_col", "bool_col"]] + + bf_df["int_add_int"] = bf_df["int64_col"] - bf_df["int64_col"] + bf_df["int_add_1"] = bf_df["int64_col"] - 1 + + bf_df["int_add_bool"] = bf_df["int64_col"] - bf_df["bool_col"] + bf_df["bool_add_int"] = bf_df["bool_col"] - bf_df["int64_col"] + + snapshot.assert_match(bf_df.sql, "out.sql") + + +def test_sub_timedelta(scalar_types_df: bpd.DataFrame, snapshot): + bf_df = scalar_types_df[["timestamp_col", "date_col"]] + timedelta = pd.Timedelta(1, unit="d") + + bf_df["date_sub_timedelta"] = bf_df["date_col"] - timedelta + bf_df["timestamp_sub_timedelta"] = bf_df["timestamp_col"] - timedelta + bf_df["timestamp_sub_date"] = bf_df["date_col"] - bf_df["date_col"] + bf_df["date_sub_timestamp"] = bf_df["timestamp_col"] - bf_df["timestamp_col"] + bf_df["timedelta_sub_timedelta"] = timedelta - timedelta + + snapshot.assert_match(bf_df.sql, "out.sql") + + +def test_sub_unsupported_raises(scalar_types_df: bpd.DataFrame): + with pytest.raises(TypeError): + _apply_binary_op(scalar_types_df, ops.sub_op, "string_col", "string_col") + + with pytest.raises(TypeError): + _apply_binary_op(scalar_types_df, ops.sub_op, "int64_col", "string_col") diff --git a/tests/unit/session/test_metrics.py b/tests/unit/session/test_metrics.py new file mode 100644 index 0000000000..7c2f01c5b9 --- /dev/null +++ b/tests/unit/session/test_metrics.py @@ -0,0 +1,247 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import datetime +import os +import unittest.mock + +import google.cloud.bigquery as bigquery +import pytest + +import bigframes.session.metrics as metrics + +NOW = datetime.datetime.now(datetime.timezone.utc) + + +def test_count_job_stats_with_row_iterator(): + row_iterator = unittest.mock.create_autospec( + bigquery.table.RowIterator, instance=True + ) + row_iterator.total_bytes_processed = 1024 + row_iterator.query = "SELECT * FROM table" + row_iterator.slot_millis = 1234 + execution_metrics = metrics.ExecutionMetrics() + execution_metrics.count_job_stats(row_iterator=row_iterator) + + assert execution_metrics.execution_count == 1 + assert execution_metrics.bytes_processed == 1024 + assert execution_metrics.query_char_count == 19 + assert execution_metrics.slot_millis == 1234 + + +def test_count_job_stats_with_row_iterator_missing_stats(): + row_iterator = unittest.mock.create_autospec( + bigquery.table.RowIterator, instance=True + ) + # Simulate properties not being present on the object + del row_iterator.total_bytes_processed + del row_iterator.query + del row_iterator.slot_millis + execution_metrics = metrics.ExecutionMetrics() + execution_metrics.count_job_stats(row_iterator=row_iterator) + + assert execution_metrics.execution_count == 1 + assert execution_metrics.bytes_processed == 0 + assert execution_metrics.query_char_count == 0 + assert execution_metrics.slot_millis == 0 + + +def test_count_job_stats_with_row_iterator_none_stats(): + row_iterator = unittest.mock.create_autospec( + bigquery.table.RowIterator, instance=True + ) + row_iterator.total_bytes_processed = None + row_iterator.query = None + row_iterator.slot_millis = None + execution_metrics = metrics.ExecutionMetrics() + execution_metrics.count_job_stats(row_iterator=row_iterator) + + assert execution_metrics.execution_count == 1 + assert execution_metrics.bytes_processed == 0 + assert execution_metrics.query_char_count == 0 + assert execution_metrics.slot_millis == 0 + + +def test_count_job_stats_with_dry_run(): + query_job = unittest.mock.create_autospec(bigquery.QueryJob, instance=True) + query_job.configuration.dry_run = True + query_job.query = "SELECT * FROM table" + execution_metrics = metrics.ExecutionMetrics() + execution_metrics.count_job_stats(query_job=query_job) + + # Dry run jobs shouldn't count as "executed" + assert execution_metrics.execution_count == 0 + assert execution_metrics.bytes_processed == 0 + assert execution_metrics.query_char_count == 0 + assert execution_metrics.slot_millis == 0 + + +def test_count_job_stats_with_valid_job(): + query_job = unittest.mock.create_autospec(bigquery.QueryJob, instance=True) + query_job.configuration.dry_run = False + query_job.query = "SELECT * FROM table" + query_job.total_bytes_processed = 2048 + query_job.slot_millis = 5678 + query_job.created = NOW + query_job.ended = NOW + datetime.timedelta(seconds=2) + execution_metrics = metrics.ExecutionMetrics() + execution_metrics.count_job_stats(query_job=query_job) + + assert execution_metrics.execution_count == 1 + assert execution_metrics.bytes_processed == 2048 + assert execution_metrics.query_char_count == 19 + assert execution_metrics.slot_millis == 5678 + assert execution_metrics.execution_secs == pytest.approx(2.0) + + +def test_count_job_stats_with_cached_job(): + query_job = unittest.mock.create_autospec(bigquery.QueryJob, instance=True) + query_job.configuration.dry_run = False + query_job.query = "SELECT * FROM table" + # Cache hit jobs don't have total_bytes_processed or slot_millis + query_job.total_bytes_processed = None + query_job.slot_millis = None + query_job.created = NOW + query_job.ended = NOW + datetime.timedelta(seconds=1) + execution_metrics = metrics.ExecutionMetrics() + execution_metrics.count_job_stats(query_job=query_job) + + assert execution_metrics.execution_count == 1 + assert execution_metrics.bytes_processed == 0 + assert execution_metrics.query_char_count == 19 + assert execution_metrics.slot_millis == 0 + assert execution_metrics.execution_secs == pytest.approx(1.0) + + +def test_count_job_stats_with_unsupported_job(): + query_job = unittest.mock.create_autospec(bigquery.QueryJob, instance=True) + query_job.configuration.dry_run = False + query_job.query = "SELECT * FROM table" + # Some jobs, such as scripts, don't have these properties. + query_job.total_bytes_processed = None + query_job.slot_millis = None + query_job.created = None + query_job.ended = None + execution_metrics = metrics.ExecutionMetrics() + execution_metrics.count_job_stats(query_job=query_job) + + # Don't count jobs if we can't get performance stats. + assert execution_metrics.execution_count == 0 + assert execution_metrics.bytes_processed == 0 + assert execution_metrics.query_char_count == 0 + assert execution_metrics.slot_millis == 0 + assert execution_metrics.execution_secs == pytest.approx(0.0) + + +def test_get_performance_stats_with_valid_job(): + query_job = unittest.mock.create_autospec(bigquery.QueryJob, instance=True) + query_job.configuration.dry_run = False + query_job.query = "SELECT * FROM table" + query_job.total_bytes_processed = 2048 + query_job.slot_millis = 5678 + query_job.created = NOW + query_job.ended = NOW + datetime.timedelta(seconds=2) + stats = metrics.get_performance_stats(query_job) + assert stats is not None + query_char_count, bytes_processed, slot_millis, exec_seconds = stats + assert query_char_count == 19 + assert bytes_processed == 2048 + assert slot_millis == 5678 + assert exec_seconds == pytest.approx(2.0) + + +def test_get_performance_stats_with_dry_run(): + query_job = unittest.mock.create_autospec(bigquery.QueryJob, instance=True) + query_job.configuration.dry_run = True + stats = metrics.get_performance_stats(query_job) + assert stats is None + + +def test_get_performance_stats_with_missing_timestamps(): + query_job = unittest.mock.create_autospec(bigquery.QueryJob, instance=True) + query_job.configuration.dry_run = False + query_job.created = None + query_job.ended = NOW + stats = metrics.get_performance_stats(query_job) + assert stats is None + + query_job.created = NOW + query_job.ended = None + stats = metrics.get_performance_stats(query_job) + assert stats is None + + +def test_get_performance_stats_with_mocked_types(): + query_job = unittest.mock.create_autospec(bigquery.QueryJob, instance=True) + query_job.configuration.dry_run = False + query_job.created = NOW + query_job.ended = NOW + query_job.total_bytes_processed = unittest.mock.Mock() + query_job.slot_millis = 123 + stats = metrics.get_performance_stats(query_job) + assert stats is None + + query_job.total_bytes_processed = 123 + query_job.slot_millis = unittest.mock.Mock() + stats = metrics.get_performance_stats(query_job) + assert stats is None + + +@pytest.fixture +def mock_environ(monkeypatch): + """Fixture to mock os.environ.""" + monkeypatch.setenv(metrics.LOGGING_NAME_ENV_VAR, "my_test_case") + + +def test_write_stats_to_disk_writes_files(tmp_path, mock_environ): + os.chdir(tmp_path) + test_name = os.environ[metrics.LOGGING_NAME_ENV_VAR] + metrics.write_stats_to_disk( + query_char_count=100, + bytes_processed=200, + slot_millis=300, + exec_seconds=1.23, + ) + + slot_file = tmp_path / (test_name + ".slotmillis") + assert slot_file.exists() + with open(slot_file) as f: + assert f.read() == "300\n" + + exec_time_file = tmp_path / (test_name + ".bq_exec_time_seconds") + assert exec_time_file.exists() + with open(exec_time_file) as f: + assert f.read() == "1.23\n" + + query_char_count_file = tmp_path / (test_name + ".query_char_count") + assert query_char_count_file.exists() + with open(query_char_count_file) as f: + assert f.read() == "100\n" + + bytes_file = tmp_path / (test_name + ".bytesprocessed") + assert bytes_file.exists() + with open(bytes_file) as f: + assert f.read() == "200\n" + + +def test_write_stats_to_disk_no_env_var(tmp_path, monkeypatch): + monkeypatch.delenv(metrics.LOGGING_NAME_ENV_VAR, raising=False) + os.chdir(tmp_path) + metrics.write_stats_to_disk( + query_char_count=100, + bytes_processed=200, + slot_millis=300, + exec_seconds=1.23, + ) + assert len(list(tmp_path.iterdir())) == 0 diff --git a/tests/unit/session/test_session.py b/tests/unit/session/test_session.py index 26b74a3f8a..63c82eb30f 100644 --- a/tests/unit/session/test_session.py +++ b/tests/unit/session/test_session.py @@ -252,12 +252,46 @@ def test_read_gbq_cached_table(): ) session.bqclient.get_table.return_value = table - with pytest.warns(UserWarning, match=re.escape("use_cache=False")): + with pytest.warns( + bigframes.exceptions.TimeTravelCacheWarning, match=re.escape("use_cache=False") + ): df = session.read_gbq("my-project.my_dataset.my_table") assert "1999-01-02T03:04:05.678901" in df.sql +def test_read_gbq_cached_table_doesnt_warn_for_anonymous_tables_and_doesnt_include_time_travel(): + session = mocks.create_bigquery_session() + table_ref = google.cloud.bigquery.TableReference( + google.cloud.bigquery.DatasetReference("my-project", "_anonymous_dataset"), + "my_table", + ) + table = google.cloud.bigquery.Table( + table_ref, (google.cloud.bigquery.SchemaField("col", "INTEGER"),) + ) + table._properties["location"] = session._location + table._properties["numRows"] = "1000000000" + table._properties["location"] = session._location + table._properties["type"] = "TABLE" + session._loader._df_snapshot[table_ref] = ( + datetime.datetime(1999, 1, 2, 3, 4, 5, 678901, tzinfo=datetime.timezone.utc), + table, + ) + + session.bqclient.query_and_wait = mock.MagicMock( + return_value=({"total_count": 3, "distinct_count": 2},) + ) + session.bqclient.get_table.return_value = table + + with warnings.catch_warnings(): + warnings.simplefilter( + "error", category=bigframes.exceptions.TimeTravelCacheWarning + ) + df = session.read_gbq("my-project._anonymous_dataset.my_table") + + assert "1999-01-02T03:04:05.678901" not in df.sql + + @pytest.mark.parametrize("table", CLUSTERED_OR_PARTITIONED_TABLES) def test_default_index_warning_raised_by_read_gbq(table): """Because of the windowing operation to create a default index, row @@ -474,7 +508,7 @@ def get_table_mock(table_ref): google.api_core.exceptions.Forbidden, match="Check https://cloud.google.com/bigquery/docs/query-drive-data#Google_Drive_permissions.", ): - api(query_or_table) + api(query_or_table).to_pandas() @mock.patch.dict(os.environ, {}, clear=True) diff --git a/tests/unit/test_dataframe_polars.py b/tests/unit/test_dataframe_polars.py index 79f2049da8..2070b25d66 100644 --- a/tests/unit/test_dataframe_polars.py +++ b/tests/unit/test_dataframe_polars.py @@ -2445,12 +2445,40 @@ def test_join_different_table( assert_pandas_df_equal(bf_result, pd_result, ignore_order=True) -def test_join_duplicate_columns_raises_not_implemented(scalars_dfs): +@all_joins +def test_join_raise_when_param_on_duplicate_with_column(scalars_df_index, how): + if how == "cross": + return + bf_df_a = scalars_df_index[["string_col", "int64_col"]].rename( + columns={"int64_col": "string_col"} + ) + bf_df_b = scalars_df_index.dropna()["string_col"] + with pytest.raises( + ValueError, match="The column label 'string_col' is not unique." + ): + bf_df_a.join(bf_df_b, on="string_col", how=how, lsuffix="_l", rsuffix="_r") + + +def test_join_duplicate_columns_raises_value_error(scalars_dfs): scalars_df, _ = scalars_dfs df_a = scalars_df[["string_col", "float64_col"]] df_b = scalars_df[["float64_col"]] - with pytest.raises(NotImplementedError): - df_a.join(df_b, how="outer").to_pandas() + with pytest.raises(ValueError, match="columns overlap but no suffix specified"): + df_a.join(df_b, how="outer") + + +@all_joins +def test_join_param_on_duplicate_with_index_raises_value_error(scalars_df_index, how): + if how == "cross": + return + bf_df_a = scalars_df_index[["string_col"]] + bf_df_a.index.name = "string_col" + bf_df_b = scalars_df_index.dropna()["string_col"] + with pytest.raises( + ValueError, + match="'string_col' is both an index level and a column label, which is ambiguous.", + ): + bf_df_a.join(bf_df_b, on="string_col", how=how, lsuffix="_l", rsuffix="_r") @all_joins @@ -2462,7 +2490,7 @@ def test_join_param_on(scalars_dfs, how): bf_df_b = bf_df[["float64_col"]] if how == "cross": - with pytest.raises(ValueError): + with pytest.raises(ValueError, match="'on' is not supported for cross join."): bf_df_a.join(bf_df_b, on="rowindex_2", how=how) else: bf_result = bf_df_a.join(bf_df_b, on="rowindex_2", how=how).to_pandas() diff --git a/third_party/bigframes_vendored/ibis/backends/sql/compilers/base.py b/third_party/bigframes_vendored/ibis/backends/sql/compilers/base.py index acccd7ea6c..cbc51e59d6 100644 --- a/third_party/bigframes_vendored/ibis/backends/sql/compilers/base.py +++ b/third_party/bigframes_vendored/ibis/backends/sql/compilers/base.py @@ -537,7 +537,7 @@ def if_(self, condition, true, false: sge.Expression | None = None) -> sge.If: false=None if false is None else sge.convert(false), ) - def cast(self, arg, to: dt.DataType) -> sge.Cast: + def cast(self, arg, to: dt.DataType, format=None) -> sge.Cast: return sge.Cast( this=sge.convert(arg), to=self.type_mapper.from_ibis(to), copy=False ) diff --git a/third_party/bigframes_vendored/ibis/backends/sql/compilers/bigquery/__init__.py b/third_party/bigframes_vendored/ibis/backends/sql/compilers/bigquery/__init__.py index be8f9fc555..08bf0d7650 100644 --- a/third_party/bigframes_vendored/ibis/backends/sql/compilers/bigquery/__init__.py +++ b/third_party/bigframes_vendored/ibis/backends/sql/compilers/bigquery/__init__.py @@ -544,7 +544,7 @@ def visit_Cast(self, op, *, arg, to): f"BigQuery does not allow extracting date part `{from_.unit}` from intervals" ) return self.f.extract(self.v[to.resolution.upper()], arg) - elif from_.is_floating() and to.is_integer(): + elif (from_.is_floating() or from_.is_decimal()) and to.is_integer(): return self.cast(self.f.trunc(arg), dt.int64) return super().visit_Cast(op, arg=arg, to=to) diff --git a/third_party/bigframes_vendored/pandas/core/frame.py b/third_party/bigframes_vendored/pandas/core/frame.py index 731e9a24eb..1f79c428c1 100644 --- a/third_party/bigframes_vendored/pandas/core/frame.py +++ b/third_party/bigframes_vendored/pandas/core/frame.py @@ -4574,7 +4574,15 @@ def map(self, func, na_action: Optional[str] = None) -> DataFrame: # ---------------------------------------------------------------------- # Merging / joining methods - def join(self, other, *, on: Optional[str] = None, how: str) -> DataFrame: + def join( + self, + other, + *, + on: Optional[str] = None, + how: str, + lsuffix: str = "", + rsuffix: str = "", + ) -> DataFrame: """Join columns of another DataFrame. Join columns with `other` DataFrame on index @@ -4647,6 +4655,19 @@ def join(self, other, *, on: Optional[str] = None, how: str) -> DataFrame: [2 rows x 4 columns] + If there are overlapping columns, `lsuffix` and `rsuffix` can be used: + + >>> df1 = bpd.DataFrame({'key': ['K0', 'K1', 'K2'], 'A': ['A0', 'A1', 'A2']}) + >>> df2 = bpd.DataFrame({'key': ['K0', 'K1', 'K2'], 'A': ['B0', 'B1', 'B2']}) + >>> df1.set_index('key').join(df2.set_index('key'), lsuffix='_left', rsuffix='_right') + A_left A_right + key + K0 A0 B0 + K1 A1 B1 + K2 A2 B2 + + [3 rows x 2 columns] + Args: other: DataFrame or Series with an Index similar to the Index of this one. @@ -4663,6 +4684,10 @@ def join(self, other, *, on: Optional[str] = None, how: str) -> DataFrame: index, preserving the order of the calling's one. ``cross``: creates the cartesian product from both frames, preserves the order of the left keys. + lsuffix(str, default ''): + Suffix to use from left frame's overlapping columns. + rsuffix(str, default ''): + Suffix to use from right frame's overlapping columns. Returns: bigframes.pandas.DataFrame: @@ -4677,6 +4702,10 @@ def join(self, other, *, on: Optional[str] = None, how: str) -> DataFrame: ValueError: If left index to join on does not have the same number of levels as the right index. + ValueError: + If columns overlap but no suffix is specified. + ValueError: + If `on` column is not unique. """ raise NotImplementedError(constants.ABSTRACT_METHOD_ERROR_MESSAGE) diff --git a/third_party/bigframes_vendored/pandas/core/groupby/__init__.py b/third_party/bigframes_vendored/pandas/core/groupby/__init__.py index ebfbfa8830..f0bc6348f8 100644 --- a/third_party/bigframes_vendored/pandas/core/groupby/__init__.py +++ b/third_party/bigframes_vendored/pandas/core/groupby/__init__.py @@ -537,6 +537,80 @@ def kurtosis( """ raise NotImplementedError(constants.ABSTRACT_METHOD_ERROR_MESSAGE) + def first(self, numeric_only: bool = False, min_count: int = -1): + """ + Compute the first entry of each column within each group. + + Defaults to skipping NA elements. + + **Examples:** + >>> import bigframes.pandas as bpd + >>> bpd.options.display.progress_bar = None + + >>> df = bpd.DataFrame(dict(A=[1, 1, 3], B=[None, 5, 6], C=[1, 2, 3])) + >>> df.groupby("A").first() + B C + A + 1 5.0 1 + 3 6.0 3 + + [2 rows x 2 columns] + + >>> df.groupby("A").first(min_count=2) + B C + A + 1 1 + 3 + + [2 rows x 2 columns] + + Args: + numeric_only (bool, default False): + Include only float, int, boolean columns. If None, will attempt to use + everything, then use only numeric data. + min_count (int, default -1): + The required number of valid values to perform the operation. If fewer + than ``min_count`` valid values are present the result will be NA. + + Returns: + bigframes.pandas.DataFrame or bigframes.pandas.Series: + First of values within each group. + """ + raise NotImplementedError(constants.ABSTRACT_METHOD_ERROR_MESSAGE) + + def last(self, numeric_only: bool = False, min_count: int = -1): + """ + Compute the last entry of each column within each group. + + Defaults to skipping NA elements. + + **Examples:** + >>> import bigframes.pandas as bpd + >>> bpd.options.display.progress_bar = None + + >>> df = bpd.DataFrame(dict(A=[1, 1, 3], B=[5, None, 6], C=[1, 2, 3])) + >>> df.groupby("A").last() + B C + A + 1 5.0 2 + 3 6.0 3 + + [2 rows x 2 columns] + + Args: + numeric_only (bool, default False): + Include only float, int, boolean columns. If None, will attempt to use + everything, then use only numeric data. + min_count (int, default -1): + The required number of valid values to perform the operation. If fewer + than ``min_count`` valid values are present the result will be NA. + + Returns: + bigframes.pandas.DataFrame or bigframes.pandas.Series: + Last of values within each group. + """ + raise NotImplementedError(constants.ABSTRACT_METHOD_ERROR_MESSAGE) + def sum( self, numeric_only: bool = False, @@ -1256,6 +1330,32 @@ def nunique(self): """ raise NotImplementedError(constants.ABSTRACT_METHOD_ERROR_MESSAGE) + def value_counts( + self, + normalize: bool = False, + sort: bool = True, + ascending: bool = False, + dropna: bool = True, + ): + """ + Return a Series or DataFrame containing counts of unique rows. + + Args: + normalize (bool, default False): + Return proportions rather than frequencies. + sort (bool, default True): + Sort by frequencies. + ascending (bool, default False): + Sort in ascending order. + dropna (bool, default True): + Don't include counts of rows that contain NA values. + + Returns: + Series or DataFrame: + Series if the groupby as_index is True, otherwise DataFrame. + """ + raise NotImplementedError(constants.ABSTRACT_METHOD_ERROR_MESSAGE) + class DataFrameGroupBy(GroupBy): def agg(self, func, **kwargs): @@ -1406,3 +1506,102 @@ def nunique(self): Number of unique values within a BigQuery DataFrame. """ raise NotImplementedError(constants.ABSTRACT_METHOD_ERROR_MESSAGE) + + def value_counts( + self, + subset=None, + normalize: bool = False, + sort: bool = True, + ascending: bool = False, + dropna: bool = True, + ): + """ + Return a Series or DataFrame containing counts of unique rows. + + **Examples:** + + >>> import bigframes.pandas as bpd + >>> import numpy as np + >>> bpd.options.display.progress_bar = None + + >>> df = bpd.DataFrame({ + ... 'gender': ['male', 'male', 'female', 'male', 'female', 'male'], + ... 'education': ['low', 'medium', 'high', 'low', 'high', 'low'], + ... 'country': ['US', 'FR', 'US', 'FR', 'FR', 'FR'] + ... }) + + >>> df + gender education country + 0 male low US + 1 male medium FR + 2 female high US + 3 male low FR + 4 female high FR + 5 male low FR + + [6 rows x 3 columns] + + >>> df.groupby('gender').value_counts() + gender education country + female high FR 1 + US 1 + male low FR 2 + US 1 + medium FR 1 + Name: count, dtype: Int64 + + >>> df.groupby('gender').value_counts(ascending=True) + gender education country + female high FR 1 + US 1 + male low US 1 + medium FR 1 + low FR 2 + Name: count, dtype: Int64 + + >>> df.groupby('gender').value_counts(normalize=True) + gender education country + female high FR 0.5 + US 0.5 + male low FR 0.5 + US 0.25 + medium FR 0.25 + Name: proportion, dtype: Float64 + + >>> df.groupby('gender', as_index=False).value_counts() + gender education country count + 0 female high FR 1 + 1 female high US 1 + 2 male low FR 2 + 3 male low US 1 + 4 male medium FR 1 + + [5 rows x 4 columns] + + >>> df.groupby('gender', as_index=False).value_counts(normalize=True) + gender education country proportion + 0 female high FR 0.5 + 1 female high US 0.5 + 2 male low FR 0.5 + 3 male low US 0.25 + 4 male medium FR 0.25 + + [5 rows x 4 columns] + + Args: + subset (list-like, optional): + Columns to use when counting unique combinations. + normalize (bool, default False): + Return proportions rather than frequencies. + sort (bool, default True): + Sort by frequencies. + ascending (bool, default False): + Sort in ascending order. + dropna (bool, default True): + Don't include counts of rows that contain NA values. + + Returns: + Series or DataFrame: + Series if the groupby as_index is True, otherwise DataFrame. + """ + raise NotImplementedError(constants.ABSTRACT_METHOD_ERROR_MESSAGE) diff --git a/third_party/bigframes_vendored/version.py b/third_party/bigframes_vendored/version.py index e85f0b73c8..7aff17a40d 100644 --- a/third_party/bigframes_vendored/version.py +++ b/third_party/bigframes_vendored/version.py @@ -12,8 +12,8 @@ # See the License for the specific language governing permissions and # limitations under the License. -__version__ = "2.14.0" +__version__ = "2.15.0" # {x-release-please-start-date} -__release_date__ = "2025-08-05" +__release_date__ = "2025-08-11" # {x-release-please-end}