Skip to content

Adding Iceberg REST Catalog Examples for Dataflow Documentation #10149

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

tarun-google
Copy link

@tarun-google tarun-google commented Aug 17, 2025

Description

Fixes # b/427973623

I have already add complex RESTCatalog examples to apache/beam cookbook. These are simpler versions so that we can drive the Dataflow Documentation

  1. Upgraded apache-beam sdk version. Newer Features like 'create dataset.table if not exists' are helpful
  2. Added iceberg-gcp so that we can now pass GCS bucket as storage. using local Hadoop is not possible with rest catalog
  3. Added integration tests

Note: Before submitting a pull request, please open an issue for discussion if you are not associated with Google.

Checklist

  • I have followed Sample Format Guide
  • pom.xml parent set to latest shared-configuration
  • Appropriate changes to README are included in PR
  • These samples need a new API enabled in testing projects to pass (let us know which ones)
  • These samples need a new/updated env vars in testing projects set to pass (let us know which ones)
  • Tests pass: mvn clean verify required
  • Lint passes: mvn -P lint checkstyle:check required
  • Static Analysis: mvn -P lint clean compile pmd:cpd-check spotbugs:check advisory only
  • This sample adds a new sample directory, and I updated the CODEOWNERS file with the codeowners for this sample
  • This sample adds a new Product API, and I updated the Blunderbuss issue/PR auto-assigner with the codeowners for this sample
  • Please merge this PR for me once it is approved

@product-auto-label product-auto-label bot added samples Issues that are directly related to samples. api: dataflow Issues related to the Dataflow API. labels Aug 17, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @tarun-google, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces new Dataflow documentation examples that illustrate how to interact with Apache Iceberg tables using a REST catalog, specifically leveraging Google Cloud Storage (GCS) for data storage. The changes include examples for both streaming data writes and Change Data Capture (CDC) reads, alongside necessary dependency updates and robust integration tests to validate the new functionalities.

Highlights

  • Enhanced Apache Beam Integration: The Apache Beam SDK has been upgraded to version 2.67.0, which enables access to newer features beneficial for table management, such as the 'create database.table if not exists' functionality.
  • GCS-backed Iceberg Examples: The iceberg-gcp dependency has been added, and two new Java examples (ApacheIcebergRestCatalogStreamingWrite.java and ApacheIcebergCDCRead.java) have been introduced. These examples demonstrate streaming data operations with Iceberg tables that are stored in Google Cloud Storage (GCS) buckets and accessed via a REST catalog.
  • Comprehensive Integration Tests: A new integration test, testApacheIcebergRestCatalog, has been added to ApacheIcebergIT.java. This test verifies the end-to-end functionality of the new streaming write and CDC read examples, ensuring proper data persistence and metadata creation within GCS.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds new examples for using Apache Iceberg with a REST catalog, specifically for CDC reads and streaming writes. The changes include updating the Beam SDK version, adding the iceberg-gcp dependency, and providing two new example pipelines with an integration test. My review focuses on improving the robustness of the new test and enhancing the documentation in the new example code. I've identified a critical issue in the integration test where a variable is initialized with a null value, which would cause the test to fail. I've also suggested adding a comment to one of the new examples to warn users about OAuth token expiration, which is a crucial piece of information for running these streaming pipelines for extended periods.

@tarun-google tarun-google marked this pull request as ready for review August 21, 2025 20:47
@tarun-google tarun-google requested review from yoshi-approver and a team as code owners August 21, 2025 20:47
Copy link

snippet-bot bot commented Aug 21, 2025

Here is the summary of changes.

You are about to add 2 region tags.

This comment is generated by snippet-bot.
If you find problems with this result, please file an issue at:
https://github.com/googleapis/repo-automation-bots/issues.
To update this comment, add snippet-bot:force-run label or use the checkbox below:

  • Refresh this comment

@tarun-google
Copy link
Author

I see there is failure in com.example.dataflow.KafkaReadIT. which is not relevant to the changes.

@tarun-google
Copy link
Author

Adding Reviewers from dataflow team: @chamikaramj @ahmedabu98 @VeronicaWasson

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: dataflow Issues related to the Dataflow API. samples Issues that are directly related to samples.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant