Skip to content

Add support for replicated ClickHouse setups #2895

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 59 commits into from
Aug 7, 2025

Conversation

virajmehta
Copy link
Member

@virajmehta virajmehta commented Jul 22, 2025

This PR:

  • adds a docker compose for replicated test setup for E2E tests docker-compose.replicated.yml.
  • adds support throughout migration manager for replicated setups. This requires an ON CLUSTER {cluster_name} to be added to all create table and create function queries (not views). It also requires the table engine to be changed to be Replicated{normal table name}({keeper_path}, {replica_name}, ... other args to table engine). I added helper functions for each of these but they still rely on string formatting by the query writer. I didn't want to add too many layers of abstraction in the migration writing but am open to suggestions on how to improve.
  • Adds a test that only runs in a replicated setup that checks the other replica to see that the data lands.

Changes to gateway configuration:

  • adds an environment variable TENSORZERO_CLICKHOUSE_CLUSTER_NAME that if set will get the gateway to use that cluster for replicated operations.

We cannot run concurrent migrations with replicated tables so I introduce a gateway --run-migrations-only CLI command that does that for concurrent migrations. If the database is replicated but the migrations haven't been run it will error on startup.

This PR closes #2228


Important

Add support for replicated ClickHouse setups with migration management and testing enhancements.

  • Behavior:
    • Adds support for replicated ClickHouse setups in migration_manager by introducing ON CLUSTER syntax and Replicated table engines.
    • Introduces --run-migrations-only CLI command for manual migration execution in replicated setups.
    • Adds TENSORZERO_CLICKHOUSE_CLUSTER_NAME environment variable for cluster configuration.
  • Docker & Configurations:
    • Adds docker-compose.replicated.yml for setting up a replicated ClickHouse environment.
    • Configures ClickHouse nodes with replication settings in replicated_clickhouse_config.
  • Tests:
    • Updates E2E tests in inference.rs and howdy.rs to validate replicated ClickHouse functionality.
    • Adds tests to ensure data consistency across replicas.

This description was created by Ellipsis for e20739e. You can customize this summary. It will automatically update as commits are pushed.

@Aaron1011
Copy link
Member

From our discussion:

  • Check system.clusters to determine if the Clickhouse server is in clutser mode
    • If clickhouse is is normal non-cluster mode, then we error if you're trying to start TensorZero in replicated mode
    • If clickhouse is in cluster mode, then require an extra environment variable to start TensorZero in non-replicated mode.
  • Make the TensorZeroMigration table replicated
  • Add an e2e test that gets all the tables from the e2e database, and verifies that all of them are in replicated/non-replicated mode as expected

@virajmehta
Copy link
Member Author

Note: we need a docker run tensorzero/gateway --apply-migrations or similar.

Aaron1011
Aaron1011 previously approved these changes Aug 6, 2025
Aaron1011
Aaron1011 previously approved these changes Aug 6, 2025
@virajmehta virajmehta added this pull request to the merge queue Aug 6, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 6, 2025
@virajmehta virajmehta added this pull request to the merge queue Aug 6, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 6, 2025
Aaron1011
Aaron1011 previously approved these changes Aug 6, 2025
@virajmehta virajmehta added this pull request to the merge queue Aug 6, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 6, 2025
@virajmehta virajmehta added this pull request to the merge queue Aug 7, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 7, 2025
@virajmehta virajmehta added this pull request to the merge queue Aug 7, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 7, 2025
@virajmehta virajmehta added this pull request to the merge queue Aug 7, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 7, 2025
@virajmehta virajmehta added this pull request to the merge queue Aug 7, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 7, 2025
@virajmehta virajmehta added this pull request to the merge queue Aug 7, 2025
Merged via the queue into main with commit 843af78 Aug 7, 2025
32 checks passed
@virajmehta virajmehta deleted the viraj/handle-replicated-clickhouse branch August 7, 2025 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support automatic migrations for self-hosted replicated ClickHouse cluster
4 participants