Skip to content

Conversation

mgencur
Copy link
Contributor

@mgencur mgencur commented Aug 20, 2025

This commit introduces a complete HCP (Hosted Control Plane) backup and restore testing framework with support for both newly created and existing HostedCluster environments.

Key Features Added:

New Test Infrastructure

  • Add hcp_external_cluster_backup_restore_suite_test.go: Test suite for full HCP backup/restore scenarios with data plane
  • Support for two operational modes:
    • create: Creates new HostedCluster for testing (existing behavior)
    • external: Uses pre-existing (external) HostedCluster with data plane

Enhanced Configuration Options

  • Add Makefile variables for HCP test configuration:
    • HC_NAME: Specifies HostedCluster name for existing mode

Improved Test Architecture

  • Refactor runHCPBackupAndRestore() function for unified handling of both modes
  • Retrieve Kubeconfig for guest cluster from the management cluster on demand (before backup, and later before restore as it changes)
  • Add guest cluster verification functions (PreBackupVerifyGuest, PostRestoreVerifyGuest)
  • Separate log gathering and DPA resource cleanup into reusable functions
  • Enhanced error handling and validation for both control plane and guest cluster

Documentation Updates

  • Add testing documentation for HCP scenarios
  • Include examples for running tests against existing HostedControlPlane
  • Document environment variable configuration options

Build System Improvements

  • Add conditional must-gather build based on SKIP_MUST_GATHER flag

Technical Implementation:

The implementation supports testing both scenarios where OADP needs to:

  1. Create a new HostedCluster and test backup/restore (existing functionality)
  2. Work with an existing HostedCluster that already has workloads and data plane

This enables comprehensive testing of HCP backup/restore functionality in realistic production-like environments where clusters already exist and contain user workloads.

🤖 PR description Generated with Claude Code and modified.

Why the changes were made

  • The test suite should support validations in data plane for hosted clusters after "restore". This can be further extended to support ROSA (there are some todos as hints in this PR) where some resources already exist in the cluster (like DataProtectionApplication)

How to test the changes made

  • See the new instructions in TESTING.md
  • The new tests only run if -hc_backup_restore_mode=external.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 20, 2025
@openshift-ci openshift-ci bot requested review from kaovilai and sseago August 20, 2025 06:50
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 20, 2025
@@ -0,0 +1,94 @@
package e2e_test
Copy link
Contributor Author

@mgencur mgencur Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you suggest a better name for this file? Maybe hcp_external_cluster_backup_restore_suite_test.go ?
Or hcp_existing_cluster_backup_restore_suite_test.go

func runHCPBackupAndRestore(brCase HCPBackupRestoreCase, updateLastBRcase func(brCase HCPBackupRestoreCase), h *libhcp.HCHandler) {
const (
HCModeCreate HCBackupRestoreMode = "create" // Create new HostedCluster for test
HCModeExisting HCBackupRestoreMode = "existing" // Get existing HostedCluster
Copy link
Contributor Author

@mgencur mgencur Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this could be external instead of existing...

@mgencur
Copy link
Contributor Author

mgencur commented Aug 20, 2025

The work can be further extended for testing with ROSA, here's the initial draft: https://github.com/mgencur/oadp-operator/pull/1/files (just showing the idea, I sent this PR against my own branch).

Copy link
Contributor

@jparrill jparrill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the PR still on going? I'm missing some verifications at multiple levels. In general looks good apart from the comments mentioned in the review.

Said that, I would put these new tests under a different flag, something like TEST_HCP_NEW to avoid impact the current tests until we have them properly set.

Makefile Outdated
Comment on lines 782 to 784
-hc_backup_restore_mode=$(HC_BACKUP_RESTORE_MODE) \
-hc_name=$(HC_NAME) \
-hc_kubeconfig=$(HC_KUBECONFIG) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would isolate those 3 new flags, and bring those up only when the TEST_HCP == true. This way we would not affect the other tests reqs

}
}

func postBackupVerifyGuest() VerificationFunctionGuest {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those ones will be reused, adding more checks? or the verification is just that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This is just the beginning. It's basically a smoke test for the guest cluster. We can add checks over time, in separate PRs.

Are the PR still on going? I'm missing some verifications at multiple levels.

What other verifications are you missing? Just curious if it's specific to the guest cluster or something more general.

@weshayutin
Copy link
Contributor

@kaovilai FYI

@mgencur
Copy link
Contributor Author

mgencur commented Aug 21, 2025

@jparrill , tried to address your review comments.

@mgencur
Copy link
Contributor Author

mgencur commented Aug 21, 2025

/hold
After doing restore, I will need to get a new kubeconfig for the guest cluster because the old one doesn't work anymore.
Let me figure this out.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 21, 2025
@mgencur
Copy link
Contributor Author

mgencur commented Aug 21, 2025

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 21, 2025
mgencur and others added 6 commits August 27, 2025 08:50
This commit introduces a complete HCP (Hosted Control Plane) backup and restore testing framework
with support for both newly created and existing HostedCluster environments.

- Add `hcp_full_backup_restore_suite_test.go`: Complete test suite for full HCP backup/restore scenarios
- Support for two operational modes:
  - `create`: Creates new HostedCluster for testing (existing behavior)
  - `existing`: Uses pre-existing HostedCluster with data plane

- Add Makefile variables for HCP test configuration:
  - `HC_BACKUP_RESTORE_MODE`: Controls test execution mode (create/existing)
  - `HC_NAME`: Specifies HostedCluster name for existing mode
  - `HC_KUBECONFIG`: Path to guest cluster kubeconfig for existing mode
- Pass HCP configuration parameters to e2e test execution

- Refactor `runHCPBackupAndRestore()` function for unified handling of both modes
- Add guest cluster verification functions (`PreBackupVerifyGuest`, `PostRestoreVerifyGuest`)
- Separate log gathering and DPA resource cleanup into reusable functions
- Enhanced error handling and validation for both control plane and guest cluster

- Add support for kubeconfig-based guest cluster operations
- Implement pre/post backup verification for guest cluster resources
- Add namespace creation/validation tests for guest cluster functionality

- Add `GetHostedCluster()` method to retrieve existing HostedCluster objects
- Add `ClientGuest` field to `HCHandler` for guest cluster operations
- Improve error message formatting in DPA helpers

- Add comprehensive testing documentation for HCP scenarios
- Include examples for running tests against existing HostedControlPlane
- Document environment variable configuration options

- Add conditional must-gather build based on `SKIP_MUST_GATHER` flag
- Enhanced e2e test parameter passing for HCP configurations

The implementation supports testing both scenarios where OADP needs to:
1. Create a new HostedCluster and test backup/restore (existing functionality)
2. Work with an existing HostedCluster that already has workloads and data plane

This enables comprehensive testing of HCP backup/restore functionality in realistic
production-like environments where clusters already exist and contain user workloads.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Replace HC_BACKUP_RESTORE_MODE with TEST_HCP_EXTERNAL flag
- Rename "existing" mode to "external" for clarity
- Move HCP external test args to separate HCP_EXTERNAL_ARGS variable
- Rename hcp_full_backup_restore_suite_test.go to hcp_external_cluster_backup_restore_suite_test.go
- Update test labels from "hcp" to "hcp_external" for external cluster tests
- Simplify Makefile by removing unused HC mode variables from main test-e2e target
- Update documentation to reflect new external cluster test configuration
…rieval

- Remove HC_KUBECONFIG flag and related global variables from test suite
- Remove hardcoded crClientForHC global client initialization
- Add GetHostedClusterKubeconfig() method to dynamically retrieve kubeconfig from HostedCluster status
- Update pre/post backup verification to create client on-demand using retrieved kubeconfig
- Clean up Makefile to remove HC_KUBECONFIG parameter handling
- Simplify HCHandler by removing ClientGuest field

This change improves test reliability by ensuring the guest cluster client
is always created with the current kubeconfig rather than relying on
potentially stale configuration passed via flags.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Copy link
Contributor

@jparrill jparrill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall lgtm, just a couple of nits

@@ -0,0 +1,93 @@
package e2e_test
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this file is still under development and this is part of a sequence of PRs (and not being part of the current E2E execution), I will give you the freedom to continue in a follow up PR with the validations. I'm being a bit more strict with the in-use components.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Even though the PR title says "full backup/restore" test suite, the main goal of this PR is to setup the way to add validations for the hosted/guest cluster. It adds one basic validation and the point is that we will be adding more in the future (possibly as we add more feature to the hypershift plugin). If you have ideas now on what to validate in the guest cluster, I can follow-up with those changes in the next PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, Once this PR get's in we can create the validations, I have some ideas in mind

@jparrill
Copy link
Contributor

jparrill commented Sep 2, 2025

/test 4.20-e2e-test-hcp-aws

Copy link
Contributor

@jparrill jparrill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@jparrill
Copy link
Contributor

jparrill commented Sep 2, 2025

Hey @kaovilai as a first step to include more tests lgtm, if you find some time plz take a look 🙏

@jparrill
Copy link
Contributor

jparrill commented Sep 2, 2025

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 2, 2025
Copy link

openshift-ci bot commented Sep 2, 2025

@mgencur: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/4.20-e2e-test-hcp-aws b0d77e3 link false /test 4.20-e2e-test-hcp-aws

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 2, 2025
Copy link

openshift-ci bot commented Sep 2, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jparrill, kaovilai, mgencur, weshayutin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit d9617c6 into openshift:oadp-dev Sep 2, 2025
13 of 14 checks passed
@mgencur
Copy link
Contributor Author

mgencur commented Sep 3, 2025

@kaovilai @jparrill Thank you, gentleman.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants