Skip to content

CLOUDP-333692: Re-design images building #303

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 58 commits into from
Aug 8, 2025

Conversation

Julien-Ben
Copy link
Collaborator

@Julien-Ben Julien-Ben commented Jul 29, 2025

Re-design images building

Note for review:

Since atomic_pipeline.py is largely a refactored version of pipeline.py, it’s much clearer to review their side-by-side diff than to wade through GitHub’s “all new lines” view.
Here's the diff:

https://gist.github.com/Julien-Ben/3698d532d17bafb380f2e4f05b5db153

You can also take a look at the related TD Section

Changes

The PR refactors our Docker image build system. Most notably by replacing pipeline.py along with other components, detailed below.

Usage of standalone Dockerfiles

Added in a previous PR, they eliminate the need for templating, and make it possible to retire Sonar once the Atomic Releases Epic is completed.

Building with docker buildx, multi-platform builds

In build_images.py we use docker buildx through a python API. It eliminates the need for building images separately for each platform (ARM/AMD), and then manually bundling them in a manifest.

Handle build environments explicitly

We’ve introduced a framework that centralizes build configuration by scenario (e.g local development, staging releases etc) so the pipeline automatically picks sensible defaults (registry, target platforms, signing flags, and more) based on where you’re running.

In pipeline_main.py (with support from build_configuration.py and build_context.py) we treat each execution context (local dev, merge to master, release etc...) as an explicit, top-level environment.
It infers defaults automatically but lets you override any value via CLI flags, ensuring all build parameters live in one single source of truth rather than scattered through pipeline scripts.

CLI usage

usage: pipeline_main.py [-h] [--parallel] [--debug] [--sign] [--scenario {BuildScenario.RELEASE,BuildScenario.PATCH,BuildScenario.STAGING,BuildScenario.DEVELOPMENT}]
                        [--platform PLATFORM] [--version VERSION] [--registry REGISTRY] [--parallel-factor PARALLEL_FACTOR]
                        image

Build container images.

positional arguments:
  image                 Image to build.

options:
  -h, --help            show this help message and exit
  --parallel            Build images in parallel.
  --debug               Enable debug logging.
  --sign                Sign images.
  --scenario {BuildScenario.RELEASE,BuildScenario.PATCH,BuildScenario.STAGING,BuildScenario.DEVELOPMENT}
                        Override the build scenario instead of inferring from environment. Options: release, patch, master, development
  --platform PLATFORM   Target platforms for multi-arch builds (comma-separated). Example: linux/amd64,linux/arm64. Defaults to linux/amd64.
  --version VERSION     Override the version/tag instead of resolving from build scenario
  --registry REGISTRY   Override the base registry instead of resolving from build scenario
  --parallel-factor PARALLEL_FACTOR
                        Number of builds to run in parallel, defaults to number of cores

Proof of work

CI is building images with the new pipeline, and tests pass.

Note

For the duration of the Atomic Releases epic, both pipelines will be in the repository, until we are done with the staging and promotion process. This new pipeline will only be used for Evergreen patches.
This PR also heavily depends on changes that are introduced by the agent matrix removal, and the multi-platform support epic.

The existing Evergreen function, that uses pipeline.py has been renamed legacy_pipeline, and is used for release and periodic builds tasks.
A new one has been created, calling the new pipeline.

Once the Atomic Release Epic is complete, we'll be able to remove:

  • Sonar
  • Inventories
  • Periodic builds
  • pipeline.py

Follow up ticket to this PR: https://jira.mongodb.org/browse/CLOUDP-335471

Checklist

  • Have you linked a jira ticket and/or is the ticket in the title?
  • Have you checked whether your jira ticket required DOCSP changes?
  • Have you added changelog file?

Copy link

github-actions bot commented Jul 29, 2025

⚠️ (this preview might not be accurate if the PR is not rebased on current master branch)

MCK 1.3.0 Release Notes

Other Changes

  • Optional permissions for PersistentVolumeClaim moved to a separate role. When managing the operator with Helm it is possible to disable permissions for PersistentVolumeClaim resources by setting operator.enablePVCResize value to false (true by default). When enabled, previously these permissions were part of the primary operator role. With this change, permissions have a separate role.
  • subresourceEnabled Helm value was removed. This setting used to be true by default and made it possible to exclude subresource permissions from the operator role by specifying false as the value. We are removing this configuration option, making the operator roles always have subresource permissions. This setting was introduced as a temporary solution for this OpenShift issue. The issue has since been resolved and the setting is no longer needed.

@Julien-Ben Julien-Ben force-pushed the julienben/redesign-pipeline branch from d7ae339 to 6649987 Compare July 29, 2025 15:30
Fix build scenario

Remove create and push manifests

Continue improvement to main

Simplify main and build_context

missed

Pass Build Configuration object directly

Use legacy and new pipeline

Fix

Remove --include

Rename MCO test image

Multi platform builds, with buildx

TODOs

Implement is_release_step_executed()

Fix init appdb image

Import sort

black formatting

Some cleaning and version adjustments

Adapt main to new build config

Add buildscenario to buildconfig

Handle build env

Renaming, usage of high level config

All images build pass on EVG

Lint

Explicit image type, support custom build_path

Replace old by new pipeline in EVG

Add documentation

Split in multiple files, cleanup

WIP, passing builds on staging temp + multi arch manifests

Replace usage of sonar

Remove namespace

Remove pin_at and build_id

Copied pipeline, removed daily builds and --exclude
@Julien-Ben Julien-Ben self-assigned this Jul 29, 2025
@MaciejKaras
Copy link
Collaborator

How will this be merged with Maciej's changes (the release_info file)?

@lucian-tosa I already have integration working -> #317. At least for PRs.

@Julien-Ben
Copy link
Collaborator Author

How will this be merged with Maciej's changes (the release_info file)?

Maciej answered above, but I also plan to sync with him as soon as this is merged, to properly rebase and wire everything togetger

def infer_scenario_from_environment(cls) -> "BuildScenario":
"""Infer the build scenario from environment variables."""
git_tag = os.getenv("triggered_by_git_tag")
is_patch = os.getenv("is_patch", "false").lower() == "true"
Copy link
Contributor

@lucian-tosa lucian-tosa Aug 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is true only when triggering manual patches or PR commits? Will it be false on merges to master?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it more explicit, and we'll make proper use of staging registry in a follow up:
2ec7587

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was referring to the is_patch variable from evergreen. This is what I found from the devprod docs

${is_patch} is "true" if the running task is in a patch build and undefined if it is not.

Does this mean that is_patch will be false in a merge to master?

@anandsyncs anandsyncs removed their request for review August 8, 2025 08:39
logger.info(f"Created new buildx builder: {builder_name}")
except DockerException as e:
# Check if this is a race condition (another process created the builder)
if hasattr(e, 'stderr') and 'existing instance for' in str(e.stderr):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nammn @lucian-tosa @MaciejKaras

I'd like your opinion on two solutions here
This function is sometimes failing when building multiple agents concurrently.
I tried two alternatives:

  • Catching the exception
  • Use ensure_buildx_builder at top leven, in the pipeline_main

None of them are ideal. Using exception is not the right way of handling concurrency. Making the main aware of buildx increase coupling.
My knowledge of python multiprocessing is limited. I know that for multi threading it is easy to make a lock with a global variable and threading.Lock() . From what I've seen online, for multi processing we would need to pass the lock from the agent building method all the way down to here.
I feel like it adds a lot of complexity and optional variables.

What solutions do you prefer ?
Do you see an alternative ?

Here is the alternative solution with ensuring builder in main:
1badff0


logger.info(f"Building {image_name}, dockerfile args: {build_args}")
logger.debug(f"Build args: {build_args}")
logger.debug(f"Building {image_name} for platforms={build_configuration.platforms}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we have the span already setup here - can we save the span attributes as we already log them anyway?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should really be done as a follow up, it might look like one minute but I have no idea how to properly setup spans.
The scope of this PR is getting quite large

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's listed in the follow up ticket
https://jira.mongodb.org/browse/CLOUDP-335471

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't have to setup anything. All you need to do is already there.
its
span.set_attribute("mck.platforms", build_configuration.platforms)
span.set_attribute("mck.registry", registry)

and the same everywhere...

return builder_name


def execute_docker_build(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or here - i think adding tracing costs 1 minute and saves us all the troubles later.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

especially since you are logging the exact information we need as metrics anyway

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should really be done as a follow up, it might look like one minute but I have no idea how to properly setup spans.
The scope of this PR is getting quite large

@Julien-Ben Julien-Ben merged commit 3e482c3 into master Aug 8, 2025
30 of 33 checks passed
@Julien-Ben Julien-Ben deleted the julienben/redesign-pipeline branch August 8, 2025 14:49
MaciejKaras added a commit that referenced this pull request Aug 11, 2025
…ros propagation (#328)

# Summary

Fixes issue with conflicting `Dockerfile` files on master. Sonar is
creating Dockerfiles in the directory where new Dockerfile existed
already. They were added here ->
#289.

Additionally another PR
#267 made changes to
how the agent images were built and especially to agent Dockerfile ARGs:

https://github.com/mongodb/mongodb-kubernetes/blob/89866b79233745e27ade51dc2d9fd0e73c6124e5/docker/mongodb-agent/Dockerfile#L3-L6

This conflicted with
#303 where new
`atomic_pipeline.py` was still depending on the old Dockerfile
structure:


https://github.com/mongodb/mongodb-kubernetes/blob/774bbaca930a1752060cfe5e5bf60f24cd6c999c/docker/mongodb-agent/Dockerfile#L25-L26

## Proof of Work

Passing CI (especially agent build task)

## Checklist

- [ ] Have you linked a jira ticket and/or is the ticket in the title?
- [ ] Have you checked whether your jira ticket required DOCSP changes?
- [x] Have you added changelog file?
    - use `skip-changelog` label if not needed
- refer to [Changelog files and Release
Notes](https://github.com/mongodb/mongodb-kubernetes/blob/master/CONTRIBUTING.md#changelog-files-and-release-notes)
section in CONTRIBUTING.md for more details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
skip-changelog Use this label in Pull Request to not require new changelog entry file
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants