Skip to content

ASG doesn't respect max size when updating EKS managed nodes #3347

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 task done
guidodobboletta opened this issue Apr 23, 2025 · 3 comments
Closed
1 task done

ASG doesn't respect max size when updating EKS managed nodes #3347

guidodobboletta opened this issue Apr 23, 2025 · 3 comments
Labels

Comments

@guidodobboletta
Copy link

Description

When doing a user_data update or an AMI update on my node group which has min: 1, max: 1 and desired_size: 1 the ASG will spin 5 more instances for absolutely no apparent reason.

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "20.36.0"

  cluster_name    = var.eks_cluster_name
  cluster_version = "1.32"

  cluster_addons = {
    coredns = {
      most_recent = true
    }

    kube-proxy = {
      most_recent = true
    }

    vpc-cni = {
      most_recent = true
    }

    aws-ebs-csi-driver = {
      most_recent = true
    }
  }

  vpc_id     = var.vpc_id
  subnet_ids = var.private_subnets

  create_cni_ipv6_iam_policy = false
  cluster_ip_family          = "ipv4"

  cluster_enabled_log_types              = ["audit", "api", "authenticator", "controllerManager", "scheduler"]
  create_cloudwatch_log_group            = true
  cloudwatch_log_group_retention_in_days = 365

  create_node_security_group = true
  cluster_security_group_additional_rules = {
    tailscale = {
      cidr_blocks = [var.vpc_cidr_block]
      description = "Allow all traffic from private VPC"
      from_port   = 443
      to_port     = 443
      protocol    = "all"
      type        = "ingress"
    }
  }

  cluster_endpoint_private_access = true

  enable_cluster_creator_admin_permissions = true

  access_entries = local.access_entries

  eks_managed_node_groups = {

    (var.eks_cluster_name) = {
      min_size     = 1
      max_size     = 1
      desired_size = 1

      use_name_prefix = true
      instance_types  = [var.instance_type]

      update_config = {
        max_unavailable = 1
      }

      pre_bootstrap_user_data = <<-EOT
      #!/bin/bash
      set -ex
      curl -fsSL https://tailscale.com/install.sh | sh
      tailscale up --accept-dns=false --authkey='${data.aws_secretsmanager_secret_version.tailscale_oauth_eks.secret_string}?ephemeral=true&preauthorized=true'
      tailscale set --ssh
      EOT

      ebs_optimized = true

      block_device_mappings = {
        xvda = {
          device_name = "/dev/xvda"
          ebs = {
            volume_size           = 100
            volume_type           = "gp3"
            delete_on_termination = true
          }
        }
      }
    }
  }
}
  • ✋ I have searched the open/closed issues and my issue is not listed.

⚠️ Note

Before you submit an issue, please perform the following first:

Versions

  • Module version [Required]: 20.36.0

  • Terraform version: opentofu 1.9.0

  • Provider version(s): aws 5.95.0

Reproduction Code [Required]

Pasted above

Steps to reproduce the behavior:

Expected behavior

I'm expecting EKS to create another single node and then delete the old. Not create 6 nodes and then delete 5

Actual behavior

This is from a change in my pre_bootstrap_user_data I'm doing:

Status Instance ID UTC Time Local Start Time Local End Time Log Details
✅ Launch i-07cd11bde9c978d4f 2025-04-23T06:50:49Z 2025 April 23, 01:51:01 PM +07:00 2025 April 23, 01:52:10 PM +07:00 At 2025-04-23T06:50:49Z a user request update of AutoScalingGroup constraints to min: 1, max: 2, desired: 2 changing the desired capacity from 1 to 2. At 2025-04-23T06:50:59Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 1 to 2.
✅ Launch i-0879b60cf3b843448 2025-04-23T06:52:53Z 2025 April 23, 01:52:55 PM +07:00 2025 April 23, 01:54:07 PM +07:00 At 2025-04-23T06:52:53Z a user request update of AutoScalingGroup constraints to min: 1, max: 3, desired: 3 changing the desired capacity from 2 to 3. At 2025-04-23T06:52:53Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 2 to 3.
✅ Launch i-008676abad9cb498d 2025-04-23T06:54:56Z 2025 April 23, 01:55:10 PM +07:00 2025 April 23, 01:56:20 PM +07:00 At 2025-04-23T06:54:56Z a user request update of AutoScalingGroup constraints to min: 1, max: 4, desired: 4 changing the desired capacity from 3 to 4. At 2025-04-23T06:55:08Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 3 to 4.
✅ Launch i-0d91cf39d55ce4819 2025-04-23T06:57:00Z 2025 April 23, 01:57:15 PM +07:00 2025 April 23, 01:58:24 PM +07:00 At 2025-04-23T06:57:00Z a user request update of AutoScalingGroup constraints to min: 1, max: 5, desired: 5 changing the desired capacity from 4 to 5. At 2025-04-23T06:57:13Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 4 to 5.
✅ Launch i-00522e75bfa472fb5 2025-04-23T06:59:03Z 2025 April 23, 01:59:09 PM +07:00 2025 April 23, 02:00:18 PM +07:00 At 2025-04-23T06:59:03Z a user request update of AutoScalingGroup constraints to min: 1, max: 6, desired: 6 changing the desired capacity from 5 to 6. At 2025-04-23T06:59:07Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 5 to 6.
✅ Terminate i-07ce15c6c067ce5f7 2025-04-23T07:04:49Z 2025 April 23, 02:04:49 PM +07:00 2025 April 23, 02:05:33 PM +07:00 At 2025-04-23T07:04:49Z instance i-07ce15c6c067ce5f7 was taken out of service in response to a user request, shrinking the capacity from 6 to 5.
✅ Terminate i-0d91cf39d55ce4819 2025-04-23T07:05:41Z 2025 April 23, 02:05:51 PM +07:00 2025 April 23, 02:07:36 PM +07:00 At 2025-04-23T07:05:41Z a user request update of AutoScalingGroup constraints to min: 1, max: 5, desired: 4 changing the desired capacity from 5 to 4. At 2025-04-23T07:05:51Z an instance was taken out of service in response to a difference between desired and actual capacity, shrinking the capacity from 5 to 4. At 2025-04-23T07:05:51Z instance i-0d91cf39d55ce4819 was selected for termination.
✅ Terminate i-008676abad9cb498d 2025-04-23T07:07:43Z 2025 April 23, 02:07:45 PM +07:00 2025 April 23, 02:09:30 PM +07:00 At 2025-04-23T07:07:43Z a user request update of AutoScalingGroup constraints to min: 1, max: 4, desired: 3 changing the desired capacity from 4 to 3. At 2025-04-23T07:07:45Z an instance was taken out of service in response to a difference between desired and actual capacity, shrinking the capacity from 4 to 3. At 2025-04-23T07:07:45Z instance i-008676abad9cb498d was selected for termination.
✅ Terminate i-00522e75bfa472fb5 2025-04-23T07:09:45Z 2025 April 23, 02:09:50 PM +07:00 2025 April 23, 02:11:35 PM +07:00 At 2025-04-23T07:09:45Z a user request update of AutoScalingGroup constraints to min: 1, max: 3, desired: 2 changing the desired capacity from 3 to 2. At 2025-04-23T07:09:50Z an instance was taken out of service in response to a difference between desired and actual capacity, shrinking the capacity from 3 to 2. At 2025-04-23T07:09:50Z instance i-00522e75bfa472fb5 was selected for termination.
✅ Terminate i-0879b60cf3b843448 2025-04-23T07:11:46Z 2025 April 23, 02:11:55 PM +07:00 (not specified) At 2025-04-23T07:11:46Z a user request update of AutoScalingGroup constraints to min: 1, max: 2, desired: 1 changing the desired capacity from 2 to 1. At 2025-04-23T07:11:55Z an instance was taken out of service in response to a difference between desired and actual capacity, shrinking the capacity from 2 to 1. At 2025-04-23T07:11:55Z instance i-0879b60cf3b843448 was selected for termination.
@bryantbiggs
Copy link
Member

Please familiarize yourself with the service https://docs.aws.amazon.com/eks/latest/userguide/managed-node-update-behavior.html

@guidodobboletta
Copy link
Author

oh I see. I didn't know that detail. Is there a way to tune it to make user data deployments faster? If not that's fine I was just wondering what was going on

@bryantbiggs
Copy link
Member

Is there a way to tune it to make user data deployments faster?

Not that I am aware of - however, I haven't come across folks who are making a lot of changes to the user data

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants