Node Join Failure
Background
Corporation XYZ's e-commerce platform has been steadily growing, and the engineering team has decided to expand the EKS cluster to handle the increased workload. The team plans to create a new subnet in the us-west-2 region and provision a new managed node group under this subnet.
Sam, an experienced DevOps engineer, has been tasked with executing this expansion plan. Sam begins by creating a new VPC subnet in the us-west-2 region, with a new CIDR block. The goal is to have the new managed node group run the application workloads in this new subnet, separate from the existing node groups.
After creating the new subnet, Sam proceeds to configure the new managed node group new_nodegroup_2 in the EKS cluster. During the node group creation process, Sam notices that the new nodes are not visible in the EKS cluster and not joining the cluster.
Step 1: Verify Node Status
- Let's first verify if the new nodes from nodegroup new_nodegroup_2 are visible in the cluster:
No resources found
Step 2: Check Managed Node Group Status
Let's examine the EKS managed node group configuration to verify its status and configuration:
Output:
{
    "nodegroup": {
        "nodegroupName": "new_nodegroup_2",
        "nodegroupArn": "arn:aws:eks:us-west-2:1234567890:nodegroup/eks-workshop/new_nodegroup_2/abcd1234-1234-abcd-1234-1234abcd1234",
        "clusterName": "eks-workshop",
        ...
        "status": "ACTIVE",
        "capacityType": "ON_DEMAND",
        "scalingConfig": {
            "minSize": 0,
            "maxSize": 1,
            "desiredSize": 1
        },
        ...
        "health": {
            "issues": []
Alternatively, you can also check the console for the same. Click the button below to open the EKS Console.
 Open EKS Cluster Compute Tab
Open EKS Cluster Compute TabKey observations from the output:
- Node group status is ACTIVE
- Desired capacity is 1
- No health issues reported
- Scaling configuration is correct
Step 3: Investigate Auto Scaling Group
Let's check the ASG activities to understand the instance launch status:
3.1. Identify Nodegroup's Auto Scaling Group Name
Run the below command to capture Nodegroup Autoscale Group name as NEW_NODEGROUP_2_ASG_NAME.
4.2. Check the AutoScaling Activities
Output:
{
    "Activities": [
        {
            "ActivityId": "1234abcd-1234-abcd-1234-1234abcd1234",
            "AutoScalingGroupName": "eks-new_nodegroup_2-abcd1234-1234-abcd-1234-1234abcd1234",
    --->>>  "Description": "Launching a new EC2 instance: i-1234abcd1234abcd1",
            "Cause": "At 2024-10-09T14:59:26Z a user request update of AutoScalingGroup constraints to min: 0, max: 2, desired: 1 changing the desired capacity from 0 to 1.  At 2024-10-09T14:59:36Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 0 to 1.",
            ...
    --->>>  "StatusCode": "Successful",
            ...
        }
    ]
}
You can check the EKS console as well. Click the Autoscaling group name to open the ASG console view ASG activity.
 Open EKS cluster Nodegroup Tab
Open EKS cluster Nodegroup TabKey findings:
- Instance launch was successful
- ASG reports normal operation
- Desired capacity changes were processed
Step 4: Examine EC2 Instance Configuration
Let's inspect the launched EC2 instance configuration:
Note: For your convenience we have added the instance ID as env variable with the variable $NEW_NODEGROUP_2_INSTANCE_ID.
Output:
[
  [
    {
      "InstanceState": "running",
      "SubnetId": "subnet-1234abcd1234abcd1",
      "VpcId": "vpc-1234abcd1234abcd1",
      "InstanceProfile": {
        "Arn": "arn:aws:iam::1234567890:instance-profile/eks-abcd1234-1234-abcd-1234-1234abcd1234",
        "Id": "ABCDEFGHIJK1LMNOP2QRS"
      },
      "SecurityGroups": [
        {
          "GroupName": "eks-cluster-sg-eks-workshop-123456789",
          "GroupId": "sg-1234abcd1234abcd1"
        }
      ]
    }
  ]
]
Important aspects to verify:
- Instance state is "running"
- Instance profile and IAM role assignments
- Security group configurations
infoTo use the console, click the button below to open the EC2 Console.  Open EC2 Console Open EC2 Console
Step 5: Analyze Network Configuration
Let's examine the subnet and routing configuration:
Note: For your convenience Subnet ID is added as env variable $NEW_NODEGROUP_2_SUBNET_ID.
5.1. Check subnet configuration
Output:
[
  {
    "AvailabilityZone": "us-west-2a",
    "AvailableIpAddressCount": 8186,
    "CidrBlock": "10.42.192.0/19",
    "State": "available"
  }
]
5.2. Obtain route table ID
Output:
[
  {
    "RouteTableId": "rtb-1234abcd1234abcd1",
    "AssociatedSubnets": ["subnet-1234abcd1234abcd1"]
  }
]
5.3. Examine route table configuration
Note: For your convenience Subnet ID is added as env variable $NEW_NODEGROUP_2_ROUTETABLE_ID.
Output:
[
  {
    "DestinationCidrBlock": "10.42.0.0/16",
    "GatewayId": "local",
    "Origin": "CreateRouteTable",
    "State": "active"
  }
]
To use the VPC console click the button. Check the Subnet Details tab, and Route tables tab for route table routes.
 Open VPC Console
Open VPC ConsoleCritical Finding: Route table shows only local routes (10.42.0.0/16) with no internet access path
Step 6: Implement Solution
The root cause is identified as missing internet access for the worker nodes. Let's implement the fix:
Note: For your convenience NatGateway ID is added as env variable $DEFAULT_NODEGROUP_NATGATEWAY_ID.
6.1. Add NAT Gateway route
Output:
{
  "Return": true
}
6.2. Verify the new route
Output:
[
    {
        "RouteTableId": "rtb-1234abcd1234abcd1",
        "VpcId": "vpc-1234abcd1234abcd1",
        "Routes": [
            {
                "DestinationCidrBlock": "10.42.0.0/16",
                "GatewayId": "local",
                "Origin": "CreateRouteTable",
                "State": "active"
            },
            {
                "DestinationCidrBlock": "0.0.0.0/0",            <<<---
                "NatGatewayId": "nat-1234abcd1234abcd1",        <<<---
                "Origin": "CreateRoute",
                "State": "active"
            }
        ]
    }
]
Click the button below to use the VPC Console.
 Open VPC Console
Open VPC Console6.3. Recycle the node group to trigger new instance launch
Scale down and scale up the node group. This can take up to 1 minute.
Step 7: Verification
Verify the node has successfully joined the cluster:
NAME STATUS ROLES AGE VERSION
ip-10-42-108-252.us-west-2.compute.internal Ready <none> 3m9s v1.30.0-eks-036c24b
Newly joined node can take up to about 1 minute to show.
Key Takeaways
Network Requirements
- Worker nodes require internet access for AWS service communication
- NAT Gateway provides secure outbound connectivity
- Route table configuration is critical for node bootstrapping
Troubleshooting Approach
- Verify node group configuration
- Check instance status
- Analyze network configuration
- Examine routing tables
Best Practices
- Implement proper network planning
- Use private subnets with NAT Gateway
- Follow AWS security best practices
- Consider VPC endpoints for enhanced security
Additional Resources
Security and Access Control
- Security Group Requirements - Essential security group rules and configurations required for EKS cluster communication
- AWS User Guide for Private Clusters - Comprehensive guide for setting up and managing private EKS clusters
- Configuring Private Access to AWS Services - Detailed instructions for configuring private access to AWS services using VPC endpoints - eksctl
Best Practices Documentation
- EKS Networking Best Practices - AWS recommended networking practices for EKS cluster design and operation
- VPC Endpoint Services Guide - Complete guide to implementing and managing VPC endpoints for secure service access
For a comprehensive understanding of EKS networking, review the EKS Networking Documentation. For a troubleshooting guide, review the Knowledge Center article.