Tamerlan666 Dec 15 2023 at 20:36

Managing AWS Auto Scaling Group Instance Refresh: The Harmony of Terraform and Ansible

Medium

6 min

621

System administration*IT Infrastructure*Server Administration*Amazon Web Services*DevOps*

Tutorial

Translation

Original author: Tamerlan666

In the DevOps realm, where automation is crucial, the management of resources and updating processes in the cloud is vitally important. Many modern projects, particularly in AWS cloud environments, leverage Auto Scaling Groups (ASG). This mechanism aims to achieve three key objectives: balancing loads, increasing service reliability, and optimizing operational costs for efficiency and effectiveness.

Imagine working at a company where you deploy applications on Amazon's resources. To streamline this process and manage configurations more effectively, you use pre-built AMI images. These are crafted with tools like HashiCorp Packer, ensuring your applications launch swiftly and reliably. For the actual infrastructure deployment, you turn to Terraform. It's widely recognized as the standard in many major companies for managing cloud resources and using the IaC (Infrastructure as Code) approach.

As an IT engineer, you sometimes need to update instance versions to a newer AMI image, either for the latest security patches or to introduce new functionalities. The challenge lies in updating an active ASG without causing downtime. It's crucial to ensure the new AMI performs as reliably as the existing one, balancing the need for updates with system stability and uptime.

ASG's instance refresh is a crucial feature that allows for updating instances within a group while minimizing downtime, thereby maintaining high availability. However, ensuring the success of such updates, especially in large, complex systems, can be a challenge. Terraform resources, such as aws_autoscaling_group, can initiate this process but don't provide progress tracking. This limitation becomes apparent when other infrastructure components, such as certificate renewals or DNS updates, depend on the state and version of the instances. Monitoring the update process is essential to maintain an accurate infrastructure state after Terraform's execution.

To overcome this challenge, Ansible can be utilized. This tool, widely recognized for its configuration management and automation capabilities, can also be employed in this situation. Ansible allows us to monitor the update process and ensure its successful conclusion.

1. Preparing Terraform

The first step is creating a Terraform configuration that provides the necessary structure and process for updating ASGs.

resource "aws_autoscaling_group" "example" {
  desired_capacity     = 3
  max_size             = 5
  min_size             = 2
  vpc_zone_identifier  = ["subnet-0bb1c79de3EXAMPLE"]

  launch_template {
    id      = aws_launch_template.example.id
    version = aws_launch_template.example.latest_version
  }
  
  instance_refresh {
    strategy = "Rolling"
    preferences {
      min_healthy_percentage = 100
      instance_warmup        = 120
    }
    triggers = ["tag"]
  }

  health_check_type          = "EC2"
  force_delete               = true
  wait_for_capacity_timeout  = "0"
}

A detailed breakdown of the instance_refresh block:

This block is critical as it sets parameters for updating instances in the ASG:

strategy = "Rolling": This strategy ensures that updates are carried out step-by-step, minimizing potential service availability issues.
preferences: This block contains two key settings:
- min_healthy_percentage = 100: Ensures that a 100% health level of the group is maintained during the update, which is vital for service reliability.
- instance_warmup = 120: The time in seconds allowing new instances to "warm up" before they are put into operation.
triggers = ["tag"]: These triggers initiate instance updates when specified attributes change, useful, for example, when modifying resource tags.

Pay close attention to the launch_template block. While setting version = "$Latest" might seem convenient, it's not recommended. This setting instructs the Auto Scaling group to always utilize the latest launch template version for launching new EC2 instances, but it won't automatically update existing instances if the template changes.

To initiate instance refresh whenever the template changes, utilize the latest_version value extracted from the aws_launch_template resource. This approach ensures that every template change triggers an instance update.

Next, we'll incorporate an Ansible invocation into Terraform to manage the update process, employing a specialized Terraform resource called null_resource:

resource "null_resource" "ansible_run" {
  triggers = {
    template_version = aws_autoscaling_group.example.launch_template[0].version
  }

  provisioner "local-exec" {
    command = join(" ",
      [
        "ansible-playbook ${path.module}/asg_refresh_handler.yml -i 'localhost,'",
        "-e asg_name=${aws_autoscaling_group.example.name}"
      ]
    )
}

In Terraform, null_resource is a way of performing actions that are not tied to any actual resource of a cloud provider. This resource is ideal for integrating external tools, such as Ansible.

Triggers: triggers are a construct in Terraform that indicates under what conditions a resource should be recreated. In our case, each time the version of launch_template in aws_autoscaling_group.example changes, Terraform will run the Ansible playbook. This ensures that after each ASG update, Ansible will be invoked to track the status of the instance refresh.
Provisioner "local-exec": This provisioner tells Terraform to execute a command on the local machine. In this case, we are running an Ansible playbook.
- ansible-playbook ${path.module}/asg_refresh_waiter.yml indicates the path to our playbook.
- -i 'localhost,' instructs Ansible to operate on the local machine.
- -e asg_name=" class="formula inline">{aws_autoscaling_group.example.name} passes the name of the autoscaling group to Ansible to work with.

So, whenever Terraform updates the ASG due to changes in the launch_template, it automatically calls Ansible to track the process of instance refresh.

2. Creating Ansible Playbook

Let's now move on to developing an Ansible Playbook, which will track the update process based on data received from AWS. As we can see from the code above, we need a file named asg_refresh_waiter.yml, which we will place in the same directory as the code for our terraform module.

---
- name: ASG Refresh Handler
  hosts: localhost
  gather_facts: false
  connection: local
  tasks:

    - name: Obtain ASG Information
      amazon.aws.ec2_asg_info:
        name: '{{ asg_name }}'
      register: asg_status

    - name: Display ASG Instances
      debug:
        msg: '{{ asg_status.results[0].instances }}'

    - name: Display ASG Launch Template Info
      debug:
        msg: '{{ asg_status.results[0].launch_template }}'

    - name: Await Instance Refresh Completion
      amazon.aws.ec2_asg_info:
        name: '{{ asg_name }}'
      register: updated_asg_status
      retries: 300
      until:
        - >-
          updated_asg_status.results[0].instances
            | map(attribute='launch_template.version')
            | union([updated_asg_status.results[0].launch_template.version])
            | length == 1
        - >-
          updated_asg_status.results[0].instances
            | map(attribute='launch_template.version')
            | unique
            | length == 1
      when: asg_status.results[0].launch_template.version is defined

    - name: Display Updated Instances
      debug:
        msg: '{{ updated_asg_status.results[0].instances }}'

Let's break this down:

Obtain ASG Information: This task retrieves the current information about the ASG, which allows assessing whether an update is required and whether it can be carried out.
Display ASG Instances and Display ASG Launch Template Info: These tasks help in debugging by displaying current information about the status of instances and the launch template.
Await Instance Refresh Completion: This is the heart of our playbook. Here we use the mechanism of retries/until, which allows us to track the update process until its completion:
- retries: 300 indicates that the task will be repeated up to 300 times until the until condition is met.
- This task uses the until condition with two conditions to determine the completion of the update process.

Breaking down the conditions in the until block:

In the Await Instance Refresh Completion task, the until block presents two checks. These checks are needed to ensure that all instances have been updated to the latest version of the Launch Template.

The first check:

updated_asg_status.results[0].instances
  | map(attribute='launch_template.version')
  | union([updated_asg_status.results[0].launch_template.version])
  | length == 1

This check performs the following actions:

Extracts the launch template versions of all instances in the ASG.
Combines the obtained list of versions with the ASG launch template version.
Checks that all versions match, i.e., the list contains only one unique version.

The second check:

updated_asg_status.results[0].instances
  | map(attribute='launch_template.version')
  | unique
  | length == 1

The second check ensures there are no differences between the Launch Templates versions among the instances, guaranteeing that all instances are updated to the latest version.

Conclusion

If everything is done correctly, when running the Terraform code and a new AMI version appears, the launch_template version for the autoscaling group will be updated, and the instance refresh process will be started automatically. After this, Terraform will run an Ansible playbook with the parameters we have specified, passing the playbook the name value of the autoscaling group.

The launched Ansible playbook will monitor the state of the ASG and the version of the templates of the launched machine instances for a specified period of time, waiting until all versions of the launched machines are the same as the updated launch_template version for the ASG.

The given example of Ansible playbook code is quite universal and depends on a single input parameter - the name of the autoscaling group. Therefore, it can be easily used in almost any environment and with any Terraform code without changes.

I hope this example of combining Terraform and Ansible will help someone build a more efficient and reliable service update system.

Tags:

Hubs: