ReleaseEngineering/How To/Work with Golden AMIs
Contents
Golden AMI
Background
Puppet used to be one of the bottlenecks for Releng infra, especially for EC2 instances.
Puppet requires DNS (A and PTR entries) to work properly. Having a static IP assigned to a spot instance required us to precreate network interfaces and specify them as a part of a spot request. Even worse, to have an external IP assigned to a network interface we had to run an instance and terminate it, because the API does not allow you to do that.
This approach didn't scale for us:
- Adding new instances requires adding new network interfaces by creating and terminating temporary instances.
- Once a network interface is created it is bound to a subnet (to an Availability Zone). Because of this slave names cannot be reused in different Azs with better spot prices.
- Regular puppet checks add load on the puppet infra. Puppet errors may bring down the whole infra easily.
Puppetless/DNSless Concept
To avoid the issues stated above we came up with the following architectue.
- Regularly create “golden” AMIs:
- Puppetize a “base” AMI
- Disable puppet
- Strip host specific files
- Generate a “golden” AMI
- Copy the “golden” AMI to other regions
- Use the “golden” AMIs for spot instance requests
- Make sure an instance uses a fresh AMI on boot, terminate itself otherwise.
How it works
Base AMI
Base AMIs are created by create_ami.py script manually and published in the corresponding config files, e.g. configs/bld-linux64. These AMIs are used as a base image for “golden” AMIs.
Golden AMI
“Golden” AMIs are generated on aws-manager2 by multiple daily cronjobs, e.g. modules/aws_manager/manifests/cron.pp
Instances used to generate “golden” AMIs use DNS to make puppet work and rely on proper DNS entries. To reduce possibility of IP collisions they live in a different subnet and require --ignore-subnet-check as a paramter.
If the process is stuck for some reason you need to terminate the instances and kill the process.
The generated AMIs are published to https://s3.amazonaws.com/mozilla-releng-amis/amis.json by scripts/aws_publish_amis.py scheduled by modules/aws_manager/manifests/cron.pp. The published file is used by running spot instances to determine available AMIs and if they are need to be terminated if the AMI is out of date.
Once a day scripts/delete_old_spot_amis.py scheduled at modules/aws_manager/manifests/cron.pp#l71 deletes old AMIs leaving latest 10 AMIs.
Troubleshooting
Something is wrong with the new AMIs
See https://wiki.mozilla.org/ReleaseEngineering/How_To/Manage_spot_AMIs for the details how to determine AMIs in use, how to delete them and how to terminate instances based on the broken AMIs
How to determine the IP of a spot instance which is not in DNS
See https://wiki.mozilla.org/ReleaseEngineering/How_To/Manage_AWS_slaves. Running the script with “status” sub-command shows the IP used by instances.
python aws_manage_instances.py status b-2008-spot-006 2017-11-17 06:15:42,273 - INFO - Found b-2008-spot-006 (i-04ad9421626d050de)... Name: b-2008-spot-006 ID: i-04ad9421626d050de IP: 10.134.55.105 Enabled: True State: running Tags: Name -> b-2008-spot-006, moz-type -> b-2008, FQDN -> b-2008-spot-006.build.releng.use1.mozilla.com, moz-state -> ready
Something is happening to the spot instances
- Search the slave name in the AWS console's Instances section. “State transition reason” describes what was the latest state of the instance. It may have some clue about the reason why the instance was terminated.
- Search the slave name in the AWS console's Spot Requests section. It may contain multiple entries, choose the one that matches your time frame. “Status” and “Status message” contain some information about the reason why it was terminated
How to force AMI generation
There are cronjobs on aws-manager2 responsible for this. Running the underlying scripts would regenerate “golden” AMIs and copy then to other regions.
The cronjobs can be listed by running
ls -lt /etc/cron.d/*golden*
The underlying wrapper scripts live in
/builds/aws_manager/bin/*golden*
To regenerate AMIs one should run those scripts as “buildduty” user. Consider using screen/tmux because it may take up to 2 hours to generate some AMIs.
e.g. screen -S "bld-linux64-ec2-golden"