ReleaseEngineering/How To/Manage AWS slaves
To simplify management of AWS slaves you can use aws_manage_instances.py script. It can stop, start, restart instances; enable or disable automatic reboot and automatic shutdown.
Contents
Usage
ssh into aws-manager2.srv.releng.scl3.mozilla.com as buildduty
ssh buildduty@aws-manager2.srv.releng.scl3.mozilla.com
set up aws environment (note: this gets run automatically now when you login as the buildduty user)
source /builds/aws_manager/bin/activate cd /builds/aws_manager/cloud-tools/scripts
from /builds/aws_manager/cloud-tools/scripts, you can exec aws_manage_instances.py with the following usage options. See below for examples
usage: aws_manage_instances.py [-h] [-k SECRETS] [-r REGIONS] [-m COMMENTS] [-n] [-q] {stop,start,restart,enable,disable,terminate,status} host [host ...] positional arguments: {stop,start,restart,enable,disable,terminate,status} action to be performed host hosts to be processed optional arguments: -h, --help show this help message and exit -k SECRETS, --secrets SECRETS optional file where secrets can be found -r REGIONS, --region REGIONS optional list of regions -m COMMENTS, --comments COMMENTS reason to disable -n, --dry-run Dry run mode -q, --quiet Supress logging messages
Examples
Disable automatic reboots and start bld-linux64-ec2-001 (you may also want to disable it in slavealloc):
python aws_manage_instances.py disable -m "rail: need to debug XXX" bld-linux64-ec2-001 python aws_manage_instances.py start bld-linux64-ec2-001
Reboot it
python aws_manage_instances.py restart bld-linux64-ec2-001
Terminate it
python aws_manage_instances.py terminate dev-linux64-ec2-001
Secrets (credentials)
There a 2 ways to pass AWS credentials to properly authenicate yourself.
AWS_CREDENTIAL_FILE
The underlying library (boto) uses AWS_CREDENTIAL_FILE environment variable with path to file with your credentials in the following format:
AWSAccessKeyId=xxx AWSSecretKey=xxx
To use it add the following command to your profile
export AWS_CREDENTIAL_FILE=~/.ec2/aws-credential-file.txt
-k secrets.json
Create a JSON file with your credentials and pass it via -k parameter. Example file:
{ "aws_access_key_id": "xxx", "aws_secret_access_key": "xxx" }
AWS Sanity Check
Long Running Instances
- Dealing With A Long Running Instance
- Check machine current status (is it actually running right now) by either
- Logging into AWS web console, look up instance, and see if it is still running
- [Note]: if you don't know the credentials for this, they probably have to be generated for you. Ask :catlee, as he has done this
- Using releng cloud-tools from aws-manager2.srv.releng.scl3.mozilla.com
- see usage for 'status' command above
- ssh into machine
- Logging into AWS web console, look up instance, and see if it is still running
- Check when the lastest build/activity was
- If loaners/releng-dev machines:
- ssh as root into that machine, and run `last`
- find the bug that is associated with the instance and check latest comments.
- the bug number can also be found by looking at the instance tags in AWS console
- If it's one of our Buildbot CI machines
- use Slave Health or ssh into machine and tail twistd.log
- [Note]: these machines should not be running long. It's put on the long running process list if it's up for more than 2h. So if it's been idle for while, further action will be required.
- If loaners/releng-dev machines:
- For instances that have not had any recent builds/activity and you are sure they are not currently doing a build
- If loaners/releng-dev machines:
- Poke the owner of the instance via the associated bug, checking if they still need the machine.
- use judgement for what's fair. eg: if it's been up for 24-48hrs, probably not cause for further action.
- Store owner/usage detail in the moz-used-by instance Tag (if not already updated)
- this is done by going into the AWS web console and filling in the section called "tags"
- Decide whether to stop instance for a period of time or reclaim + terminate instance
- 'stop' the instance if owner wants to use it again soon but won't be working on it for a day or two
- see usage for 'stop' command above
- [Note]: this should be made appealing to the owner as turning it back on is *easy* and fast!
- 'terminate' the instance if owner has stated to be finished forever or if bug is resolved
- see Reclaiming Loaners
- [Note]: don't forget to revert vpn access bug and delete A/PTR records.
- 'stop' the instance if owner wants to use it again soon but won't be working on it for a day or two
- If it's one of our Buildbot CI machines:
- Decide whether to stop or terminate instance
- if this is a spot instance:
- terminate it (don't delete A/ATR records)
- [Note]: they need to be terminated because spot instances don't really have a 'stopped' state.
- if this is not a spot instance:
- shut it down by:
- see usage for 'stop' command above
- logging into AWS web console and choose 'stop' in dropdown
- ssh in to machine and: $ shutdown -h now
- [Note]: stopping will allow aws_watch_pending to deal with deciding when it needs to be started up again
- shut it down by:
- if this is a spot instance:
- Decide whether to stop or terminate instance
- If loaners/releng-dev machines:
- Check machine current status (is it actually running right now) by either
- For repeating problematic instances, further action will be required. Ask in #releng and possibly esculate to catlee/rail
Unknown Type Or State Instances
- Most of these are created by us. Track down who made the instance and request either:
- the instance be tagged properly via the AWS web console and filling in the section called "tags"
- fix the reporting
Stopped For A While Instances
- If loaners/releng-dev machines:
- when it has been more than a ~2 weeks (say >300 hrs) poke the owner in the associated bug, querying if it is OK to terminate this instance.
- If it's one of our Buildbot CI machines:
- when it has been more than ~1 month (say >700 hrs) we should terminate the instance