User:Hwine/Holiday VCS-Sync Troubleshooting
Contents
Context
Modern vcs-sync has been running in both releng & devservices AWS accounts for several months without issues. While this gives high confidence in the reliability of the new (dev services) instances, it makes trouble shooting more difficult. Therefore, I'm seizing the pre-holiday lull to disable the older instances to simplify things prior to my PTO.
This guide gives reasonable shortcuts to apply, if anything should go sideways. System details are in mana (start here).
Current System Layout
- Legacy vcs-sync is not affected - see its docs.
- Everything related to modern vcs-sync is now running on a single AWS instances in the dev services account.
- Only folks in dev services have access to the machines
- Machines only accept connections from mozilla hosts (e.g. people).
- The releng instances (2) are
still uppowered off and awaiting deletion bug 1244790. - The conversion jobs on both instances are configured by committing to the production branch of Moz Harness
- System status is reported via email to a google group
Troubleshooting Modern 101
- Check the logs first. Only errors are reported - no news is good news. (Logs can only be accessed via a Mozilla google account.)
- If any branch on either gecko-dev or gecko-projects is updating properly, it's not a system issue. (See below for per repo or branch issues.)
- If no branch in both gecko-dev and gecko-projects is updating, there may be a system issue. Next steps:
- Check dev services instance for anything obvious (dev services team member needed)
- If nothing obvious, ask releng team member to uncomment cron tab entries on hosts vcssync[12] for user vcs2vcs. First run will take longer to "catch up".
- If this fixes things, comment out the crontab lines on the dev services box (dev services team member). File a bug and need-info :hwine.
- If still broken at this point, you're in brand new territory. Get a stiff drink and start at it. (Page hal.)
That's it. Anything else is new territory, hasn't happened in several years of operation, event. Modern vcs-sync is based on MozHarness, and releng folks should be able to assist. My suggestion would be to cut completely back over to the releng instances, so all access is in one group.
Diagnosing Single Repo Issues
The most common question about vcs-sync is "how come my gaia commit X hasn't started build Y?" Most often, this is not a vcs-sync issue, but an issue of impatience or a problem elsewhere. This section will help to determine the location of "elsewhere" (no, not Boston).
Here's what is currently involved to get from gaia commit to gecko build start:
That is a long pipeline, and a simple change can take ~40 minutes to traverse (worst case if all the polling gates are missed). Longer for large commits, such as merges.
There are a number of publicly visible places that can be checked along the way. Fortunately, commit messages are usually unique enough to avoid dealing with the changing hash values. Pay attention to the branches!
Tracing Progress
- Verify commit on github, and note branch.
- Look for commit on git.mozilla.org and note branch
- If not here after ~40 minutes, it may not have been a "fast forward" commit. Partner rules require only non-fast-forward commits. Fix is for dev to redo the commit appropriately, so they are all fast-forward.
- Look for converted commit in appropriate-for-gaia-branch integration repo on hg.mozilla.org.
- If not here within a reasonable time, there may be a vcs-sync issue. Talk to folks in #vcs.
- Look for commit from b2g_bumper in matching gecko hg repository, file "b2g/config/gaia.json"
- If the tree is closed for that repo, that's the reason there is no commit or build. Once the tree is reopened, things should progress. If not, ask for help from a sheriff.
- If not there, there may be a b2g_bumper issue. Talk to folks in #releng.
- If a matching build doesn't start after the commit from bumper, check with a sheriff.