ReleaseEngineering/Meeting Notes/2010-02-08 - Frustrations Deploying Services
From MozillaWiki
Contents
What's Bothering us right now
- being blocked about getting things like self serve public because we want a way to deploy things
- DNS, hosting fixes - no one is taking on
- have things to deploy and can't
- overall there's a "things aren't the way we want them to be"
- no one has time to work on things - we all feel this
- It's hard for new hires to learn what's already in place
- What are the systems, how do they work, interrelationships are hard to track down
- Someone who's new to this doesn't have a central picture to predict conflicts on new work
What would help right now?
- sound description of each system (app store is beginning of that)
- tips on how to start the discussion on where to get new work up and running (what db, host, etc)
- process for how we create new systems - and determine the 'ilities': scalability, maintainability, reliability, etc
Suggestions/Questions/Comments
- would it help to be writing down the questions to create the template for what kind of docs/info we need to gather?
- how much of this is inherent in the system? complicated/moving parts/imperfect software & hardware
- one-line patch assumption can turn into a week stuck into something, the 'how long will this take' estimates can be off
- this is endemic to something we've all been picking at
- we can no longer be a loose collection of engineers working on their thing
- because of size of team and scope of problems, we are hitting the communication issue of a larger team
- a single person can no longer just do everything alone
- we need specializations (eg: Dustin is aleady specializing away from master side of buildbot)
- there's a lot of stuff that someone like, for example catlee, knows that is not shared knowledge: how to access db, use talos monitoring
- need technical measures that test/ensure fault tolerance and isolation - e.g., netflix chaos monkey
- someone might send email when a system first come out, but what kind of things do we need to document and make sure are easy to find over time?
- still missing something to get things done
- example for right now: self serve - is deploying it in its current state a mistake?
- we haven't got time/known method for auditing new tool or system to know whether it is ready to be production level
Action items
- looking into scalability of self serve (hosting)
- everyone contribute to docs about the systems we use and how they inter-operate with other tools/systems - ReleaseEngineering/Applications