Data in the Open
Rationale
The value of working in the open is at the core of Mozilla. This is clearly outlined in section 7 (“Free and open source software promotes the development of the internet as a public resource”) and section 8 (“ Transparent community-based processes promote participation, accountability and trust.”) of the Mozilla Manifesto. In the Data organization, we see the positive effects of these principles in our work:
- Open practices (like the Data Stewardship model) builds trust with our users: this provides us social license to explore new areas like personalized search results and other contextual services.
- Our involvement in the Outreachy and Google Summer of Code programs has helped us build out a first class data collection and analysis platform, improving the user experience of about:telemetry, metadata tools like the Glean Dictionary, and others.
- Having the source to our systems publicly available can, in many cases, make working with partners easier since it obviates the need to sign NDAs and give privileged access to our systems.
That said, Data has some unique characteristics which make living up to these principles challenging: there are privacy, legal, and business reasons why we can’t just release all of the data Mozilla collects (or processes) to the public in unaggregated form. While Mozilla goes out of its way not to collect personally identifiable information about our users when not required, in practice some of the data sets could be correlated to a particular individual given enough external resources and effort. Releasing this information carelessly is both a moral and legal risk. And in terms of the business, releasing certain types of data (for example, client-level search count information) would compromise Mozilla’s ability to compete in a cut-throat marketplace. That said, this is no reason to give up entirely. There is a balance between our core principles and these realities, and it tilts strongly towards openness.
This document attempts to codify Mozilla Data’s existing practices around working in the open. It is intended to be a resource we can draw on when making a decision on a technology choice, working group setup, or any other decision about what we work on and how we work on it.
Best Practices
Software and Tools
We use open tools for software development wherever possible and reasonable. Examples of such tools include:
- Public GitHub repositories in the Mozilla organization
- Issue tracking using either GitHub issues (see above) or bugzilla.mozilla.org
- Open source software, such as Python and JSONSchema
It is acceptable to use private repositories or issue tracking tools (for example: JIRA) given a compelling business need or strong legal case, but this should be the exception with appropriate justification provided (e.g. inside a proposal for a new project, or inside the GitHub repository itself).
Private repositories and Mozilla-internal-only tools make collaboration with external contributors (for example, Outreachy interns) essentially impossible: take extra care to avoid using them if this is something you want to allow, either now or in the future.
Communication
Likewise, we favour tools for communicating about our work where possible and convenient. Examples of this include:
- Public working groups documented on wiki.mozilla.org/Data
- Matrix for synchronous chat (public by default: end-to-end encrypted rooms may be used to discuss matters under NDA, which enables collaboration with a wider group of contributors)
- Public mailing lists using Google Groups or Mozilla’s instance of discourse for asynchronous discussions -- these tools are more inclusive of people in different timezones.
- World-readable Google documents for proposals (linked to inside the appropriate working group)
As of this writing, most discussion on Data topics currently happens on Slack which is a closed system: Slack’s support for threaded messaging is frequently cited as a strong technical reason to use this platform (this may be fixed in the future: vector-im/element-web#2349). For pragmatic reasons, this is ok: but bear in mind conversations conducted on this platform exclude community involvement and make it more difficult to provide context to others about our work. We see Matrix as the future of open communication at Mozilla: the existing rooms around Glean (#glean, #glean-dev, and #glean-dictionary) provide good examples of how this can work well in practice.
Open Datasets
In addition to publishing the source code to our data collection and analysis systems, Mozilla also releases a small set of aggregated datasets of interest to the public, along with a public data report (which has proven useful for answering both internal and external questions about the number and characteristics of Firefox users). Publishing our datasets and the dashboards we use to visualize them is perhaps the strongest way we can signal to the public that we think about this area differently, as well as empowering the entire Mozilla community to make data-informed decisions. See our data publishing policy for information on how we do this without compromising user privacy.