CA/Responding To An Incident

From MozillaWiki
Jump to: navigation, search

Please go to https://www.ccadb.org/cas/incident-report for detailed information about reporting compliance incidents.

(Researchers who report CA incidents such as misissuances are welcome to include a link to that page in their report to the CA, reminding the CA of Mozilla's expectations for incident reporting.)

This page provides supplemental information on Mozilla's expectations regarding the handling of compliance incidents, incident reporting, remediation, and communication. It gives guidance to CAs as to how Mozilla expects them to react to reported incidents such as misissuances, and what the best practices are.

Overview

An incident arises any time a CA fails to comply with an applicable requirement found in the Mozilla Root Store Policy, the CA/Browser Forum's requirements, or the CCADB's requirements. As noted in section 2.4 of the Mozilla Root Store Policy, a compliance incident can arise from certificate misissuance, delayed revocation, procedural or operational issues, or some other cause.

A "misissuance" is defined as any certificate issued in contravention of any applicable standard, process or document - so it could be RFC non-compliant, BR non-compliant, issued contrary to the CA's CP/CPS, or have some other flaw or problem.

Sometimes our guidance is framed in terms of misissuance of certificates; it will need to be adapted as necessary for incidents of a different nature, respecting the spirit of the information requests contained in the standard incident-reporting template.

Other examples of incidents include misconfigured CRLs and OCSP responders, delayed responses, failures to properly communicate information, and any other event affecting trust in the WebPKI which does not involve the actual contents of certificates.

While some forms of incident may be seen as less serious than others, opinions may vary. Mozilla sees all incidents as good opportunities for CA operators to confirm that their incident response processes are working well, and so we expect a similar level of timeliness of response and quality of reporting for all incidents, whatever their adjudged severity.

To be clear, the incident reporting template and incident-reporting process provide a set of best practices. Therefore, failure to follow one or more of the recommendations alone is not by itself sanctionable. However, failure to do so without good reason may affect Mozilla's general opinion of the CA. Our confidence in a CA is in part affected by the number and severity of incidents, but it is also significantly affected by the speed and quality of incident response.

Immediate Actions

In misissuance cases, a CA should almost always immediately cease issuance from the affected part of its PKI. In situations not involving misissuance, there also may be processes that need to be stopped until the CA has diagnosed the source of the problem.

Once the problem is diagnosed, if the CA is able to put in place temporary or manual procedures to prevent the problem from re-occurring, it may restart the process even if a full fix is not rolled out. CAs should not restart affected processes until they are confident that the problem will not re-occur.

An initial report should be filed within 72 hours of being made aware of the incident. See https://www.ccadb.org/cas/incident-report#incident-reports

Revocation

It is normal practice for CAs to revoke misissued or otherwise problematic certificates. But that leaves the question about when this should be done, particularly if it's not possible to contact the customer immediately, or if they are unable to replace their certificate quickly. CAs should ensure that they are complying with Sections 4.9.1 through 4.9.5 of the CA/Browser Forum’s Baseline Requirements.

This means that, in most cases of misissuance, the CA has an obligation under the BRs to revoke the certificates concerned within 24 hours, or 5 days in some cases.

Mozilla recognizes that in some exceptional circumstances, revoking the affected certificates within the prescribed deadline may cause significant harm, such as when the certificate is used in critical infrastructure and cannot be safely replaced prior to the revocation deadline, or when the volume of revocations in a short period of time would result in a large cumulative impact to the web. However, Mozilla does not grant exceptions to the BR revocation requirements. It is our position that your CA is ultimately responsible for deciding if the harm caused by following the requirements of the Baseline Requirements outweighs the risks that are passed on to individuals who rely on the web PKI by choosing not to meet this requirement.

If your CA will not be revoking the certificates within the time period required by the BRs, our expectations are that:

  • A separate incident report will be filed in Bugzilla.
  • The decision and rationale for delaying revocation will be disclosed in the form of a preliminary incident report immediately; preferably before the BR-mandated revocation deadline. The rationale must include detailed and substantiated explanations for why the situation is exceptional. Responses similar to “we do not deem this non-compliant certificate to be a security risk” are not acceptable. When revocation is delayed at the request of specific Subscribers, the rationale must be provided on a per-Subscriber basis.
  • Any decision to not comply with the timeline specified in the Baseline Requirements must also be accompanied by a clear timeline describing if and when the problematic certificates will be revoked or expire naturally, and supported by the rationale to delay revocation.
  • The issue will need to be listed as a finding in your CA’s next BR audit statement.
  • Your CA will work with your auditor (and supervisory body, as appropriate) and the Root Store(s) that your CA participates in to ensure your analysis of the risk and plan of remediation is acceptable.
  • You will perform an analysis to determine the factors that prevented timely revocation of the certificates, and include a set of remediation actions in the final incident report that aim to prevent future revocation delays.

If your CA will not be revoking the problematic certificates as required by the BRs, then we recommend that you also contact the other root programs that your CA participates in to acknowledge this non-compliance and discuss what expectations their Root Programs have with respect to these certificates.

Follow-Up Actions

  • Work out how the bug or problem was introduced. For a code bug, were the code review processes sufficient? Does your code have automated tests, and if so, why did they not catch this case?
  • Work out why the problem was not detected earlier. Were these certificates missed by your linting processes or self audits? Or is the code or process you use for insufficient?
  • If the problem is lack of compliance to an RFC, Baseline Requirement, or Mozilla Policy requirement: were you aware of this requirement? If not, why not? If so, was an attempt made to meet it? If not, why not? If so, why was that attempt flawed? Do any processes need updating for making sure your CA complies with the latest version of the various requirements placed upon it?
  • Scan your corpus of certificates to look for others with the same issue. It does not look good for a CA to claim they have revoked all affected certificates and resolved the issue, and then for a researcher to discover another set of certificates with the same or a similar problem.
  • Examine whether there are potential related problems which you can also remediate at the same time. For example, if the problem was bad data in a particular field, consider improving the validation of all fields in the certificate prior to issuance. You should be proactively looking for ways, such as pre-issuance lint testing, to harden your issuance pipeline against further problems.
  • If, as happens in a regrettably large number of cases, a problem report was sent to your CA but action in accordance with BR section 9.4.5 was not taken within 24 hours, investigate what happened to that report and whether your report handling processes are adequate.

Incident Report

For guidance on incident reporting, first visit https://www.ccadb.org/cas/incident-report.

Your CA must submit an incident report by creating a bug in Bugzilla under the CA Program :: CA Certificate Compliance component. When the incident is reported only on the CCADB public list or on the MDSP mailing list, then a bug will be created to track the incident and its resolution in Bugzilla. CAs are encouraged to announce important incidents on public@ccadb.org when they involve the Baseline Requirements, other root programs, or the CCADB; or on the Mozilla dev-security-policy list, when they only involve violations of the Mozilla Root Store Policy.

The incident report should use the markdown template provided on the CCADB website:

https://www.ccadb.org/cas/incident-report#incident-report-template

Keeping Us Informed

Once the report is posted, you should respond promptly to questions that are asked, and in no circumstances should a question linger without a response for more than one week, even if the response is only to acknowledge the question and provide a later date when an answer will be delivered. You should also provide updates at least every week giving your progress, and confirm when the remediation steps have been completed - unless a root store representative has agreed to a different schedule by setting a “Next Update” date in the “Whiteboard” field of the bug or has announced they consider closing the bug and no further comments have been posted. Updates to important incidents (see e.g. https://www.ccadb.org/cas/public-group#lessons-learned-from-ca-incident-reports) should be posted to either the CCADB Public list or the MDSP mailing list and the Bugzilla bug. The bug will be closed when remediation is completed.

Examples of Good Practice

Here are some examples of good practice.

Let's Encrypt: keyCompromise key blocking deviation from CP/CPS

https://bugzilla.mozilla.org/show_bug.cgi?id=1886876

  • Clear indication of Preliminary and Full Incident Reports.
  • Detailed timeline that identifies all policy, process, and software changes that contributed to the root cause, and an indication of when the incident began and ended.
  • Detailed Root Cause Analysis that offers background on the various conditions that gave rise to the issue.
  • Timely updates in response to questions posed, continued analysis, and changes to Action Items.

Google Trust Services: Failure to properly validate IP address

https://bugzilla.mozilla.org/show_bug.cgi?id=1876593

  • Significant amount of background information that informs the timeline of the incident.
  • Clear identification of the contributing factors that contributed to the incident that notes how many of them avoided detection in the Root Cause Analysis.
  • Action Items that prevent, mitigate, and detect what didn’t go well.
  • Timely and detailed updates conveying Action Item status.

HARICA: Anomaly in OCSP services after CA software upgrade

https://bugzilla.mozilla.org/show_bug.cgi?id=1878106

  • Clear Summary that provides just enough context for new readers to understand the rest of the report.
  • Effective use of the “5 Whys” Root Cause Analysis methodology where “why” is asked as many times as necessary to identify the root cause of the incident.
  • Action Items that prevent and detect what didn’t go well.
  • Timely updates in response to questions posed and changes to Action Items.