Data/WorkingGroups/CrashReporting/Status2021
From MozillaWiki
Contents
Crash Reporting Status 2021
Every month, the coordinator will send out an email asking for status updates from teams/projects working on crash reporting things. Details on this process are at Data/WorkingGroups/CrashReporting#Monthly_status_rollup.
Updates get compiled into a newsletter and sent to lists and posted here.
Crash Reporting Headlines (August 12th, 2021)
Quick Summary
- Windows Error Reporting crash collection and macOS crash handler improvements
- Denoting more OOM crash reports as OOMs in the crash signature
Details
Completed
- Crash Stats: flagging additional crash reports as OOMs
- bug 1716742: flag ERROR_COMMITTMENT_LEVEL as OOMs
- bug 1723474: flag WER windows crashes with a reason set to STATUS_FATAL_MEMORY_EXHAUSTION or STATUS_NO_MEMORY as OOMs
- Crash reporter: WER improvements
- Windows Error Reporting is fully functioning across all processes (bug 1697895 and bug 1682518), it flags OOM crashes correctly (bug 1711418) and the reports have a special WindowsErrorReporting annotation that lets you tell them apart from the rest (bug 1703761). Capturing hangs has also been disabled (bug 1718226).
- Crash reporter: macOS crash handler improvements
- The macOS crash handler has been modernized and now properly reports 64-bit crashes (bug 1035892). Among other things this makes UAF crashes on arm64 macOS builds immediately obvious as the poison pattern will appear as the crashing address.
- macOS crashes now have thread names correctly populated (bug 1658831)
- An infamous main process crash while capturing the minidump of a child process has been fixed on macOS (bug 1723941).
- Crash reporter: native thread names support for minidumps from Linux
- Martin Sirringhaus implemented native thread names support in Linux minidumps (bug 1714465) in child process (main process crashes still rely on the old machinery)
In progress
- All: Rust rewrite of all things breakpad
- rust minidump-stackwalk:
- https://github.com/luser/rust-minidump/tree/master/minidump-stackwalk
- https://github.com/luser/rust-minidump/issues/153
- You can now install and test rust-minidump minidump-stackwalk
- Same CLI as existing minidump-stackwalk that Socorro uses. Outputs the same JSON schema.
- Work progresses.
- rust minidump-stackwalk:
- Tecken: new symbolication API microservice
- API url: https://symbolication.stage.mozaws.net/symbolicate/v5
- If you do any symbolication work, I'd love to know how it works for you and whether you encounter any issues.
- Work progresses on hardening the service. Will likely go to production in September 2021.
- https://bugzilla.mozilla.org/show_bug.cgi?id=1636210
- Socorro: Better signature generation for Java crash reports (on hold)
- https://bugzilla.mozilla.org/show_bug.cgi?id=1541120
- This is blocked on impact analysis and design work to figure out the details of how it should work.
- Let us know how better signature generation for Java crash reports helps you in the bug comments.
- This work is on hold for now.
- Symbols: improving process for acquiring symbols for macOS Big Sur
- https://bugzilla.mozilla.org/show_bug.cgi?id=1683758
- This enables profiles collected on beta versions of macOS with the Firefox profiler to have symbolicated system libraries and will improve stacks in crash reports for beta versions of macOS.
- Progress on this is ongoing.
Crash Reporting Headlines (June 4th, 2021)
Quick Summary
- Acquired symbols for openSUSE 15.3
- Improving our support for acquiring macOS symbols for release and beta builds
- Work continues on WER support in crash reporter around hang reports and child processes
- Improved capturing annotations for OOM crashes and re-enabled grabbing memory reports
Details
Completed
- Crash Stats: fixed indexing for fields that don't show up in all crash reports
- https://bugzilla.mozilla.org/show_bug.cgi?id=1706171
- This fixes searching and aggregating with fields like phc_kind, adapter_driver_version, and others which don't show up in all crash reports so were periodically not included in the index.
- Crash Stats: fixed sort-by-address in signature report tables
- https://bugzilla.mozilla.org/show_bug.cgi?id=1032227
- We changed minidump-stackwalk to leftpad all memory address values. This makes them the same width and then they sort alphanumerically correctly. While doing this, I fixed some other sorting issues in signature report tables.
- Socorro: improved cpu_arch in processing
- https://bugzilla.mozilla.org/show_bug.cgi?id=1710854
- We improved processing so the cpu_arch field has a value for Fenix crash reports. If it can't find a value, then it sets cpu_arch to "unknown" rather than the empty string. This is a better value for searching and aggregating.
- Socorro: minidump-stackwalk print readable values for NTSTATUS or winerror.h results in Windows minidumps
- Tecken: support CORS preflight in Eliot for symbolication API
- https://bugzilla.mozilla.org/show_bug.cgi?id=1713667
- Added CORS preflight headers so that the new symbolication API on stage can be used by web apps.
- Crash reporter: temporarily changed Gecko to stop grabbing hang reports with WER
- Crash Stats: display and support basic searching/aggregation of mac_crash_info data
- https://bugzilla.mozilla.org/show_bug.cgi?id=1709658
- Thank you, Steven Michaud!
- Crash reporter: re-enabled grabbing memory reports
- Crash reporter: removed BIOS_Manufacturer and MemoryErrorCorrection crash annotations
- https://bugzilla.mozilla.org/show_bug.cgi?id=1710152
- The former was largely unused and we'll reintroduce the latter in a way that doesn't cause external code to be injected into Gecko.
- Crash reporter: improved mechanism for recording allocations that lead to OOM crashers
- https://bugzilla.mozilla.org/show_bug.cgi?id=1683288
- This makes sure almost all of the crash reports have the annotation properly populated.
- Symbols: started scraping debug information for openSUSE 15.3 builds
- pdb-addr2line: crate published
- https://github.com/mstange/pdb-addr2line
- Lets you easily obtain function names, inline callstacks, and file + line information based on addresses from PDB files, similar to the Linux+macOS addr2line tool. I will be using the pdb-addr2line crate in the profiler. Part of the code in this crate was imported from the dump_syms code: The TypeFormatter helper is based on the dump_syms TypeDumper code.
In progress
- Crash reporter: intercepting child process crashes via WER
- https://bugzilla.mozilla.org/show_bug.cgi?id=1697895
- https://bugzilla.mozilla.org/show_bug.cgi?id=1682518
- Almost done with changes for intercepting child process crashes via Windows Error Reporting. This includes registering the runtime exception module with the child processes and adjusting it to inform the main process of the dumps it grabbed. This code hasn't landed yet though so it's still WIP-ish, but it works.
- All: Rust rewrite of all things breakpad
- rust minidump-stackwalk:
- https://github.com/luser/rust-minidump/tree/master/minidump-stackwalk
- https://github.com/luser/rust-minidump/issues/153
- You can now install and test rust-minidump minidump-stackwalk
- Same CLI as existing minidump-stackwalk that Socorro uses. Outputs the same JSON schema.
- We handle most stuff reasonably well on x86/x64 these days, having full stackwalkers/symbolicators. ARM/ARM64 support is in progress. Some fields like exploitability heuristic are not yet implemented.
- The biggest remaining task is replacing `breakpad-symbols` with `symbolic`, which should significantly improve performance/reliability of all the debuginfo handling.
- Additionally, we've gotten a commitment from Microsoft to help build Rust minidump-stackwalk, maintain, and extend it.
- rust minidump-stackwalk:
- Tecken: new symbolication API microservice
- API url: https://symbolication.stage.mozaws.net/symbolicate/v5
- If you do any symbolication work, I'd love to know how it works for you and whether you encounter any issues.
- https://bugzilla.mozilla.org/show_bug.cgi?id=1636210
- Socorro: Better signature generation for Java crash reports
- Symbols: improving process for acquiring symbols for macOS Big Sur
- https://bugzilla.mozilla.org/show_bug.cgi?id=1683758
- Currently we have symbols for release versions of macOS. Work is being done to acquire symbols for beta versions as well. Additionally, the process and tools for acquiring symbols for macOS Big Sur are being improved.
- This enables profiles collected on beta versions of macOS with the Firefox profiler to have symbolicated system libraries.
- This will improve stacks in crash reports for beta versions of macOS.
- Firefox profiler: fix OOM errors when profiling local builds on Linux
- Firefox profiler: getting inline callstacks
- Making this work for official builds is a bigger lift and requires that our symbolication API stops using dump_syms as part of the symbolication pipeline. The Eliot rewrite is making big strides towards that goal and I am very excited about it. (Eliot currently goes [raw build artifact] -> [.sym file] -> [symbolic symcache] -> [API response]. Once we can go directly from the raw build artifact to the symbolic symcache, the rest should be easy.)
Crash Reporting Headlines (May 7th, 2021)
Quick Summary
- WER support in crash reporter
- Work continues on WER support so that we get crash reports for situations we're currently not getting any. Main process support should be done. Content process support is in progress.
- Socorro's minidump-stackwalker improvements
- Socorro's minidump-stackwalker was improved to emit additional Windows and macOS information. You can see this in the minidump-stackwalk output in the crash report view of Crash Stats.
- rust-minidump progress is moving along
- Work towards replacing Socorro's minidump-stackwalker with rust-minidump is progressing very nicely.
- Crash Stats lets you search by major_version
- Crash Stats has an improved Extensions tab in the crash report view
Details
Completed
- Crash reporter: WER support
- Windows Error Reporting interception landed last month and can intercept all main process crashes we were previously missing. This includes __fastfail() crashes, catastrophic OOM crashes, weird DLL injections and very late shutdown crashes. It significantly increased nightly crash rate which is good! Content process support is being worked upon.
- Socorro: minidump-stackwalker improved Windows information
- minidump-stackwalker was improved to print out richer information for Windows including unloaded modules, authenticode signatures, __fastfail() crash reasons, and NTSTATUS errors.
- Socorro: minidump-stackwalker __crash_info support for macOS
- minidump-stackwalker was improved to find and emit __crash_info information for Apple-specific error messages.
- Thank you, Steven Michaud!
- Crash reporter: fixed OOM crash annotations
- Alexandre modified the way we handle out-of-memory crash annotations so that it will never be missing again.
- rust-minidump: taught rust-minidump to parse MISC_INFO_5 format
- Taught rust-minidump to parse the MISC_INFO_5 format (and wrote tests/printing machinery for all the previous formats)
- https://github.com/luser/rust-minidump/pull/137
- rust-minidump: upgraded minidump-processor unwinder
- Upgraded the minidump-processor unwinder -- can now unwind with frame-pointers and scanning on x86 and x64
- https://github.com/luser/rust-minidump/pull/145
- rust-minidump: upgraded cli to match dump_syms
- Upgraded the minidump-processor CLI frontend to match dump_syms, and taught it to generate a JSON version of its report (format is "whatever the layout of the current types are", to be iterated on over time)
- https://github.com/luser/rust-minidump/pull/151
- dump_syms: better support for Apple's compact unwinding
- Taught symbolic (and therefore dump_syms) how to dump Apple's Compact Unwinding (.__unwind_info) format into breakpad's format for x86/x64, as well as wrote up a very thorough description of the format (that is otherwise missing from llvm's implementation, which is the only existing documentation of the format). Ideally when this lands it will fix Bug 1691022 (x64 macos missing CFI on socorro).
- https://github.com/getsentry/symbolic/pull/372/
- Crash Stats: added last error value to crash report view.
- Crash Stats: redid process type support
- Redid process type support--now "parent" is the value for parent process crash reports and we're phasing out "browser".
- This makes it a lot easier to search for parent crashes and aggregations on process type work now.
- https://bugzilla.mozilla.org/show_bug.cgi?id=1701357
- Crash Stats: iImproved Extensions tab in report view
- The Extensions tab now shows the extension name, whether it's a system extension, and its signed state.
- https://bugzilla.mozilla.org/show_bug.cgi?id=1629943
- https://bugzilla.mozilla.org/show_bug.cgi?id=974968
- Crash Stats: fixed Bugs API to support POST as well as GET
- https://bugzilla.mozilla.org/show_bug.cgi?id=1282707
- If there are other APIs that would benefit by having POST support, let me know.
- Crash Stats: added search by major_version
- Added a major_version field and the ability to search it. This works for all crash reports submitted after April 25th.
- Now you can do searches like "major_version = 88" and "major_version >= 88 and major_version < 90"
- https://bugzilla.mozilla.org/show_bug.cgi?id=1111612
- https://bugzilla.mozilla.org/show_bug.cgi?id=1401517
- Crash Stats: all Super Search fields now have exists/does-not-exist filter
- Socorro: Added support for multiple processing pipeline rulesets
- The first non-default ruleset I wrote is "regenerate_signature" which just regenerates the crash signature. It takes 1/10 the time regular processing takes. I'll use this going forward to regenerate crash signatures after signature generation changes.
- We can use this infrastructure for additional processing as well. That's been something we've talked about over the years.
- https://bugzilla.mozilla.org/show_bug.cgi?id=1705469
- Siggen: Released socorro-siggen 1.0.6
- This includes signature generation changes made since 1.0.5 as well as some minor fixes.
- Presentation: Socorro Overview: 2021
- Converted Socorro Overview presentation done at Data Club into a blog post.
- https://bluesock.org/~willkg/blog/mozilla/socorro_overview_2021.html
In process
- All: Rust rewrite of all things breakpad
- Tecken: new symbolication API microservice
- API url: https://symbolication.stage.mozaws.net/symbolicate/v5
- If you do any symbolication work, I'd love to know how it works for you and whether you encounter any issues.
- https://bugzilla.mozilla.org/show_bug.cgi?id=1636210
- Socorro: Better signature generation for Java crash reports
Crash Reporting Headlines (April 7th, 2021)
Quick summary
- Started Crash Reporting Working Group
- We started a Crash Reporting Working Group to coordinate crash reporting, ingestion, and analysis work. If you're interested in participating or lurking, we've got a mailing list (crash-reporting-wg) and a Matrix channel (#crashreporting)
- Socorro: Ended collection of Email address data.
- Firefox 89+ no longer sends Email address data in crash reports.
- Email data is dropped at collection for all crash reports.
- Socorro: Ended collection of Fennec crash reports.
- Tecken: We need help testing new symbolication API microservice.
Details
Completed
- Crash Stats: Improved preview in Slack/Matrix for crash report view urls and signature report view urls.
- If there's more I can do with this to make these url previews more helpful in conversations, let me know.
- https://bugzilla.mozilla.org/show_bug.cgi?id=1688203
- Socorro: End collection of Email data in crash reports.
- I changed the collector to delete Email data for all incoming crash reports. I fixed the Firefox main and content crash reporter client code. I still have some changes to make in the webapp, but I'm waiting until May 2021 to do that.
- Many thanks to Emily, Nneka, Gabriele, Mike, and Chris for their help with this!
- https://bugzilla.mozilla.org/show_bug.cgi?id=1688883
- Socorro: End collection of crash reports for Fennec
- When working on ending collection of Email data, it came up that we don't need Fennec crash reports anymore. Thus Socorro now rejects all incoming crash reports for Fennec.
- Many thanks to Emily, Stefan, Vesta, and Agi!
- https://bugzilla.mozilla.org/show_bug.cgi?id=1699239
- Crash Stats: Fixed the webapp to automatically update the PCI device db once a week.
- Crash stats: Redid "Raw data and minidumps" tab in crash report view.
- The Crash Stats ui is confusing and clunky and I've been trying to fix bits of it over time. In this pass, I improved the tab that holds links to raw and processed crash data, minidumps, and the output of minidump-stackwalk. It should be clearer now as to what's protected data and what isn't. The links are at the top of the tab where they're easier to access. The minidump-stackwalk output is much easier to manipulate and use.
- https://bugzilla.mozilla.org/show_bug.cgi?id=1696910
- Tecken: New symbolication API microservices
- The symbolication API that Tecken has is hard to improve and there are a bunch of things we want to do with it. Because of that, I embarked on splitting it out into a separate microservice and rewriting it from the ground up using the Symbolic library. That's taken a while for a variety of reasons, but we've now got a working symbolication API in our staging environment that I think is usable.
- API url: https://symbolication.stage.mozaws.net/symbolicate/v5
- I need to write docs for it, but it uses the same payload as the existing symbolication API as documented here: https://tecken.readthedocs.io/en/latest/symbolication.html#symbolication-symbolicate-v5
- If you do any symbolication work, I'd love to know how it works for you and whether you encounter any issues.
- https://bugzilla.mozilla.org/show_bug.cgi?id=1636210
In progress
- Crash reporter client: integrating Windows Error Reporting into Firefox
- Tecken: Finishing up the new symbolication API microservice: https://bugzilla.mozilla.org/show_bug.cgi?id=1636210
- Socorro: Better signature generation for Java crash reports: https://bugzilla.mozilla.org/show_bug.cgi?id=1541120