Socorro:Overview
DRAFT
The content of this page is a work in progress intended for review.
Please help improve the draft!
Ask questions or make suggestions in the discussion
or add your suggestions directly to this page.
Contents
Socorro 1.x infrastructure history
The previous infrastructure for Socorro was designed with several scaling points, but it was never designed to be able to process 100% of the crash reports that are submitted. Further, there were two potential limiting factors:
- NFS data store -- All raw and processed crash report data are stored on an NFS filer. Due to the large volume of files, an elaborate file structure was needed to partition the data. There are also hard limits to the amount of data that could be stored on the filer, and limits on the number of clients that could connect to it simultaneously.
- Reports table(s) in Postgres -- The reports tables are partitioned by date ranges and they contain a record for every processed crash report. They are used for user searches, and for generating the materialized views that drive Socorro reports. Having this data in a traditional RDBMS means there are limitations to the amount of data that can be easily managed.
Socorro 2.x infrastructure
Socorro:OperationalMetrics
Impetus
The crashkill project became very important around the timeframe that we were planning for Firefox 3.6. The decision was made that it would be useful if we were able stop throttling and better evaluate the long tail of crash reports. Further, there were increasing requests for complex reports such as correlations that were difficult to deliver within the existing infrastructure.
Mozilla Metrics involvement
The Mozilla Metrics team stepped in and promised to deliver a new storage infrastructure that would be able to accommodate much larger volumes of data and provide the ability to define and run these more complex analysis.
The key goals of the Metrics team for the Socorro 2.0 design are:
- Provide scalable backend capable of storing 100% of 3 years worth of crash reports
- Provide a powerful analytic platform capable of handling complicated queries without requiring new features in the Socorro platform code
- Reuse existing code/components and keep as much Socorro "business logic" (e.g. crash report signature generation) as possible in the hands of the Socorro dev team to enable them to control their schedule without being blocked by Metrics team deliverables
The Metrics team determined that a key-value or document store database was best suited for this infrastructure. Some of the determining factors were:
- A continuous stream of crash reports flowing in
- Crash reports need to be retrieved by ID with low latency based on user interaction with the Socorro website
- If information is requested for a crash report that has not yet been processed, it needs to be processed as a priority, delivering the results within seconds
- Continuous background processing of crash reports
- Scheduled jobs to aggregate and analyze crash reports and populate tables for the Socorro website
After evaluating the merits of several different back-end technologies such as MongoDB, CouchDB, Cassandra, Hadoop, and HBase, the Metrics team opted for HBase as the primary data store. One of the biggest reasons for this choice was that in addition to fitting the above criteria similar to several of the other options, it is built on Hadoop and provides a general computing cluster that can be used for a variety of future projects such as large scale log processing.
Components
Breakpad clients
Submit crashes to crash-reports.mozilla.org
Collector
Python app hosted by Apache/mod_python that handles the initial manipulation and storage of the incoming crash report
HBase
Main data store for crash report related records
Hadoop
Underlying data store and Map Reduce computation platform
Cron Jobs
Scripts for inserting data from other systems (i.e. Bugzilla)
Postgres DB
Database containing materialized views useful for displaying reports and analysis in the UI
PyServe
A middle-ware service encapsulating component integration and business logic
UI
crash-stats.mozilla.org The PHP web application providing end users and developers with the ability to interact with the Socorro system