Socorro:HBase
Contents
- 1 General HBase information
- 2 Socorro HBase Schema
- 2.1 Table crash_reports
- 2.2 Index Tables
- 2.2.1 Index crash_reports_index_hang_id
- 2.2.2 Index crash_reports_index_hang_id_submitted_time
- 2.2.3 Index crash_reports_index_legacy_submitted_time
- 2.2.4 Index crash_reports_index_legacy_unprocessed_flag
- 2.2.5 Index crash_reports_index_signature_ooid
- 2.2.6 Index crash_reports_index_submitted_time
- 2.2.7 Index crash_reports_index_unprocessed_flag
- 2.3 Table metrics
- 2.4 Table crash_report_signatures
General HBase information
Required Reading
Useful Links
- Apache HBase Wiki
- irc.freenode.net #hbase
- Very friendly channel with lots of knowledgeable people
- HBase mailing list
- HBase: The Definitive Guide (Expected May 2011)
Notes
- Socorro uses the Thrift API to allow the Python layer to interact with HBase. Every node in the cluster runs a Thrift server and they are all part of a VIP that only production servers can access.
- Column families
- Example: A common column family Socorro uses is "ids:" and a common column qualifier in that family is "ids:ooid". Another column is "ids:hang"
- The table schema enumerates the column families that are part of it. The column family contains metadata about compression, number of value versions retained, and caching.
- A column family can store tens of thousands of values with different column qualifier names.
- Retrieving data from multiple column families requires at least one block access (disk or memory) per column family. Accessing multiple columns in the same family requires only one block access.
- If you specify just the column family name when retrieving data, the values for all columns in that column family will be returned.
- If a record does not contain a value for a particular column in a set of columns you query for, there is no "null", there just isn't an entry for that column in the returned row.
- Manipulating a row
- All manipulations are performed using a rowkey.
- Setting a column to a value will create the row if it doesn't exist or update the column if it already existed.
- Deleting a non-existent row or column is a no-op.
- Counter column increments are atomic and very fast. StumbleUpon has some counters that they increment hundreds of times per second.
- Tables are always ordered by their rowkeys
- Scanning a range of a table based on a rowkey prefix or a start and end range is fast.
- Retrieving a row by its key is fast.
- Searching for a row requires a rowkey structure that you can easily do a range scan on, or a reverse index table.
- A full scan on a table that contains billions of items is slow (although, unlike an RDBMS it isn't likely to cause performance problems)
- If you are continually inserting rows that have similar rowkey prefixes, you are beating up on a single RegionServer. In excess, it is unpleasant.
Socorro HBase Schema
DRAFT
The content of this page is a work in progress intended for review.
Please help improve the draft!
Ask questions or make suggestions in the discussion
or add your suggestions directly to this page.
Table crash_reports
{NAME => 'crash_reports', FAMILIES => [{NAME => 'flags', COMPRESSION = > 'NONE', VERSIONS => '1', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'ids', VERSIONS => '1', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '6553 6', IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'meta_data', COMPRESSION => 'LZO', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'pr ocessed_data', VERSIONS => '1', COMPRESSION => 'LZO', TTL => '21474836 47', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'} , {NAME => 'raw_data', COMPRESSION => 'LZO', VERSIONS => '3', TTL => ' 2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'timestamps', VERSIONS => '1', COMPRESSION => 'NONE ', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BL OCKCACHE => 'true'}]}
Index Tables
Index crash_reports_index_hang_id
{NAME => 'crash_reports_index_hang_id', FAMILIES => [{NAME => 'ids', COMPRESSION => 'LZO', VERSIONS => '1', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}
Index crash_reports_index_hang_id_submitted_time
{NAME => 'crash_reports_index_hang_id_submitted_time', FAMILIES => [{NAME => 'ids', COMPRESSION => 'LZO', VERSIONS => '1', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}
Index crash_reports_index_legacy_submitted_time
{NAME => 'crash_reports_index_legacy_submitted_time', FAMILIES => [{NAME => 'ids', COMPRESSION => 'LZO', VERSIONS => '1', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}
Index crash_reports_index_legacy_unprocessed_flag
{NAME => 'crash_reports_index_legacy_unprocessed_flag', FAMILIES => [{NAME => 'ids', COMPRESSION => 'NONE', VERSIONS => '1', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'processor_state', VERSIONS => '5', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}
Index crash_reports_index_signature_ooid
{NAME => 'crash_reports_index_signature_ooid', FAMILIES => [{NAME => 'ids', COMPRESSION => 'LZO', VERSIONS => '1', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}
Index crash_reports_index_submitted_time
{NAME => 'crash_reports_index_submitted_time', FAMILIES => [{NAME => 'ids', COMPRESSION => 'LZO', VERSIONS => '1', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}
Index crash_reports_index_unprocessed_flag
{NAME => 'crash_reports_index_unprocessed_flag', FAMILIES => [{NAME => 'ids', VERSIONS => '1', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'processor_state', COMPRESSION => 'NONE', VERSIONS => '5', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}
Table metrics
records containing aggregate metrics for varying time intervals
- yyyy
- yearly
- yyyy-mm
- monthly
- yyyy-mm-dd
- daily
- yyyy-mm-ddThh
- hourly
- yyyy-mm-ddThh:mm
- per minute
special records
- crash_report_queues
- contains metrics about the current state of the processing queues
Over time, we will expire finer grained records and roll them up into the next higher level. We should have lots of room to grow and we can figure out what sort of expiration we wish to set on these. The only argument for a strict expiration is that there are certain metrics which will only exist in the higher level stats (for instance, we wouldn't be able to generate hourly or per minute ADU numbers so those metrics would only exist in the daily records).
There are two column families in this table, 'counters:' and 'timestamps'. Below is a list of currently planned metrics. If a counter is specified at a more precise time level, expect it to be aggregated up into the next higher level.
- yyyy-mm-ddThh:mm
- counters:submitted_crash_reports
- counters:submitted_crash_reports_legacy_throttle_0 -- ACCEPT
- counters:submitted_crash_reports_legacy_throttle_1 -- DEFER
- counters:submitted_crash_reports_legacy_throttle_2 -- DISCARD
- counters:submitted_crash_report_hang_pairs
- counters:submitted_oop_plugin_crash_reports (similar columns for future oop crash types)
- counters:crash_report_processing_errors
- unprocessed_crash_report_queue_size (Was thinking a metric for oldest item in queue would be handy, but it isn't a "counter" per se so it should have its own column family)
- yyyy-mm-ddThh
- unique_crash_signatures (NOTE: This number would likely be recalculated via a MapReduce job for higher time levels to avoid having to store every level of unique count during processing time)
- yyyy-mm-dd
- firefox_active_installations (similar columns for other supported products?)
- crash_report_queues
- counters:inserts_unprocessed
- counters:deletes_unprocessed
- counters:inserts_unprocessed_priority
- counters:deletes_unprocessed_priority
- counters:inserts_unprocessed_legacy
- counters:deletes_unprocessed_legacy
- counters:inserts_processed_priority
- counters:deletes_processed_priority
- counters:inserts_processed_legacy
- counters:deletes_processed_legacy
Table crash_report_signatures
The rowkeys of this table will consist of solely the signature calculated by the Python crash report processor. There are two special values:
- ##empty##
- The generated signature was an empty string.
- ##null##
- The processor failed and there was no generated signature.
There is currently a single column family in this table, 'counters:'. Below is a list of currently planned metrics. Counters for all time levels will be incremented at the same time because it is the most efficient implementation. Additionally, there will be logic that will reach out to the metrics table and increment the unique_crash_signatures record for each time interval if this is the first time a crash report with this signature was seen for that interval.
- hourly_yyyy-mm-ddThh - Columns for current + 48 previous
- daily_yyyy-mm-dd - Columns for current + 30 previous
- monthly_yyyy-mm - Columns for current + 2 previous (NOTE: My thoughts are calendar months are less useful for trend comparison due to # of days differences and signature patterns are likely not very relevant over longer time periods due to new application versions. Disagree?)
- yearly_yyyy - Columns for current + 2 previous years
Open Questions
- Are the suggested expirations of the signature metrics sufficient?
- Are there any time levels we should drop entirely due to lack of potential uses?
- We should have no problems storing between 1 and 1,000 columns in one of these column families. As such, we could also plan for having metrics regarding the number of CRs per product, product+version, OS, etc. We need to be reasonable, but we shouldn't leave anything important out.