Socorro:OperationalMetrics
From MozillaWiki
This is a list of the major components of Socorro 2.x. Each component has two sections:
- Status is a place for metrics that are current point in time or most recent event. This is a place for metrics that would be useful for diagnosing a Nagios alert or investigating a possible problem. Ideally, having a level of interactivity with the user such as being able to log a comment about an error would provide a useful mechanism for collaborative maintenance.
- Trend is a place for periodic snapshots of metrics (i.e. per minute) that will provide a longer term view into the health and performance of the system. Having markers in the trendlines for events such as config changes will provide the possibility to quickly correlate health changes in the system with code pushes or config changes.
Please add ideas for new metrics, or add comments about potential problems or changes for existing metrics.
Contents
Existing metrics sources
- 1 HBase Master Status UI
- 2 Production Hadoop Cluster Ganglia UI
- 3 Production Hadoop Cluster Ganglia UI Socorro Stats scroll to very bottom
- 4 Metrics Dashboard - raw crash submission
- 5 Crash Stats Status
- 6 Hadoop DFS Health
Components
Collector
Status
- Number of nodes
- Build/Release label
- Config info
- Total reports collected
- Total throttled reports collected
- List of node info
- Uptime
- Last failure
- time
- stacktrace
- comments
- Pct reports collected
- Pct throttled reports collected
Trend
- Number of nodes
- Config change events
- Code change events
- Errors
- Reports collected [4]
- Throttled reports collected
Processor
Status
- Number of nodes [5]
- Build/Release label
- Config info
- Total reports processed
- Total throttled reports processed [5]
- List of node info
- Uptime
- Last failure
- time
- stacktrace
- comments
- Pct reports processed
- Pct throttled reports processed
Trend
- Number of nodes [5]
- Config change events
- Code change events
- Errors
- Reports processed
- Throttled reports processed [5]
- Reports processed with warnings
- Report processing failures
DBFeeder
Status
- Number of nodes
- Build/Release label
- Config info
- Total reports processed
- Total priority reports processed
- Total throttled reports processed
- List of node info
- Uptime
- Last failure
- time
- stacktrace
- comments
- Pct reports processed
- Pct throttled reports processed
Trend
- Number of nodes [5]
- Config change events
- Code change events
- Errors
- Reports processed
- Throttled reports processed [5]
- Reports processed with warnings
- Report processing failures
Stackwalk Symbol Server
Status
- Build/Release label
- Config info
- Uptime
- Last failure
- time
- stacktrace
- comments
- Number of symbols loaded
- Time since oldest symbol was used
Trend
- Config change events
- Code change events
- Errors
- Symbol loaded
- Symbol dropped
- Symbol cache hit
- Symbol cache miss
HBase Cluster
Status
- Cluster uptime
- Number of nodes [1]
- Number of regions [1]
- Avg regions per node [1]
- RegionServer with Min regions [1]
- RegionServer with Max regions [1]
- Youngest RegionServer uptime
- Oldest RegionServer uptime
- Build/Release label
- Config info
- Uptime
- Last failure
- time
- stacktrace
- comments
Trend
- Number of nodes [2]
- Number of regions [2]
- Config change events
- Code change events
- Errors
- RegionServer down event [2]
- RegionServer up event [2]
Hadoop Cluster
Status
- Cluster uptime
- Number of nodes [6]
- Live [6]
- Dead [6]
- Decommissioning [6]
- Number of files [6]
- Number of blocks [6]
- Under-replicated blocks [6]
- Heap size [6]
- Capacity [6]
- DFS Used [6]
- Non-DFS Used [6]
- DFS Remaining [6]
- Build/Release label [6]
- Config info
- Uptime [6]
- Last failure
- time
- stacktrace
- comments
Trend
- Number of nodes [6]
- Live [6]
- Dead [6]
- Decommissioning [6]
- Number of files [6]
- Number of blocks [6]
- Under-replicated blocks [6]
- Heap size [6]
- Capacity [6]
- DFS Used [6]
- Non-DFS Used [6]
- DFS Remaining [6]
- Config change events
- Code change events
- Errors
Zookeeper Cluster
Status
- Cluster uptime
- Number of members
- Number of nodes
- Build/Release label
- Config info
- Uptime
- Last failure
- time
- stacktrace
- comments
Trend
- Number of members
- Number of nodes
- Number of regions
- Config change events
- Code change events
- Errors
Postgres DB
Status
- PostgreSQL master up
- PostgreSQL master accepting connections
- PostgreSQL standby up
- PostgreSQL standby accepting connections
- pgBouncer up
- pgBouncer accepting connections
- Replication running
Resource low points warnings:
- 90% of connections used
- 90% of disk space used
- FS cache space below 40GB
- IIT connections > 30
- swapping
- too many postgresql log files
- too many archive log files
Trend
Slow query logging (pgfouine) -- not part of ganglia
- Database size
- TCBS size
- Reports partition size
- Replication delay
- # of pooled connections
- # of DB connections
- length and number of IIT connections
- Memory usage for: postgres processes, fs cache
- I/O metrics
- CPU metrics
- query spill-to-disk
- response time for a preset query or set of queries
- database bloat
Middleware Layer
Status
Trend
UI
Status
- Number of nodes (?)
Trend
- Number of nodes (?)
Jobs
Status
- List of jobs scheduled
- name
- time
- description
- owner
- link to results
- Recent failures
- name
- time
- reason
- logs
- blame (i.e. cvs/svn blame?)
- comments
Trend
- Executions
- Execution durations
- Failure times