Auto-tools/Projects/Pulse/PulseGuardian
Contents
Status
PulseGuardian is available at https://pulseguardian.mozilla.org.
Code can be found at https://github.com/mozilla-services/pulseguardian.
Team
- Owner: jwhitlock
- Contributors: akachkach, mccricardo, Sherry Shi
Problem
Pulse uses RabbitMQ as a pub/sub service which formerly allowed anyone to subscribe to any exchange via a common user account. Some client applications use durable queues in case they crash; however, sometimes these queues are created by accident, and sometimes apps crash without admins noticing. In these cases, the queues continue to grow without bound, which can eventually result in the RabbitMQ host running out of memory. Our previous solution was to have Nagios monitor the queues and send alerts when any queues exceed a certain number of unread or unacknowledged messages, at which point a RabbitMQ admin attempted to find the person responsible and/or delete the offending queue.
Goals & Considerations
First, a couple definitions:
- A PulseGuardian user is a human user, identified by an email address.
- A Pulse user or RabbitMQ user is a user account in Pulse's RabbitMQ cluster. It is identified by a unique user ID.
- The max_queue_length of any queue is the maximum permitted number of unread and/or unacknowledged messages in that queue. This value is defined as either a single, static value for all queues, or is determined dynamically by some algorithm, possibly including system state (e.g. when a queue is deleted may depend on how many messages are currently in other queues). Currently, only a single, static value is supported.
- The warn_queue_length is a number between 0 and max_queue_length for a given queue at which point a queue-length warning is issued. It may be a single, static value, or determined dynamically by some algorithm, as with max_queue_length.
The primary goal is management of
- Pulse users. A Pulse user is owned by one or more PulseGuardian users (currently, only one is supported). PulseGuardian users can own multiple Pulse users and create new Pulse users and delete any Pulse users they own.
- Queues. A queue should be associated with a Pulse user (the queue's creator). A user can see the length of and delete any queues associated with Pulse users it owns. If a queue's length ever exceeds warn_queue_length, that is, moves from a value less than warn_queue_length to a value equal to or exceeding it, PulseGuardian will email a warning, with details on the offending queue, to the PulseGuardian user that owns the Pulse user that is associated with the queue. Similarly, if a queue's length moves from a value greater than warn_queue_length to a value below it, PulseGuardian will email a notification. There may be some additional algorithm to prevent a large number of emails from being sent if a queue hovers around warn_queue_length or continually spikes above it. If the queue's length exceeds max_queue_length, the queue is deleted and an email is sent to the PulseGuardian user owning the Pulse user associated with the queue.
There are other RabbitMQ-management functions we can put into PulseGuardian as well, depending on their benefit to users, including extra notification email addresses and exchange management.
Design and Approach
Although the data currently in Pulse is not confidential, for accountability and to prevent possible abuse, PulseGuardian should be restricted to vouched Mozillians. Logging in should be performed via Persona, authenticating with mozillians.org, to obtain an email address. A new user is created if there is none associated with the given email address. After logging in, users can then create a RabbitMQ user account (see the Pulse security model for default permissions), which will be linked to the associated PulseGuardian user. A password will need to be entered, but it should not be saved in PulseGuardian. We may want to provide the ability to, or even require, a randomly generated password (a sort of API key).
The second part is a process that polls RabbitMQ, looking for queues that have grown above warn_queue_length. If the queue belongs to a Pulse user associated with a PulseGuardian user account (ideally all should, but it is not absolutely required), a warning email is sent containing the queue name and current queue length. If max_queue_length is reached, the queue is deleted, and another email is sent. If the Pulse user is not associated with a PulseGuardian user, that is, it was created directly in RabbitMQ, or if the queue is not associated with a Pulse user, the queue is deleted without a user notification when max_queue_length is reached (no action is performed at warn_queue_length).
Optionally, we can have admin email addresses that are also sent all notifications, including when there is no owner.
Notes
The Pulse user associated with a queue can be determined by the queue's name, since they follow a set format enforced by RabbitMQ user permissions. However given the coarse granularity of RabbitMQ permissions, technically a user can create a queue in the exchange namespace and vice versa. We could have PulseGuardian immediately delete these.
Implementation
PulseGuardian uses Flask for the user management app and SQLAlchemy + PostGreSQL to store user data.
Communication with RabbitMQ is done via the RabbitMQ management plugin's REST API.
The production app is deployed via Heroku.