Marketplace/FailureModes
Contents
Overall Summary
Most of the Marketplace is similar to a website and should have expected uptimes commensurate with up.
Additionally, there are several services associated with the marketplace. For those, short-duration failures (< 1 hour) are likely to go unnoticed, and have no impact on the user.
However, there are a few points that have implications beyond a user being unable to access the website and will impact the user directly. These are:
- App receipt generation
- App receipt verification
- Payment (deferred)
Below, we will look at the technical components of the Marketplace, followed by breaking it down into various sections and seeing how failure of the various components will impact user experience. At the stage, the Marketplace is treated as a monolithic project; as things progress, the goal is to move to a more Services-Oriented Architecture, at which time the pieces can be moved into their own pages for failure analysis.
At present, Marketplace exists in one colo. Since the goal is to have it installed in multiple colos, notes of where replication lag may have impact on user experience are also noted below.
Component Details
Network
There's an implicit Network that can fail - if we can't get to the server, we can't get content. Result in all situations is basically the same as it is for the rest of the system failing, though it's not necessarily a failure on our end (and many users will not know whether it is or not, thanks to phone flakiness)
Zeus
Zeus is used as a load balancer and for caching a few very-high-traffic pages. Our ops team has a love/hate relationship with it and is looking at other solutions.
Memcache
Memcache does a lot of DB object caching throughout the system. If memcache goes down, it's likely that the DBs would be rapidly overwhelmed due to the sudden influx of queries.
Database
Heavily normalized MySQL. DBs are at the core of all the data, including translations for the various pages. It is currently not replicating to different colos.
Elastic Search
Does searches across the apps, which are then filled in through the object cache.
Persona
Persona is the '3rd party' login solution for identifying yourself to the Marketplace. While it is a Mozilla product, it will be used in areas well beyond the Marketplace and should be treated separately.
If Persona is unavailable, the site should continue to work for browsing. However, anything that requires user identity - notably app receipt verification - will fail.
Section Summary and Failure Modes
Discovery Pages
Failure Points
- Zeus cache fails
- Object cache fails
User Visibility
High traffic page as it's the entrypoint for browsers. Failure will show up when a user visits the Marketplace in the browser, and will likely be unable to continue
Notes
Content is basically static, so it's cached by Zeus.
Homepage/Category Pages
Failure Points
- Object cache fails
- Elastic Search unavailable
User Visibility
High. Problems here mean that the user will probably not be able to progress through the site. However, it will not break any app functionality.
Notes
These pages are pretty static and built from the object cache plus some Elastic Search.
App Pages
Failure Points
- App DB fails
- Review DB fails
- Ratings DB fails
- Object cache fails
- Reviews/ratings replication falls behind
User Visibility
High for popular apps. Users won't be able to install an app if they can't get to it. Users will usually not care if ratings and reviews are temporarily unavailable. The potential exception to this is is a user posts a review and it doesn't show up immediately.
Notes
Receipt Verification
Failure Points
- Purchase Database unavailable or corrupted
- Purchase Database behind on replication
- Signing keys unavailable
- Signing service unavailable (for receipt updates)
User Visibility
Apps will fail to work. High user visibility and inconvenience.
Notes
Does not apply to free apps, as they do no receipt checks.
Receipt Signing
Failure Points
- Expired key from HSM server
User Visibility
Users will be unable to purchase an app.
Notes
Signing a receipt should be done before charging, to protect from a failure in that flow causing a user to be billed.
Version Check
Failure Points
- Version Database corrupted or unavailable
- Version Database replication behind
- Stale feed into database (not currently, but several models will have this)
User Visibility
Minimal. Failure should produce "no update" to the user, and impact is minimal. They'll pick it up on the next check, and the delay is not important.
Notes
Operational costs here make this a good area for serious examination. There are probably relatively easy wins here, and eventually we might look to improve the FF API itself.
Payments
Not applicable in current version. Process is entirely handled by BlueVia
Blocklist
Failure Points
- Generation of static file fails
User Visibility
No user visibility, as a failure will just cause clients to not update the blocklist. However, because of the nature of items on the blocklist, delays or erroneous content can have security consequences.
Notes
App Install Process
Failure Points
- Payment processing failure
- Receipt Signing failure
- App download inaccessible
- Fail to write purchase into DB
User Visibility
Attempted purchases of apps will fail. Whether that's more than a minor inconvenience for the user depends on if we're past the payment process. Failing to write a purchase after it has been made is vary bad and needs a lot of logging.
Notes
In general, we'll want a ton of logging throughout this process and a good interface to it so that we can track down reported issues.
If we have a record of a user paying for an app, can they redownload it at will?
App Search
Failure Points
- Elastic Search unavailable
- Feed for Elastic Search behind
- Memcache layer unavailable
User Visibility
Search results will be unavailable for a period of time.
Notes
Developer Workflow
Failure Points
- Login failure (see above)
- Application DB failure
- Metrics compilation failure
User Visibility
Very low. Downtime here is unlikely to affect site use, as it represents a mild inconvenience for the app developer. Accuracy of usage statistics is important, as it will correlate back to total number of receipts. If that doesn't match expected values, people will notice!