The Marketplace has been placed into maintenance mode. It is no longer under active development. You can read complete details here.

Results

tldr;

RabbitMQ is a single point of failure when the process is down on one of the nodes
MySql is a single point of failure when the master(s) is down.
The marketplace app starts to get some lock errors in the DB as soon as we put a bit of content addition (>100 RPS) - 823054 - so it does not scale unless we remove this part of our load scenarii
in some webheads Marketplace complains the "GeopIP server" is not installed. 823697

The following table summarizes the availability of Marketplace depending on the back end states. We provide an HA Grade For each back end depending on the Marketplace behavior. When applicable, a list of follow-up bugs are linked in the table for each backend.

Backend	HA Grade	Notes	Related Bugs
Elastic Search	B	Notes
Membase	B	Notes	#819876
Redis	B	Notes
MySQL	E	Notes
RabbitMQ & Celery	C	Notes	#823510
SMTPD	B	Notes
Stats & Graphite	B	Notes
Logstash & Metlog	B	Notes

HA Grades:

A: No interruption of service at all
B: Partial interruption of service when the whole cluster is taken down
C: Partial interruption of service when one part of the cluster is down
D: Full interruption of service when the whole cluster is taken down
E: Full interruption of service when one part of the cluster is taken down

ElasticSearch

results

Failure	Searching	Browsing	Adding content	Review content	Indexing	Self-Healing
Slave Down	OK	OK	OK	OK	OK	OK
Master Down	OK	OK	OK	OK	OK	OK
Everything Down	KO [2]	OK	OK	KO [1]	OK	OK
Everything Hanged/Slowed	KO [2]	OK	OK	KO [1]	OK	OK

notes

[1] the new apps are not indexed - the celeryd task fails
[2] the website hangs for 30 s.

preconisation

on indexation errors (cron or celeryd), we should try to keep the job somewhere to replay it if possible. see apps/addons/tasks.py:index_addons
shorter timeouts view and the cron/task before it fails. 5 seconds seems better for the UI and 10 seconds for the cron/task maybe?

Membase

results

Failure	Searching	Browsing	Adding content	Review content	Indexing	Self-Healing
Slave Down	OK	OK	OK	OK	OK	OK
Master Down	OK	OK	OK	OK	OK	OK
Everything Down	OK	OK	KO	KO	OK	OK
Everything Hanged/Slowed	OK	OK	KO	KO	OK	OK

notes

I have seen huge chunks of data being cached (templates) - like > 1mb irrc. We should avoid this.
XXX (to check) should we protect every call to memcache and make sure the app state survives it ?
why membase is mandatory for app submissions etc ?

preconisation

Is Membase the best place to cache templates ? what about disk cache ?

Redis

results

Failure	Searching	Browsing	Adding content	Review content	Indexing	Self-Healing
Slave Down	OK	OK	OK	OK	OK	OK
Master Down	OK	OK	OK	OK	OK	OK
Everything Down	OK	OK	OK	OK	OK	OK
Everything Hanged/Slowed	OK	OK	OK	OK	OK	OK

notes

Redis is going to be deprecated - less relevant anymore I guess
Django Cache Machine absorbs all errors in safe_redis()

MySQL

results

Failure	Searching	Browsing	Adding content	Review content	Indexing	Self-Healing
Master Down	KO [0]	KO [0]	KO [0]	KO [0]	KO [0]	OK
Slave Down	OK	OK	OK	OK	OK	OK
Everything Down	KO [2]	KO [2]	KO [2]	KO [2]	KO [2]	OK
Everything Hanged/Slowed	OK	OK	KO [1]	KO [1][3]	KO [1]	OK

notes

Single Point Of Failure when the master is down
[0] raw "Internal Server Error" on the web app
[1] no timeouts in the webapp when mysql hangs
[2] nginx 504 and 502 on the front page
[3] nginx gateway timeout on /developers/submissions

preconisation

Is there a way to avoid raw 504/502. A templatized screen on Zeus or Nginx?
we need a timeout in the marketplace app, so we can display a cleaner error, before nginx itself times out. Maybe a shorter timeout on reads.
can't the app work in degraded mode when the master is down ?

RabbitMQ & Celery

results

Failure	Searching	Browsing	Adding content	Review content	Indexing	Self-Healing
One RabbitMQ node down	OK	OK	KO	KO	KO	KO
One celeryd process down	OK	OK	OK	OK	OK	OK
Everything Down	TBD	TBD	TBD	TBD	TBD	TBD
Everything Hanged/Slowed	TBD	TBD	TBD	TBD	TBD	TBD

notes

Shutting down one rabbit node breaks celery - for instance IOError: Socket closed on upload_manifest the webhead is not properly doing a fallback on another node
kombu raises errors, the task is lost (XXX verify persistency/replay)
When the node gets back online we're still facing issues
shutting one celeryd has no impact. the other celeryd instance picks the work

preconisation

we should fail over another RabbitMQ node if the node associated to the webhead is down | kombu seems to be able to do this

SMTPD

results

Failure	Searching	Browsing	Adding content	Review content	Indexing	Self-Healing
All down	OK	OK	OK	KO	OK	OK
Hangs	OK	OK	OK	KO	OK	OK
One SMTP Down	Cannot Test	Cannot Test	Cannot Test	Cannot Test	Cannot Test	Cannot Test

notes

we could not shut down single SMTPD nodes because they are used in production. So we just did a vaurien local test
When a reviewer accept an application, a mail is sent out. If smtpd is down then, we get a django.db.backends.leave_transaction_managementTransactionManagementError error.
When SMTPD hangs we get raw nginx 504 when accepting apps
other emails are sent by crons

preconisation

while it's unlikely that both SMTPD server can be down, we could catch the error and just warns the reviewer the mail was not sent maybe?

Statds & Graphite

results

Failure	Searching	Browsing	Adding content	Review content	Indexing	Self-Healing
statsd down	OK	OK	OK	OK	OK	OK
Graphite Down	TBD	TBD	TBD	TBD	TBD	TBD

notes

If statds is down the UDP packets are just silently dropped, and we don't get stats

LogStash & Metlog

results

Failure	Searching	Browsing	Adding content	Review content	Indexing	Self-Healing
logstash down	OK	OK	OK	OK	OK	OK
Graphite Down	TBD	TBD	TBD	TBD	TBD	TBD

notes

If logstash is down the UDP packets are just silently dropped, and we don't get logs
If logstash is up but one of its backend server (syslog, sentry, etc) is down the UDP packets are just silently dropped, and we don't get stats
the messages are sent to both logstash servers

preconisations

we should check the impact about the duplication of messages

Marketplace/HAResults

Contents

Results

ElasticSearch

Membase

Redis

MySQL

RabbitMQ & Celery

SMTPD

Statds & Graphite

LogStash & Metlog

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

How to Contribute

MozillaWiki

Around Mozilla

Tools