Marketplace/HAResults
From MozillaWiki
Contents
Results
tldr;
- RabbitMQ is a single point of failure when the process is down on one of the nodes
- MySql is a single point of failure when the master(s) is down.
- The marketplace app starts to get some lock errors in the DB as soon as we put a bit of content addition (>100 RPS) - 823054 - so it does not scale unless we remove this part of our load scenarii
- in some webheads Marketplace complains the "GeopIP server" is not installed. 823697
The following table summarizes the availability of Marketplace depending on the back end states. We provide an HA Grade For each back end depending on the Marketplace behavior. When applicable, a list of follow-up bugs are linked in the table for each backend.
Backend | HA Grade | Notes | Related Bugs |
Elastic Search | B | Notes | |
Membase | B | Notes | #819876 |
Redis | B | Notes | |
MySQL | E | Notes | |
RabbitMQ & Celery | C | Notes | #823510 |
SMTPD | B | Notes | |
Stats & Graphite | B | Notes | |
Logstash & Metlog | B | Notes |
HA Grades:
- A: No interruption of service at all
- B: Partial interruption of service when the whole cluster is taken down
- C: Partial interruption of service when one part of the cluster is down
- D: Full interruption of service when the whole cluster is taken down
- E: Full interruption of service when one part of the cluster is taken down
ElasticSearch
results
Failure | Searching | Browsing | Adding content | Review content | Indexing | Self-Healing |
Slave Down | OK | OK | OK | OK | OK | OK |
Master Down | OK | OK | OK | OK | OK | OK |
Everything Down | KO [2] | OK | OK | KO [1] | OK | OK |
Everything Hanged/Slowed | KO [2] | OK | OK | KO [1] | OK | OK |
notes
- [1] the new apps are not indexed - the celeryd task fails
- [2] the website hangs for 30 s.
preconisation
- on indexation errors (cron or celeryd), we should try to keep the job somewhere to replay it if possible. see apps/addons/tasks.py:index_addons
- shorter timeouts view and the cron/task before it fails. 5 seconds seems better for the UI and 10 seconds for the cron/task maybe?
Membase
results
Failure | Searching | Browsing | Adding content | Review content | Indexing | Self-Healing |
Slave Down | OK | OK | OK | OK | OK | OK |
Master Down | OK | OK | OK | OK | OK | OK |
Everything Down | OK | OK | KO | KO | OK | OK |
Everything Hanged/Slowed | OK | OK | KO | KO | OK | OK |
notes
- I have seen huge chunks of data being cached (templates) - like > 1mb irrc. We should avoid this.
- XXX (to check) should we protect every call to memcache and make sure the app state survives it ?
- why membase is mandatory for app submissions etc ?
preconisation
- Is Membase the best place to cache templates ? what about disk cache ?
Redis
results
Failure | Searching | Browsing | Adding content | Review content | Indexing | Self-Healing |
Slave Down | OK | OK | OK | OK | OK | OK |
Master Down | OK | OK | OK | OK | OK | OK |
Everything Down | OK | OK | OK | OK | OK | OK |
Everything Hanged/Slowed | OK | OK | OK | OK | OK | OK |
notes
- Redis is going to be deprecated - less relevant anymore I guess
- Django Cache Machine absorbs all errors in safe_redis()
MySQL
results
Failure | Searching | Browsing | Adding content | Review content | Indexing | Self-Healing |
Master Down | KO [0] | KO [0] | KO [0] | KO [0] | KO [0] | OK |
Slave Down | OK | OK | OK | OK | OK | OK |
Everything Down | KO [2] | KO [2] | KO [2] | KO [2] | KO [2] | OK |
Everything Hanged/Slowed | OK | OK | KO [1] | KO [1][3] | KO [1] | OK |
notes
- Single Point Of Failure when the master is down
- [0] raw "Internal Server Error" on the web app
- [1] no timeouts in the webapp when mysql hangs
- [2] nginx 504 and 502 on the front page
- [3] nginx gateway timeout on /developers/submissions
preconisation
- Is there a way to avoid raw 504/502. A templatized screen on Zeus or Nginx?
- we need a timeout in the marketplace app, so we can display a cleaner error, before nginx itself times out. Maybe a shorter timeout on reads.
- can't the app work in degraded mode when the master is down ?
RabbitMQ & Celery
results
Failure | Searching | Browsing | Adding content | Review content | Indexing | Self-Healing |
One RabbitMQ node down | OK | OK | KO | KO | KO | KO |
One celeryd process down | OK | OK | OK | OK | OK | OK |
Everything Down | TBD | TBD | TBD | TBD | TBD | TBD |
Everything Hanged/Slowed | TBD | TBD | TBD | TBD | TBD | TBD |
notes
- Shutting down one rabbit node breaks celery - for instance IOError: Socket closed on upload_manifest the webhead is not properly doing a fallback on another node
- kombu raises errors, the task is lost (XXX verify persistency/replay)
- When the node gets back online we're still facing issues
- shutting one celeryd has no impact. the other celeryd instance picks the work
preconisation
- we should fail over another RabbitMQ node if the node associated to the webhead is down | kombu seems to be able to do this
SMTPD
results
Failure | Searching | Browsing | Adding content | Review content | Indexing | Self-Healing |
All down | OK | OK | OK | KO | OK | OK |
Hangs | OK | OK | OK | KO | OK | OK |
One SMTP Down | Cannot Test | Cannot Test | Cannot Test | Cannot Test | Cannot Test | Cannot Test |
notes
- we could not shut down single SMTPD nodes because they are used in production. So we just did a vaurien local test
- When a reviewer accept an application, a mail is sent out. If smtpd is down then, we get a django.db.backends.leave_transaction_managementTransactionManagementError error.
- When SMTPD hangs we get raw nginx 504 when accepting apps
- other emails are sent by crons
preconisation
- while it's unlikely that both SMTPD server can be down, we could catch the error and just warns the reviewer the mail was not sent maybe?
Statds & Graphite
results
Failure | Searching | Browsing | Adding content | Review content | Indexing | Self-Healing |
statsd down | OK | OK | OK | OK | OK | OK |
Graphite Down | TBD | TBD | TBD | TBD | TBD | TBD |
notes
- If statds is down the UDP packets are just silently dropped, and we don't get stats
LogStash & Metlog
results
Failure | Searching | Browsing | Adding content | Review content | Indexing | Self-Healing |
logstash down | OK | OK | OK | OK | OK | OK |
Graphite Down | TBD | TBD | TBD | TBD | TBD | TBD |
notes
- If logstash is down the UDP packets are just silently dropped, and we don't get logs
- If logstash is up but one of its backend server (syslog, sentry, etc) is down the UDP packets are just silently dropped, and we don't get stats
- the messages are sent to both logstash servers
preconisations
- we should check the impact about the duplication of messages