Marketplace/HAtesting
Contents
Introduction
As part of HA testing, we need to identify single points of failure within the system, figuring out what happens when those pieces break with a long-term goal of making sure that these problems do not bring the rest of the system down. Each of these should be simulated while the system is under load unless otherwise noted. Simulations should reflect likely areas of degradation - unavailablility, slowness, desynchronization.
As we move to a more SOA-style marketplace, many of these will become easier to test, as the points of contact will be more controlled and easier to identify. Some, such as the mysql master, will always be central to the system, though.
Below is a list of tests that should be run to test HA readiness, including their results. A failed test is not necessarily a problem - after all, if the master load balancer goes down, you're not going to have a good time - but helps us to identify areas to prioritize in becoming HA.
Identified components/problems that could occur
Webserver
webserver dies
Simulation: While running under load, shut down the webserver on one of the frontend machines
Result:
Mysql
Mysql master dies
Simulation: While running under load, perform a mysql shutdown on the master DB
Result:
Mysql switches masters
(see https://bugzilla.mozilla.org/show_bug.cgi?id=804255) Simulation: While running under load, trigger a failover to a new master DB
Result:
Mysql slave dies
Simulation: While running under load, do a mysql shutdown on one of the non-master dbs
Result:
Mysql load balancer dies
Note that this is the equivalent of just turning off all of mysql. We're not expecting to survive this, just to see how gracefully the frontend handles it.
Simulation: While running under load, tell the load balancer to stop serving traffic.
Result:
Slow Mysql replication
Simulation: While running under load, turn off replication to one of the mysql slaves.
Result:
Slow Mysql processing
Simulation: While running under load, delay all queries from mysql by 10/20/30s. (we can obviously delay the connection, can we write a proxy that does a sleep in mysql?)
Result:
Elastic Search
Elastic Search dies
Simulation: While running under load, bring down Elastic Search nodes.
Result:
Elastic Search is slow
Simulation: While running under load, make Elastic Search reponses +30 seconds
Result:
Elastic Search node dies
Simulation: While running under load, bring down one of the Elastic Search nodes. Are there visible changes to the site.
Result:
Elastic Search load balancer dies
Processing Queues and Automated Tasks
Celery dies
Simulation:
Result:
Rabbitmq dies
Simulation:
Result:
Cron jobs stop running or die
Simulation: While the site is running normally, turn off all marketplace cron jobs for 48 hours. Examine the site for notable deviations from expected values.
(Note: need to identify all the crons. May need to call some out individually)
Result:
Redis
Redis node dies
Simulation: While running under load, bring down one of the redis nodes.
Result:
Redis responds slowly
Simulation: Add 1s delay to all redis calls
Result:
Memcache
Memcache node dies
Simulation: While running under load, bring down a single memcache node.
Result:
Memcache dies
Simulation: While running under load, turn memcache off on all memcache nodes
Result:
Memcache responds slowly
Simulation: While running under load, add a 1s delay to all results coming from memcache
Result:
Signing Services
Simulation: While running purchasing and receipt verification, shut off receipt signing.
Result:
Receipt signing service slow
Simulation: While running purchasing and receipt verification, add a 20s/30s delay to the query. (Question: how long will the client hold the connection open? We should test both sides of that)
Result:
Simulation: While testing the approval process, turn off the JAR signing service
Result:
Payments
Single payment gateway server dies (Webpay)
Simulation: While users attempt to make a purchase, kill a gateway server.
Result:
Payment gateway load balancer dies (Webpay)
Simulation: While users attempt to make a purchase, kill the load balancer. This effectively removes the payment service.
Result:
Payment processing server dies (Solitude)
Simulation: While attempting to make purchases, bring down one of the Solitude servers
Result:
Payment processing load balancer dies (Solitude)
Simulation: While attempting to make purchases, bring down all of Solitude.
Result:
Payment service (Bango or Paypal) dies
Simulation: While making purchases, sever the connection between the payment servers and paypal (blackhole address in the configuration?)
Result:
Monitoring
Webtrends/analytics goes down
Simulation:
Result:
Statsd/graphite/sentry goes down
Simulation:
Result:
Syslog, CEF or Metlog goes down
Simulation:
Result:
Miscellaneous
Backend storage for images and applications dies
Simulation: While running under load, turn off the nfs hosting our images and applications
Result:
Simulation: Blackhole the ip for recaptcha and attempt to register an account.
Result:
Simulation: Turn off browserid (or blackhole it), then attempt to use the site. Should include some registration.
Result:
Simulation:
Result:
Outgoing.mozilla.org not responsive
Simulation:
Result:
REALLY BIG STUFF
DNS resolution dies
Simulation: (I have no idea how to test this one)
Result:
CDN dies
Simulation: Is this realistically testable? It's out of our hands entirely.
Result:
Webserver front end load balancer dies
Simulation: While running under load, perform a shutdown on the web load balancer
Result: