Services/Sync/Server/DesignNotes

From MozillaWiki
< Services‎ | Sync‎ | Server
Jump to: navigation, search

Overview of used tools

  • Apache
  • mod_wsgi 3.2
  • Python 2.6.x
  • Paste 1.7.4
  • PasteScript 1.7.3
  • PasteDeploy 1.3.3
  • SQLAlchemy 0.6.2
  • MySQL-Python 1.2.3c1
  • python-ldap 2.3.11
  • WebOb 0.9.8
  • Routes 1.12.3

Development tools (optional)

  • distribute 0.6.13
  • virtualenv 1.4.9

Framework

Since we don't need any specific web framework feature, let's KISS and use a pure Python implementation that uses the minimal set of tools to interact with incoming requests.

Web: WebOb + Routes

We want to use WSGI capable tools of course, since it's the Python standard for web applications.

The suggested tools are:

  • WebOb: Request / Response wrapper
  • Routes: Request dispatcher

Those are used in numerous web frameworks and are rock-solid. WebOb is maintained by Ian, and provides simple objects to read the requests and write the responses.

Routes is a non-intrusive request dispatcher inspired by RoR and is used by Pylons. Unlike some other request dispatchers, Routes doesn't forces the developers to have a tree of objects in the application. It's a bit similar to Django's one, but without the regexp.

Data : python-ldap + SQLALchemy

Sync currently uses a mix of MySQL and LDAP for storage.

For ldap, Python has python-ldap, which is a robust connector for OpenLDAP and ActiveDirectory.

For SQL, we want to use SQLAlchemy. This is the most used ORM in the Python community and has very advanced features. Besides its ORM features, SQLALchemy provides useful low-level APIs that allows us to bypass the ORM logic when we need more speed.

SQLAlchemy has drivers for all majors DB systems, meaning that we can run the server on SQLite, MySQL or Postgres just through configuration. This is useful for the tests, and useful for running the server on any box with no hassle.

In any case, a DB abstract layer like the one we have in php, will let us implement any kind of backend for experiments (dummy in memory, GAE's specific ORM, S3, etc)

[Given the simplicity of the data model, we'll probably want to use SQLAlchemy without any of its ORM features, just as a SQL/connection abstraction layer -- Ian]

[I did a few benches this morning, just to quickly check the overhead SQLAlchemy adds on pure reading. It creates a connection + a select, or a connection + several selects, on the users table (5000 rows)

As expected, the overhead is quite small when SQLAlchemy is used without binding objects, and sometimes a bit faster when we don't retrieve back a lot of lines.


 $ bin/python bench.py
 sqlalchemy with ORM -- fetching one line: 0.1636 seconds
 sqlalchemy low-level -- fetching one line: 0.1580 seconds
 pure mysqldb -- fetching one line: 0.1689 seconds
 sqlalchemy with ORM -- fetching all: 0.2108 seconds
 sqlalchemy low-level -- fetching all: 0.2047 seconds
 pure mysqldb -- fetching all: 0.1549 seconds
 sqlalchemy with ORM -- getting one user: 0.0386 seconds
 sqlalchemy low-level -- getting one user: 0.0310 seconds
 pure mysqldb -- getting one user: 0.0306 seconds
 sqlalchemy with ORM -- ALL: 0.3116 seconds
 sqlalchemy low-level -- ALL: 0.3109 seconds
 pure mysqldb -- ALL: 0.2739 seconds


I'll finish the SQLAlchemy version and do the MySQL one since it's very fast to have, and I propose that we come back to do a real benching once we have a first version of the storage APIs under Python.

I suspect that there will be no difference at all at the end (it's slower only when we retrieve lots of lines, since SQLAlchemy do some wrapping). So, if SQLAlchemy turns out to be as fast as the mysql backend alone, there are only benefits to use it.

The bench module is here, feel free to play with it, change it: http://hg.mozilla.org/users/tziade_mozilla.com/sandbox/file/tip/bench.py ]


Server : apache + mod_wsgi

WSGI application are quite simple to set up in a web server. There are specialized tools like uWSGI or GUnicorn, mod_wsgi for Apache, a module for ngninx, lighttpd setups using fcgi, etc.

Since the current stack is using Apache, it would probably be simpler and faster to use mod_wsgi. But once we have a first working Python server, investigating on alternative setups with some benchmarks could be useful to see if we can lower the CPU and memory usage. ("yes, we can!")

Interesting stacks to bench are:

  • lighttpd/NGinx + uWSGI
  • NGinx + GUnicorn
  • lighttpd + scgi + flup

[From what I've seen most configurations perform similarly once the application does anything. Once everything is in place we could do some performance tests, but we wouldn't want to be premature about it, and mod_wsgi is a reasonable default choice. We'd want to test our full stack with things like Slowloris as well. -- Ian]

Development environment

The application should come with a Paster configuration so it can be launched locally using Paster's integrated web server, and a SQLite DB.

Also, for general testing purpose, a centos virtual image with the same environment than production could be useful for experimentation.

(XXX I don't have any solution yet, on how to avoid installing OpenLDAP standalone installations. I am investigating on the idea to have a micro Python ldap server. For the tests we can always mock it XXX)

[I think we can make an abstract interface similar to the data store, something that fetches/stores user information, with ldap as one implementation of that abstraction -- Ian]