User:Ffledgling/Senbonzakura

From MozillaWiki
Jump to: navigation, search

This service, I'm calling it Senbonzakura (or 'SBZ' for those who prefer TL;DRing everthing) will generate partial MAR (Mozilla ARchive) files for updates from Version A to Version B of firefox on demand.

Demo Available -- Prototype 0.1 Demo [Warning, large ~105MB file]


Etherpad :- https://etherpad.mozilla.org/58uQVvAYbA

Immediate Tasks:

  1. Code Cleanup, logging, exceptions
  2. Code review/feedback
  3. Tooling, what to do
  4. File level caching (see make_partial_update.[py,sh])
  5. Proper API, what inputs do we want to give it, what format? (Form request POST? JSON?)
  6. Need for DB/Queues
  7. Worker/App duration db/cache connections vs. request level db/app connections
  8. RelengAPI vs. Individual app
  9. Celery on RelengAPI
  10. Space optimzation == do all the diffs before hand, then package MAR on the fly
  11. Any others?

Benefits

  • Generate updates on the fly.
  • Generate updates on a need-only basis.
  • Separate the update mar generation process form the build process (speed up ze builds!)
  • Update generation as a service rather than a step that simply happens during the build process. This makes the updates available to a wider audience, although the consequences of doing so are a little unclear at the moment.
  • Helps transition older/no-longer-supported firefox versions to newer firefox versions, without adding delays and/or adding to compute time during the build process.
  • Greater flexibility in what update paths we need/want.

Open Issues

These are a list of 'issues' that have no definite solution at the moment, but are important in some way or the other and thus need to be kept note of.

  • Figure out tool versioning.
  • Integration with Releng API (need to talk to dustin after we have a concrete prototype)
  • Parallelizing the MAR build process further by using separate celery workers or subprocess calls to fetch the MARs and do diffs on larger files (ref: Level 2 caching)
  • Do we need end-to-end testing? Mozmill has a suite of tests called update tests that apply a MAR and check if the update applied correctly, can we/do we want to use this to test our prototype? How is QA affected when we change the way we generate our updates? Can they still test if Firefox updates correctly? We might want to talk to Henrik(:whimnoo) or Clint(:ctalbert) eventually. (See conversation snippet at the end)
  • What do we want to use for our Caching layer? Why is X better/preferred over Y?
  • There seems to be some confusion about whether all the required tooling will be available somewhere (even in-tree) for some of the older Firefox versions (talk to bhearsum & catlee)
  • Use SHA-512 or another instead of MD5
  • Other open issues?

Pertinent Questions

Subset of Open Issues, using this as a scratchpad to note down issues and later polish them and move them upto the Open Issues section

  • does the client require the request to be synchronous or asynchronous?
  • does the client require any progress information?
  • will any client need to ask if the partial mar already exists?
  • how will cache maintenance/invalidation be handled? (same api, admin api, cli, scripts, docs?)
  • what type of docs are planned.

Structure

Service Signature

Input : URL for CompleteMAR1, URL for CompleteMAR2, CompleteMAR1 Hash, CompleteMAR2 Hash

Output : PartialMAR1-2 (Available to the user/client in some form)

Web API implementation

  1. GET

Sent to /partial/\<identifier\> . Where identifier is a valid identifier returned by the POST request sent to the /partial/

  1. POST

Sent to /partial/ The POST request is used to request the generation of the partial MAR file to update from a source mar to a destination mar.

The parameters that need to passed in as part of the post request are:

  mar_from        : HTTP URL to the complete source MAR file
  mar_to          : HTTP URL to the complete destination MAR file
  mar_from_hash   : MD5 hash of the source MAR file
  mar_to_hash     : MD5 hash of the destination MAR file

Although the API has been concretized to some extent it is still subject to change based on the one or more of the following factors:

  • The kind of requests will be sent to the API
  • The Client and/or the target audience using the API. Possible clients include:
    • Balrog
    • Developers
    • QA
  • Requirements for usage of the other HTTP Verbs
    • PUT : What would we want to 'PUT' on our servers?
    • PATCH : What would we want to PATCH on our servers? (Monkey Patching code? Probably not a good idea.)
    • DELETE : Delete/Stop Serving a partial mar from Senbonzakura
  • The need for a separate Admin API. The admin API would theoretically allow an authenticated entity known as the 'Admin' to perform administrative tasks on the service such as but not limited to:
  • Restarting Partial MAR builds that have aborted or stalled or failed for some reason
  • Control over the cache, such as:
    • Flushing/Invalidating the cache
    • Removing selected files.
  • Reseting the entire database.
  • Extra infromation that the API might need to expose.

resources regarding REST API design:

Caching

This service will be dealing with and generating a lot of files. It therefore makes sense to have an underlying caching layer that stores the generated and downloaded files/tools.

The caching layer can be implemented in a number of ways, some of the initial ideas being: - As storage on Amazon S3 - As a shared NFS file-system - Local storage on the nodes (probably not the best way)

There are certain requirements that are imposed on the caching layer, and more might be added as the requirements for the caching layer clear up. Some of these requirements are as followed:

  • Must be agnostic to the file type being stored in the cache.
  • Accessing the cache Must be much faster than directly accessing the files via a direct download.
  • The caching layer should provide an identifier that can be used to uniquely identify and reference the files in the cache.
  • The caching layer should ideally have fast read, write and lookup, but in a toss up between all the 3, lookup and read need to be the faster operations (they will ideally be used much more than anything else)
  • OPTIONAL: A method to access files via the identifier over the network, so that clients/users can directly access the files in the cache without Senbonzakura acting as middle man.

There are two levels of caching that are planned for this service, detailed as follows:

Level 0

This level simply keeps track of the downloaded files and their hashes on the worker's local file system. This cache is not persistant and is not meant to be, this is simply a cache that exists for convenience.

This Cache level has not been stubbed out yet and may or may not make it into the service.

Requires Discussion

Level 1 Caching

This level does caching at the MAR level. Downloaded complete MARs are cached to save bandwidth and improve speed during the Partial MAR generation phase.

Partial MARs are stored in the Cache after generation and are returned after a lookup in the Cache when requested for by the client.

Each of the files are identified by a unique identifier which at the moment is the MD5 Hash of the file for lack of a better function.

Level 2 Caching

A lot of the bigger stuff between releases like the XUL libs on every platform remain the same despite different locales, this locale independent stuff should probably be cached and re-used. The level will cache the files inside the different MAR versions.

The idea is to not re-do already done work by diff-ing files or to be aware of the files that don't need to be diff'd.

If we take the example of the XUL binary, it is an extremely large binary that takes a very large chunk of the total time it takes to generate a partial MAR. If we can recognize that the XUL binary has not changed, we can skip the binary diff'ing step and this should theoretically save us a lot of compute time and resources. If we also manage to cache the binary diff of two different XUL runners, this diff is useful to cache and keep track of because, we this is likely to be common across all firefox version updates regardless of locales, so it should help us speed up partial MAR generation after we have the diff'd binary as long as we can recognize the duplication effort.

The actual recognition logic will be separate from the caching layer and ideally a part of the par generation/diff'ing service.

Implementation details

Dependencies

Nearly everything we use is pip installable for the application, but the host machine must provide a few things that might not be pip installable. The known ones are:

  • RabbitMQ (or anykind of message queue to be used by Celery)
  • Virtualenv
  • Python 2.7

File Structure

  1. api.py
    This file contains all the Flask related code for routing and handling the API call parameters.
  2. cache.py
    This is currently a stub file that contains function prototypes for the caching layer.
  3. core.py
    This file contains all the core logic for Building and generating MARs.
  4. csum.py
    This file contains checksum calculation and verification functions, mostly just a conveinent wrapper over python's unbuilt hashlib
  5. db.py
    This file contains the Database utilities, in essence wrapper functions that make Insert, Update and Search operations on the database more convienent to use.
  6. db_classes.py
    Defines the database schema and provides other convienent excpetions and Enum-Style Dicts for status codes. Used directly only by db.py
  7. fetch.py
    This file contains methods to download/fetch files given a URI.
  8. flasktask.py
    File contains class for flask and celery integration, but isn't actually used at the moment. Might be removed
  9. tasks.py
    This file contains wrappers that call the core functions from core.py for celery.

Known Issues:

  • DB Errors are handled very poorly at the moment.
  • Parameter validation in the flask API as well as for the DB wrapper functions is poor, invalid parameters or empty strings slip through
  • The unwrap_full_update.pl and make_incremental_update.sh are known to require chmod +x or the equivalent otherwise the subprocess calls fails with a PermissionDenied Error.
  • The DB doesn't seem to handle repeat triggers very well, something needs to be improved in that portion of the code.

Unit Tests

api.py              : N/A

cache.py            : N/A

core.py             : N/A

csum.py             : N/A

db.py               : N/A

db_classes.py       : N/A

fetch.py            : 
                     test_correct_download
                     test_incorrect_download
                     test_file_save
                     test_existing_file_save

flasktask.py        : N/A

tasks.py            : N/A

Things to take care about:

Use a resilient retry library while fetching (bhearsum's redo is a good one to look at)

Catching Exceptions and raising the correct exceptions at different parts in the code. Currently a lot of places have a commented out raise these need actual custom exceptions and need to be raised. These and other exceptions need to be caught and handled properly so that the build does not fail in between and if it does there's enough traceback or logs to debug.

Replace all the print statements with logging statements and LOG ALL THE THINGS ~!

Unit-test ALL of teh things!

  • determine which version of the mar, mbsdiff tools to use, use them.
    These probably need to be cached as well, maybe based on own version, maybe based on gecko version, simply keep a function that decides and determines which one to use and points you to the right one. Use the one given by that tool, assume abstraction.
    We might have to cache these as well based on the version of update paths we're given.
  • cache the generated partial mar file based on the update path or based on a combination of the hashes of the input mar files.
    Where and how the partial mars are actually cached again depends on our caching strategy, we simply use our abstraction functions.

Tooling

We need to figure out how which tools to use with any given combination of CompleteMAR files. There are atleast three different versions of these tools and there is no central location for these tools.

Tools also fall into two categories:

  1. The partial mar generation scripts.
  2. The mar and mbsdiff binaries.

These live in separate locations and it might be in our best interest to consolidate them.

To be able to decide which tools to use with the targeted version of firefox, we need to figure out a Tool Version --> FF version mapping. To the best of my knowledge and based on feedback from Ben and Catlee such a mapping does not exist at the moment and will need to be built as part of the project going forward.

How do we handle fetching/Building/using the tools? Issues: - Tools like mar and mbsdiff are built as part of a firefox build. Their source code exists in Mozilla Central, but the complied binaries are built as part of the build and available on FTP.m.o after a build has been completed, do we pull the source in and compile them? Do we keep pre-compiled versions at hand? - To move to central repo or not to move to a central repo, that is the question. - As ranted about above, versioning.

Note on Scaling, Resilience and Caching

It is probably best to design for scalability, resilience and caching from the ground up so things to keep in mind are:

  • Retry retry retry
  • Log more than enough to debug (See Things to care about above)
  • Have our application/service start up from a config file
  • Do not trust your machine to store state, keep it on disk or on file?
    We now use an SQL database to do this.
  • abstraction abstraction abstraction?

How do we optimize our caching? It will depend on caching strategy and underlying caching layer in use.

Signing and Certs

Still very hazy on how this plugins into the rest of the system, where it's needed and how if at all it changes things. Feedback needed by catlee, nthomas, bhearsum

Issues

  1. Catlee's partial's on demand vs. nthomas's something else
  2. Signing explanation
  3. What do we do about the tool versioning?

Implementation questions

  • Will we have mar installed?
  • How do we handle multiple mar versions?
  • Where do we put them?
  • Does it make sense to modify the script? Probably not, because we have no control over the older scripts
  • How do we fetch the tools? Just the ones we need without cloning all of MC.

Deliverables

I do not have a concrete idea of the deliverables so everything below is subject to possibly radical change, but for now, this is what makes sense to me:

Prototype 0.1

The intial prototype will simply be a bunch of python that essentially simply takes the input MAR urls, diffs them and spits them out

Prototype 0.2

The second prototype starts to add the caching functions, resilience logic, mar/mbsdiff tool versioning logic and generally attempts to map out the entire structure/flow of code.
Should probably have some ideas about the certs as well at this point in time

Deliverable 1.0

Have all the basics services up and running with our partial Mar (Level 1) caching up and running, should ideally try deployment on a machine in the cloud and let it run for a bit to see how things go

Deliverable 1.x

Change things around based on feedback from various team members, fine tune the system, add features requested and most importantly iron out glitches and swat those bugs.

Unit Tests

Unit-Test as much code as possible

Docs

Keep documenting stuff being done.
Using this Wiki as general documentation purposes.
Use Sphinx for API level documentation.

Repository

We have a Github Repo -- Senbonzakura

People to contact

In no particular order:

  1. bhearsum
  2. catlee
  3. nthomas
  4. hwine
  5. dustin

Related Bug #s

Relevant Links

IRC Conversation snippets

Conversation with Henrik re: Browser update testing

22:15 < ffledgling> I was wondering if it's possible to use mozmill to test browser updates
22:15 < ffledgling> but with custom MAR files and make sure they applied correctly?
22:16 <@whimboo> browser updates? thats something we are doing for a long time
22:16 < ffledgling> I think I found some tests that do what I want with the actual updation from offical servers -- http://hg.mozilla.org/qa/mozmill-tests/file/tip/firefox/tests/update/testDirectUpdate/
22:16 <@whimboo> the only thing you would have to do is to set the right update url
22:16 < ffledgling> whimboo: yes, but I want to use a custom MAR
22:16 <@whimboo> right
22:17 < ffledgling> ah, can you point me to how I can configure that?
22:17 <@whimboo> as said you would have to modify the update server url
22:17 <@whimboo> app.update.url
22:18 <@whimboo> just change that pref and ensure to send correct update snippets