SummerOfCode/SampleApplications/3
This is one of three real applications from Summer of Code 2009 which we thought were good enough to put up as examples. Having said that, none of them are perfect, so please treat them as inspirational material, not as templates. :-)
Transferring web pages with delta encoding algorithms
Abstract
Modern web sites often serve dynamic resources that cause obstacles to caching them. HTTP delta encoding could greatly alleviate this problem by transferring only deltas between different versions of resources. In this proposal, I suggest an extension to HTTP/1.1 to support this idea. As a proof of concept, I also propose to implement a Firefox extension as well as a matching server-side program.
I hope this project will be a solid move towards the standardization of HTTP delta encoding.
Content
Personal Details
- Name: Yin QIU
- Email: hiddenforprivacy@gmail.com (my GTalk account too)
- MSN: hiddenforprivacy@msn.com
Project Proposal
I propose to implement a series of programs as a PoC for the idea of transferring web pages with delta encoding algorithms.
Principles
Dynamic resources are often employed by modern web sites, espcially dynamic web pages. They are hard to cache because even a minor change to the content will result in a cache miss. HTTP delta encoding [6] could alleviate this problem. Instead of sending the entire resource, a web server in could send the difference (delta) between the new version and the old version of the resource to the client (could be a browser or a proxy). If the size of the delta is smaller than the resource, which happens when two versions resemble each other a lot, we'll benefit from this scheme.
To compute the delta, the server could maintain older versions of resources, and the client then includes extra information specifying the resource version in its request. An alternative is to use encoding algorithms such as rsync [3] and crcsync [7]. Roughly speaking, they all partition the data into chunks and compute the checksums of chunks. A client sends a request with the checksums. The delta is computed by the server based on the resource in hand and these checksums, and is then transferred to the client. Finally, the client is able to compute the up-to-date content of the resource according to its local old copy of resource and the received delta.
Project scope
Ideally, HTTP delta encoding can be useful if it is well supported by all the three parties commonly involved in the Web, namely web servers, browsers, and proxies. Because of the time limit, this project will only implement initial support for two of them. More specifically, I will develop a Firefox extension that supports sending a variant form of HTTP requests, and a server program (to be defined below) that can handle such kind of requests.
Technical details
In this section, I will describe the technical details of my project.
Protocol
Extension must be made to existing HTTP so as to enable delta encoding between hosts. I intended to use HTTP/1.1 as the starting point. Here are some my preliminary thoughts:
Request message: First, we need a way to let a client claim itself to support delta encoding. This is achieved by adding a "Accept-Delta" header whose value specifies the accepted encoding algorithms. Unlike in [8], I will allow request messages to optionally include checksums in their message-body field. This implies that the server does not need to keep multiple versions of a resource.
Response message: Corresponding to the request message, we also need a way to allow a server to specify its ability to do delta encoding. There are two mechanisms. First, an added status code, say 220, is used to indicate that a response is encoded. Second, a "Delta" header is used to designate the algorithm used in encoding the delta. The "Delta" header could be present even if the status code is not 220, in which case it solely serves as an delta-encoding ability indicator.
The protocol design is still incomplete. For example, I didn't consider much for proxies. Disorder may arise when "traditional" hosts and delta-encoding-enabled hosts work together. But I expect to discuss with my mentor to work it out. The final solution will be well documented and be part of assessing deliverables.
Firefox extension
To convenience purpose, I will refer to this extension as Tri throughout this section. Tri comes from the word "triangle", which is the symbol for uppercase Greek letter delta.
Tri will probably involves an XPCOM component to support sending HTTP requests defined in the previous section. Alternatively, it may change requests generated by Firefox just like the ModifyHeaders [9] extension. However, according to my description above, we need to change not only the headers but also the message-body. So something more have to be done here.
To better integrate with Firefox, Tri also ought to have access to the cache facilities of Firefox. Since we already got something like CacheViewer [10], this wouldn't be a problem.
Tri also has to intercept the data received by data. This is an area I haven't explored yet. Once some data is intercepted, Tri should examine if it is a response from a delta-encoding-enabled server. And if so, it should subsequently decode the data.
Tri maintains a whitelist of those servers that support delta encoding. A request whose destination is on this list will be enhanced by adding the checksum payload, provided that the requested resource has been cached locally. For those resources whose hosts are not on the whitelist, Tri will send an ordinary request to the server, except that it uses an Accept-Delta header to tell the server that it supports delta encoding. If the contacted server fortunately returns a response with header "Delta" present, Tri will add this server to its "whitelist".
Server-side program
There are two candidates for a server-side program. One is relatively simple, while the other is not. I'll choose one depending on the project progress and future discussion with my mentor.
The server-side program could simply be a basic HTTP server, probably developed in Python. A better candidate is to have a handy HTTPD module. No matter which one we choose, the server-side program has to understand the extended HTTP request (sent by Tri). It should be able to identify a client that supports delta encoding by reading the Accept-Delta header. To handle a request, it would extract the checksums contained in the request's message-body, compute the delta based on the resource it has, and finally send the response to the client. The response would have its status code set to 220 and contain the encoding algorithm used.
Schedule of Deliverables
First of all, I'm planning to devote 4 to 5 hours each day in the GSoC period to this project on average. I'm certainly willing to make more efforts if the project is behind schedule. It is supposed that I will have no other commitments during this summer.
The final deliverables would consist of:
- A written specification of the protocol that is to be followed throughout the development.
- A Firefox extension that supports at least two encoding schemes (e.g., XOR and rsync).
- A server side program that supports returning only the encoded difference between a page and its older version. This program could be an httpd module, or simply a primitive HTTP server that can serve some predefined data and can generate deltas.
- And perhaps some test cases.
Staring on April 20, when Google announces the accepted students, the whole available time of GSoC is about 16 weeks. Accordingly, my planned schedule is:
- Week 1 - Week 3: Get to know the community; discuss the protocol details with my mentor.
- Week 3 - Week 5: Document the protocol; start to develop Tri.
- Week 5 - Week 7: Implement the function that sends custom HTTP requests.
- Week 7 - Week 11: Make Tri be able to read the cached resources and send XOR-encoded checksums to the server. (milestone for mid-term evaluation)
- Week 11 - Week 14: Incorporate rsync into Tri; start to code for the server-side program
- Week 14 - Week 16: server-side program done; gather some statistics by doing preliminary experiments (goal for final evaluation)
Note these are approximated schedule. Please allow minor adjustments (within 3 to 4 days) according to the actual project status.
Open Source Development Experience
Personally, I started two open-source projects on SourceForge, though I am no longer actively maintaining them because of lack of time. One is called iChatLE [1], a peer-to-peer chat utility written in Java; the other is Clairv [2], which is a distributed search engine framework also written in Java.
Besides, I pay close attention to the open source world. Conforming to their licenses, I have been actively utilizing open-source tools and libraries in my development. And yes, I remember I reported a bug to Eclipse when I interned in IBM, but unfortunately I cannot find it in its Bugzilla now.
Work/Internship Experience
I have experience of two internships:
- Microsoft Research Asia (January 2008 - May 2008): involved some data processing tasks at the Data Intelligence and Tools Group.
- IBM China Development (July 2007 - January 2008): did federated test against various IBM products and wrote a simple survey web-app for the group, at the department of Federated Integration Test.
In the summer of 2008, I also participated in Haiku OS's Code Drive and tried to finish the ICMP implementation of its network protocol stack. I read a lot materials at that time, including RFC 792, 1122, and 1191, and discussed issues with my mentor. Unfortunately, I failed to accomplish that task because of lack of time. But I still gained a lot from it, for example, network programming and communication with a community. (FYI, I'm planning to continue with this failed project in the near future, as long as it is not taken.)
Academic Experience
I graduated with a B.S. degree in software engineering from Nanjing University, China, in 2008, ranking 20/200. Now I am pursuing a master degree in the same university. I am particularly interested in distributed systems. My B.S. thesis was based on the previously-mentioned Clairv project. I, together with another two classmates, managed to extend Clairv by adding prototype peer-to-peer search functions to it based on Pastry [5].
Why Mozilla
First, the idea itself intrigues me a lot. I believe this project would be useful towards the standardization of HTTP delta encoding. Second, as a long-time Firefox and Thunderbird user, I admire the open spirit of the Mozilla project and would like to be part of its community.
[1] iChatLE. http://sourceforge.net/projects/ichatle/.
[2] Clairv. http://clairv.sourceforge.net/.
[3] Andrew Tridgell and Paul Mackerras. The rsync algorithm. Technical Report TR-CS-96-05, Dept. of Computer Science, Australian National University, June, 1996.
[4] R. Fielding et al. RFC 2616 - Hypertext Transfer Protocol -- HTTP/1.1. http://tools.ietf.org/html/rfc2616.
[5] Antony Rowstron and Peter Druschel. Pastry: Scalable, Decentralized Object Location, and Routing for Large-scale Peer-to-Peer Systems. In Middleware '01: Proceedings of the IFIP/ACM International Conference on Distributed Systems Platforms Heidelberg, pp 329-350, 2001.
[6] What is HTTP Delta Encoding? http://www.webreference.com/internet/software/servers/http/deltaencoding/intro/.
[7] crcsync. http://ccan.ozlabs.org/repo/?cmd=inventory;path=ccan/crcsync.
[8] J. Mogul et al. RFC 3229 - Delta encoding in HTTP (proposed standard). http://tools.ietf.org/html/rfc3229.
[9] The ModifyHeaders extension. http://modifyheaders.mozdev.org/.
[10] The CacheViewer extension. https://addons.mozilla.org/en-US/firefox/addon/2489/.