Phishing Protection: Server Spec
Contents
Phishing Protection Server Specification
Niels Provos
This documents contains the relevant information necessary to implement a server that implements the Phishing Protection protocol initially developed by Google for their Safe Browsing extension. This protocol does not use binary data but instead uses human-readable text only. An implementation is relatively straight forward.
The protocol consists of four different HTTP requests:
- /getkey: This request requires SSL support. It provides the client with a private key for confidential communication with the server.
- /lookup: Asks the server to render a verdict on the provided target URL. This request needs to be encrypted with the private key from above.
- /update: The client provides a list of tables that it wants the server to update. The server either provides the full content of the current tables or incremental updates to bring the client's tables up to the current version.
- /report: Used to report when users visit phishing pages and if they decided to heed the warning or ignore it.
Data Encoding Standards
Encrypted data or key material needs to be encoded so that it can be transmitted as part of a _HTTP GET_ request. We use Websafe Base64 encoding with the following table:
/** * Our websafe alphabet. Value 64 (=) is special; it means "nothing." */ ENCODED_VALS_WEBSAFE = "ABCDEFGHIJKLMNOPQRSTUVWXYZ" + "abcdefghijklmnopqrstuvwxyz" "0123456789-_=";
The encoding works just like regular Base64 encoding. Each 6-bit of data gets mapped into a websafe encoded character and vice versa.
The *=* character is used for padding.
Key Value Pairs
All server responses are in the form of key value pairs. They need to follow this format:
key:<value length>:value
If a key is unknown to the client, it must ignore it. Most clients should simply ignore the length -- it's a historical artifact and not necessary in most circumstances. A client should skip lines of names it does not understand, and should be tolerant of blank lines. However, a server should still generate a correct length field even if most clients are just going to ignore it.
GetKey Request
The GetKey request is used when the browser starts up to create a shared secret key between the client and the server. The secret key is used to encrypt lookup requests and also for authenticating table updates. The protocol does not enforce the use of a secret key but it's strongly encouraged. To be secure, the client makes the a GetKey request over SSL.
A server needs to respond with the following data:
clientkey:24:pOAblTUiZFkLSv3xRiXKKQ== wrappedkey:24:MTqdJvrixHRGAyfebvaQWYda
The client key is a 16-byte long random nonce generated by the server when receiving the GetKey request. The wrapped key is the random nonce encrypted by a server key. The wrappedkey is opaque to the client and a server may implement any encryption algorithm it sees fit. The wrappedkey allows the server to reconstruct the client key without requiring per-client state. It is up to the server to include verification information into the wrapped key that might allow it to determine if decrypting it was successful.
If the server key changes, the server can prepend
pleaserekey:1:1
to responses for lookup or update requests to tell the client to request a new client key.
Lookup Requests
Sample format of an encrypted lookup request:
GET /safebrowsing/lookup?sourceid=firefox-antiphish&features=TrustRank&client=navclient-auto-tbff&encver=1&nonce=-151363793 &wrkey=MTpuN9U8dJcjmXboTmcm8edD&encparams=8R_ixk0eg40SLt0sN0rPcx4ahPttnQUkFjrn_cq53I3d&
Sample format of an unencrypted lookup request:
GET /safebrowsing/lookup?sourceid=firefox-antiphish&features=TrustRank&client=navclient-auto-tbff&q=http%3A//www.honeyd.org
The output from the server is either empty or contains:
phishy:1:1
If the request was unencrypted or contained a bad wrapped key, a server may also output
pleaserekey:1:1
to instruct the client to issue a new GetKey? request.
To decrypt an encrypted a bad wrapped key, a server may also output
pleaserekey:1:1
to instruct the client to issue a new GetKey? request.
To decrypt an encrypted lookup request, a server needs to inspect the following query parameters:
- encver: The algorithm version used by the client to construct the encrypted URL. The only valid version is currently 1.
- nonce: A random nonce selected by the client to seed the encryption algorithm. The server needs to convert it into an unsigned 32-bit integer.
- wrkey: The wrappedkey provided by the server as response to a GetKey? request.
- encparams: The encrypted content of the lookup request.
The description of the current encryption algorithm can be found below. Decryption results, in an unencrypted lookup request; see above.
If the URL reported in the q query parameter belongs to a phishing site, return:
phishy:1:1
If you want the client to update their wrapped key, instruct it to do so by including pleaserekey:1:1 in the output.
The client query parameter tells you which version of the Anti-Phishing extension is talking to your servers.
Decryption Algorithm Version 1:
The wrappedkey and the encparams are Websafe Base64 encoded and need to be decoded into binary first.
The server drives an decryption key from the client key and nonce, in the following fashion:
- Create MD5 context ctx.
- Update the MD5 context with the 16-byte client key.
- Convert the nonce into a 32-bit unsigned integer; represent the integer in network order (most-significant byte first) and update the MD5 context with it.
- Use the 128-bit MD5 digest from this context as decryption key.
The algorithm for encrypting lookups is RC4. It is also being used to decrypt the encparams. Remove all the parameters relating to encryption from the URL and add the decrypt content as new query parameters. As a result, you should have a request that looks like an unencrypted lookup.
Update Requests
The client can download and update various kinds of tables (lists) via the update request. Each table has a name with three components: provider-type-format. The provider is just a name used to identify where the list comes from. The type indicates whether the list is a white or blacklist. The format indicates how URLs should be looked up in the list, for example the list might contain domains, hosts, or URLs. For example:
- goog-black-url // A blacklist from Google; lookups should be by URL
- acme-white-domain // A whitelist of domains from Acme, Inc.; lookups by domain
Tables are versioned with a major and minor numbers. The major version is currently 1, and is used to describe the wire format (see below), how the table is serialized. The minor number is the version of the list. When providers add new items to a list or take items out of it, they increment the minor version number.
The client keeps a list of tables it knows about, as well as the version it has of each. To request an update from a provider, the client issues an update request and expresses its tables and versions as a query parameter like: version=type:major:minor[,type:major:minor]*. For example:
http://www.example.com/phishing/update?client=foo&version=goog-black-url:1:432,acme-white-domain:1:32
The server responds with updates to all tables in the wire format. For each table, the response includes either a completely new table or a diff between the client's version of the table and the most current version, whichever is smaller. If the client provided a wrapped key, the server also needs to compute a Message Authentication Code for the response data that the client use to verify the integrity of the tabels.
The Firefox client is currently aware of three different table formats:
- enchash: An encrypted hash table. The host name is hashed and used as encryption key. If a match can be found in a table, the value is decrypted into regular expressions that must match the URL for there to be a hit.
- url: The URL is looked up directly in the table.
- domain: The host name or domain is derived from the URL and used as key for a table lookup.
The different formates will be discussed in more detail further below.
Wire Format
The serialized form of the tables is called the wire format. It's the format of an update response.
The wire format is a simple line-oriented protocol. It consists of a sequence of sections consisting of a header line like
[type major.minor [update]][[mac=<digest>]]
followed by lines of data comprising the table described by the header. If the "update" token appears in the header line, the data following constitute an update to the client's existing table. Else the data specify a full, new table. If the client provided a wrappedkey, the response must include the message authentication code. Here are a few possible first-line responses:
[goog-black-url 1.372 update] [goog-black-url 1.372] [goog-white-domain 1.10][mac=iA5vLUidpXAPwfcAH9+8OQ==] [goog-white-domain 1.10 update][mac=iA5vLUidpXAPwfcAH9+8OQ==]
Data lines start with a + or -. A plus indicates an addition to the table and is followed by a tab-separated key/value pair. A minus means to remove a key from the table and is followed by the key itself.
An example update response is:
[goog-black-url 1.372 update] +http://payments.g00gle.com/ 1 +http://www.ovrture.com/givemeallyourmoney.htm 1 +http://www.microfist.com/foo?bar=x 1 -http://www.gewgul.com/index.html -http://yah0o.com/login.shtml ... [acme-white-domain 1.13] +google.com 1 +slashdot.org 1 +amazon.co.uk 1 ...
In this example, the client has some version of goog-black-url prior to 372 and the server is telling the client to bring itself up to version 372 by applying the adds and deletes that follow. The client has some version of acme-white-domain earlier than version 13, but the diff would be longer than the entire version 13 table, so it is sending a complete replacement.
The data lines are opaque to the wire format. They come in some format that the extension knows how to use, based on the table type. More complicated types of tables than just domain-, host-, and URL-lookup are possible. For example, a table could map hosts to regular expressions matching phishy pages on the host in question.
The client provided a wrapped key via the wrkey query parameter. The server must compute a MAC over the update contents. See below for a description of the message authentication code.
Message Authentication Code for Table Updates
The client provides a wrapped key to the server which the server needs to convert into the shard client key; see above. The MAC is computed from an MD5 Digest over the following information: client_key|separator|table data|separator|client_key. The separator is the string :coolgoog: - that is a colon followed by "coolgoog" followed by a colon. The resulting 128-bit MD5 digest is websafe base-64 encoded and provided via [mac=<encoded digest>] on the first line of a table update; see above.
Here is an example, that you can use to verify the MAC algorithm in your server:
client key: "dtmbEN1kgN/LmuEoYifaFw=="
A sample query to get table data protected by the client key looks like this:
www.example.com/phishing/update?version=test-white-domain:1:-1&wrkey=MTpPH3pnLDKihecOci%2B0W5dk"
Below is a sample response including a correct MAC based on the keys provided to the server above.
[test-white-domain 1.1][mac=iA5vLUidpXAPwfcAH9+8OQ==] +white1.com 1 +white2.com 1 +white3.com 1
Table Formats
The server needs to be aware of the way in which the Firefox client is going to used to supplied tables. Both the 'url' as well as the 'domain' formats is pretty straight forward. The 'enchash' table format is more complicated and for that reason we are going to describe it in detail below:
Encrypted Hash Format
The keys in the table are hashes of host names. The current database salt is defined as 'oU3q.72p'. To determine if a URL is in an enchash-formated table, the following steps are necessary:
- Extract the hostname from the Url.
- Create a canonical hostname.
- Split the canonical hostname into an array of hostname components by splitting on the 'dot', e.g. "www.sub.acme.com" gets turned into the array [www sub acme com]
- For each sub-hostname, e.g. "www.sub.acme.com", "sub.acme.com", "acme.com", do the following:
- Compute the MD5 hash of the database salt concatenated with the sub-hostname string and uppercase the result, for example, "www.sub.acme.com" results in "F64932D7EE9CEAEBA3DED3689C5A77CA" and "sub.acme.com" results in "819C6791744F94B7C425FA81B09B0751".
- If the resulting key cannot be found in the table, continue with the next sub-hostname, otherwise continue with the data below.
- Decode the data by unencoding the regular base64 encoding.
- The first 8 characters of the decoded data are being used as random salt for the encryption key. The encryption/decryption key is constructed by computing the MD5 hash over 'database salt|random salt|hostname'. This results in a 128-bit key.
- Use RC4 to decrypt the remaining data with the decryption key from above.
- Split on '\t' (tab stop) to create an array of regular expressions.
- Run each regular expression over the whole URL. If one of the regular expressions matches, the whole lookup evaluates as true.
Use the reverse of these steps to contruct an encrypted hash table yourself.
Canonical Hostname Creation
Extract the hostname from the URL (if it's an international domain, we use the ascii punycode representation) and then follow these steps:
- Remove all characters that match the following regular expressions:
- "[\x00-\x1f\x7f-\xff]+"
- "^\\.+|\\.+$"
- Replace consecutive dots with a single dot.
- If the hostname can be parsed as an IP address, it should be normalized to 4 dot-separated decimal values. The client should handle any legal IP address encoding, including octal, hex, and fewer than 4 components.
- Escape all characters that are not alphanumeric or '.' or '-'.
- Lowercase the whole string.
Then to get the hostname for the encrypted hash lookup, we also apply this rule:
- Strip all leading components so that the resulting hostname has at most 5 dots.
To canonicalize the remainder of the URL:
- The sequences "/../" and "/./" in the path should be resolved, by replacing "/./" with "/", and removing "/../" along with the preceding path component.
- The fragment identifier ("#") and everything after it should be removed
Report Requests
In enhanced mode, a client informs the server about phishing pages it encounters. For example, it might report that the user declined the warning on a blacklisted page, a cue to the provider that the page might be a false positive (or that the warning is ineffective). For example:
http://www.example.com/phishing/report?client=foo&evts=phishdecline&evtd=http://somephishydomain.com/login.html
The client does not expect an answer to a report request.