Raindrop/RaindropDocumentModel

From MozillaWiki
Jump to: navigation, search

Introduction

Raindrop works with a number of conceptual entities, such as a 'message', 'identity', 'extension', 'account', etc. The set of entities is informal and can be extended at any time. These items are known as 'raindrop content'. For example, a single conceptual 'message', or a single 'identity' is a single item of 'raindrop content', regardless of how that content is expressed. Every item of raindrop content has a 'raindrop key' - this key is an object that uniquely identifies the content item. How this key is chosen depends on the 'provider' of the content item, but once created the raindrop key can be treated as an opaque blob - see below for more details.

A 'schema' is a definition of a set of fields and their meanings for a particular item of raindrop content. A schema could almost be seen as an 'interface for data' - it is a 'contract' between the producer and consumer of data about what that data means. A 'schema instance' is a particular set of fields conforming to a schema which describe a particular item of raindrop content. Schema instances are generally provided by raindrop extensions. For example, we may define a schema rd.msg.body which defines the body and envelope data of an abstract message, and a schema called rd.tags which defines tags for any item. Raindrop extensions will be able to create instances of these schemas for different items of content (eg, to define the tags for a specific message or identity.) A schema instance must be thought of as a 'set of fields' rather than as any particular storage mechanism - in particular, there isn't a 1:1 relationship between schemas and couchdb documents. Conceptually, each item has only 1 (or zero) instances of a particular schema. If multiple extensions provide the same schema for the same item, these schemas are "aggregated" such that a single conceptual schema instance exists. This mechanism allows multiple extensions to contribute to the final result (eg, by overriding a single field).

A single item of content is defined as the set of all schemas provided for a single 'raindrop key'. As a result, an item of raindrop content does not have the concept of a 'type', only the concept of what schemas it makes available. In other words, an item of raindrop content is not a 'message' or an 'identity' - it is just that some raindrop content items provide 'message' related schemas while others provide 'identity' related ones. There is nothing in the model to prevent a single item of raindrop content providing both identity and message related schemas, although that probably doesn't make much sense. Some items of raindrop content, such as account definitions or extensions, will provide neither message nor identity related schemas.

The intention is that our document model and API abstract away the storage of schemas - conceptually, all parts of raindrop deal only with the concept of reading and writing particular schema instances for particular 'raindrop keys' rather than dealing with the concept of couch documents. While this abstraction makes sense for a number of reasons, the primary reason is that we want to reserve the right to reconsider our storage model based on our experiences as the size of the data grows (and indeed we already have - the storage and aggregation of specific schema instances has changed since the first iteration)

Multiple schema instances, confidences and aggregation.

A key requirement for raindrop is extensibility and schemas provide the key mechanism to achieve this. One canonical example used by the raindrop team is the ability to override the 'from' field for an item. For example, consider an email message which arrives, so the 'from' is recorded as being the email address which sent the message. We would like to enable an extension to notice that this address is actually a special address (eg, an auto-generated mail on behalf of a real person) and change the 'from' value which is shown by the front-end application. In our functional model, this extension is *not* free to simply modify the 'from' field which was initially written - instead some other mechanism to override that value must be provided.

We have chosen the model summarized below:

  • Multiple extensions can write the same schema instance for an item. For example, the "core" schema can write a complete rd.msg.body, while another extension can write a new rd.msg.body schema for the item, but provide only the 'from' field.
  • Raindrop will detect the case of multiple schema instances for an item and "aggregate" them together. As many extensions may want to "compete" for a field, each extension itself has a "confidence level". All new extensions will have a default confidence, and the user will be able to adjust the confidence values to reflect how that user would prefer the items to be handled. For example, if 3 extensions all attempt to override the 'from' field, the user has the ability to adjust the relative priority of them.

It is worth noting that competing extensions should be rare. While there may be many extensions which desire overriding the "from" field (eg, a facebook detecter, twitter detector), it should be rare that more than 1 wants to handle a single item of content.

It is expected that in the future a schema definition may need to declare its "aggregation policy", so different result sets can be formed from the inputs. An example here might be a "tags" schema - this schema may have a single field also called tags, but each extension wants to *contribute* to the final set of tags rather than replace previous ones. A further complication with this aggregation model is that the model needs to handle 'delete' as well as 'add' operations. For example, consider a twitter message with a #tag. An extension may emit a schema with this tag, however, if the user then wants to *remove* that tag from the message, our model must allow for that removal without modifying the original twitter message's schema. In other words, the schema emitted by the front-end in response to a user's tag operation must allow for tags to both be contributed to the final set, and remove specific items from the final set.

Storage of schemas

All schemas of the same type for the same item of raindrop content are stored in a single document. If multiple instances of the same schema for the same item are written, they are managed in a single document - however, the common case is that only a single schema instance for each item exists - so most documents store exactly 1 schema instance.

The end result of that is that many couch documents combine to describe a single item of content, with each document holding a single, possibly aggregated, schema instance. Each couch documents holds the following fields:

  • rd_key - the 'raindrop key' for the item. This is a normal json object.
  • rd_schema_id - a string holding the ID of the schema.
  • rd_schema_instances - An "object" (ie, a dictionary) holding information about the specific schemas emitted by individual extensions. The key into this object is the extension ID. If there is more than one schema provided, the schema instances are stored in this object, but if only 1 exists (the common case), the fields are stored in the top-level document instead.

The final set of "aggregated" fields for the schema are all stored directly in the couch document; this is the primary reason all special raindrop attributes begin with 'rd_'. If only one extension provided a schema for this object, these top-level fields correspond exactly to what that extension provided (ie, no aggregation is necessary), but if multiple extensions provided schemas then this top-level set of fields corresponds to the final aggregated values (the individual values are in rd_schema_instances). The end result of this means that the top-level fields in the document always hold the single "effective" schema resulting from the aggregation (or the single schema itself in the case when no aggregation is necessary.) The megaview also works in terms of these top-level fields, so querying the megaview will also result in seeing the final aggregated result set.

The _id for the couch document is built from the rd_key and rd_schema_id fields, while the extension ID must be unique in the rd_schema_items object. As a result each extension can only emit a single schema instance for an item of content.

Some other fields are maintained semi-automatically by the raindrop back-end. These almost never need to be provided by extensions, but are listed for completeness.

  • rd_megaview_expandable - a list of field names that are able to be 'expanded' by the megaview. In general this will list fields which are arrays of simple values, such as 'tags' and 'contacts'. The value of this field is part of the schema definition and will automatically be set by the framework. Currently the field and schema names are hard-coded, so Python code needs to change when a new schema is introduced with this requirement, but ultimately we expect the schema definitions themselves, including this type of metadata about the schema, validation functions etc, to live in couch docs and thus be truly dynamic. See Raindrop/Megaview for more information.
  • rd_megaview_no_aggr is a special case which indicates the schema ID doesn't want any aggregation; the megaview always emits each extensions instance rather than the top-level aggregated fields.
  • rd_schema_provider names the extension which was the "provider" for this schema. A "provider" is expected to provide the entire schema, where a schema "extender" is expected to only override a select few fields. A schema isn't considered 'complete' until a 'provider' has been run (ie, until this field has a value) so we can expect a full complement of fields and begin processing the schema.

The rd_schema_instances object is a dict keyed by the extension ID, and the value contains a number of fields which relate directly to that extension:

  • rd_source - the "source document" and revision that caused this schema instance to be created. Following the rd_source attribute backwards through items will allow you to create a DAG of extension points that processed this message.
  • schema - the actual schema items emitted by this extension. If this field is null, it implies that there is only a single extension's fields (ie, rd_schema_instances.length should be 1), so the field values are taken from the document itself.

Backend extensions

Extensions themselves are also managed via this schema mechanism. A special rd.core.ext schema exists to describe extensions and includes a field for their source-code. Thus, new extensions can be created simply by creating a new schema item in the database with all the relevant fields. This schema also includes the 'source schema IDs' they are interested in (ie, the ID of the schemas they take as input). The back-end will create a separate 'work queue' for each extension, and as documents are modified in the database, the back-end will call all the extensions which have declared an interest in the schema.

The extensions themselves will generally create new schema instances. In most cases they will emit new schema instances for the same item of raindrop content - for example, the 'conversation' extension point will generally emit a new rd.msg.conversation schema for the same message (ie, a new schema instance is written for an existing item of raindrop content). However, these extensions are free to create anything they like - the conversation extension may also emit rd.msg.conversation schemas for new raindrop content items (eg, for referenced emails which don't exist on the mail server), thereby springing an entire new item of raindrop content into life.

As a result of this model there is no formal DAG for items of raindrop content - extensions declare the schema IDs they depend on, but not the schema IDs they create. However, for any single item of raindrop content a DAG could be deduced. So while a DAG must exist (or the queues would never finish), it can't be predicted ahead of time. The 'viz-raindrop.py' script is capable of analyzing a database and creating a pretty-picture of the DAG it found.

raindrop keys

While the 'raindrop key' for an item is generally opaque, it must remain somewhat transparent for raw content providers. These content providers will need to informally agree on the key format for certain items to help identify the same item of content from different providers - for example, an email message will have a key based on the message-id header, and if all providers of emails use the same key format, we will be able to detect and handle the same email message coming from different providers.

For another example of where keys are not quite opaque, consider the 'conversations' extension point - it has the requirement of stitching a conversation thread together from email messages, but some referenced messages may not exist on the server. This extension point does know that the 'raindrop key' for an email message is of the form
['email', message-id-header]
, so it is capable of generating rd.msg.conversation schema records for these missing messages. The extension is therefore springing new items of raindrop content into life - these are items without a rd.msg.body schema, but in all other ways are now bona-fide 'messages'. Should this message later appear on the IMAP server a rd.msg.body schema would be created for the content item at that point.

Views

The model described above allows for a single well-crafted view to slice and dice this data in various forms.

The megaview emits the schema id, field name and field value as the key for all content items. Thus, you can determine all content items that hold a specific value for a specific field in a specific schema. This view has a simple reduce function meaning that by using an appropriate group_level, you can determine which items define the field regardless of value, etc. Note that fields are always emitted from the top-level document, so the megaview always reflects the final aggregated schema instances should multiple exist.

As an added bonus, the mega-view also emits a pseudo-schema called 'rd/core/content'. This schema emits the values of the various 'rd_*' fields, meaning this view is also suitable to use as a meta-view. For example, to find all documents which emit a particular schema, you can query for the key ['rd/core/content', 'schema_id', schema_name], or to query all items with a specific rd_key value (ie, all known schemas for a single raindrop content item), you could query for ['rd/core/content', 'key', rd_key_value]. Some of these values are concatenated, allowing you to specify for all schema items with a specific key, etc - see the megaview source for details.

The key advantage of this model is that there is no requirement to add custom views for each extension point, which is important given couchdb's view architecture; rebuilding a view index from scratch over a large database can take a very significant amount of time and resources.

The key downsides to this model are the complexity involved in learning to use the megaview, the huge amount of disk-space this view takes on disk and the lack of flexibility in performing queries which aren't able to be expressed by the megaview.