User:Andrew Sutherland/MailNews/GlobalDatabase

From MozillaWiki
Jump to: navigation, search

Review Tracking Review Points that are believed resolved.

Draft-template-image.png THIS PAGE IS A WORKING DRAFT Pencil-emoji U270F-gray.png
The page may be difficult to navigate, and some information on its subject might be incomplete and/or evolving rapidly.
If you have any questions or ideas, please add them as a new topic on the discussion page.

(Desired) Features

  • Given a Contact, easily find ((Bryan Clark)):
    • Previous messages from the person
    • Previous attachments from the person
    • (A count of) messages from this person (to me)
    • (A count of) messages to this person (from me)
    • People connected to this person (people they have been involved in threads with)
    • Previous conversations with this person
    • URL's involved in the messages
  • When reading a message, be able to:
    • Know what lightning tasks/events were spawned from this message
    • Take/associate (long) notes ((Ovidiu))

Schema

Givens (Assumptions)

We are using the existing SQLite as the database engine already supported by the code-base. CouchDB is getting close to usable (0.8 will have compaction and incremental reduction), but is too risky for the desired Thunderbord 3.0 release date.

The global database is not a message storage mechanism. It is a layer on top of the existing folder-centric message storage mechanism, and will rely on that fact. There are separate plans to make the message storage mechanism modular (for e-mail accounts), but that has nothing to do with the global database.

The global database is not a full-text search mechanism, at least not for now. Each of the major platforms has their own platform-wide search mechanism and we are already planning/working to integrate with those mechanisms.

A Normalized Start

Because we are building a layer on top of the folder-based storage, each message in the global database only needs to 1) store enough information to locate the message in folder-centric storage plus 2) whatever information we need about it to act on/filter/group by/etc. in the global database without touching the message store.

For the first case, we store the folder a message is in (using the URI, although it may turn out we should break this down further), and its message key. The folder URI is normalized into its own table for storage efficiency and less work on folder changes (renames, etc.). Because we expect the number of folders to be reasonable, we may in fact keep the contents of the table cached in memory at all times, avoiding the need for joins.

For the second case, we break the query-able aspects of the message up into attributes which are associated with their message (and the conversation it belongs to).

All messages belong to a conversation, which stores the originating subject of the thread. Messages do not store their subjects, although it might be appropriate to try and concisely indicate when a subject change occurs through use of a flag for a hash of the subject content (no 're:').

  • folderLocations TABLE
    • id INTEGER PRIMARY KEY
    • folderURI TEXT
  • no index on folderURI for now; the number of folders should be sane enough that a table scan won't be too bad.
  • conversations TABLE
    • id INTEGER PRIMARY KEY
    • subject TEXT
    • oldestMessageDate INTEGER although generally less useful than the newest message, track/cache the date of the oldest/earliest message seen for the conversation
    • newestMessageDate INTEGER for performance reasons, track/cache the date of the newest message seen for this conversation
  • subject index: (subject) indexed for threading by subject
  • newestMessageDate index: (newestMessageDate) indexed so we can order conversations from most-recent to oldest
  • oldestMessageDate index: (oldestMessageDate) indexed mainly for symmetry...
  • messages TABLE
    • id INTEGER PRIMARY KEY
    • folderID INTEGER NOT NULL REFERENCES folderLocations(id) together with messageKey identifies the message in the traditional folder-storage model
    • messageKey INTEGER NOT NULL
    • conversationID INTEGER NOT NULL REFERENCES conversations(id)
    • parentID INTEGER REFERENCES messages(id)
    • headerMessageID TEXT the Message-ID header from the message
    • bodySnippet TEXT clarkbw's little snippet of text to display
  • folderID index: (folderID)
  • headerMessageID index: (headerMessageID)
  • conversationID index: (conversationID)

Both because people/e-mail addresses are first class to us, and because it would be horrible for performance if we didn't, people/e-mail get normalized into their own table. However, we then have an interesting modeling difficulty. People, who we will call 'contacts' frequently has more than one e-mail address, which we will call 'identities'. The difficulty is that some times we want to deal with just the contact, ignoring the fact that they have multiple e-mail addresses, and other times we may want to only deal with mail from one specific e-mail address.

There are a few obvious alternatives. We could track both concepts throughout the database; anytime there is a to/from/cc, we mention both the contact and the identity. We can also only use one of the two. In the case of using the contact, we would have to post-process based on the textual e-mail address if we only care about one (or more) specific identities. In the case of using the identity, if we want to talk about the contact, we need to query for all of the identities associated with the contact.

The choice we have made is to only cite the identity, requiring us to issue queries for all of a contact's identities when we want to interact on a contact level. The rationale of not using the contact+identity option, based on previous prototyping experience, is that the extra permutations lead to ugliness.

The identity approach also potentially allows for tree-like hierarchies in the future. For example, it is possible that I have a contact who has multiple work identities (2 work e-mail addresses, 1 work IM account, 1 work IRC account), multiple home identities (1 home e-mail for spam, 1 home e-mail for good friends, 1 home IM account, 1 text message address (phone number)), etc. The hierarchy approach where we're just building a list of identities arguably scales better permutation-wise than potentially having multiple contacts to deal with too (for each non-leaf node). The down-side is, of course, that the number of low-level attribute queries scales linearly with the number of identities. This can be dealt with by structuring the identities so that identities are consecutively numbers so that by masking off some number of low-order bits (and potentially right-shifting) we get the contact id, but that requires some cleverness.

Suffice it to say, input is greatly appreciated on this area.

(Note: there are some notes on correspondence to the address book in the API setion, but in a nutshell, the idea is that we do want to integrate with the address book as much as possible).

  • contacts TABLE
    • id INTEGER PRIMARY KEY
    • directoryUUID TEXT
    • contactUUID TEXT
    • name TEXT
  • identities TABLE
    • contactID INTEGER NOT NULL REFERENCES contacts(id)
    • kind TEXT email address/instant mesenger (and type)/IRC/phone/etc.
    • value TEXT email address, instant messenger account, etc.
    • description TEXT
  • identityAttributes TABLE
    • (see messageAttributes)

See the examples section for more information on attributes and aggregate data.

  • attributeDefinitions TABLE
    • id INTEGER PRIMARY KEY
    • attributeSubject INTEGER message? identity? etc. (from enumeration)
    • attributeType TEXT will change into an integer...
    • pluginName TEXT
    • name TEXT
    • parameter BLOB
  • messageAttributes TABLE
    • conversationID INTEGER NOT NULL REFERENCES conversations(id)
    • messageID INTEGER NOT NULL REFERENCES messages(id)
    • attributeID INTEGER NOT NULL REFERENCES attributeDefinitions(id)
    • value NUMERIC
    •  ?? maybe date ??
  • attrib-query (covering) index: (attributeID, value, ...)
  • message-attrib-fetch (covering) index: (messageID, ...)
  • conversation-attrib-fetch (covering) index: (conversationID, ...)


  • aggregateDataDefinitions TABLE
    • id INTEGER PRIMARY KEY
    • pluginName TEXT
    • name TEXT
    • subjectType TEXT
    • parameter BLOB
  • aggregateDataCache TABLE
    • aggregateID INTEGER NOT NULL REFERENCES aggregateDataDefinitions(id)
    • subject INTEGER
    • value NUMERIC
    • data BLOB
  • lookup index: (aggregateID, subject, value)

Query Strategy

We optimize our attribute queries by generating indices for each of the potential usage patterns. We also cause the indices to be covering indices; that is, the index contains all of the information in the row, so that any query can be answered using only the data in the index. This avoids the need for the database engine to perform any lookup for the actual row or its data.

When performing attribute queries, we perform complex queries by INTERSECTing the multiple simple queries involved. This does result in SQLite having to create an 'ephemeral' intermediary table for each of the results, but it is unavoidable, and far better for compiled code to be doing that than our javascript code.

Examples

Fundamental Attributes

Fundamental attributes are inherent and defining properties of an e-mail message: who sent the message, who they sent it to, when they sent it, etc. More explicitly, they are the pieces of the header that Thunderbird would natively display and expose as column headers, excepting the subject (which is explicitly part of the message).

  • From/To/cc
    • Definitions
      • "fundamental", "built-in", "FROM", NULL
      • "fundamental", "built-in", "TO", NULL
      • "fundamental", "built-in", "CC", NULL
    • Usage
      • conv(1), msg(1), attr(FROM), identity(my-email@my-domain.tld)
      • conv(1), msg(1), attr(TO), identity(cheese@hat.com)
      • conv(1), msg(1), attr(TO), identity(row@bow.org)
      • conv(1), msg(1), attr(CC), identity(foo@bar.com)
  • Other
    • Definition
      • "built-in", "DATE", NULL


Optimization Attributes

Optimization attributes are attributes that can be derived from multiple fundamental attributes, but, owing to presumed frequency of use and execution cost, get to be their own special attributes.

The definitive example are messages sent to/from the user. At least in the case of e-mail accounts, we can reasonably expect that nearly all of the (non mailing-list) messages will involve the user in some way. This means that the number of "FROM":user/"TO":user attributes will be quite high. Additionally, many of our queries will involve searching for this attribute, because most of our functionality centers around the user and his/her relationship to others.

Because of the way our database schema (and query strategy) works, any query for messages possessing attribute A:1 and attribute B:2 effectively results in intersecting the set of all messages with attribute A:1 and the set of all messages with attribute B:2. Since the set of FROM:user/TO:user attributes will be relatively large, we can expect a nice performance benefit by adding a bounded amount of redundant data.

  • From-Me-*, *-Me-From
    • Definitions
      • "optimized", "built-in", "FROM-ME-TO", identity(my-email@my-domain.tld)
      • "optimized", "built-in", "FROM-ME-CC", identity(my-email@my-domain.tld)
      • "optimized", "built-in", "TO-ME-FROM", identity(my-email@my-domain.tld)
      • "optimized", "built-in", "CC-ME-FROM", identity(my-email@my-domain.tld)
      • "optimized", "built-in", "FROM-ME-TO", identity(my-other-email@my-other-account.tld)
    • Usage
        • conv(1), msg(1), attr(FROM-ME-TO), identity(cheese@hat.com)
        • conv(1), msg(1), attr(FROM-ME-TO), identity(row@bow.org)
        • conv(1), msg(1), attr(FROM-ME-CC), identity(foo@bar.com)

Derived Attributes

Derived attributes are attributes populated by "data extractors" plugins/extensions from the contents of the message.

Simple Ones:

  • Bugzilla/bug-tracker helper plugin
    • Definitions
      • "derived", "buglamp", "BUG", "bugzilla.mozilla.org"
    • Usage
      • conv(99), msg(99), attr(BUG), 123456
  • A geolocation plugin that attempts to figure out where a message was sent from based on the IP address of the sender, explicit headers, or black magic.
    • Definitions
      • "derived", "geoplug", "LATITUDE", NULL
      • "derived", "geoplug", "LONGITUDE", NULL
    • Usage
      • conv(1), msg(1), attr(LATITUDE), 12.34
      • conv(1), msg(1), attr(LONGITUDE), 23.45
  • A purchase detector plugin
    • Definitions
      • "derived", "amazon-spree-tector", "I-SPENT", NULL
    • Usage
      • conv(99), msg(99), attr(I-SPENT), 40.20
  • Topic Analysis / Message Classifier plugin - Use analysis of the message content to assign the message to a topic.

Complex Ones (need their own table unless we change things up):

  • URL Detector
    • note: So, one thing we could do is use a (piece of an MD5) hash as the attribute value. This would technically be a win in that we could perform the message query without consulting the URL database. In theory we could discard/prune the URL database without losing the ability to attribute query an exact URL, but we'd lose so much other functionality it wouldn't be worth it.
    • note: A more feature-ful implementation might try and overload the attribute value so that the high-order bits actually encode the protocol and domain-name, leaving the low-order bits to refer to the full/specific URL. This would allow attribute queries to use ranges to locate links to a given domain.
    • note: We do not generate an attribute definition per URL (or domain) because the rate of growth of URLs (cardinality) is too high and additionally not under the control of the user.
    • Definitions
      • "derived", "built-in", "URL", NULL
    • Helper Table: builtin_urls
      • id INTEGER PRIMARY KEY
      •  ?? hash INTEGER -- 8 bytes 'o' MD5(url)
      • url TEXT
    • Helper Index: (url)
    • Usage

Explicit Attributes

Explicit attributes are attributes that are the result of a user action, creating data that cannot be reproduced, except by the user. For example, when the user applies a 'tag' to a message, this is an explicit attribute. Tags and the set of flags natively supported by MailNews are (in the case of e-mail), in theory, reliably preserved through their storage in the message (or on its keywords for IMAP servers). But any new explicit attributes created by add-ons/plug-ins/extensions (or users, should we expose such a capability, which we probably should) will not have that luxury.

This means we should probably go out of our way to find a way to ensure that explicit attributes (that are not already preserved else-where) can be backed-up and restored and/or journaled as they are set. At least, that's a reasonable lesson from the sadness of the Mork index files, though the situations are arguably different. (In theory, we can rely on SQLite to not corrupt itself, and perhaps we might assume that we back-up the global database at start-up like is under discussion/has been implemented for places.)

The simplest solution, if somewhat fill-up-your-disk-drive-y, is to log the message id of a message, the attribute's attribute definition, the attribute value, and any corresponding BLOB that the extension claims is associated with the attribute.

  • Tags!
    • note: We generate a separate attribute definition for each tag. This allows us to use the 'value' field of the attribute to provide a time-date stamp as to when the user assigned the tag. Awesome? Probably.
    • Definitions
      • "explicit", "tags", "TAG", "To Do"
      • "explicit", "tags", "TAG", "Personal"
    • Usages
      • conv(2), msg(2), attr("TAG", "To Do"), date(2001/01/01)
  • Calendar Tasks/Events
    • note: You can see where I'm going here. I still need to figure out if we can get an internal unique identifier that we can trust to fit in 8 bytes...

Implicit Attributes

Like explicit but instead of the user explicitly creating an attribute, we created it by observing the user's behaviour and deciding that we should make a note of it. The most important difference being that it would probably be infeasible/a bad idea to back this information up using a journaling strategy unless the journal aggressively compacted.

Also, let's not go crazy with implicit attributes. Given the number of indices currently suggested on our attributes, it is not a trivial operation to add/modify an attribute.

  • User Attention! Track messages the user opens a lot / looks at a lot!
    • note: The idea is to not have this as an attribute on every message. If we do that, it would probably be a better idea to just add a column to the message table. The idea here is that we would only first tag a message once it crossed a certain threshold of usage. This would probably be as a result of either:
      • Obvious hints from the UI in a single message viewing experience.
      • Use of some form of LRU-discarding cache of messages observed that would track enough information to decide when a message could graduate to explicitly having a tag (and at which point it would be evicted from the cache).
    • Definitions
      • "implicit", "achtung!", "ATTENTION", NULL
    • Usages
      • conv(5), msg(6), attr(ATTENTION), 42 (some arbitrary scale-factor)

Aggregate Data

'Aggregate Data' providers extract computationally expensive information from messages. We do not expect that users will need or want all of this information for all of their messages. Even when they do want it, they may not want it forever. For this reason, the global database provides a mechanism for extensions to define 'aggregate data providers'. The responsibility for determining when these providers should be invoked, when their results should be cached, when they should be updated, and when they should be deleted is left up to the global database layer (although informed by requests to the layer). By centralizing the responsibility, we simplify the extensions and hopefully avoid over-zealous extensions from consuming too many system resources.

The aggregate data mechanism does not simply cache prior results. Each entry in the table is considered a 'live' representation that is updated as new messages are processed by the global database. (Note: for performance reasons, the updates may not happen immediately, although they will happen before the aggregate data is return to a caller requesting it.)

Examples of possible aggregate data providers:

  • Message Body Analysis:
    • Topic-Analysis; categorization of message into distinct topics.
    • Word-Histograms (term frequencies) per contact per time period, enabling themail-like analysis.
  • Aggregation of values per time period over a period of time:
    • Messages sent by a contact, per week, (since forever). Possibly restricted by tag, recipient, etc. These can be visualized using sparklines, etc.
    • (Assuming a derived-attribute provider for money you spent on books from invoice/receipt e-mails), the amount of money you spent on books per month (since forever). Suitable for graphing.
    • (Assuming a very fancy bugzilla derived-attribute provider), the number of bugs you fixed per week. Suitable for graphing.
  • Complex/expensive/frequently used queries:
    • Geolocation-ish queries. What messages were sent within some geographic boundary (in the state of New York, in New York city, within 20-miles of this lat/lon, etc.)

Potential Denormalizations

Use Cases

Conversation Centric

Conversations ordered by their most recent message
By 'caching' the newest/most recent message on the rows in the conversations table, we are able to use a fairly cheap query to directly retrieve the recent conversations with enough information to perform a basic display. We can then fill in the rest with the actual underlying message information. (The information could also be retrieved from the messageAttributes

Bugzilla Workflow

As raised by Kent James, "If the mailnews team figured out how to effectively use TB or SM *in their own work* to combine information flow, it would be an amazing leap forward."

From the perspective of the global database, this is actually fairly simple. Each message you exchange with other people (e-mail/newsgroup), chat you are involved with (IRC/instant messaging), automated e-mails from bugzilla (e-mail), and check-in feed you subscribe to (RSS/Atom) simply needs to be tagged with a (derived) attribute that uniquely identifies the bug. For example, a BUGZILLA attribute (parameterized/defined by the URL of the bugzilla server which serves as the namespace) with the value being the bug identifier.

The difficult part is exposing those data sources that MailNews does not yet support as messages to the global database. Writing the derived attribute processor may also be non-trivial in some cases, at least in terms of reliably determining when a bug is being mentioned if there's just a plain number, and in the case where multiple bugzilla systems may potentially be discussed. Given the significance of bugzilla in the mozilla world, it might even be oddly practical to have the extension directly query the bugzilla server as a source of information.

Additional things a derived attribute 'data extractor' could provide would be:

  • Indicating the relative importance/significance of messages. For example, an automated e-mail about someone being added to the CC list is not very interesting or significant. An automated e-mail where a patch is added or reviewed could be very significant.
  • Indicating the state of the bug or significant events in the life of the bug for time-line visualization purposes. See when the bug was first opened, when it was actually confirmed, when patches showed up, patches were reviewed, things were committed, etc. The goal in this case is not to duplicate (or create) bugzilla features, but to provide context to the user's other communication. For example, on a time-line visualization, if the user can see when a patch was first posted and then rejected, it could help them to know that other communication around that time might be of specific interest to them. (And could inform automated assistance, fancy user queries, etc.)

Extension Interaction

Extensions are able to register attribute and aggregate data providers.

API

I have some doodles on this, but basically, I think it would be swell to:

  • Structure things so that results are provided asynchronously, as we are eventually going to get our results asynchronously.
  • Have queries resembling jQuery or Django; successive calls conceptually successively filter the context (be it the set of all conversations/messages, a single conversation, or a contact). In actuality, of course, the calls are just setting constraints and the query is not kicked off until explicitly requested/needed.

In terms of STEEL, the plan is that, for now, we will be a consumer of STEEL, if possible. To the extent possible we will mimic STEEL's API, but since we are trying to build a layer on top of the folder-centric interface, those things will be different.

Our relationship to/interaction with the address book is going to be non-existent at first (we will have our own concept of contacts/identities), but will subsequently need to be integrated. Because our contacts/identities are first and foremost normalizations, we do not need or want a 1:1 mapping with the address book. Our goal is to make sure that we can get from our contacts/identities to address book cards and back again.

Implementation Plan

Milestones

The game plan for now is to produce two extensions:

* gloda, the global database extension.
* expmess, the experimental message view extension.

The hope is to decouple the experimental message view from the global database development so that they can be worked on by different parties. For the first several milestones, the global database will be growing its fundamental features, limiting the functionality of the experimental message view (and coupling them), but still allowing for the experimental message view's interaction model/integration with the system to be fleshed out at least.

Below are proposed a series of milestones. If the proposed limitations at each stage sound silly/ridiculous, it's only because the goal is to try and get a steady stream of (small) milestones, and because eliminating the silliness requires an additional piece of functionality and adding that functionality would be equivalent to dropping a milestone (because it would be in a subsequent milestone).

  • M0, the pathologically simple base case:
    • The very primitive global database will:
      • Perform simple/free conversation threading (in-reply-to/references). No subject threading. (This is not because threading by subject is difficult, but because naive subject threading, especially across many folders, can result in erroneously associated conversations, potentially to the tune of thousands of messages. This is not desirable for testing or usage.)
      • Be able to process only e-mail accounts/folders.
      • Process e-mail accounts statically; if you move a message around, it will get confused.
      • Probably be far too synchronous and conceivably enjoy making the UI unusuable while it is thinking.
      • Do nothing with contacts/identities.
      • Do nothing with message attributes.
    • The very basic experimental message view will:
      • Show all the messages in a conversation (using the global database, so cross-folder).
      • Be keyed off the currently selected message (easy to hook into...)
      • Be its own little panel off to the right side of the screen.
      • Be linear by time; no tree stuff.
      • Not respond to message selection.
      • Derive its display data for display from loading the underlying message header
  • M1: An actual (simple) proof-of-concept:
    • Global Database
      • Process/create contacts/identities. (No interaction with address book!)
      • Populate fundamental attributes (To/Cc/From/Date)
      • Populate 'body snippet', with at least some attempt to ignore quoted data.
      • Expose fundamental attributes as data, not requiring loading the underlying message
      • Provide query support for fundamental attributes.
    • Experimental Message View
      • Grow a 'data mine' sidebar in our frankenstein overlay pane.
      • Message list gets its data only from the global database, without using the underlying message header.
      • Clicking on a message in the list displays information on the sender in the 'data mine'. Namely, some conversations they have been involved in and some people they have e-mailed.
      • The data-mine will not have anything you can click on.

Okay, everything after here is very nebulous, and the experimental message view should be thought of as a separate feature track once the global database is sufficiently feature-complete...

  • M2: Broad strokes all there, maybe? (As proposed, this may be too many changes and should be split into two).
    • Global Database
      • Process attachments.
      • Process user tags.
      • Update the global database based on listening to changes that happen.
    • Experimental Message View
      • Don't be a frankenstein overlay, but an actual view.
      • Clicking on things in the data mine does stuff.
      • Hook message list up to quick-search-alike.
  • M3: Be reasonably usable
    • Global Database
      • Do as much as possible asynchronously, minimizing blocking of the user interaction thread.
      • Have some form of support for subject threading.
  • M4: Support derived attributes by extensions.
    • Example Bugzilla extension!
    • Global Database
      • Support derived attributes by extensions.
    • Experimental Message View
      • Whatever it takes to make the bugzilla case happy.
  • M5: Support aggregate data by extensions.
    • Extension that draws pretty sparklines!
    • Global Database
      • Yeah, support that.
    • Experimental Message View
      • Show sparklines/graphs!
  • Mx: Support newsgroups.
  • My: Support RSS feeds.

Future Work

'Correct' Subject Threading

When References/In-Reply-To headers are not available (or Thread-Index/Thread-Topic headers for some Microsoft cases), MailNews has the ability to fail-over to threading based on the subject. (See MailNews:Message_Threading for background information on threading and MailNews' current implementation and the preferences that affect it.) This is workable, but is not without its issues.

A (more expensive) solution is to leverage the fact that many replies will include quotes from, or the entire body of, the message they are in response to. One could even make the argument that people who actually care about top-posting versus bottom-posting and cutting down the quoted content to only the relevant snippets are likely to use a mail client that produces References/In-Reply-To headers. One could then continue to argue that people who don't use well-behaved mail clients probably don't care about these things and probably leave the entire quoted message around (if quoting is enabled).

This suggests it is possible to more reliably thread messages by subject by also processing their content for quoted content from the previous message. Not only that, but that we probably do not need to be exceedingly thorough about the quote analysis; there should be some very juicy low-hanging fruit.

This work can also be used to enhance the user experience by linking quoted text (even reduced quotes) back to their original message.