Thunderbird:SummerOfCode2006:SPAM
Contents
Proposal Overview
Objectives
The primary objectives of this project are to:
- Develop a set of performance tests to measure the effectiveness of Thunderbirds's Bayesian spam filter implementation.
- Develop and implement several improvements to the spam filters.
- Regression test the performance of these changes.
- Investigate training improvements including a default training set.
Background
Thunderbird contains a bayesian spam filter based on the Spam Bayes Project which uses a chi squared distribution model suggested by Gary Robinson.
Our chi squared implementation was originally developed in May of 2004. The effectivness of the changes were measured based on a simple random sampling from the Spam Assassin Public Corpus.
Over the last two years, there have been many suggested improvements for our bayesian spam filters. We need to develop a formalized performance test for the filter. With tests in hand, could then start experimenting with these improvements.
Our existing spam filter implementation can be found here: [1].
Regression Tests
There are a lot of interesting ideas for how we can improve our spam filter. However, before we can embark on any of them, we need an effective set of performance tests to determine if these improvements really do make the filter better.
Gary Robinson talks about an interest approach, five-fold cross validation which could be applied to the Spam Assassin public corpus. This could be a good starting point. For more information, read the Testing section of this paper on handling token redundancy.
The Spam Bayes Wiki may have other ideas as well.
Improving the Filter
Improved Chi
Gary Robinson wrote a paper in April of 2004 about Handling redundancy in Email Token Probabilities. I think it would be interesting to implement the suggested changes in this paper. This is being tacked in Bug 243430.
Token Pruning
Currently the set of training tokens grows unbounded as the user keeps training the filter. Pruning the token base based on the age of the token could be a good way to keep the training set small while giving greater precedence to more recent junk mail messages. This is being tracked in Bug #228675. There is also some discussion about token pruning on Spam Bayes Adaptive Training.
Token Ratios
A user can over train for spam or ham (less likely). We don't maintain a ratio of ham to spam tokens in the training set. This can lead to performance issues for the filter. It could be useful to enforce a ratio.
It might also be interesting to think about automatically tokenizing new e-mail you send as ham when the training ratios get out of whack.
Base64
We currently don't decode base64 encoded message bodies. As a result, these message bodies never get tokenized and fed into the bayesian spam filter. To Do: Add a bug link.
Consider a New Approach
There are many other open source spam filters out in the wild such as The DSPAM Project. Or maybe the Chung-Kwei approach which is discussed in Bug 256563.
Additional Reading
- Paul Graham's A Plan For Spam
- Gary Robinson's Why Chi
- Gary Robinson's Improved Chi
- Spam Bayes Wiki