This article is an excerpt from Ending Spam: Bayesian Content Filtering and
the Art of Statistical Language Classification. Printed with permission from
No Starch Press. Copyright 2005.
Unlike older spam filters, in which the author programs the characteristics
of spam, statistical filtering automatically chooses the characteristics (or
"features") of spam and nonspam directly from each e-mail. Two years from
now, when spam has evolved in content, statistical filters will have learned
enough to continue doing their job. This is because unlike older spam
filters, in which the author programmed rules to identify spam, statistical
filters automatically identify damning features of a spam based on message
content.
Tokenization is the process of reducing a message to its coll... (more)