[math-fun] spam blocking, not mathematically
hihi, all - it turns out the the single most effective way to block the kind of spam that i get is to look for words or just letter sequences with embedded html comments, things of the form (this from one i got this morning) REVOLU<!--Am-->TIONARY detecting and removing these messages would take care of almost all the spam i get, except for the nigeria scam and its brethren, and the messages with no content except a link (and since an embedded comment is needed only as a way to hide a word from a text search, that automatically makes the containing spam as far as i'm concerned) i suspect that removing these messages first would make any statistical approach far more effective, since many include extra (or only) nonsense words to circumvent language-based discrimination (i got this example also this morning) wpj<...html stuff...>a osnkni tolzbo mxwzys a finally, apparwently you can also encode ordinary ascii letters to disguise them, such as the example below for the letter i i finally, i have not yet seen even one legitimate message from any of the new domains (.biz, .info, etc.), though i imagine that that will eventually happen if anyone wants a good hard test of their spam filter rules or tools, please let me help, since i get enough mail that i do want to make discriminating it from spam interestingly difficult (i run my mail on a unix machine, so windows tools won't help me) more later, cal Dr. Christopher Landauer Aerospace Integration Science Center cal@aero.org +1 (310) 336-1361
REVOLU<!--Am-->TIONARY wpj<...html stuff...>a osnkni tolzbo mxwzys a i
Of course, this kind of subtrafuge is a spammer's dodge to avoid being caught by simple spam filters. The problem with trying to analyze the text to reduce it to the "real message" is that the real message is innocuous except to an intellegent reader. For example, I've received messages containing this kind of gibberish, for which the intellegible message is "gentleman's johnson enhancer". The beauty of the baysean approach is that the gibberish and html noise is itself a filterable criterion, and I don't have to decide what parts are safe to use as the filter.
participants (2)
-
Chris Landauer -
Dave Dyer