Filter hebrew, russian, chinese… spams with SpamAssassin

Hi!

According to several report from my users, it seems we were getting more and more spams written in some foreign languages.
Despite my good amavis/spamassassin filtering system, all kind of bayesian filters are no-op and this spam usually comes from valid yahoo/gmail/others accounts aren’t reported to pyzor or dcc.
Real pain…

The good news is nobody speaks hebrew around, so I can safely tag these mails as junk. Here a quick howto to enable this on Debian:

Edit /etc/spamassassin/v310.pre and uncomment the following line:

loadplugin Mail::SpamAssassin::Plugin::TextCat

1	loadplugin Mail::SpamAssassin::Plugin::TextCat

Configure this new plugin from /etc/spamassassin/local.cf:

# SpamAssassin TextCat (Language Guesser Plugin)
# http://spamassassin.apache.org/full/3.3.x/doc/Mail_SpamAssassin_Plugin_TextCat.html
<strong>ok_languages en fr</strong> # I can't understand anything else than french or english
<strong>inactive_languages ''</strong> # Enable all languages
<strong>score UNWANTED_LANGUAGE_BODY 5</strong> # Increase score
<strong>add_header all Languages _LANGUAGES_</strong>  # Write the detected langs in X-Spam-Languages

# SpamAssassin TextCat (Language Guesser Plugin)

# http://spamassassin.apache.org/full/3.3.x/doc/Mail_SpamAssassin_Plugin_TextCat.html

ok_languages en fr # I can't understand anything else than french or english

inactive_languages '' # Enable all languages

score UNWANTED_LANGUAGE_BODY 5 # Increase score

add_header all Languages _LANGUAGES_ # Write the detected langs in X-Spam-Languages

“ok_languages” contains only the lang I actually understand (french and english). You can add yours (see the commented URL for “language codes”).
The second line enable all supported language. TextCat disable by default a couple of rare languages to save servers ressources, but honestly, who cares about CPU usage on servers nowadays…
Then, I increase the score to 5 (default is 2.8) and the last line add a X-Spam-Languages headers so I can check my spam/ham to see which langs have been detected.

However, amavis will rewrite all headers by his own and drop X-Spam-Languages.
So, edit “/etc/amavis/conf.d/50-user” and add the following lines before “1;”:

# Print X-Spam-Languages header from TextCat SpamAssassin plugin
$allowed_added_header_fields{lc('X-Spam-Languages')} = 1;

1 2	# Print X-Spam-Languages header from TextCat SpamAssassin plugin $allowed_added_header_fields{lc('X-Spam-Languages')} = 1;

This will ask amavis to keep this header from spamassasin. Please note, it won’t work unless you’re running amavis >= 2.7 !

You may want to check than spamassasin can load the module fine:

user@server:~$ sudo spamassassin --lint -D 2>&1 | grep -i textcat
Oct 22 21:34:25.772 [17852] dbg: plugin: loading Mail::SpamAssassin::Plugin::TextCat from @INC
Oct 22 21:34:25.778 [17852] dbg: textcat: loading languages file...
Oct 22 21:34:25.885 [17852] dbg: textcat: loaded 73 language models
Oct 22 21:34:26.541 [17852] dbg: config: fixed relative path: /var/lib/spamassassin/3.003002/updates_spamassassin_org/25_textcat.cf
Oct 22 21:34:26.542 [17852] dbg: config: using "/var/lib/spamassassin/3.003002/updates_spamassassin_org/25_textcat.cf" for included file
Oct 22 21:34:26.543 [17852] dbg: config: read file /var/lib/spamassassin/3.003002/updates_spamassassin_org/25_textcat.cf
Oct 22 21:34:28.913 [17852] dbg: plugin: Mail::SpamAssassin::Plugin::TextCat=HASH(0xa0af534) implements 'extract_metadata', priority 0
Oct 22 21:34:28.915 [17852] dbg: textcat: classifying, skipping: ''
Oct 22 21:34:28.936 [17852] dbg: textcat: language possibly: en
Oct 22 21:34:28.937 [17852] dbg: textcat: X-Languages: "en", X-Languages-Length: 1342

user@server:~$ sudo spamassassin --lint -D 2>&1 | grep -i textcat

Oct 22 21:34:25.772 [17852] dbg: plugin: loading Mail::SpamAssassin::Plugin::TextCat from @INC

Oct 22 21:34:25.778 [17852] dbg: textcat: loading languages file...

Oct 22 21:34:25.885 [17852] dbg: textcat: loaded 73 language models

Oct 22 21:34:26.541 [17852] dbg: config: fixed relative path: /var/lib/spamassassin/3.003002/updates_spamassassin_org/25_textcat.cf

Oct 22 21:34:26.542 [17852] dbg: config: using "/var/lib/spamassassin/3.003002/updates_spamassassin_org/25_textcat.cf" for included file

Oct 22 21:34:26.543 [17852] dbg: config: read file /var/lib/spamassassin/3.003002/updates_spamassassin_org/25_textcat.cf

Oct 22 21:34:28.913 [17852] dbg: plugin: Mail::SpamAssassin::Plugin::TextCat=HASH(0xa0af534) implements 'extract_metadata', priority 0

Oct 22 21:34:28.915 [17852] dbg: textcat: classifying, skipping: ''

Oct 22 21:34:28.936 [17852] dbg: textcat: language possibly: en

Oct 22 21:34:28.937 [17852] dbg: textcat: X-Languages: "en", X-Languages-Length: 1342

Don’t worry about the “language possibly: en” line, it doesn’t mean anything (when using –lint spamassassin behaves like if it was processing a real mail).

Restart amavis and enjoy !

Here is what you should find in headers of a mail from your Junk mailbox soon:

X-Spam-Flag: YES
X-Spam-Score: 14.858
X-Spam-Level: **************
X-Spam-Status: Yes, score=14.858 tagged_above=-999 required=6.31
	tests=[AWL=-2.517, BAYES_99=6, DCC_CHECK=2.5, HTML_MESSAGE=0.001,
	MPART_ALT_DIFF=0.79, NORMAL_HTTP_TO_IP=0.001, RCVD_IN_PSBL=2.7,
	RP_MATCHES_RCVD=-0.735, SPF_PASS=-0.5, T_KHOP_FOREIGN_CLICK=0.01,
	<strong>UNWANTED_LANGUAGE_BODY=5</strong>, URIBL_WS_SURBL=1.608] autolearn=no
<strong>X-Spam-Languages: pt</strong>

X-Spam-Flag: YES

X-Spam-Score: 14.858

X-Spam-Level: **************

X-Spam-Status: Yes, score=14.858 tagged_above=-999 required=6.31

tests=[AWL=-2.517, BAYES_99=6, DCC_CHECK=2.5, HTML_MESSAGE=0.001,

MPART_ALT_DIFF=0.79, NORMAL_HTTP_TO_IP=0.001, RCVD_IN_PSBL=2.7,

RP_MATCHES_RCVD=-0.735, SPF_PASS=-0.5, T_KHOP_FOREIGN_CLICK=0.01,

UNWANTED_LANGUAGE_BODY=5, URIBL_WS_SURBL=1.608] autolearn=no

X-Spam-Languages: pt

5 thoughts on “Filter hebrew, russian, chinese… spams with SpamAssassin”

Benja on 2013/10/09 at 22:09 said:

Hi!
This saved my day 🙂
May I still ask how spamassassin scores a mail too short to detect language?
Or more generally a mail where it cannot determine the language.
Thanks!

Reply ↓
Le_Vert on 2013/10/09 at 22:45 said:

Hi,

I’m not sure to understand your question. If spamassassin can’t detect any language, this feature will just be no-op !

Reply ↓
Thomas on 2017/02/11 at 13:11 said:

Great post, thank you! Works like a charm.

Reply ↓
Discount on 2018/02/03 at 03:41 said:

Amazing things here. I am very happy tto look your post.
Thank you so much and I am looking forward to touch you.
Will you kindly drop me a mail?

Reply ↓
- Le_Vert on 2018/02/03 at 04:03 said:
  
  Why is that ? You can reach me at gandalf@le-vert.net
  
  Reply ↓

blog.le-vert.net

No bullshit, only Linux stuff

Filter hebrew, russian, chinese… spams with SpamAssassin

5 thoughts on “Filter hebrew, russian, chinese… spams with SpamAssassin”

Leave a Reply Cancel reply