2

I have a load of database entries that have been saved that are full of spam. I would like to be able to pipe the text output of each into a spamassassin or similar tool to be able to get a score on how likely it is to be spam, but without the whole machine-learning thing from mailboxes, or even running on a mail server. It seems that everything I've found is incredibly biased towards emails rather than just a simple stdin > process > stdout type thing.

If there's one written in a scripting language, that's fine, but I'd rather something that can work with an out-of-the-box centos machine. Any help appreciated.

MadHatter
  • 80,590

1 Answers1

2

It's interesting you mention spamassassin, because it has a mode that seems to be exactly what you want (/tmp/spammyin this case contains a single, candidate email):

[me@lory tmp]$ spamassassin < /tmp/spammy 
Oct 20 11:54:47.097 [19986] warn: netset: cannot include 127.0.0.1/32 as it has already been included
From: "REDACTED" <redacted>
To: REDACTED
Subject: Pharmacy
Date: 20 Oct 2014 02:22:04 +0100
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on lory.teaparty.net
X-Spam-Flag: YES
X-Spam-Level: *********
X-Spam-Status: Yes, score=9.2 required=3.9 tests=BAYES_20,MISSING_MID,
        NO_RECEIVED,NO_RELAYS,TVD_SPACE_RATIO,URIBL_BLACK,URIBL_DBL_SPAM,
        URIBL_JP_SURBL,URIBL_SBL,URIBL_WS_SURBL autolearn=no version=3.3.1
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="----------=_5444E9FB.89EA3D9F"

This is a multi-part message in MIME format.

------------=_5444E9FB.89EA3D9F
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: 8bit

Spam detection software, running on the system "lory.teaparty.net", has
identified this incoming email as possible spam.  The original message
has been attached to this so you can view it (if it isn't spam) or label
similar future email.  If you have any questions, see
the administrator of that system for details.

Content preview:  Good medicines special http://canadiantabletstore.com/ [...]


Content analysis details:   (9.2 points, 3.9 required)

 pts rule name              description
---- ---------------------- --------------------------------------------------
 2.5 URIBL_DBL_SPAM         Contains a spam URL listed in the DBL blocklist
                            [URIs: canadiantabletstore.com]
 1.7 URIBL_BLACK            Contains an URL listed in the URIBL blacklist
                            [URIs: canadiantabletstore.com]
 1.6 URIBL_WS_SURBL         Contains an URL listed in the WS SURBL blocklist
                            [URIs: canadiantabletstore.com]
 1.2 URIBL_JP_SURBL         Contains an URL listed in the JP SURBL blocklist
                            [URIs: canadiantabletstore.com]
-0.0 NO_RELAYS              Informational: message was not relayed via SMTP
 1.6 URIBL_SBL              Contains an URL's NS IP listed in the SBL blocklist
                            [URIs: canadiantabletstore.com]
-0.0 BAYES_20               BODY: Bayes spam probability is 5 to 20%
                            [score: 0.1750]
 0.5 MISSING_MID            Missing Message-Id: header
-0.0 NO_RECEIVED            Informational: message has no Received headers
 0.0 TVD_SPACE_RATIO        TVD_SPACE_RATIO



------------=_5444E9FB.89EA3D9F
Content-Type: message/rfc822; x-spam-type=original
Content-Description: original message before SpamAssassin
Content-Disposition: inline
Content-Transfer-Encoding: 8bit

Date: 20 Oct 2014 02:22:04 +0100
From: "REDACTED" <REDACTED>
To: REDACTED
Subject: Pharmacy

Good medicines special
http://canadiantabletstore.com/


------------=_5444E9FB.89EA3D9F--
MadHatter
  • 80,590
  • Oh yeah... I couldn't find anything documenting that, it was all going on about mail servers :/ – Matt Fletcher Oct 20 '14 at 11:32
  • Do you know if there's a porcelain mode or am I going to have to grep it? I'll just grep it... :P – Matt Fletcher Oct 20 '14 at 11:44
  • I found that cat /tmp/spammy | spamassassin |& grep ^X-Spam-Flag | grep NO > /dev/null did what I wanted; exit status 0 for ham, 1 for spam. YMMV! – MadHatter Oct 20 '14 at 11:45
  • I would recommend maybe rephrasing your answer though- it isn't instantly apparent that /tmp/spammy is something you're testing. Maybe mention that spamassassin does indeed take stdin – Matt Fletcher Oct 20 '14 at 11:45
  • Thanks for that, Matt, but the very first line mentions that the file contains a single candidate email - does that not make it clear that that's what's being tested? – MadHatter Oct 20 '14 at 11:46
  • Perhaps- it's probably just me being thick. At first glance I thought that what you'd pasted was the contents of spammy.txt - I also wasn't sure whether that the file was generated by you or by SA (as I have no prior experience with it). That kind of thing. Probably just stupidity as prev mentioned :P Cheers again – Matt Fletcher Oct 20 '14 at 11:53
  • Also, it seems to be checking for things based around emails, so it's still failing on EMPTY_MESSAGE, MISSING_DATE, MISSING_SUBJECT etc etc – Matt Fletcher Oct 20 '14 at 12:09
  • Yes, that's not surprising. Spam is an email concept; it generally refers to unsolicited commercial email. It had other meanings in the USENET days, but those are generally gone, and most people think of email when you say spam. If what you've got isn't a DB of email, what is it, that you think it might have spam in it? Are you perhaps looking for a tool to detect advertisements? – MadHatter Oct 20 '14 at 12:45
  • It's more like spammy blog content- bots have basically managed to create false user accounts and then generate content (testosterone, viagra, even selling pushchairs and toys). The site is a student revision website (I won't mention which one) and so a hell of a lot of content is written by children, making it hard to separate the two. But yes, what you said in your final sentence :) – Matt Fletcher Oct 20 '14 at 13:24
  • You could consider putting in a set of SA-neutral, RFC-compliant headers, onto each message before running it through SA; that would allow the content to be fully processed without the grumbling about it not being an email message. – MadHatter Oct 20 '14 at 13:56