From tlyons at ivenue.com  Mon Feb 22 14:53:42 2010
From: tlyons at ivenue.com (Todd Lyons)
Date: Mon, 22 Feb 2010 11:53:42 -0800
Subject: [Pymilter] general gossip questions
Message-ID: <48b1344b1002221153ub89ad40jdb6ee34b973b00ba@mail.gmail.com>

Hi, just joined the list.  I am going to implement some kind of
reputation system on our mail servers.  I've studied and experimented
with both gossip and pygossip.  pygossip was not too difficult to get
running on my laptop (Ubuntu, servers are CentOS 4.x and 5.x).  gossip
compiled, but it's implementation of SSL pulls in a few things that I
don't have a need for, so now I'll see about getting pygossip up and
running on a test mail system.

I need to implement the client on two different types of mail server,
sendmail and exim.  The off-the-cuff plan for sendmail is to write a
quick milter in perl, and for exim I'll embed the perl into exim and
write a few macros.

I would like to use pygossip's server capabilities, however I have
some questions:
1. Is there any more recent version/code in git or any other VCS?  Bug
fixes?  Feature additions?
2. I don't know much about python, but Shelves just seems to be a
version of berkeleyDB.  Is this right?
3. What about manual adjustments to the data?  For example, completely
purge one sender's reputation?
4. Any kind of summary or data mining scripts?  Bosses like pretty graphs.

I don't quite have a grasp of the UMIS yet, but I'm studying to see
how it differs in function from the unique identifier that is the
IP:domain tuple.

So far, I find the documentation within pygossip scripts to be quite
readable and illuminating.  You've written a fantastic project, I just
need my understanding to catch up with it.

-- 
Regards...      Todd
I seek the truth...it is only persistence in self-delusion and
ignorance that does harm.  -- Marcus Aurealius


From stuart at bmsi.com  Tue Feb 23 00:00:00 2010
From: stuart at bmsi.com (Stuart D. Gathman)
Date: Tue, 23 Feb 2010 00:00:00 -0500 (EST)
Subject: [Pymilter] general gossip questions
In-Reply-To: <48b1344b1002221153ub89ad40jdb6ee34b973b00ba@mail.gmail.com>
References: <48b1344b1002221153ub89ad40jdb6ee34b973b00ba@mail.gmail.com>
Message-ID: <Pine.LNX.4.64.1002222340470.20465@bmsred.bmsi.com>

On Mon, 22 Feb 2010, Todd Lyons wrote:

> don't have a need for, so now I'll see about getting pygossip up and
> running on a test mail system.

Beware the unsolved problem of REUSE_ADDR not working for TCPServer.

> 1. Is there any more recent version/code in git or any other VCS?  Bug
> fixes?  Feature additions?

I check in work to CVS on sourceforge.

> 2. I don't know much about python, but Shelves just seems to be a
> version of berkeleyDB.  Is this right?

Correct.  Easy, and sufficient for my servers.  My server with 
heaviest usage (100000 msgs/day) uses 630M for the shelve database
with no problem.  I think there is a way to plug in another database.

> 3. What about manual adjustments to the data?  For example, completely
> purge one sender's reputation?

The 'R' command resets the reputation for a domain:qual.  The tc.py
script (inexplicably not installed by package - should at least go in
/usr/lib/pymilter) provides a simple command line interface to query, feedback,
and reset reputations.

> 4. Any kind of summary or data mining scripts?  Bosses like pretty graphs.

No.  pygossip_purge.py deletes old unused records.  It could collect
stats while doing its thing.   (It could run daily automatically
when the REUSE_ADDR bug is fixed.)

> I don't quite have a grasp of the UMIS yet, but I'm studying to see
> how it differs in function from the unique identifier that is the
> IP:domain tuple.

The UMIS identifies a message - as opposed to a domain qual.  The
UMIS is stored in an email header, and can be used to update how that
message "votes" for the reputation of the domain.  The UMIS is used
so that each message gets at most one vote.  pygossip doesn't currently worry
too much about losing votes.  :-)

Note that IP has been generalized to qualifier.  In the milter package,
domains that pass SPF have the SPF qualifier.  Softfail gets the
SOFTFAIL qualifier, and so on for NEUTRAL, FAIL, PERMERROR, GUESS (gets
a best guess pass).  (Yes, some braindead domains routinely send out mail with
SPF fail, and need a special policy to accept their braindamage - but the
reputation system will still block them if enough spam comes in with their
domain.)  IP is used when SPF result is NONE.

Thus, the NEUTRAL reputation of a domain may have a different (and likely
much worse) reputation than the SPF reputation.

-- 
	      Stuart D. Gathman <stuart at bmsi.com>
    Business Management Systems Inc.  Phone: 703 591-0911 Fax: 703 591-6154
"Confutatis maledictis, flammis acribus addictis" - background song for
a Microsoft sponsored "Where do you want to go from here?" commercial.


From stuart at bmsi.com  Tue Feb 23 13:31:52 2010
From: stuart at bmsi.com (Stuart D. Gathman)
Date: Tue, 23 Feb 2010 13:31:52 -0500 (EST)
Subject: [Pymilter] general gossip questions
In-Reply-To: <Pine.LNX.4.64.1002222340470.20465@bmsred.bmsi.com>
References: <48b1344b1002221153ub89ad40jdb6ee34b973b00ba@mail.gmail.com>
	<Pine.LNX.4.64.1002222340470.20465@bmsred.bmsi.com>
Message-ID: <Pine.LNX.4.64.1002231326420.4066@bmsred.bmsi.com>

On Tue, 23 Feb 2010, Stuart D. Gathman wrote:

> > 2. I don't know much about python, but Shelves just seems to be a
> > version of berkeleyDB.  Is this right?

More accurately, shelve is a wrapper around a bsd style database.  
The database just has to act like a map (lookup records given keys)
with variable size binary records.

Shelve uses strings for keys, and serializes a python object for 
the record.  By using a single BLOB for the record, it avoids the
complication of extracting fields.  For this type of application, where
all data in a record is needed every time, extracting fields would
be a waste of time.

-- 
	      Stuart D. Gathman <stuart at bmsi.com>
    Business Management Systems Inc.  Phone: 703 591-0911 Fax: 703 591-6154
"Confutatis maledictis, flammis acribus addictis" - background song for
a Microsoft sponsored "Where do you want to go from here?" commercial.


From stuart at bmsi.com  Wed Feb 24 18:28:36 2010
From: stuart at bmsi.com (Stuart D. Gathman)
Date: Wed, 24 Feb 2010 18:28:36 -0500 (EST)
Subject: [Pymilter] Re: [Patch] PyDNS Documention
In-Reply-To: <87wryu3g07.fsf@SSpaeth.de>
References: <87pr4nwvob.fsf@SSpaeth.de>
	<Pine.LNX.4.64.1002021246080.4976@bmsred.bmsi.com>
	<87wryu3g07.fsf@SSpaeth.de>
Message-ID: <Pine.LNX.4.64.1002241824500.19696@bmsred.bmsi.com>

On Wed, 3 Feb 2010, Sebastian Spaeth wrote:

> On Tue, 2 Feb 2010 13:06:45 -0500 (EST), "Stuart D. Gathman" <stuart at bmsi.com> wrote:
> 
> > Thank you!  I ended up using doxygen for python docs - you can see 
> > the result for pymilter at http://spidey2.bmsi.com/pymilter/
> 
> Doxygen is quite nice too. But I wanted to learn sphinx, so this was a
> nice side project :-). I have updated the documentation since the last
> mail, please find the link to patches below:

I hope you don't mind, I added a link to your temporary documentation
to http://bmsi.com/python/milter.html

I actually can't do much with the pydns sourceforge project, being
a "developer" and not an admin.  You might want to mention the
http://bmsi.com/mailman/listinfo/pymilter mailing list.

I may end up translating your docs to Doxygen.  I really don't want to
learn another markup at the moment ...

-- 
	      Stuart D. Gathman <stuart at bmsi.com>
    Business Management Systems Inc.  Phone: 703 591-0911 Fax: 703 591-6154
"Confutatis maledictis, flammis acribus addictis" - background song for
a Microsoft sponsored "Where do you want to go from here?" commercial.


From tlyons at ivenue.com  Thu Feb 25 18:16:35 2010
From: tlyons at ivenue.com (Todd Lyons)
Date: Thu, 25 Feb 2010 15:16:35 -0800
Subject: [Pymilter] general gossip questions
In-Reply-To: <Pine.LNX.4.64.1002222340470.20465@bmsred.bmsi.com>
References: <48b1344b1002221153ub89ad40jdb6ee34b973b00ba@mail.gmail.com>
	<Pine.LNX.4.64.1002222340470.20465@bmsred.bmsi.com>
Message-ID: <48b1344b1002251516t67d53f97p1199b7593d4480c6@mail.gmail.com>

On Mon, Feb 22, 2010 at 9:00 PM, Stuart D. Gathman <stuart at bmsi.com> wrote:
>
>> don't have a need for, so now I'll see about getting pygossip up and
>> running on a test mail system.
> Beware the unsolved problem of REUSE_ADDR not working for TCPServer.

I googled to try to understand a little bit about this.  What prevents
you from just adding:
SocketServer.TCPServer.allow_reuse_address = True
right after the import SocketServer?  Or does that not fix the issue?
I was able to telnet multiple times to the daemon in my testing and
never had problems with concurrency, so I'm not sure that I even
comprehend what the REUSE_ADDR issue actually is.

> Correct. ?Easy, and sufficient for my servers. ?My server with
> heaviest usage (100000 msgs/day) uses 630M for the shelve database
> with no problem. ?I think there is a way to plug in another database.

I plan to put pygossip on two servers, configure them as peers, and
set my TTL to 2.  Then I've got 8 servers that I will point at those
two instances.  I figured I'd have 4 point to one pygossip server and
the other 4 point to the other pygossip server.  Since the TTL is 2,
they should talk to each other and trade reputation info back and
forth.  This should also result in the data being evenly split between
the two machines.

Or am I misunderstanding how the peer system works and is all data is
stored on both nodes?  I apologize for the lack of understanding this
question reveals.

> The 'R' command resets the reputation for a domain:qual. ?The tc.py
> script (inexplicably not installed by package - should at least go in
> /usr/lib/pymilter) provides a simple command line interface to query, feedback,
> and reset reputations.

That's awesome, in all my reading I had never seen the R command.  I
also added tc.py into the rpm I build.

Part of the reason I think I didn't know about the R command is that
the pygossip webpage download link goes directly to pygossip-0.3.  It
wasn't until after I sent my first email to the list that I realized
there was a pygossip-0.4.  Since I had been looking at gossip and
pygossip off and on during the previous week, you would think I would
have stumbled across that R command, but I never did (because I was
reading 0.3 release code).

> The UMIS identifies a message - as opposed to a domain qual. ?The
> UMIS is stored in an email header, and can be used to update how that
> message "votes" for the reputation of the domain. ?The UMIS is used
> so that each message gets at most one vote. ?pygossip doesn't currently worry
> too much about losing votes. ?:-)

That seems pretty clear, thanks.

> Note that IP has been generalized to qualifier. ?In the milter package,
> domains that pass SPF have the SPF qualifier. ?Softfail gets the
> SOFTFAIL qualifier, and so on for NEUTRAL, FAIL, PERMERROR, GUESS (gets
> a best guess pass). ?(Yes, some braindead domains routinely send out mail with
> SPF fail, and need a special policy to accept their braindamage - but the
> reputation system will still block them if enough spam comes in with their
> domain.) ?IP is used when SPF result is NONE.
>
> Thus, the NEUTRAL reputation of a domain may have a different (and likely
> much worse) reputation than the SPF reputation.

Since I'm going to be implementing my own feedback mechanism in exim
and writing my own milter for sendmail, I have a blank slate WRT how I
want to treat all of these variables.  I will definitely be studying
your code and drawing from its concepts and design.  At this point the
variables I want to use for generating a feedback score are:
  spamassassin score as the base
  spf result
  dkim result
  helo name tests

There could easily be more, this list is off the top of my head.

Thanks for the feedback Stuart, much appreciated.

-- 
Regards...      Todd
I seek the truth...it is only persistence in self-delusion and
ignorance that does harm.  -- Marcus Aurealius


From stuart at bmsi.com  Fri Feb 26 15:21:47 2010
From: stuart at bmsi.com (Stuart D. Gathman)
Date: Fri, 26 Feb 2010 15:21:47 -0500 (EST)
Subject: [Pymilter] general gossip questions
In-Reply-To: <48b1344b1002251516t67d53f97p1199b7593d4480c6@mail.gmail.com>
References: <48b1344b1002221153ub89ad40jdb6ee34b973b00ba@mail.gmail.com>
	<Pine.LNX.4.64.1002222340470.20465@bmsred.bmsi.com>
	<48b1344b1002251516t67d53f97p1199b7593d4480c6@mail.gmail.com>
Message-ID: <Pine.LNX.4.64.1002261503020.1939@bmsred.bmsi.com>

On Thu, 25 Feb 2010, Todd Lyons wrote:

> On Mon, Feb 22, 2010 at 9:00 PM, Stuart D. Gathman <stuart at bmsi.com> wrote:
> >
> >> don't have a need for, so now I'll see about getting pygossip up and
> >> running on a test mail system.
> > Beware the unsolved problem of REUSE_ADDR not working for TCPServer.
> 
> I googled to try to understand a little bit about this.  What prevents
> you from just adding:
> SocketServer.TCPServer.allow_reuse_address = True
> right after the import SocketServer?  Or does that not fix the issue?

Doesn't fix the issue.

> I was able to telnet multiple times to the daemon in my testing and
> never had problems with concurrency, so I'm not sure that I even
> comprehend what the REUSE_ADDR issue actually is.

Telnet to daemon.  Restart daemon with session still active (something
that will almost always be the case in production).

Daemon will shutdown for restart, but won't start again for 5 mins. 
Error in log is "socket in use".  The SO_REUSE_ADDR socket option is
supposed to let you immediately reuse the socket without trying to
shutdown active connections.

Somehow, the allow_reuse_address flag doesn't actually result in the
socket option getting set.  I have spent a little bit of time debugging the
TCPServer python code, and can't see where it goes wrong.

> I plan to put pygossip on two servers, configure them as peers, and
> set my TTL to 2.  Then I've got 8 servers that I will point at those
> two instances.  I figured I'd have 4 point to one pygossip server and
> the other 4 point to the other pygossip server.  Since the TTL is 2,
> they should talk to each other and trade reputation info back and
> forth.  This should also result in the data being evenly split between
> the two machines.
> 
> Or am I misunderstanding how the peer system works and is all data is
> stored on both nodes?  I apologize for the lack of understanding this
> question reveals.

I think you have a slight misunderstanding.  But your setup should be
reasonable.  For each query, the pygossip server looks up the reputation
in its own database, then queries peers for their "opinion".  The
peer opinions are weighted by how often the peer agrees with the
local server (one mans spam is another mans daily entertainment),
and combined with the local reputation for the final score.

So while the databases will be different, there will be a lot of overlap.

However, adding a "load sharing peer" variation should be pretty
straightforward to design.

Re your scoring plans.  Gossip tracks just a spam/notspam vote for each
UMIS.  (A bitmap tracks the last N UMISs - where N defaults to 1024.)

So your feedback to gossip has to be yea/nay.  You could combine
the gossip score with the other scores in your main filter.

Note that Gossip is compatible with "AOL style" user feedback as well
(where user complaints about an email form a reputation that will eventually
get a sender kicked off AOL.  AOL tracks IPs rather than domains).

-- 
	      Stuart D. Gathman <stuart at bmsi.com>
    Business Management Systems Inc.  Phone: 703 591-0911 Fax: 703 591-6154
"Confutatis maledictis, flammis acribus addictis" - background song for
a Microsoft sponsored "Where do you want to go from here?" commercial.