From tlyons at ivenue.com Mon Feb 22 14:53:42 2010 From: tlyons at ivenue.com (Todd Lyons) Date: Mon, 22 Feb 2010 11:53:42 -0800 Subject: [Pymilter] general gossip questions Message-ID: <48b1344b1002221153ub89ad40jdb6ee34b973b00ba@mail.gmail.com> Hi, just joined the list. I am going to implement some kind of reputation system on our mail servers. I've studied and experimented with both gossip and pygossip. pygossip was not too difficult to get running on my laptop (Ubuntu, servers are CentOS 4.x and 5.x). gossip compiled, but it's implementation of SSL pulls in a few things that I don't have a need for, so now I'll see about getting pygossip up and running on a test mail system. I need to implement the client on two different types of mail server, sendmail and exim. The off-the-cuff plan for sendmail is to write a quick milter in perl, and for exim I'll embed the perl into exim and write a few macros. I would like to use pygossip's server capabilities, however I have some questions: 1. Is there any more recent version/code in git or any other VCS? Bug fixes? Feature additions? 2. I don't know much about python, but Shelves just seems to be a version of berkeleyDB. Is this right? 3. What about manual adjustments to the data? For example, completely purge one sender's reputation? 4. Any kind of summary or data mining scripts? Bosses like pretty graphs. I don't quite have a grasp of the UMIS yet, but I'm studying to see how it differs in function from the unique identifier that is the IP:domain tuple. So far, I find the documentation within pygossip scripts to be quite readable and illuminating. You've written a fantastic project, I just need my understanding to catch up with it. -- Regards... Todd I seek the truth...it is only persistence in self-delusion and ignorance that does harm. -- Marcus Aurealius From stuart at bmsi.com Tue Feb 23 00:00:00 2010 From: stuart at bmsi.com (Stuart D. Gathman) Date: Tue, 23 Feb 2010 00:00:00 -0500 (EST) Subject: [Pymilter] general gossip questions In-Reply-To: <48b1344b1002221153ub89ad40jdb6ee34b973b00ba@mail.gmail.com> References: <48b1344b1002221153ub89ad40jdb6ee34b973b00ba@mail.gmail.com> Message-ID: On Mon, 22 Feb 2010, Todd Lyons wrote: > don't have a need for, so now I'll see about getting pygossip up and > running on a test mail system. Beware the unsolved problem of REUSE_ADDR not working for TCPServer. > 1. Is there any more recent version/code in git or any other VCS? Bug > fixes? Feature additions? I check in work to CVS on sourceforge. > 2. I don't know much about python, but Shelves just seems to be a > version of berkeleyDB. Is this right? Correct. Easy, and sufficient for my servers. My server with heaviest usage (100000 msgs/day) uses 630M for the shelve database with no problem. I think there is a way to plug in another database. > 3. What about manual adjustments to the data? For example, completely > purge one sender's reputation? The 'R' command resets the reputation for a domain:qual. The tc.py script (inexplicably not installed by package - should at least go in /usr/lib/pymilter) provides a simple command line interface to query, feedback, and reset reputations. > 4. Any kind of summary or data mining scripts? Bosses like pretty graphs. No. pygossip_purge.py deletes old unused records. It could collect stats while doing its thing. (It could run daily automatically when the REUSE_ADDR bug is fixed.) > I don't quite have a grasp of the UMIS yet, but I'm studying to see > how it differs in function from the unique identifier that is the > IP:domain tuple. The UMIS identifies a message - as opposed to a domain qual. The UMIS is stored in an email header, and can be used to update how that message "votes" for the reputation of the domain. The UMIS is used so that each message gets at most one vote. pygossip doesn't currently worry too much about losing votes. :-) Note that IP has been generalized to qualifier. In the milter package, domains that pass SPF have the SPF qualifier. Softfail gets the SOFTFAIL qualifier, and so on for NEUTRAL, FAIL, PERMERROR, GUESS (gets a best guess pass). (Yes, some braindead domains routinely send out mail with SPF fail, and need a special policy to accept their braindamage - but the reputation system will still block them if enough spam comes in with their domain.) IP is used when SPF result is NONE. Thus, the NEUTRAL reputation of a domain may have a different (and likely much worse) reputation than the SPF reputation. -- Stuart D. Gathman Business Management Systems Inc. Phone: 703 591-0911 Fax: 703 591-6154 "Confutatis maledictis, flammis acribus addictis" - background song for a Microsoft sponsored "Where do you want to go from here?" commercial. From stuart at bmsi.com Tue Feb 23 13:31:52 2010 From: stuart at bmsi.com (Stuart D. Gathman) Date: Tue, 23 Feb 2010 13:31:52 -0500 (EST) Subject: [Pymilter] general gossip questions In-Reply-To: References: <48b1344b1002221153ub89ad40jdb6ee34b973b00ba@mail.gmail.com> Message-ID: On Tue, 23 Feb 2010, Stuart D. Gathman wrote: > > 2. I don't know much about python, but Shelves just seems to be a > > version of berkeleyDB. Is this right? More accurately, shelve is a wrapper around a bsd style database. The database just has to act like a map (lookup records given keys) with variable size binary records. Shelve uses strings for keys, and serializes a python object for the record. By using a single BLOB for the record, it avoids the complication of extracting fields. For this type of application, where all data in a record is needed every time, extracting fields would be a waste of time. -- Stuart D. Gathman Business Management Systems Inc. Phone: 703 591-0911 Fax: 703 591-6154 "Confutatis maledictis, flammis acribus addictis" - background song for a Microsoft sponsored "Where do you want to go from here?" commercial. From stuart at bmsi.com Wed Feb 24 18:28:36 2010 From: stuart at bmsi.com (Stuart D. Gathman) Date: Wed, 24 Feb 2010 18:28:36 -0500 (EST) Subject: [Pymilter] Re: [Patch] PyDNS Documention In-Reply-To: <87wryu3g07.fsf@SSpaeth.de> References: <87pr4nwvob.fsf@SSpaeth.de> <87wryu3g07.fsf@SSpaeth.de> Message-ID: On Wed, 3 Feb 2010, Sebastian Spaeth wrote: > On Tue, 2 Feb 2010 13:06:45 -0500 (EST), "Stuart D. Gathman" wrote: > > > Thank you! I ended up using doxygen for python docs - you can see > > the result for pymilter at http://spidey2.bmsi.com/pymilter/ > > Doxygen is quite nice too. But I wanted to learn sphinx, so this was a > nice side project :-). I have updated the documentation since the last > mail, please find the link to patches below: I hope you don't mind, I added a link to your temporary documentation to http://bmsi.com/python/milter.html I actually can't do much with the pydns sourceforge project, being a "developer" and not an admin. You might want to mention the http://bmsi.com/mailman/listinfo/pymilter mailing list. I may end up translating your docs to Doxygen. I really don't want to learn another markup at the moment ... -- Stuart D. Gathman Business Management Systems Inc. Phone: 703 591-0911 Fax: 703 591-6154 "Confutatis maledictis, flammis acribus addictis" - background song for a Microsoft sponsored "Where do you want to go from here?" commercial. From tlyons at ivenue.com Thu Feb 25 18:16:35 2010 From: tlyons at ivenue.com (Todd Lyons) Date: Thu, 25 Feb 2010 15:16:35 -0800 Subject: [Pymilter] general gossip questions In-Reply-To: References: <48b1344b1002221153ub89ad40jdb6ee34b973b00ba@mail.gmail.com> Message-ID: <48b1344b1002251516t67d53f97p1199b7593d4480c6@mail.gmail.com> On Mon, Feb 22, 2010 at 9:00 PM, Stuart D. Gathman wrote: > >> don't have a need for, so now I'll see about getting pygossip up and >> running on a test mail system. > Beware the unsolved problem of REUSE_ADDR not working for TCPServer. I googled to try to understand a little bit about this. What prevents you from just adding: SocketServer.TCPServer.allow_reuse_address = True right after the import SocketServer? Or does that not fix the issue? I was able to telnet multiple times to the daemon in my testing and never had problems with concurrency, so I'm not sure that I even comprehend what the REUSE_ADDR issue actually is. > Correct. ?Easy, and sufficient for my servers. ?My server with > heaviest usage (100000 msgs/day) uses 630M for the shelve database > with no problem. ?I think there is a way to plug in another database. I plan to put pygossip on two servers, configure them as peers, and set my TTL to 2. Then I've got 8 servers that I will point at those two instances. I figured I'd have 4 point to one pygossip server and the other 4 point to the other pygossip server. Since the TTL is 2, they should talk to each other and trade reputation info back and forth. This should also result in the data being evenly split between the two machines. Or am I misunderstanding how the peer system works and is all data is stored on both nodes? I apologize for the lack of understanding this question reveals. > The 'R' command resets the reputation for a domain:qual. ?The tc.py > script (inexplicably not installed by package - should at least go in > /usr/lib/pymilter) provides a simple command line interface to query, feedback, > and reset reputations. That's awesome, in all my reading I had never seen the R command. I also added tc.py into the rpm I build. Part of the reason I think I didn't know about the R command is that the pygossip webpage download link goes directly to pygossip-0.3. It wasn't until after I sent my first email to the list that I realized there was a pygossip-0.4. Since I had been looking at gossip and pygossip off and on during the previous week, you would think I would have stumbled across that R command, but I never did (because I was reading 0.3 release code). > The UMIS identifies a message - as opposed to a domain qual. ?The > UMIS is stored in an email header, and can be used to update how that > message "votes" for the reputation of the domain. ?The UMIS is used > so that each message gets at most one vote. ?pygossip doesn't currently worry > too much about losing votes. ?:-) That seems pretty clear, thanks. > Note that IP has been generalized to qualifier. ?In the milter package, > domains that pass SPF have the SPF qualifier. ?Softfail gets the > SOFTFAIL qualifier, and so on for NEUTRAL, FAIL, PERMERROR, GUESS (gets > a best guess pass). ?(Yes, some braindead domains routinely send out mail with > SPF fail, and need a special policy to accept their braindamage - but the > reputation system will still block them if enough spam comes in with their > domain.) ?IP is used when SPF result is NONE. > > Thus, the NEUTRAL reputation of a domain may have a different (and likely > much worse) reputation than the SPF reputation. Since I'm going to be implementing my own feedback mechanism in exim and writing my own milter for sendmail, I have a blank slate WRT how I want to treat all of these variables. I will definitely be studying your code and drawing from its concepts and design. At this point the variables I want to use for generating a feedback score are: spamassassin score as the base spf result dkim result helo name tests There could easily be more, this list is off the top of my head. Thanks for the feedback Stuart, much appreciated. -- Regards... Todd I seek the truth...it is only persistence in self-delusion and ignorance that does harm. -- Marcus Aurealius From stuart at bmsi.com Fri Feb 26 15:21:47 2010 From: stuart at bmsi.com (Stuart D. Gathman) Date: Fri, 26 Feb 2010 15:21:47 -0500 (EST) Subject: [Pymilter] general gossip questions In-Reply-To: <48b1344b1002251516t67d53f97p1199b7593d4480c6@mail.gmail.com> References: <48b1344b1002221153ub89ad40jdb6ee34b973b00ba@mail.gmail.com> <48b1344b1002251516t67d53f97p1199b7593d4480c6@mail.gmail.com> Message-ID: On Thu, 25 Feb 2010, Todd Lyons wrote: > On Mon, Feb 22, 2010 at 9:00 PM, Stuart D. Gathman wrote: > > > >> don't have a need for, so now I'll see about getting pygossip up and > >> running on a test mail system. > > Beware the unsolved problem of REUSE_ADDR not working for TCPServer. > > I googled to try to understand a little bit about this. What prevents > you from just adding: > SocketServer.TCPServer.allow_reuse_address = True > right after the import SocketServer? Or does that not fix the issue? Doesn't fix the issue. > I was able to telnet multiple times to the daemon in my testing and > never had problems with concurrency, so I'm not sure that I even > comprehend what the REUSE_ADDR issue actually is. Telnet to daemon. Restart daemon with session still active (something that will almost always be the case in production). Daemon will shutdown for restart, but won't start again for 5 mins. Error in log is "socket in use". The SO_REUSE_ADDR socket option is supposed to let you immediately reuse the socket without trying to shutdown active connections. Somehow, the allow_reuse_address flag doesn't actually result in the socket option getting set. I have spent a little bit of time debugging the TCPServer python code, and can't see where it goes wrong. > I plan to put pygossip on two servers, configure them as peers, and > set my TTL to 2. Then I've got 8 servers that I will point at those > two instances. I figured I'd have 4 point to one pygossip server and > the other 4 point to the other pygossip server. Since the TTL is 2, > they should talk to each other and trade reputation info back and > forth. This should also result in the data being evenly split between > the two machines. > > Or am I misunderstanding how the peer system works and is all data is > stored on both nodes? I apologize for the lack of understanding this > question reveals. I think you have a slight misunderstanding. But your setup should be reasonable. For each query, the pygossip server looks up the reputation in its own database, then queries peers for their "opinion". The peer opinions are weighted by how often the peer agrees with the local server (one mans spam is another mans daily entertainment), and combined with the local reputation for the final score. So while the databases will be different, there will be a lot of overlap. However, adding a "load sharing peer" variation should be pretty straightforward to design. Re your scoring plans. Gossip tracks just a spam/notspam vote for each UMIS. (A bitmap tracks the last N UMISs - where N defaults to 1024.) So your feedback to gossip has to be yea/nay. You could combine the gossip score with the other scores in your main filter. Note that Gossip is compatible with "AOL style" user feedback as well (where user complaints about an email form a reputation that will eventually get a sender kicked off AOL. AOL tracks IPs rather than domains). -- Stuart D. Gathman Business Management Systems Inc. Phone: 703 591-0911 Fax: 703 591-6154 "Confutatis maledictis, flammis acribus addictis" - background song for a Microsoft sponsored "Where do you want to go from here?" commercial.