From esj at harvee.org Tue May 11 09:10:50 2004 From: esj at harvee.org (Eric S. Johansson) Date: Tue, 11 May 2004 09:10:50 -0400 Subject: [Pymilter] really long processing In-Reply-To: <408E96AF.2050704@harvee.org> References: <408E96AF.2050704@harvee.org> Message-ID: <40A0D0DA.6080101@harvee.org> FYI, I have the receiver side milter working reasonably well. And I will be sending you the code as it as I'm convinced it is not embarrassingly ugly. I am moving to the stand generation side and I have two options as far as I can tell, first is to replicate the milter. in some way, shape, or form for the stamping filter. I'm coming to the opinion that it may be the preferable form. The alternative is to switch to the filter or stamper inside of the milter. One of the main challenges is queuing. When I am generating a stamp, it takes approximately 15 to 40 seconds depending on CPU load and stamp size. While there is no problem with generating stamps in parallel, there is no advantage and in fact there are some significant disadvantages. Specifically that will delay the delivery of all mail significantly instead of delivering messages one at a time. Additionally, it's effectively using the process table as a queuing structure which is never a good thing. The filter structure is: If the other party is known, send the message through if the other party is unknown, generate a stamp real complicated. step generation is handled by calling a command. This allows me to easily remain compatible with the hashcash project without the overhead of calling a library that doesn't exist... How will the milter behave if there are multiple requests for its services at the same time as it is waiting for a sub process to finish? that also raises the question. Is pymilter single threaded? It shouldn't be a problem because I do use file locking to control access to various important files but I'm wondering if that is sufficient in the milter environment that also raises the question. Is pymilter single threaded? It shouldn't be a problem because I do use file locking to control access to various important files but I'm wondering if that is sufficient in the milter environment ---eric From stuart at bmsi.com Tue May 11 14:20:07 2004 From: stuart at bmsi.com (Stuart D. Gathman) Date: Tue, 11 May 2004 14:20:07 -0400 (EDT) Subject: [Pymilter] really long processing In-Reply-To: <40A0D0DA.6080101@harvee.org> Message-ID: On Tue, 11 May 2004, Eric S. Johansson wrote: > that also raises the question. Is pymilter single threaded? It > shouldn't be a problem because I do use file locking to control access > to various important files but I'm wondering if that is sufficient in > the milter environment pymilter is multi-threaded. There can be thousands of mail connections in progress - depending on memory and sendmail configuration. I routinely see hundreds. By using the "stackless" python VM (which uses linked lists for stack frames instead of allocating a fixed stack area per thread), you can support hundreds of thousands of simultaneous connections. If all access to a resource is from threads within the Python VM, then it is much more efficient to use a semaphore: import thread _lock = thread.allocate_lock() # _lock in some global dictionary (e.g. module) ... try: _lock.acquire() ... finally: _lock.release() Even when using a file lock for external processes, it is much more efficient to first acquire a semaphore for internal threads before acquiring the external file lock. -- Stuart D. Gathman Business Management Systems Inc. Phone: 703 591-0911 Fax: 703 591-6154 "Confutatis maledictis, flamis acribus addictis" - background song for a Microsoft sponsored "Where do you want to go from here?" commercial. From esj at harvee.org Tue May 11 14:51:03 2004 From: esj at harvee.org (Eric S. Johansson) Date: Tue, 11 May 2004 14:51:03 -0400 Subject: [Pymilter] really long processing In-Reply-To: References: Message-ID: <40A12097.9080608@harvee.org> Stuart D. Gathman wrote: > pymilter is multi-threaded. There can be thousands of mail connections > in progress - depending on memory and sendmail configuration. I > routinely see hundreds. By using the "stackless" python VM (which > uses linked lists for stack frames instead of allocating a fixed stack > area per thread), you can support hundreds of thousands of simultaneous > connections. interesting. Which version of Python is stackless? I'm using 2.3 now on one machine and 2.2.1 on another. I've been heavily encouraged by some of pythonista to just stick with 2.3. > If all access to a resource is from threads within the Python VM, then > it is much more efficient to use a semaphore: if it is multithreaded then yes, that would make more sense. Which says I should probably made by locking code pay attention to multithreaded this or not and do the appropriate type which raises the question of how do you detect if multithreaded is active? also, if I am calling an external program, which I do in both filter and Stamper, will that kill threading (using popen and family)? another question is if I am busy processing something externally, can milter throw up a "stop" flag and ask the mail server to queue the message and try again later (i.e. in 60 seconds) rather than doing the queuing in threads? ---eric From stuart at bmsi.com Tue May 11 15:23:58 2004 From: stuart at bmsi.com (Stuart D. Gathman) Date: Tue, 11 May 2004 15:23:58 -0400 (EDT) Subject: [Pymilter] really long processing In-Reply-To: <40A12097.9080608@harvee.org> Message-ID: On Tue, 11 May 2004, Eric S. Johansson wrote: > Stuart D. Gathman wrote: > > > pymilter is multi-threaded. There can be thousands of mail connections > > in progress - depending on memory and sendmail configuration. I > > routinely see hundreds. By using the "stackless" python VM (which > > uses linked lists for stack frames instead of allocating a fixed stack > > area per thread), you can support hundreds of thousands of simultaneous > > connections. > > interesting. Which version of Python is stackless? I'm using 2.3 now > on one machine and 2.2.1 on another. I've been heavily encouraged by > some of pythonista to just stick with 2.3. Stackless is another implementation derived from the CPython reference implementation: http://www.stackless.com/ > which raises the question of how do you detect if multithreaded is active? libmilter requires threading. If you are coding a milter, then threading is active. > also, if I am calling an external program, which I do in both filter and > Stamper, will that kill threading (using popen and family)? No. Unless the standard library is broken - which is wasn't the last I checked. > another question is if I am busy processing something externally, can > milter throw up a "stop" flag and ask the mail server to queue the > message and try again later (i.e. in 60 seconds) rather than doing the > queuing in threads? If you compile sendmail and pymilter with _FFR_SMFI_PROGRESS, then you can call the progress() method to tell sendmail to reset its timeout and keep waiting. This still ties up a thread (not a problem with stackless - and usually not a problem in general since sendmail currently runs a separate sendmail process per milter thread). Without SMFI_PROGRESS, you must configure the milter timeouts for the worst case. If you return TEMPFAIL, then the sender will retry later. If you compile sendmail and pymilter with _FFR_QUARANTINE, then you can use the quarantine(reason) method to tell sendmail to put a message on hold. I have not played with QUARANTINE, and do not know how such a message gets put back into play. -- Stuart D. Gathman Business Management Systems Inc. Phone: 703 591-0911 Fax: 703 591-6154 "Confutatis maledictis, flamis acribus addictis" - background song for a Microsoft sponsored "Where do you want to go from here?" commercial. From esj at harvee.org Tue May 11 16:42:44 2004 From: esj at harvee.org (Eric S. Johansson) Date: Tue, 11 May 2004 16:42:44 -0400 Subject: [Pymilter] really long processing In-Reply-To: References: Message-ID: <40A13AC4.3080400@harvee.org> Stuart D. Gathman wrote: >>which raises the question of how do you detect if multithreaded is active? > > > libmilter requires threading. If you are coding a milter, then > threading is active. I apologize for being totally blond here but I do need to understand. Then what I take from the past few messages is that if I'm writing a pymilter, it will use threads automatically. I don't need to do anything. It's also implied that if I want to control them, I should use the python threads module. There are certain advantages to using a stackless environment with regards to threads. Unfortunately, I am pretty much constrained to counting on ordinary cpython. the reason I was asking about testing for threading was so that I would no whether or not use process resource locks within my code in conjunction with the file based locks. It's probably harmless if I just add the resource locks. >>also, if I am calling an external program, which I do in both filter and >>Stamper, will that kill threading (using popen and family)? > > > No. Unless the standard library is broken - which is wasn't the last > I checked. fair enough. I will need to verify this however. I've been burned by too many assumptions of my own and others. >>another question is if I am busy processing something externally, can >>milter throw up a "stop" flag and ask the mail server to queue the >>message and try again later (i.e. in 60 seconds) rather than doing the >>queuing in threads? > > > If you compile sendmail and pymilter with _FFR_SMFI_PROGRESS, then > you can call the progress() method to tell sendmail to reset its > timeout and keep waiting. This still ties up a thread (not a > problem with stackless - and usually not a problem in general since > sendmail currently runs a separate sendmail process per milter thread). > > Without SMFI_PROGRESS, you must configure the milter timeouts for the > worst case. since I can't control what users have for sendmail compiles (especially with emerges and rpms), I will need to just use the worst-case time figure which is Infinity. I assume there is some way to not timeout. > If you return TEMPFAIL, then the sender will retry later. that's possibly another option. If I'm busy generating stamps and I need to generate another, just force a retry. The retry what happen in conjunction with an ordinary queue scan? is there a way to trigger a retry when done? Is it possible to put messages on a separate queue for the milter? obviously, I need to do more reading. I wish they would come out with a new bat book documenting the stuff. > If you compile sendmail and pymilter with _FFR_QUARANTINE, then > you can use the quarantine(reason) method to tell sendmail to put a message on > hold. I have not played with QUARANTINE, and do not know how such > a message gets put back into play. that's probably tied in with the (relatively) new multiple queue structures. if I figure out anything, I will let you know. thanks again for answering all these questions ---eric From stuart at bmsi.com Tue May 11 17:09:01 2004 From: stuart at bmsi.com (Stuart D. Gathman) Date: Tue, 11 May 2004 17:09:01 -0400 (EDT) Subject: [Pymilter] really long processing In-Reply-To: <40A13AC4.3080400@harvee.org> Message-ID: On Tue, 11 May 2004, Eric S. Johansson wrote: > the reason I was asking about testing for threading was so that I would > no whether or not use process resource locks within my code in > conjunction with the file based locks. It's probably harmless if I just > add the resource locks. If multiple threads are contending for a file lock, you definitely need the resource lock. It will improve performance, but more importantly, most file locking schemes use the process id - which will be the same for all the threads causing suprising and unpleasant results. > > If you return TEMPFAIL, then the sender will retry later. > > that's possibly another option. If I'm busy generating stamps and I > need to generate another, just force a retry. The retry what happen in > conjunction with an ordinary queue scan? is there a way to trigger a > retry when done? Is it possible to put messages on a separate queue for > the milter? The retry is up to the sending MTA. If that MTA is sendmail, it happens on the next queue scan. Using TEMPFAIL for this purpose has the additional benefit that most spammers give up after a TEMPFAIL, since they aren't sending from a real message queue, but generating the message on the fly. (A simple field proven anti-spam strategy is simply generate TEMPFILE the first time you get a message that is not otherwise authenticated, save the Message-ID in a database, and accept it the second time.) A risk to using TEMPFAIL is that after a certain amount of time (typically 4 hours), the MTA will send a warning bounce back to the user saying "unable to deliver mail after 4 hours, will keep trying for 5 days" or something to that effect. End users are always confused by warning bounces, and never read them, but just assume that the message failed. They then send another message - which typically compounds the problem or delay. -- Stuart D. Gathman Business Management Systems Inc. Phone: 703 591-0911 Fax: 703 591-6154 "Confutatis maledictis, flamis acribus addictis" - background song for a Microsoft sponsored "Where do you want to go from here?" commercial. From esj at harvee.org Tue May 11 17:29:03 2004 From: esj at harvee.org (Eric S. Johansson) Date: Tue, 11 May 2004 17:29:03 -0400 Subject: [Pymilter] really long processing In-Reply-To: References: Message-ID: <40A1459F.8080305@harvee.org> Stuart D. Gathman wrote: > The retry is up to the sending MTA. If that MTA is sendmail, it happens > on the next queue scan. Using TEMPFAIL for this purpose has the additional > benefit that most spammers give up after a TEMPFAIL, since they aren't > sending from a real message queue, but generating the message on the fly. > (A simple field proven anti-spam strategy is simply generate TEMPFILE > the first time you get a message that is not otherwise authenticated, > save the Message-ID in a database, and accept it the second time.) > > A risk to using TEMPFAIL is that after a certain amount of time (typically > 4 hours), the MTA will send a warning bounce back to the user saying > "unable to deliver mail after 4 hours, will keep trying for 5 days" or > something to that effect. End users are always confused by warning bounces, > and never read them, but just assume that the message failed. They > then send another message - which typically compounds the problem or delay. interesting. Again, faulty assumption on my part. I thought the temporary failure was to tell the local sendmail (i.e. the one calling milter) that it was to queue up the traffic on behalf of the milter. it is starting to sound more and more like I need to build a system that takes the output from milter and stuffs it in a queue before processing and after processing, reinject into the e-mail stream. its beginning to look like I'm going to be far better off (and more mta independent) use three "big" mta's. Two of them will accept e-mail from different directions and different conditions, the third will deliver. In the middle, I have two simple little proxies that just do all the filtering/stamping as appropriate. yes, performance can be a problem but it's solvable. damm that's painful to discover at this point. I appreciate all the help you've given me and I will certainly give back to code that I have generated if you find it useful. Unfortunately, I can see how the milter interface really isn't the right tool for techniques like hybrid sender pays. ---eric From stuart at bmsi.com Tue May 11 18:02:10 2004 From: stuart at bmsi.com (Stuart D. Gathman) Date: Tue, 11 May 2004 18:02:10 -0400 (EDT) Subject: [Pymilter] really long processing In-Reply-To: <40A1459F.8080305@harvee.org> Message-ID: On Tue, 11 May 2004, Eric S. Johansson wrote: > is starting to sound more and more like I need to build a system that > takes the output from milter and stuffs it in a queue before processing > and after processing, reinject into the e-mail stream. If you can't rely on having the new queue facilities of sendmail available (I'm in the same boat), then roll your own queue in Python. After saving the headers and body and metadata in your queue file(s), then return DISCARD to sendmail, and sendmail will forget about the message. When you are ready to reinject it, run /usr/lib/sendmail (to avoid using SMTP and being reprocessed through milter). As long as writing the message to your queue is complete and your files are fsync()ed and closed before returning DISCARD, no mail will get lost. (I doubt that sendmail goes as far as fsync). -- Stuart D. Gathman Business Management Systems Inc. Phone: 703 591-0911 Fax: 703 591-6154 "Confutatis maledictis, flamis acribus addictis" - background song for a Microsoft sponsored "Where do you want to go from here?" commercial.