[Dspam-user] large email structure

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

[Dspam-user] large email structure

joao
Hi,

I'm using DSPAM in a 100.000 users e-email structure. I run 6 mail
server with dspam with the hash driver. The database is kept in a NFS
share and it seems to work fine.

I'm using TOE training mode, since I have amavis-new in this structure
doing black/whitelist and common blocks. My users can teach ham and spam
messages to dspam automatically.

My questions:

Is TOE the training mode that less uses disk space?
What are the hash driver config that I should use? my database is +100GB
right now and growing fast.
What is the best practice for database maintenance?

this are my settings:

HashRecMax              98317
HashAutoExtend          on
HashMaxExtents          0
HashExtentSize          49157
HashPctIncrease         10
HashMaxSeek             10
HashConnectionCache     10


PurgeSignatures 14          # Stale signatures
PurgeNeutral    90          # Tokens with neutralish probabilities
PurgeUnused     90          # Unused tokens
PurgeHapaxes    30          # Tokens with less than 5 hits (hapaxes)
PurgeHits1S     15          # Tokens with only 1 spam hit
PurgeHits1I     15          # Tokens with only 1 innocent hit


I disabled the dspam_clean and dspam_logrotate from the dspam servers,
and execute them in the fileserver directly.

I tryed to use postgresql driver, but it used a lot of resources.

Can you guys give me some suggestions? The database is getting bigger
and I don't know if I'm doing the best maintenance routine.

Thanks!



------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Dspam-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/dspam-user
Reply | Threaded
Open this post in threaded view
|

Re: [Dspam-user] large email structure

ktm@rice.edu
On Tue, Mar 24, 2015 at 03:57:01PM -0300, [hidden email] wrote:

> Hi,
>
> I'm using DSPAM in a 100.000 users e-email structure. I run 6 mail
> server with dspam with the hash driver. The database is kept in a NFS
> share and it seems to work fine.
>
> I'm using TOE training mode, since I have amavis-new in this structure
> doing black/whitelist and common blocks. My users can teach ham and spam
> messages to dspam automatically.
>
> My questions:
>
> Is TOE the training mode that less uses disk space?
> What are the hash driver config that I should use? my database is +100GB
> right now and growing fast.
> What is the best practice for database maintenance?
>
> this are my settings:
>
> HashRecMax              98317
> HashAutoExtend          on
> HashMaxExtents          0
> HashExtentSize          49157
> HashPctIncrease         10
> HashMaxSeek             10
> HashConnectionCache     10
>
>
> PurgeSignatures 14          # Stale signatures
> PurgeNeutral    90          # Tokens with neutralish probabilities
> PurgeUnused     90          # Unused tokens
> PurgeHapaxes    30          # Tokens with less than 5 hits (hapaxes)
> PurgeHits1S     15          # Tokens with only 1 spam hit
> PurgeHits1I     15          # Tokens with only 1 innocent hit
>
>
> I disabled the dspam_clean and dspam_logrotate from the dspam servers,
> and execute them in the fileserver directly.
>
> I tryed to use postgresql driver, but it used a lot of resources.
>
> Can you guys give me some suggestions? The database is getting bigger
> and I don't know if I'm doing the best maintenance routine.
>
> Thanks!

Hi,

I would be leery of using the hash backend for a system with that many
users using individual training. You are only using ~1MB/user. What tokenizer
are you using? I would expect you to need much more room per user as the
training progresses, 10-100MB each. I think your disk usage is going to
continue to increase to the point that use a PostgreSQL backend would make
sense. How are you planning to address when a hash file becomes corrupt?

Regards,
Ken

------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Dspam-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/dspam-user
Reply | Threaded
Open this post in threaded view
|

Re: [Dspam-user] large email structure

joao
Em 2015-03-24 17:04, [hidden email] escreveu:

> On Tue, Mar 24, 2015 at 03:57:01PM -0300, [hidden email] wrote:
>> Hi,
>>
>> I'm using DSPAM in a 100.000 users e-email structure. I run 6 mail
>> server with dspam with the hash driver. The database is kept in a NFS
>> share and it seems to work fine.
>>
>> I'm using TOE training mode, since I have amavis-new in this structure
>> doing black/whitelist and common blocks. My users can teach ham and
>> spam
>> messages to dspam automatically.
>>
>> My questions:
>>
>> Is TOE the training mode that less uses disk space?
>> What are the hash driver config that I should use? my database is
>> +100GB
>> right now and growing fast.
>> What is the best practice for database maintenance?
>>
>> this are my settings:
>>
>> HashRecMax              98317
>> HashAutoExtend          on
>> HashMaxExtents          0
>> HashExtentSize          49157
>> HashPctIncrease         10
>> HashMaxSeek             10
>> HashConnectionCache     10
>>
>>
>> PurgeSignatures 14          # Stale signatures
>> PurgeNeutral    90          # Tokens with neutralish probabilities
>> PurgeUnused     90          # Unused tokens
>> PurgeHapaxes    30          # Tokens with less than 5 hits (hapaxes)
>> PurgeHits1S     15          # Tokens with only 1 spam hit
>> PurgeHits1I     15          # Tokens with only 1 innocent hit
>>
>>
>> I disabled the dspam_clean and dspam_logrotate from the dspam servers,
>> and execute them in the fileserver directly.
>>
>> I tryed to use postgresql driver, but it used a lot of resources.
>>
>> Can you guys give me some suggestions? The database is getting bigger
>> and I don't know if I'm doing the best maintenance routine.
>>
>> Thanks!
>
> Hi,
>
> I would be leery of using the hash backend for a system with that many
> users using individual training. You are only using ~1MB/user. What
> tokenizer
> are you using? I would expect you to need much more room per user as
> the
> training progresses, 10-100MB each. I think your disk usage is going to
> continue to increase to the point that use a PostgreSQL backend would
> make
> sense. How are you planning to address when a hash file becomes
> corrupt?
>
> Regards,
> Ken
>
I'm using osb tokenizer. The database is "new" that's why it is so small
today.

I'm planning to put it in a sql backend. What database does dspam works
better? I saw some postgresql schema optimization, but maybe mysql is
less resource eater? What are your experiences?

Thanks!



------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Dspam-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/dspam-user
Reply | Threaded
Open this post in threaded view
|

Re: [Dspam-user] large email structure

ktm@rice.edu
On Wed, Mar 25, 2015 at 06:39:43AM -0300, [hidden email] wrote:

> >>
> >> I tryed to use postgresql driver, but it used a lot of resources.
> >>
> >> Can you guys give me some suggestions? The database is getting bigger
> >> and I don't know if I'm doing the best maintenance routine.
> >>
> >> Thanks!
> >
> > Hi,
> >
> > I would be leery of using the hash backend for a system with that many
> > users using individual training. You are only using ~1MB/user. What
> > tokenizer
> > are you using? I would expect you to need much more room per user as
> > the
> > training progresses, 10-100MB each. I think your disk usage is going to
> > continue to increase to the point that use a PostgreSQL backend would
> > make
> > sense. How are you planning to address when a hash file becomes
> > corrupt?
> >
> > Regards,
> > Ken
> >
> I'm using osb tokenizer. The database is "new" that's why it is so small
> today.
>
> I'm planning to put it in a sql backend. What database does dspam works
> better? I saw some postgresql schema optimization, but maybe mysql is
> less resource eater? What are your experiences?
>
> Thanks!
>
Hi,

We currently use MySQL with a MyISAM backends with an old release of DSPAM,
version 3.6.x. We are working on an upgrade to the latest release of DSPAM
and change to a PostgreSQL backend to allow us to partition that backend
tables which will allow us to perform maintenance more easily without
impacting concurrent usage: use CLUSTER to keep user tokens adjacent, use
a <100% fill-factor to allow for HOT updates, remove old mail signatures
with TRUNCATE and not DELETE. Note, for as many users as you have, you
may not want to keep the signatures at all and simply retrain the message
if you have it available. The resource usage between MySQL and PostgreSQL
is similar, as far as I know once you move to InnoDB/XtraDB.

Regards,
Ken

------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Dspam-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/dspam-user
Reply | Threaded
Open this post in threaded view
|

Re: [Dspam-user] large email structure

Phil Stracchino
On 03/25/15 08:48, [hidden email] wrote:
> On Wed, Mar 25, 2015 at 06:39:43AM -0300, [hidden email] wrote:
>> I'm planning to put it in a sql backend. What database does dspam works
>> better? I saw some postgresql schema optimization, but maybe mysql is
>> less resource eater? What are your experiences?

> We currently use MySQL with a MyISAM backends with an old release of DSPAM,
> version 3.6.x. We are working on an upgrade to the latest release of DSPAM
> and change to a PostgreSQL backend to allow us to partition that backend
> tables which will allow us to perform maintenance more easily without
> impacting concurrent usage: use CLUSTER to keep user tokens adjacent, use
> a <100% fill-factor to allow for HOT updates, remove old mail signatures
> with TRUNCATE and not DELETE. Note, for as many users as you have, you
> may not want to keep the signatures at all and simply retrain the message
> if you have it available. The resource usage between MySQL and PostgreSQL
> is similar, as far as I know once you move to InnoDB/XtraDB.

It's probably worth pointing out that MyISAM is a legacy storage engine
which, realistically, should not be used in production any more except
when strictly unavoidable.  "Conventional wisdom" has it that MyISAM is
faster than InnoDB for mostly-read loads; benchmarking performed at my
employer indicates that this is not actually the case, and that InnoDB
substantially outperforms MyISAM even for a 100% read workload, the best
possible performance case for MyISAM.  MyISAM is not robust, not ACID
compliant, and does not perform well; the only reason to use it is if
minimum resource utilization is an overriding priority, or if you are
using one of the few remaining MyISAM features not yet supported (or not
supported well) by InnoDB.  (Currently this pretty much means SPATIAL
indexes, which aren't supported yet in InnoDB, and FULLTEXT indexes,
which don't perform well yet.)


--
  Phil Stracchino
  Babylon Communications
  [hidden email]
  [hidden email]
  Landline: 603.293.8485

------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Dspam-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/dspam-user
Reply | Threaded
Open this post in threaded view
|

Re: [Dspam-user] large email structure

ktm@rice.edu
On Wed, Mar 25, 2015 at 09:03:54AM -0400, Phil Stracchino wrote:

> On 03/25/15 08:48, [hidden email] wrote:
> > On Wed, Mar 25, 2015 at 06:39:43AM -0300, [hidden email] wrote:
> >> I'm planning to put it in a sql backend. What database does dspam works
> >> better? I saw some postgresql schema optimization, but maybe mysql is
> >> less resource eater? What are your experiences?
>
> > We currently use MySQL with a MyISAM backends with an old release of DSPAM,
> > version 3.6.x. We are working on an upgrade to the latest release of DSPAM
> > and change to a PostgreSQL backend to allow us to partition that backend
> > tables which will allow us to perform maintenance more easily without
> > impacting concurrent usage: use CLUSTER to keep user tokens adjacent, use
> > a <100% fill-factor to allow for HOT updates, remove old mail signatures
> > with TRUNCATE and not DELETE. Note, for as many users as you have, you
> > may not want to keep the signatures at all and simply retrain the message
> > if you have it available. The resource usage between MySQL and PostgreSQL
> > is similar, as far as I know once you move to InnoDB/XtraDB.
>
> It's probably worth pointing out that MyISAM is a legacy storage engine
> which, realistically, should not be used in production any more except
> when strictly unavoidable.  "Conventional wisdom" has it that MyISAM is
> faster than InnoDB for mostly-read loads; benchmarking performed at my
> employer indicates that this is not actually the case, and that InnoDB
> substantially outperforms MyISAM even for a 100% read workload, the best
> possible performance case for MyISAM.  MyISAM is not robust, not ACID
> compliant, and does not perform well; the only reason to use it is if
> minimum resource utilization is an overriding priority, or if you are
> using one of the few remaining MyISAM features not yet supported (or not
> supported well) by InnoDB.  (Currently this pretty much means SPATIAL
> indexes, which aren't supported yet in InnoDB, and FULLTEXT indexes,
> which don't perform well yet.)
>

Exactly why we are moving to PostgreSQL in the update. When we first set
up the old version, InnoDB performance was very poor and PostgreSQL had
other problems. Since then, both products have improved greatly! It should
also be noted, that MyISAM also performs poorly for mixed read-write
activities due to the locking needed to coordinate access. If fact, we
had to use multiple DBs to keep locking to a manageable level in each
DB. SPATIAL and FULLTEXT indexes are not really applicable to DSPAM.

Regards,
Ken

------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Dspam-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/dspam-user
Reply | Threaded
Open this post in threaded view
|

Re: [Dspam-user] large email structure

Phil Stracchino
On 03/25/15 09:16, [hidden email] wrote:
> It should
> also be noted, that MyISAM also performs poorly for mixed read-write
> activities due to the locking needed to coordinate access.

Oh dear gods yes.  In the benchmark sets I alluded to, we determined
that, both properly configured with the same workload on the same
hardware, in a 100% read OLTP workload InnoDB outperformed MyISAM by
60%; by the time the workload reached 75% read/25% write, InnoDB was
outperforming MyISAM by 400%.


--
  Phil Stracchino
  Babylon Communications
  [hidden email]
  [hidden email]
  Landline: 603.293.8485

------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Dspam-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/dspam-user