Brief lorien Downtime: lwood

lwood

Brief lorien Downtime

Apr 04, 2012 17:04

[Original: 04 April 2012, 17:04 PDT]
If you have a host or an e-mail address through me, you may have occasionally noticed a slight hiccup in mail delivery. Soon after I post this, I will be bringing the server down to give it a memory upgrade that will, hopefully, address this issue--I'll update this post when the server is back online.

In my mail server setup, there are several layers of server that pass a mail message back and forth:

Postfix, the mail transport agent (what most people think of as "the mail server").
Amavisd-new, a server that scans the contents of each message. In turn, it calls:
1. ClamAV, which takes e-mails from Amavis and checks them for viruses and other malware.
2. SpamAssassin, which takes e-mails from Amavis and checks to see if they are spam.

A few weeks ago, I notice a very occasional problem with processes on the server that appeared to be spurious write errors, and every once in awhile amavis (the content filter), would fail with a message of the form:

Apr 3 21:13:40 lorien amavis[20273]: (20273-03) (!!)file(1) utility (/usr/bin/file)
FAILED: run_command: can't fork: Cannot allocate memory at /usr/sbin/amavisd-new line 3081, line 807.

Now, you don't have to be a 31337 h4xx0r to know that that's just not good.

We tried reseating the RAM, which didn't have any apparent effect--I don't see random "halp cannot write!" errors anymore, but Amavis still blows its cookies, and when it blows too many cookies, it crashes out entirely. This means that postfix can't pass mail to it, with the user-apparent effect of "halp mail is down!" but the admin-apparent effect of "halp mail is stuck". The content filtering servers are called on all mail, inbound and outbound, by paranoid design, so while we received all e-mail fine, we wouldn't deliver it until my husband or I restarted the content filtering server.

Anyway, assuming it is bad RAM (and we have reason to believe this is so), we could either have the server down all night for burn-in RAM testing, or replace/upgrade all the memory on the box. We're choosing the latter--it's cheaper to spend $75 and bring the boy up to 8 GB (plenty for a server of this sort) than to have the server potentially down overnight for something we wanted to do anyway (it has less just now, so why not more?).

Should this not fix it, there were some hints in a forum that this might be a BIOS issue, so updating that may be next, and failing both of those, it might be a problem with the motherboard, which, should it be necessary (gods forbid) will involve several hours' downtime.

[Update: 04 Apr 2012, 17:22 PDT] lorien is back online and now has all the RAM his motherboard can handle. You can, of course, expect me to keep my eye on the situation, and all that other things one expects from trained IT personnel.

In the meantime, my best to you all...