Behind the scenes of the new GNU mailing list server
If this delicate pipeline is disrupted for any reason, the incoming mail quickly starts to queue up, causing an explosive multiplicative factor when the service comes back online. Needless to say, this workload makes a good scalability test for the entire software stack: Mailman, Exim, MHonArc, and a very fragmented XFS filesystem containing millions of small files. Last year, a double disk crash forced us to restore the archives from the nightly backup and merge them with newer data recovered from the dead array. The extra I/O load resulted in delays in email spooling for several days. At times, the outgoing queue would grow too much, exposing a quadratic behavior in an ancient version of Exim. The machine would then thrash to death while mail continued to arrive at a rate faster than it could be delivered. Our monitoring systems would promptly page us and we'd babysit the spooling process back to safety.
Upgrading a critical system like this isn't exactly easy: its aging components have to be replaced all at once, ideally without causing prolonged downtime and any externally visible change. The HTML archives are especially problematic because they're an enormous mass of fast-changing data which takes several hours to sync and several days to reconstruct from the raw mbox files.
We built the new list server as a XEN virtual machine hosted on one of our largest servers in our colocation facility, with well-tuned EXT4 file systems running on a RAID-1 array of solid-state drives and a second array of fast hard-drives. To ensure that the increased workload wouldn't disrupt the other guests sharing the XEN host, we ran CPU and I/O torture tests for a few days. Copying the list archives required a trip to the colo to preload a snapshot of the data from a 130GB tarball.
Testing the new system before transitioning it into production was also quite a challenge: the test environment must not be allowed to start spooling out millions of bogus messages, but you still want to get some messages out of it to verify that everything works. The list server is also tightly integrated with Savannah and the FSF mail system through a number of scripts and cron jobs, many of which have been around and running quietly for years.
Eventually, the day to switch over to the new server came. We announced it four days in advance; longer would have been better, but since we were concerned about possible failure in the existing setup, we did not want to wait longer than absolutely necessary.
The day before, we had prepared a long checklist of operations to be executed in sequence to gracefully stop the old service, perform a final rsync of all data, and cut-over to the new machine.
Not everything went smoothly, but there were no major incidents and the downtime of the lists was limited to a few hours (the HTML archives took a few days to regenerate). A side effect that we hadn't anticipated was that the sudden surge of email from the new list server immediately triggered the anti-spam defenses of Gmail and Yahoo, which blocked our new IP until we got it whitelisted. All mail was delivered in the end.
This central component of the GNU Project has finally been reworked, but there are still many legacy services waiting to be replaced or upgraded. The systems administration team of the Free Software Foundation is continually working to improve the development and collaboration infrastructure which supports thousands of free software projects.
Thank you to all of our members -- your support makes these community services possible.