I’d love to know how large motherboard manufacturers manage to produce buggy boards. Just as we’re putting two new servers into production at work, I discover an annoying problem: if there is a sustained data transfer over the network for about 10 minutes, the systems reboot. No warning, no error messages, just a reboot. Having poked around I discovered that changing the kernel’s interrupt timer frequency (i.e. setting how often Linux checks for interrupts) changes the amount of time it takes for the system to reboot during data transfer. Set to 100Hz, the system reboots immediately as soon as the transfer is started. At 250Hz, you get a few seconds before the reboot. At 1000Hz (the default), you get 5-10 minutes. So knowing that the problem was related to interrupts, I suspected it may be the APIC on the motherboard at fault, as it seems relatively common for motherboard manufacturers to stuff it up.
So I went into the BIOS and disabled the APIC (after first having to disable hyperthreading), rebooted and alas all is well: no more reboots.
My theory about the exact cause of the problem is that for some reason interrupts were being generated faster than the OS could handle them, possibly due to spurious interrupts being generated at the APIC. Interrupt controllers receive interrupts from components within the system and essentially queue them. Then at a pre-defined interval (between 100-1000Hz on Linux, 100Hz on Windows) the OS checks for interrupts, acknowledges them (causing the interrupt controller to remove them from the queue) then goes away to handle them. I believe that so many interrupts were being generated that the interrupt controller’s queue was getting full. When this happens the motherboard logic says ‘bloody hell this shouldn’t happen and I can’t recover from it’ and reboots the machine.
So if you are the unlucky owner of a Supermicro P4SCT+ motherboard, beware!




