Posts Tagged ‘kernel’

Kernel hacking for fun but not profit, part I: Tasklets

Thursday, December 14th, 2006

I’ve been doing quite a bit of kernel hacking recently and am learning a lot, but for some reason it only just occurred to me that I should be documenting some of the things I’m learning. Learning to hack the kernel is not easy and often there isn’t much documentation (I have Linux Kernel Development by Robert Love [Amazon link] which is really useful – it’s by Novell press, so please buy it second-hand rather than give Novell your money).

I should really start at the beginning, but I’m not going to, since I can’t remember that far back and I’m not sure that there is a sensible place to start in any case. I’m currently working with tasklets, so that’s where I’m going to start.

When writing an interrupt handler in the kernel, there’s one key rule: you need your interrupt handler function to return as quickly as possible. To make that possible, we split the interrupt handler in half: a top-half and a bottom-half. The top half is our main interrupt handler function which does the bare minimum: work out what we need to do in response to the interrupt. The most-used way to implement a bottom-half is a tasklet. A tasklet is a function much like any other in C, except that you don’t call it, you schedule it. When a tasklet is scheduled, it is not run immediately but scheduled to run at some point in the future. This allows your interrupt handler top-half to return quickly without having to wait for the bottom-half to finish.
In a tasklet, we do the actual work that we need to do in response to the interrupt we received – this could be manipulating data structures or reading/writing to hardware.

Here’s a simplified example of an interrupt handler and a tasklet bottom-half based roughly on the code I’m currently working on:

// tasklets are in interrupt.h
#include <linux/interrupt.h>

// declare interrupt handler function
static irqreturn_t interrupt_handler(int irq, void *dev_id);
// declare tasklet function
static void handle_disk_removal(unsigned long data);
// declare tasklet
// prototype:
// DECLARE_TASKLET(tasklet_name, function_name, unsigned long data)
DECLARE_TASKLET(disk_removed, handle_disk_removal, 0);

static irqreturn_t interrupt_handler(int irq, void *dev_id)
{
  // read interrupt source
  u8 interruptregister = i2c_read_8574(CTRL_ADDR);

  // acknowledge the interrupts to the interrupt controller
  i2c_write_8574(CTRL_ADDR, 0xFF);

  // determine the source of the interrupt
  // NOTE: this is not the right way to determine the source; this is a simplified example
  switch (interruptregister) {
    case 0xFE:
    // we know that when the interrupt register is 0xFE it means that a hard
    // disk has been hot-swap removed

    // schedule our disk_removed tasklet to run.
    tasklet_schedule(&disk_removed);
    break;
  }
   
   …
   
}

static void handle_disk_removal(unsigned long data) {
  // manipulate our disks data structure
  // print a kernel message
  // whatever else we need to do
  …
}

You don’t need to worry about most of the code in the interrupt handler (since every one is different), it is the tasklet_schedule function that is important.

When we declare our tasklet, we give it a name (disk_removed) and give it a function to call (handle_disk_removal). We also have the option to pass it some data in the form of an unsigned long but we don’t need to, so we just pass 0. Incidentally, I don’t think there is one place in the kernel where someone actually passes a value to a tasklet – most often you’ll need to access something that isn’t an unsigned long, so you’ll use a (properly locked) global variable or structure instead.
Now we’ve declared the tasklet, getting it to run is a simple case of calling tasklet_schedule and passing the tasklet name. This will cause the tasklet to run in the future – we don’t know (or care) when, but we can be sure that it will be run. If it gets scheduled more than one before it gets run, it will only be run once.

So tasklets are actually very simple to use. The hard part comes when you need to share data between regular functions, tasklets and your interrupt handler. You have to use proper locking to make sure nothing nasty happens, but locking deserves a post of it’s own, I think.

Buggy motherboards

Monday, September 25th, 2006

I’d love to know how large motherboard manufacturers manage to produce buggy boards. Just as we’re putting two new servers into production at work, I discover an annoying problem: if there is a sustained data transfer over the network for about 10 minutes, the systems reboot. No warning, no error messages, just a reboot. Having poked around I discovered that changing the kernel’s interrupt timer frequency (i.e. setting how often Linux checks for interrupts) changes the amount of time it takes for the system to reboot during data transfer. Set to 100Hz, the system reboots immediately as soon as the transfer is started. At 250Hz, you get a few seconds before the reboot. At 1000Hz (the default), you get 5-10 minutes. So knowing that the problem was related to interrupts, I suspected it may be the APIC on the motherboard at fault, as it seems relatively common for motherboard manufacturers to stuff it up.

So I went into the BIOS and disabled the APIC (after first having to disable hyperthreading), rebooted and alas all is well: no more reboots.
My theory about the exact cause of the problem is that for some reason interrupts were being generated faster than the OS could handle them, possibly due to spurious interrupts being generated at the APIC. Interrupt controllers receive interrupts from components within the system and essentially queue them. Then at a pre-defined interval (between 100-1000Hz on Linux, 100Hz on Windows) the OS checks for interrupts, acknowledges them (causing the interrupt controller to remove them from the queue) then goes away to handle them. I believe that so many interrupts were being generated that the interrupt controller’s queue was getting full. When this happens the motherboard logic says ‘bloody hell this shouldn’t happen and I can’t recover from it’ and reboots the machine.

So if you are the unlucky owner of a Supermicro P4SCT+ motherboard, beware!

Update on Sun E250/E450 environmental monitoring

Wednesday, June 28th, 2006

As regular readers will know I’ve been helping out with a driver to support the environmental monitoring hardware found in Sun’s E250 and E450 servers for a while – mostly reverse engineering and testing of the work done by Eric Brower. However as Eric is a little short on time at the moment, I’ve started to get my hands dirty for the first time and thanks to people doing almost instant testing of the patches I’m releasing, things are progressing really well.

I’m now hosting a Subversion repository for the driver and have also put up a page giving a few details.

Over the last week I’ve ported the driver to the 2.6.17 kernel series which was certainly an experience, given that I’ve never done any real kernel (or C!) development before. I can now understand why people moan about not having a stable kernel API — laziness. All the changes I had to accommodate going from the 2.6.11 to the 2.6.17 series were designed to provide more functionality and/or make life easier – I think it would be insane to slow the progression of Linux simply to keep a few developers happy (most of whom work on proprietary drivers and would like to be able to just write a driver and leave it unmaintained forevermore).

In addition to the porting, I’ve fixed a stack of bugs reported by the guy currently testing things for me (who I know only as ‘Eki’) so compilation with gcc 4.x now works, as does static compilation and static usage – the driver is essentially in two parts where one has to be loaded before the other, which wasn’t happening when it was statically linked. I had to ask on LKML to find the solution to this and was given a lengthy explanation of how to do it by Arjan van de Ven.

It’s great to see the community working so well – without it, this driver certainly wouldn’t exist.

As far as the driver is concerned, the next steps are to implement interrupt handling then clean-up the code a bit. That may take some time, but once that’s done the driver can start its journey towards being merged into Linus’ tree and becoming part of the kernel proper.

While I’m on the topic of Linux kernel development, it’s great to see a lot of work going on at the moment to support Sun’s new sun4v architecture (for the UltraSPARC T1 processor), mostly being done by David Miller. Naturally I will almost certainly never own such an exciting (both in terms of technology and philosophy – the CPU is GPL’d as regular readers here will know) piece of hardware since it’s rather out of my price range, but the side-effect of this work is that improvements are being made to SPARC64 support in general which benefit people like me with older Sun hardware. I don’t know whether Sun are employing David or just providing hardware, but in either case, thanks Sun!

Me, the kernel developer

Wednesday, September 21st, 2005

Yep, I’ve just had not one but two patches accepted into the Linux kernel.
From the 2.6.14-rc1 changelog:

commit 05ade5a5cd32f8393c22fc454b0546df2ed497c5
Author: David Johnson <dj@xxx>
Date: Fri Sep 9 13:02:55 2005 -0700

[PATCH] dvb: bt8xx: Nebula DigiTV mt352 support

Add support for Nebula DigiTV PCI cards with the MT352 frontend.

Signed-off-by: David Johnson <dj@xxx>
Signed-off-by: Johannes Stezenbach <js@xxx>
Signed-off-by: Andrew Morton <akpm@xxx>
Signed-off-by: Linus Torvalds <torvalds@xxx>

commit 1f15ddd0b79d1722049952b7359533a18a72f106
Author: David Johnson <dj@xxx>
Date: Fri Sep 9 13:02:54 2005 -0700

[PATCH] dvb: bt8xx: cleanup

Indentation fixes and remove unnecessary braces.

Signed-off-by: David Johnson <dj@xxx>
Signed-off-by: Johannes Stezenbach <js@xxx>
Signed-off-by: Andrew Morton <akpm@xxx>
Signed-off-by: Linus Torvalds <torvalds@xxx>

This all started when I bought a Nebula DigiTV-PCI DVB card which didn’t work under Linux. Looking back through the linux-dvb list archives I found an old patch someone had submitted to make it work. This patch didn’t get accepted because it needed fixing and cleaning up. So I took the patch, re-made it to apply to the current dvb-kernel CVS and fixed the problems. So no I didn’t actually add the support for the card myself, but I got it into the kernel. I still wouldn’t suggest anyone goes out and buys one of these though: not only are they overpriced, but they have some hardware issues which cause lots of people problems.
The other patch is to clean-up the particular file I was working on to fix the coding style, formatting and so on.

OK so I won’t be winning any "Best Linux kernel contributor" awards, but it’s nice to see my name in the changelog and know that I’ve actually contributed something useful.

Update on E450 environmental monitoring

Friday, August 26th, 2005

Those who’ve read my blog before might remember that I’m working with a Linux kernel developer to help him write a driver for the environmental monitoring features of the Sun Ultra Enterprise 450 server.

We’ve come a huge distance and Eric said yesterday that we’re approaching being able to release the code for further testing once it’s been cleaned-up a bit. At the moment we can dynamically set fan speeds depending on the temperature – this was the most important and time-consuming feature to implement. Without this driver the fans run at full speed which is a real noise; now the machine probably isn’t any louder than the average desktop machine. We can also tell when fans have failed, determine how many PSUs are fitted and hopefully soon determine their status as well. Eventually we’ll also be able to determine when disks have failed and set the disk LEDs and front panel LEDs appropriately as well as respond to the various keyswitch positions.
This may not sound like much, but when you consider that there was no documentation available from Sun and everything had to be reverse-engineered, it’s a huge task. This has been a work in progress for nearly 3 months so far.

There was however nearly a casualty… my machine. At one point Eric had been poking at it via SSH and had inadvertently stopped all the fans. About an hour later I discovered that the machine was very, very hot and creating that all-too-familiar smell of electrical burning…
Amazingly the machine survived and was up and running a few minutes later – I had to turn it on again to get the fans going. As it was working and had passed a full diagnostic, I wasn’t too concerned about it. However the other day I took the case off and saw how the plastic air-flow guides fitted around each CPU had melted onto their respective heatsinks. Bear in mind that these are designed to withstand temperatures of 70°C and above. I took some photos of the mess today, as I removed the air-flow guides completely. Unfortunately this now means that the cooling isn’t as efficient as it was designed to be and the CPUs are running hotter than they otherwise would, but hopefully I can get some replacements.

I’m very happy with how things are progressing and with any luck I’ll soon be able to put my server into production use using the new driver.