It's been nine days since my last post about this. The prom console thing worked, but it exposed another problem: smp_boot_one_cpu() was hanging trying to start the other processors. Well, to be precise, it was calling prom_startcpu(), and that was returning just fine, but the next printk() after that would output one character and then hang.
I immediately suspected locking issues; the prom console acquires and releases prom_entry_lock once per character, to avoid holding it too long, so I suspected it was managing to grab it right away, get one character out (which would take a little under a millisecond on a 9600 bps serial terminal), and then release it and try to reacquire it for the next character, and have trouble there and end up spinning forever.
It turned out I was right, but I ended up investigating quite a few false starts before I found the problem. I really needed to be able to trace what the second processor startup code in trampoline.S was up to, and it wasn't as easy as just throwing some calls to printk() in, since it held prom_entry_lock for much of its execution (kernel spinlocks hang if you try to recursively acquire them), and it didn't have little things like a stack, or even a TLB, set up in some of the places I needed to trace.
Eventually, I devised a mechanism to allow trampoline.S running on cpu1 to signal to a waiting loop on cpu0, and make sure it released prom_entry_lock so cpu0 could use printk() to give me debug output. I was able to use this mechanism earlier this evening to trace the exact location of the hang down to when cpu1 calls SUNW,set-trap-table
here. Notice how it does not hold prom_entry_lock when it makes that call. Yes, after nine days of debugging, the problem turned out to be as simple as something not being locked when it should. Both processors were probably trying to use the same buffer simultaneously to call the PROM, and stomping all over each other and probably both getting thrown off into hyperspace. With the appropriate locking and unlocking code added to trampoline.S, the kernel successfully starts all 11 additional processors and gets all the way up to trying to mount the root filesystem. I'll be working on getting my initramfs image ready tomorrow evening.
For the record, all three patches against stock 2.6.23 that I needed to get this kernel booting are reproduced here.
First, for the interrupts on the PCI I/O board, psycho.diff:
--- linux-2.6.23/arch/sparc64/kernel/prom.c 2007-10-09 13:31:38.000000000 -0700
+++ linux-2.6.23.patched/arch/sparc64/kernel/prom.c 2008-01-12 11:58:52.000000000 -0800
@@ -196,9 +196,17 @@
/*0x32*/ PSYCHO_IMAP_PMGMT,
/*0x33*/ PSYCHO_IMAP_GFX,
/*0x34*/ PSYCHO_IMAP_EUPA,
+/*0x35*/ PSYCHO_IMAP_EUPA,
+/*0x36*/ PSYCHO_IMAP_EUPA,
+/*0x37*/ PSYCHO_IMAP_EUPA,
+/*0x38*/ PSYCHO_IMAP_EUPA,
+/*0x39*/ PSYCHO_IMAP_EUPA,
+/*0x3a*/ PSYCHO_IMAP_EUPA,
+/*0x3b*/ PSYCHO_IMAP_EUPA,
+/*0x3c*/ PSYCHO_IMAP_EUPA,
};
#define PSYCHO_ONBOARD_IRQ_BASE 0x20
-#define PSYCHO_ONBOARD_IRQ_LAST 0x34
+#define PSYCHO_ONBOARD_IRQ_LAST 0x3c
#define psycho_onboard_imap_offset(__ino) \
__psycho_onboard_imap_off[(__ino) - PSYCHO_ONBOARD_IRQ_BASE]
Second, for the PROM console, promcon.diff:
diff -Naur linux-2.6.23/drivers/video/console/Kconfig linux-2.6.23.patched/drivers/video/console/Kconfig
--- linux-2.6.23/drivers/video/console/Kconfig 2007-10-09 13:31:38.000000000 -0700
+++ linux-2.6.23.patched/drivers/video/console/Kconfig 2008-01-14 18:56:29.000000000 -0800
@@ -92,7 +92,7 @@
config DUMMY_CONSOLE
bool
- depends on PROM_CONSOLE!=y || VGA_CONSOLE!=y || SGI_NEWPORT_CONSOLE!=y
+ depends on PROM_CONSOLE!=y && VGA_CONSOLE!=y && SGI_NEWPORT_CONSOLE!=y
default y
config DUMMY_CONSOLE_COLUMNS
Finally, for the SMP startup, trampoline.diff:
diff -Naur linux-2.6.23/arch/sparc64/kernel/trampoline.S linux-2.6.23.patched/arch/sparc64/kernel/trampoline.S
--- linux-2.6.23/arch/sparc64/kernel/trampoline.S 2007-10-09 13:31:38.000000000 -0700
+++ linux-2.6.23.patched/arch/sparc64/kernel/trampoline.S 2008-01-23 03:01:42.000000000 -0800
@@ -386,6 +386,12 @@
wrpr %g0, 0, %wstate
+ sethi %hi(prom_entry_lock), %g2
+1: ldstub [%g2 + %lo(prom_entry_lock)], %g1
+ membar #StoreLoad | #StoreStore
+ brnz,pn %g1, 1b
+ nop
+
/* As a hack, put &init_thread_union into %g6.
* prom_world() loads from here to restore the %asi
* register.
@@ -395,9 +401,9 @@
sethi %hi(is_sun4v), %o0
lduw [%o0 + %lo(is_sun4v)], %o0
- brz,pt %o0, 1f
+ brz,pt %o0, 2f
nop
-
+
TRAP_LOAD_TRAP_BLOCK(%g2, %g3)
add %g2, TRAP_PER_CPU_FAULT_INFO, %g2
stxa %g2, [%g0] ASI_SCRATCHPAD
@@ -427,10 +433,10 @@
call %o1
add %sp, (2047 + 128), %o0
- ba,pt %xcc, 2f
+ ba,pt %xcc, 3f
nop
-1: sethi %hi(sparc64_ttable_tl0), %o0
+2: sethi %hi(sparc64_ttable_tl0), %o0
set prom_set_trap_table_name, %g2
stx %g2, [%sp + 2047 + 128 + 0x00]
mov 1, %g2
@@ -444,9 +450,13 @@
call %o1
add %sp, (2047 + 128), %o0
- ba,pt %xcc, 2f
+ ba,pt %xcc, 3f
nop
-1: sethi %hi(sparc64_ttable_tl0), %o0
+2: sethi %hi(sparc64_ttable_tl0), %o0
set prom_set_trap_table_name, %g2
stx %g2, [%sp + 2047 + 128 + 0x00]
mov 1, %g2
@@ -444,9 +450,13 @@
call %o1
add %sp, (2047 + 128), %o0
-2: ldx [%l0], %g6
+3: sethi %hi(prom_entry_lock), %g2
+ stb %g0, [%g2 + %lo(prom_entry_lock)]
+ membar #StoreStore | #StoreLoad
+
+ ldx [%l0], %g6
ldx [%g6 + TI_TASK], %g4
-
+
mov 1, %g5
sllx %g5, THREAD_SHIFT, %g5
sub %g5, (STACKFRAME_SZ + STACK_BIAS), %g5
And, with that, it's 3:41 AM and I'm going to bed.