0% found this document useful (0 votes)
28 views

Linux Insides - Interrupts - Linux Interrupts 6

The document discusses the handling of non-maskable interrupts (NMI) in the Linux kernel. When an NMI occurs, the processor immediately calls the NMI handler. The NMI handler saves registers and stack pointer to the stack. It then checks if the NMI is nested by checking a value previously pushed to the stack. If not nested, it pushes a 1 to mark that an NMI is currently executing, to prevent issues from nested NMIs corrupting the stack.

Uploaded by

Avinash Pal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

Linux Insides - Interrupts - Linux Interrupts 6

The document discusses the handling of non-maskable interrupts (NMI) in the Linux kernel. When an NMI occurs, the processor immediately calls the NMI handler. The NMI handler saves registers and stack pointer to the stack. It then checks if the NMI is nested by checking a value previously pushed to the stack. If not nested, it pushes a 1 to mark that an NMI is currently executing, to prevent issues from nested NMIs corrupting the stack.

Uploaded by

Avinash Pal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

The processor receives a message on the system bus or the APIC serial bus with a

0xAX / linux-insides Public


delivery mode NMI .

Code Issues 27 Pull requests 7 Actions Security Insights When the processor receives a NMI from one of these sources, the processor handles it
immediately by calling the NMI handler pointed to by interrupt vector which has number
2 (see table in the first part). We already filled the Interrupt Descriptor Table with the
linux-insides / Interrupts / linux-interrupts-6.md
vector number, address of the nmi interrupt handler and NMI_STACK Interrupt Stack Table
entry:
0xAX Merge branch 'master' into master last year

set_intr_gate_ist(X86_TRAP_NMI, &nmi, NMI_STACK);


480 lines (371 loc) · 24.6 KB

in the trap_init function which defined in the arch/x86/kernel/traps.c source code file. In
Preview Code Blame Raw
the previous parts we saw that entry points of the all interrupt handlers are defined with
the:

Interrupts and Interrupt Handling. Part 6. .macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
ENTRY(\sym)
...
Non-maskable interrupt handler ...
...
END(\sym)
It is sixth part of the Interrupts and Interrupt Handling in the Linux kernel chapter and in .endm
the previous part we saw implementation of some exception handlers for the General
Protection Fault exception, divide exception, invalid opcode exceptions, etc. As I wrote in
macro from the arch/x86/entry/entry_64.S assembly source code file. But the handler of
the previous part we will see implementations of the rest exceptions in this part. We will
the Non-Maskable interrupts is not defined with this macro. It has own entry point:
see implementation of the following handlers:

Non-Maskable interrupt; ENTRY(nmi)


BOUND Range Exceeded Exception; ...
...
Coprocessor exception; ...
SIMD coprocessor exception. END(nmi)

in this part. So, let's start.


in the same arch/x86/entry/entry_64.S assembly file. Lets dive into it and will try to
understand how Non-Maskable interrupt handler works. The nmi handlers starts from the
Non-Maskable interrupt handling call of the:

A Non-Maskable interrupt is a hardware interrupt that cannot be ignored by standard


PARAVIRT_ADJUST_EXCEPTION_FRAME
masking techniques. In a general way, a non-maskable interrupt can be generated in either
of two ways:
macro but we will not dive into details about it in this part, because this macro related to
External hardware asserts the non-maskable interrupt pin on the CPU. the Paravirtualization stuff which we will see in another chapter. After this save the content
of the rdx register on the stack:

Why do we push 1 on the stack? As the comment says: We allow breakpoints in NMIs .
pushq %rdx
On the x86_64, like other architectures, the CPU will not execute another NMI until the first
NMI is completed. A NMI interrupt finished with the iret instruction like other interrupts
And allocated check that cs was not the kernel segment when an non-maskable interrupt and exceptions do it. If the NMI handler triggers either a page fault or breakpoint or
occurs: another exception which are use iret instruction too. If this happens while in NMI
context, the CPU will leave NMI context and a new NMI may come in. The iret used to
cmpl $__KERNEL_CS, 16(%rsp) return from those exceptions will re-enable NMIs and we will get nested non-maskable
jne first_nmi interrupts. The problem the NMI handler will not return to the state that it was, when the
exception triggered, but instead it will return to a state that will allow new NMIs to
The __KERNEL_CS macro defined in the arch/x86/include/asm/segment.h and represented preempt the running NMI handler. If another NMI comes in before the first NMI handler is
second descriptor in the Global Descriptor Table: complete, the new NMI will write all over the preempted NMIs stack. We can have nested
NMIs where the next NMI is using the top of the stack of the previous NMI . It means that

#define GDT_ENTRY_KERNEL_CS 2
we cannot execute it because a nested non-maskable interrupt will corrupt stack of a
#define __KERNEL_CS (GDT_ENTRY_KERNEL_CS*8) previous non-maskable interrupt. That's why we have allocated space on the stack for
temporary variable. We will check this variable that it was set when a previous NMI is
executing and clear if it is not nested NMI . We push 1 here to the previously allocated
more about GDT you can read in the second part of the Linux kernel booting process
space on the stack to denote that a non-maskable interrupt executed currently. Remember
chapter. If cs is not kernel segment, it means that it is not nested NMI and we jump on
that when and NMI or another exception occurs we have the following stack frame:
the first_nmi label. Let's consider this case. First of all we put address of the current stack
pointer to the rdx and pushes 1 to the stack in the first_nmi label:
+------------------------+
| SS |
first_nmi: | RSP |
movq (%rsp), %rdx | RFLAGS |
pushq $1 | CS |
| RIP |
+------------------------+

and also an error code if an exception has it. So, after all of these manipulations our stack
frame will look like this:

+------------------------+
| SS |
| RSP |
| RFLAGS |
| CS |
| RIP |
| RDX |
| 1 |
+------------------------+

In the next step we allocate yet another 40 bytes on the stack:


| Saved SS |
subq $(5*8), %rsp | Saved Return RSP |
| Saved RFLAGS |
| Saved CS |
and pushes the copy of the original stack frame after the allocated space: | Saved RIP |
+-------------------------+
.rept 5
pushq 11*8(%rsp)
After this we push dummy error code on the stack as we did it already in the previous
.endr
exception handlers and allocate space for the general purpose registers on the stack:

with the .rept assembly directive. We need in the copy of the original stack frame.
pushq $-1
Generally we need in two copies of the interrupt stack. First is copied interrupts stack: ALLOC_PT_GPREGS_ON_STACK
saved stack frame and copied stack frame. Now we pushes original stack frame to the
saved stack frame which locates after the just allocated 40 bytes ( copied stack frame).
We already saw implementation of the ALLOC_PT_GPREGS_ON_STACK macro in the third part
This stack frame is used to fixup the copied stack frame that a nested NMI may change.
of the interrupts chapter. This macro defined in the arch/x86/entry/calling.h and yet
The second - copied stack frame modified by any nested NMIs to let the first NMI know
another allocates 120 bytes on stack for the general purpose registers, from the rdi to
that we triggered a second NMI and we should repeat the first NMI handler. Ok, we have
the r15 :
made first copy of the original stack frame, now time to make second copy:

.macro ALLOC_PT_GPREGS_ON_STACK addskip=0


addq $(10*8), %rsp
addq $-(15*8+\addskip), %rsp
.endm
.rept 5
pushq -6*8(%rsp)
.endr
After space allocation for the general registers we can see call of the paranoid_entry :
subq $(5*8), %rsp

call paranoid_entry
After all of these manipulations our stack frame will be like this:

We can remember from the previous parts this label. It pushes general purpose registers
+-------------------------+
on the stack, reads MSR_GS_BASE Model Specific register and checks its value. If the value
| original SS |
| original Return RSP | of the MSR_GS_BASE is negative, we came from the kernel mode and just return from the
| original RFLAGS | paranoid_entry , in other way it means that we came from the usermode and need to
| original CS | execute swapgs instruction which will change user gs with the kernel gs :
| original RIP |
+-------------------------+
| temp storage for rdx | ENTRY(paranoid_entry)
+-------------------------+ cld
| NMI executing variable | SAVE_C_REGS 8
+-------------------------+ SAVE_EXTRA_REGS 8
| copied SS | movl $1, %ebx
| copied Return RSP | movl $MSR_GS_BASE, %ecx
| copied RFLAGS | rdmsr
| copied CS | testl %edx, %edx
| copied RIP | js 1f
+-------------------------+ SWAPGS

xorl %ebx, %ebx /* Pop the extra iret frame at once */


1: ret REMOVE_PT_GPREGS_FROM_STACK 6*8
END(paranoid_entry) /* Clear the NMI executing stack variable */
movq $0, 5*8(%rsp)
INTERRUPT_RETURN
Note that after the swapgs instruction we zeroed the ebx register. Next time we will
check content of this register and if we executed swapgs than ebx must contain 0 and
1 in other way. In the next step we store value of the cr2 control register to the r12
where INTERRUPT_RETURN is defined in the arch/x86/include/asm/irqflags.h and just
register, because the NMI handler can cause page fault and corrupt the value of this expands to the iret instruction. That's all.
control register: Now let's consider case when another NMI interrupt occurred when previous NMI
interrupt didn't finish its execution. You can remember from the beginning of this part that
movq %cr2, %r12 we've made a check that we came from userspace and jump on the first_nmi in this case:

Now time to call actual NMI handler. We push the address of the pt_regs to the rdi , cmpl $__KERNEL_CS, 16(%rsp)
error code to the rsi and call the do_nmi handler: jne first_nmi

movq %rsp, %rdi Note that in this case it is first NMI every time, because if the first NMI caught page fault,
movq $-1, %rsi breakpoint or another exception it will be executed in the kernel mode. If we didn't come
call do_nmi
from userspace, first of all we test our temporary variable:

We will back to the do_nmi little later in this part, but now let's look what occurs after the cmpl $1, -8(%rsp)
do_nmi will finish its execution. After the do_nmi handler will be finished we check the je nested_nmi
cr2 register, because we can got page fault during do_nmi performed and if we got it we
restore original cr2 , in other way we jump on the label 1 . After this we test content of and if it is set to 1 we jump to the nested_nmi label. If it is not 1 , we test the IST stack.
the ebx register (remember it must contain 0 if we have used swapgs instruction and 1 In the case of nested NMIs we check that we are above the repeat_nmi . In this case we
if we didn't use it) and execute SWAPGS_UNSAFE_STACK if it contains 1 or jump to the ignore it, in other way we check that we above than end_repeat_nmi and jump on the
nmi_restore label. The SWAPGS_UNSAFE_STACK macro just expands to the swapgs
nested_nmi_out label.
instruction. In the nmi_restore label we restore general purpose registers, clear allocated
space on the stack for this registers, clear our temporary variable and exit from the Now let's look on the do_nmi exception handler. This function defined in the
interrupt handler with the INTERRUPT_RETURN macro: arch/x86/kernel/nmi.c source code file and takes two parameters:

address of the pt_regs ;


movq %cr2, %rcx
cmpq %rcx, %r12 error code.
je 1f
movq %r12, %cr2 as all exception handlers. The do_nmi starts from the call of the nmi_nesting_preprocess
1: function and ends with the call of the nmi_nesting_postprocess . The
testl %ebx, %ebx nmi_nesting_preprocess function checks that we likely do not work with the debug stack
jnz nmi_restore
and if we on the debug stack set the update_debug_stack per-cpu variable to 1 and call
nmi_swapgs:
SWAPGS_UNSAFE_STACK
the debug_stack_set_zero function from the arch/x86/kernel/cpu/common.c. This function
nmi_restore: increases the debug_stack_use_ctr per-cpu variable and loads new Interrupt Descriptor
RESTORE_EXTRA_REGS Table :
RESTORE_C_REGS
return;
static inline void nmi_nesting_preprocess(struct pt_regs *regs) }
{
if (unlikely(is_debug_stack(regs->sp))) {
debug_stack_set_zero(); That's all.
this_cpu_write(update_debug_stack, 1);
}
} Range Exceeded Exception

The next exception is the BOUND range exceeded exception. The BOUND instruction
The nmi_nesting_postprocess function checks the update_debug_stack per-cpu variable
determines if the first operand (array index) is within the bounds of an array specified the
which we set in the nmi_nesting_preprocess and resets debug stack or in another words it
second operand (bounds operand). If the index is not within bounds, a BOUND range
loads origin Interrupt Descriptor Table . After the call of the nmi_nesting_preprocess
exceeded exception or #BR is occurred. The handler of the #BR exception is the
function, we can see the call of the nmi_enter in the do_nmi . The nmi_enter increases
do_bounds function that defined in the arch/x86/kernel/traps.c. The do_bounds handler
lockdep_recursion field of the interrupted process, update preempt counter and informs
starts with the call of the exception_enter function and ends with the call of the
the RCU subsystem about NMI . There is also nmi_exit function that does the same stuff
exception_exit :
as nmi_enter , but vice-versa. After the nmi_enter we increase __nmi_count in the
irq_stat structure and call the default_do_nmi function. First of all in the
prev_state = exception_enter();
default_do_nmi we check the address of the previous nmi and update address of the last
nmi to the actual: if (notify_die(DIE_TRAP, "bounds", regs, error_code,
X86_TRAP_BR, SIGSEGV) == NOTIFY_STOP)
if (regs->ip == __this_cpu_read(last_nmi_rip)) goto exit;
b2b = true; ...
else ...
__this_cpu_write(swallow_nmi, false); ...
exception_exit(prev_state);
__this_cpu_write(last_nmi_rip, regs->ip); return;

After this first of all we need to handle CPU-specific NMIs : After we have got the state of the previous context, we add the exception to the
notify_die chain and if it will return NOTIFY_STOP we return from the exception. More
about notify chains and the context tracking functions you can read in the previous part.
handled = nmi_handle(NMI_LOCAL, regs, b2b);
__this_cpu_add(nmi_stats.normal, handled); In the next step we enable interrupts if they were disabled with the contidional_sti
function that checks IF flag and call the local_irq_enable depends on its value:

And then non-specific NMIs depends on its reason:


conditional_sti(regs);

reason = x86_platform.get_nmi_reason(); if (!user_mode(regs))


if (reason & NMI_REASON_MASK) { die("bounds", regs, error_code);
if (reason & NMI_REASON_SERR)
pci_serr_error(reason, regs);
else if (reason & NMI_REASON_IOCHK) and check that if we didn't came from user mode we send SIGSEGV signal with the die
io_check_error(reason, regs); function. After this we check is MPX enabled or not, and if this feature is disabled we jump
on the exit_trap label:
__this_cpu_add(nmi_stats.external, 1);

exception_exit(prev_state);
if (!cpu_feature_enabled(X86_FEATURE_MPX)) { }
goto exit_trap;
}

where we execute `do_trap` function (more about it you can find in the previou and do_simd_coprocessor_error passes X86_TRAP_XF to the math_error function:

```C
exit_trap: dotraplinkage void
do_trap(X86_TRAP_BR, SIGSEGV, "bounds", regs, error_code, NULL); do_simd_coprocessor_error(struct pt_regs *regs, long error_code)
exception_exit(prev_state); {
enum ctx_state prev_state;

prev_state = exception_enter();
If MPX feature is enabled we check the BNDSTATUS with the get_xsave_field_ptr function math_error(regs, error_code, X86_TRAP_XF);
exception_exit(prev_state);
and if it is zero, it means that the MPX was not responsible for this exception:
}

bndcsr = get_xsave_field_ptr(XSTATE_BNDCSR);
if (!bndcsr) First of all the math_error function defines current interrupted task, address of its FPU,
goto exit_trap; string which describes an exception, add it to the notify_die chain and return from the
exception handler if it will return NOTIFY_STOP :
After all of this, there is still only one way when MPX is responsible for this exception. We
will not dive into the details about Intel Memory Protection Extensions in this part, but will struct task_struct *task = current;
struct fpu *fpu = &task->thread.fpu;
see it in another chapter.
siginfo_t info;
char *str = (trapnr == X86_TRAP_MF) ? "fpu exception" :
Coprocessor exception and SIMD exception "simd exception";

if (notify_die(DIE_TRAP, str, regs, error_code, trapnr, SIGFPE) == NOT


The next two exceptions are x87 FPU Floating-Point Error exception or #MF and SIMD
return;
Floating-Point Exception or #XF . The first exception occurs when the x87 FPU has
detected floating point error. For example divide by zero, numeric overflow, etc. The
second exception occurs when the processor has detected SSE/SSE2/SSE3 SIMD floating-
After this we check that we are from the kernel mode and if yes we will try to fix an
point exception. It can be the same as for the x87 FPU . The handlers for these exceptions
exception with the fixup_exception function. If we cannot we fill the task with the
are do_coprocessor_error and do_simd_coprocessor_error are defined in the
exception's error code and vector number and die:
arch/x86/kernel/traps.c and very similar on each other. They both make a call of the
math_error function from the same source code file but pass different vector number. The
if (!user_mode(regs)) {
do_coprocessor_error passes X86_TRAP_MF vector number to the math_error :
if (!fixup_exception(regs)) {
task->thread.error_code = error_code;
dotraplinkage void do_coprocessor_error(struct pt_regs *regs, long error_code) task->thread.trap_nr = trapnr;
{ die(str, regs, error_code);
enum ctx_state prev_state; }
return;
prev_state = exception_enter(); }
math_error(regs, error_code, X86_TRAP_MF);
If we came from the user mode, we save the fpu state, fill the task structure with the opcode
vector number of an exception and siginfo_t with the number of signal, errno , the Non-Maskable
address where exception occurred and signal code:
BOUND instruction
CPU socket
fpu__save(fpu);
Interrupt Descriptor Table
task->thread.trap_nr = trapnr; Interrupt Stack Table
task->thread.error_code = error_code;
Paravirtualization
info.si_signo = SIGFPE;
info.si_errno = 0; .rept
info.si_addr = (void __user *)uprobe_get_trap_addr(regs); SIMD
info.si_code = fpu__exception_code(fpu, trapnr);
Coprocessor
x86_64
After this we check the signal code and if it is non-zero we return:
iret
page fault
if (!info.si_code)
return; breakpoint
Global Descriptor Table
Or send the SIGFPE signal in the end: stack frame
Model Specific register
force_sig_info(SIGFPE, &info, task); percpu
RCU
That's all. MPX
x87 FPU
Conclusion Previous part

It is the end of the sixth part of the Interrupts and Interrupt Handling chapter and we saw
implementation of some exception handlers in this part, like non-maskable interrupt, SIMD
and x87 FPU floating point exception. Finally, we finished with the trap_init function in
this part and will go ahead in the next part. The next our point is the external interrupts
and the early_irq_init function from the init/main.c.

If you have any questions or suggestions write me a comment or ping me at twitter.

Please note that English is not my first language, And I am really sorry for any
inconvenience. If you find any mistakes please send me PR to linux-insides.

Links
General Protection Fault

You might also like