Edocti
Fortgeschrittene technische Ausbildung für den modernen Softwareingenieur

System calls - part two

Some argue that on x86 a syscall implies a context switch... Actually the kernel code runs in two possible contexts:

  • in kernel-space mode, in process context (on behalf of a specific process)
  • in kernel-space, in interrupt context (not bound to any process)

The CPU executes either userland code in user space, or kernel code in one of the two above-mentioned contexts.

By context switch we usually mean changing the current process; it is what happens when the current PID changes (when the scheduler preempts a process); In the following lines I'd like to stress that a syscall happens without this context switch.

_kernel_vsyscall() looks like this:

ffffe400 <__kernel_vsyscall>:
ffffe400:           51                      push   %ecx
ffffe401:           52                      push   %edx
ffffe402:           55                      push   %ebp
ffffe403:           89 e5                   mov    %esp,%ebp
ffffe405:           0f 34                   sysenter
ffffe407:           90                      nop
ffffe408:           90                      nop
ffffe409:           90                      nop
ffffe40a:           90                      nop
ffffe40b:           90                      nop
ffffe40c:           90                      nop
ffffe40d:           90                      nop
ffffe40e:           eb f3                   jmp    ffffe403 <__kernel_vsyscall+0x3>
ffffe410:           5d                      pop    %ebp
ffffe411:           5a                      pop    %edx
ffffe412:           59                      pop    %ecx
ffffe413:           c3                      ret

esp = stack pointer

eip = instruction pointer

Explanation:

- after moving to this address, registers %ecx, %edx and %ebp are saved on the user stack and %esp is copied to %ebp before executing sysenter (this %ebp later helps kernel in restoring userland stack)

- jmp __kernel_vsyscall+0x3 is just a trick made in order to be able to work with 6 arguments instead of 3; the standard max number of args for a syscall is 6:

- we make sysenter twice (the second sysenter has no impact: sysenter is just "restarted") — see https://lkml.org/lkml/2002/12/18/218 (Linus is a "disguisting pig") :)

- sysenter is executed; this will bring the CPU in Ring0 (a.k.a. CPL=0)

sysenter (fast system call facility on x86) does the following:

  • CS register set to the value of (SYSENTER_CS_MSR)
  • EIP register set to the value of (SYSENTER_EIP_MSR)
  • SS register set to the sum of (8 plus the value in SYSENTER_CS_MSR)
  • ESP register set to the value of (SYSENTER_ESP_MSR)

Intel defines these SFRs:

SYSENTER_CS_MSR  = 0x174
SYSENTER_ESP_MSR = 0x175
SYSENTER_EIP_MSR = 0x176

these values are defined in linux in /usr/src/linux/include/asm/msr.h:

#define MSR_IA32_SYSENTER_CS   0x174
#define MSR_IA32_SYSENTER_ESP  0x175
#define MSR_IA32_SYSENTER_EIP  0x176

at bootup, linux sets this values in a special page (/usr/src/linux/arch/i386/kernel/sysenter.c):

wrmsr(MSR_IA32_SYSENTER_CS,  __KERNEL_CS,                    0);
wrmsr(MSR_IA32_SYSENTER_ESP, tss->esp1,                      0);
wrmsr(MSR_IA32_SYSENTER_EIP, (unsigned long) sysenter_entry,  0);

- wrmsr writes the contents of registers EDX:EAX into the 64-bit model specific register (MSR) specified in the ECX register

- 'tss' refers to the Task State Segment (TSS) and tss->esp1 thus points to the kernel mode stack

Although Linux doesn't use hardware context switches, it is nonetheless forced to set up a TSS for each distinct CPU in the system. This is done for two main reasons:

  • When an x86 CPU switches from User Mode to Kernel Mode, it fetches the address of the Kernel Mode stack from the TSS
  • When a User Mode process attempts to access an I/O port by means of an in or out instruction, the CPU may need to access an I/O Permission Bitmap stored in the TSS to verify whether the process is allowed to address the port.

So during initialization kernel sets up these registers such that after SYSENTER instruction, ESP is set to kernel mode stack and EIP is set to sysenter_entry

- ESP is set to kernel mode stack and EIP is set to sysenter_entry

- now the kernel executes the following code (we are in Ring0 and we are executing kernel code, but the current PID is the old PID!
- the calling user thread is still at the sysenter line;
- context switching is done before returning to user space
- in linux context switching is made in software, not in hardware
- however, linux uses TSS for every process it creates - it creates/stores the TSS entry for the process at process creation):

When a transition between user mode and kernel mode is required in an operating system, a context switch is not necessary; a mode transition is not by itself a context switch.

However, depending on the operating system, a context switch may also take place at this time. (http://en.wikipedia.org/wiki/Context_switch#User_and_kernel_mode_switching)

179 ENTRY(sysenter_entry)
180         movl TSS_sysenter_esp0(%esp),%esp
181 sysenter_past_esp:
182         sti
183         pushl $(__USER_DS)
184         pushl %ebp            [%ebp contains userland %esp]
185         pushfl
186         pushl $(__USER_CS)
187         pushl $SYSENTER_RETURN [%userland return addr]
...
201         pushl %eax
202         SAVE_ALL              [pushes registers on to stack]
203         GET_THREAD_INFO(%ebp)
204
205         /* Note, _TIF_SECCOMP is bit number 8, and so it needs testw and not testb */
206         testw $(_TIF_SYSCALL_EMU|_TIF_SYSCALL_TRACE|_TIF_SECCOMP|_TIF_SYSCALL_AUDIT), TI_flags(%ebp)
207         jnz syscall_trace_entry
208         cmpl $(nr_syscalls), %eax
209         jae syscall_badsys

210         call *sys_call_table(,%eax,4)

211         movl %eax,EAX(%esp)
#define SAVE_ALL \
cld; \
pushl %es; \
pushl %ds; \
pushl %eax; \
pushl %ebp; \
pushl %edi; \
pushl %esi; \
pushl %edx; \
pushl %ecx; \
pushl %ebx; \
movl $(__USER_DS), %edx; \
movl %edx, %ds; \
movl %edx, %es;

In conclusion, what happens is not a context switch, but a mode transition. The current PID is the same. And, btw, user preemption can only happen in one of the 2 situations:

  • After syscall finishes and the need_resched flag is set
  • After interrupt finishes and need_resched flag is set

The need_resched flag is set by the scheduler tick when a thread needs to be preempted or by try_to_wake_up() when a higher priority process can be awakened.