System calls – part one

Definition: a controlled entry point into the kernel. A way by which user programs can execute functions that require a greater privilege.

A few words about privileges… In general, each CPU / microcontroller has a set of operating modes. Among this set of modes, some of them concern security; without going too deep into details, security means:

What memory areas can be read/written (depending on the memory map, some ranges from the address space might point to I/O devices — this happens for memory-mapped I/O)
What instructions can be executed (at the level of micro-instructions)

The current IA-32 architecture has four so-called privilege/protection rings. Ring 0 is the level with the most privileges, ring 1 is next, and ring 3 is the last. [Terminology: rings are also called current protection level (CPL) — sometimes we see the term ring 0..3, sometimes CPL-0..CPL-3.] Software runs in one of these rings. Operating systems manage the switching from one ring to another by executing CPU instructions that perform the transition. The kernel runs in ring 0, device drivers often run in ring 1, user applications usually run in ring 3, which restricts access to certain functions (like memory mapping) that would impact the correct behavior of other applications. Since the kernel is the only code to run in ring 0, it controls which application may run at which ring. An application running in a low-privilege ring cannot force the CPU to switch to a higher-privilege ring because it has no right to execute the instructions that change CPU state.

What happens on a Linux system call? Each OS exposes an API to user processes (I/O, process management, IPC, etc.). When a user application makes a Linux system call:

The application calls a wrapper function in the C library (glibc), e.g. fopen, which prepares the call.
The wrapper marshals the parameters for the kernel by placing them in registers; values originally on the user stack are moved to registers (e.g. %ecx, %edx), the current %esp is saved and typically copied into %ebp so it can be restored after the transition.
The wrapper executes the fast call instruction (sysenter on x86), transferring control to the kernel entry point (e.g. sysenter_entry in /usr/src/linux/arch/i386/kernel/entry.S).
On some setups the wrapper calls __kernel_vsyscall; its address is provided by the kernel via the AT_SYSINFO ELF auxiliary vector.

Of course, user programs can issue the system-call instruction directly, but wrapper functions provide a more convenient, portable interface.