Inside The Linux Kernel: Unixforum Chicago - March 8, 2001
Inside The Linux Kernel: Unixforum Chicago - March 8, 2001
User Mode
C B A
+
different because events causing kernel invocations are not (usually) related to the running program
Kernel
the kernel must be able to identify the reason that caused the exception
two cases of deferred allocation of resources in Linux page frames (demand paging, Copy On Write) floating point registers
HW CONCURRENCY (1/2)
INT I/O device
IRQ
CPU
the I/O APIC polls the devices and issues interrupts no new interrupt can be issued until the CPU acknowledges the previous one good kernels run with interrupts enabled most of the time
HW CONCURRENCY (2/2)
+
try to distribute kernel functions in smaller programs that can be linked separately
two approaches: microkernels and modules Linux prefers modules for reasons of efficiency
+ +
MICROKERNELS
+
only a few functions such as process scheduling, and interprocess communication are included into the microkernel
other kernel functions such as memory allocation, file system handling, and device drivers are implemented as system processes running in User Mode
MODULES (1/2)
modules are object files containing kernel functions that are linked dynamically to the kernel Linux offers an excellent support for implementing and handling modules
MODULES (2/2)
b p t
external references to kernel symbols
a b
thanks to the kernel symbol table, it is possible to defer linking of an object module
modern computer architectures based on PCI busses support autoprobe of installed I/O devices while booting the system recent Linux distributions put all noncritical I/O drivers into modules at boot time, only the I/O modules of identified I/O devices are dynamically linked to the kernel
scenario: many tasks executing concurrently on a common address space (for instance, a web server handling thousands of requests per second)
problem: implementing each client request as a new process causes a lot of overhead process creation/elimination are timeconsuming kernel functions
introduce groups of lightweight processes called clones that share a common address space, opened files, signals, etc.
CPU scheduling is done at the process level in a standard way
+
+
LINUX PEARLS
we selected in a rather arbitrary way a few pearls related to two distinct kernel design areas: clever design choices efficient coding
+ +
+
+ +
Linux source code includes two architecture-dependent directories: /usr/src/linux/arch and /usr/src/linux/include
arch
include
asm asm-i386 . asm-s390
i386 .. s390
the schedule() function invokes the switch_to() Assembly language function to perform process switching
the code for switch_to() is stored in the include/asm/system.h file
depending on the target system, the asm symbolic link is set to asm-i386, asm-s390, etc.
VFS is an abstraction for representing several kinds of information containers (IC) in a common way standard operations on ICs: open(), close(), seek(), ioctl(), read(), write()
VFS associates a logical inode with each opened IC
EXAMPLES OF ICs
+ + + +
files stored in a disk-based filesystem files stored in a network filesystem disk partitions kernel data structures (/proc filesystem)
+
+
AVOID OVER-DESIGNING
A GENERAL-PURPOSE SCHEDULER
+
the scheduler of the System V Release 4 provides a set of class-independent routines that implement common services object-oriented approach based on scheduling class: the scheduler represents an abstract base class, and each scheduling class acts as a subclass
A HEATED DISCUSSION
+ If the Linux development community is not
responsive to the end user community, refusing to incorporate necessary functionality on the basis of aesthetics, then that community will abandon Linux in favor of something else. Is that really what you want?
+ Yes - If it turns into a pile of shit they'll abandon it
even faster. I'd rather have a decent OS that works and does the right thing for most people than a single OS that tries to do everything and does nothing right (Alan Cox)
+
+
classic solution: introduce an array current[NCPU] whose components point to the process descriptors of the processes running on the CPUs clever solution: store the process Kernel Mode stack and the process descriptor into contiguous addresses so that the value of the CPU stack pointer register (esp register) is linked to that of the process descriptor
Kernel Mode stack + process descriptor are stored in 2 contiguous page frames (8 KB)
variable-length Kernel Mode stack
esp
trivial solution: maintain a list of timers ordered by increasing decaying times and start checking from the first element of the list
clever solution (timing wheel): use percolation and maintain strict ordering only for the next 256 ticks (in Linux- i386, one tick = 10 ms) use several lists of timers
tv1:
when tv1 becomes empty, it is replenished by emptying one slot of tv2, and so forth
deferred checking is more efficient since system calls are issued most of the times with correct parameters if an addressing error occurs in Kernel Mode, the kernel must be able to distinguish whether it is caused by a faulty process or whether by a kernel bug
in the first case, the kernel sends a SIGSEGV signal to the faulty process
the kernel knows from the address of the faulty instruction that it belongs to one of the functions used to access data in the process address space it can then execute some kind of fixup code: as a result, the system call returns an error code