Multitasking functionality
multitasking |
---|
processes |
threads or tasks |
synchronization |
Scheduler |
interrupts core |
CPU specific |
Linux kernel is a preemptive multitasking operating system. As a multitasking OS, it allows multiple processes to share processors (CPUs) and other system resources. Each CPU executes a single task at a time. However, multitasking allows each processor to switch between tasks that are being executed without having to wait for each task to finish. For that, the kernel can, at any time, temporarily interrupt a task being carried out by the processor, and replace it by another task that can be new or a previously suspended one. The operation involving the swapping of the running task is called context switch.
Processes
[edit | edit source]Process is a running user space program. Kernel starts the first process /sbin/init in function run_init_process idusing kernel_execve id. Processes occupy system resources, like memory, CPU time. System calls sys_fork id and sys_execve id are used to create new processes from user space. The process exit with an sys_exit id system call.
Linux inherits from Unix its basic process management system calls (⚲ API ↪ ⚙️ implementations):
man 2 fork ↪ kernel_clone id creates a new process by duplicating the process invoking it.
man 2 _exit ↪ do_exit id terminates the calling process "immediately". Any open file descriptors belonging to the process are closed.
man 2 wait ↪ kernel_waitid id suspends the execution of the calling process until one of its children processes terminates.
man 2 execve ↪ do_execve id runs an executable file in the context of current process, replacing the previous executable. This system call is used by family of functions of libc man 3 exec
Linux enhances the traditional Unix process API with its own system calls man 2 clone. Clone creates a child process that may share parts of its execution context with the parent. It is often used to implement threads (though programmers will typically use a higher-level interface such as man 7 pthreads, implemented on top of clone).
PID - Process identifier defined as pid_t id is unique sequential number.
man 1 ps -A lists current processes.
Syscall man 2 getpid ↪ task_tgid_vnr id return PID of the current process which internally is called TGID - thread group id.
A process can contain many threads. man 2 gettid ↪ task_pid_vnr id returns thread id.
Which internal historically is called PID.
⚠️ Warning: confusion. User space PID ≠ kernel space PID. man 1 ps -AF lists current processes and thread as LWP.
For a single thread process all these IDs are equal.
⚲ API
⚙️ Internals
- task_struct id
- pid_type id
- kernel/fork.c src
- syscalls:
- man 2 set_tid_address – set pointer to thread ID
- man 2 fork – create a child process
- man 2 vfork – create a child process and block parent
- man 2 clone – create a child process
- man 2 unshare – disassociate parts of the process execution context
- kernel/sys.c src
- syscalls:
- man 2 prctl – operations on a process or thread
- kernel/pid.c src
- syscalls:
- man 2 pidfd_open – obtain a file descriptor that refers to a process
- man 2 pidfd_getfd – obtain a duplicate of another process's file descriptor
- syscalls:
- man 2 pidfd_open – obtain a file descriptor that refers to a process
- man 2 pidfd_getfd – obtain a duplicate of another process's file descriptor
- kernel/exit.c src
- syscalls:
- man 2 exit – terminate the calling process
- man 2 exit_group – exit all threads in a process
- man 2 waitid – wait for process to change state
- man 2 waitpid – wait for process to change state
📖 References
Inter-process communication
[edit | edit source]Inter-process communication (IPC) refers specifically to the mechanisms an operating system provides to allow processes it manages to share data. Methods for achieving IPC are divided into categories which vary based on software requirements, such as performance and modularity requirements, and system circumstances. Linux inherited from Unix the following IPC mechanisms:
Signals (⚲ API ↪ ⚙️ implementations):
- man 2 kill sends signal to a process
- man 2 tgkill ↪ do_tkill id sends a signal to a thread
- man 2 process_vm_readv ↪ process_vm_rw id - zero-copy data transfer between process address spaces
🔧 TODO: man 2 sigaction man 2 signal man 2 sigaltstack man 2 sigpending man 2 sigprocmask man 2 sigsuspend man 2 sigwaitinfo man 2 sigtimedwait
- Anonymous pipes and named pipes (FIFOs) man 2 mknod ↪ do_mknodat id S_IFIFO id
- Express Data Path PF_XDP id
- Unix domain socket PF_UNIX id
- Memory-mapped files man 2 mmap ⤑ ksys_mmap_pgoff id
- Sys V IPC:
- Message queues
- Semaphores
- Shared memory: man 2 shmget, man 2 shmctl, man 2 shmat, man 2 shmdt
📖 References
Threads or tasks
[edit | edit source]In Linux kernel "thread" and "task" are almost synonyms.
💾 History: Till 2.6.39, kernel mode has only one thread protected by big kernel lock.
⚲ API
- linux/sched.h inc - the main scheduler API
- arch/x86/include/asm/current.h src
- current id and get_current id () return current task_struct id
- uapi/linux/taskstats.h inc per-task statistics
- linux/thread_info.h inc
- function current_thread_info id() returns thread_info id
- linux/sched/task.h inc - interface between the scheduler and various task lifetime (fork()/exit()) functionality
- linux/kthread.h inc - simple interface for creating and stopping kernel threads without mess.
- kthread_run id creates and wake a thread
- kthread_create id
⚙️ Internals
Scheduler
[edit | edit source]The scheduler is the part of the operating system that decides which process runs at a certain point in time. It usually has the ability to pause a running process, move it to the back of the running queue and start a new process.
Active processes are placed in an array called a run queue, or runqueue - rq id. The run queue may contain priority values for each process, which will be used by the scheduler to determine which process to run next. To ensure each program has a fair share of resources, each one is run for some time period (quantum) before it is paused and placed back into the run queue. When a program is stopped to let another run, the program with the highest priority in the run queue is then allowed to execute. Processes are also removed from the run queue when they ask to sleep, are waiting on a resource to become available, or have been terminated.
Linux uses the Completely Fair Scheduler (CFS), the first implementation of a fair queuing process scheduler widely used in a general-purpose operating system. CFS uses a well-studied, classic scheduling algorithm called "fair queuing" originally invented for packet networks. The CFS scheduler has a scheduling complexity of O(log N), where N is the number of tasks in the runqueue. Choosing a task can be done in constant time, but reinserting a task after it has run requires O(log N) operations, because the run queue is implemented as a red–black tree.
In contrast to the previous O(1) scheduler, the CFS scheduler implementation is not based on run queues. Instead, a red-black tree implements a "timeline" of future task execution. Additionally, the scheduler uses nanosecond granularity accounting, the atomic units by which an individual process' share of the CPU was allocated (thus making redundant the previous notion of timeslices). This precise knowledge also means that no specific heuristics are required to determine the interactivity of a process, for example.
Like the old O(1) scheduler, CFS uses a concept called "sleeper fairness", which considers sleeping or waiting tasks equivalent to those on the runqueue. This means that interactive tasks which spend most of their time waiting for user input or other events get a comparable share of CPU time when they need it.
The data structure used for the scheduling algorithm is a red-black tree in which the nodes are scheduler specific structures, entitled sched_entity id. These are derived from the general task_struct process descriptor, with added scheduler elements. These nodes are indexed by processor execution time in nanoseconds. A maximum execution time is also calculated for each process. This time is based upon the idea that an "ideal processor" would equally share processing power amongst all processes. Thus, the maximum execution time is the time the process has been waiting to run, divided by the total number of processes, or in other words, the maximum execution time is the time the process would have expected to run on an "ideal processor".
When the scheduler is invoked to run a new processes, the operation of the scheduler is as follows:
- The left most node of the scheduling tree is chosen (as it will have the lowest spent execution time), and sent for execution.
- If the process simply completes execution, it is removed from the system and scheduling tree.
- If the process reaches its maximum execution time or is otherwise stopped (voluntarily or via interrupt) it is reinserted into the scheduling tree based on its new spent execution time.
- The new left-most node will then be selected from the tree, repeating the iteration.
If the process spends a lot of its time sleeping, then its spent time value is low and it automatically gets the priority boost when it finally needs it. Hence such tasks do not get less processor time than the tasks that are constantly running.
An alternative to CFS is the Brain Fuck Scheduler (BFS) created by Con Kolivas. The objective of BFS, compared to other schedulers, is to provide a scheduler with a simpler algorithm, that does not require adjustment of heuristics or tuning parameters to tailor performance to a specific type of computation workload.
Con Kolivas also maintains another alternative to CFS, the MuQSS scheduler.[1]
The Linux kernel contains different scheduler classes (or policies). The Completely Fair Scheduler used nowadays by default is SCHED_NORMAL id scheduler class aka SCHED_OTHER. The kernel also contains two additional classes SCHED_BATCH id and SCHED_IDLE id, and another two real-time scheduling classes named SCHED_FIFO id (realtime first-in-first-out) and SCHED_RR id (realtime round-robin), with a third realtime scheduling policy known as SCHED_DEADLINE id that implements the earliest deadline first algorithm (EDF) added later. Any realtime scheduler class takes precedence over any of the "normal" —i.e. non realtime— classes. The scheduler class is selected and configured through the man 2 sched_setscheduler ↪ do_sched_setscheduler id system call.
Properly balancing latency, throughput, and fairness in schedulers is an open problem.[1]
⚲ API
- man 1 renice – priority of running processes
- man 1 nice – run a program with modified scheduling priority
- man 1 chrt – manipulate the real-time attributes of a process
- man 2 sched_getattr ↪ sys_sched_getattr id – get scheduling policy and attributes
- linux/sched.h inc – the main scheduler API
- man 2 getpriority, man 2 setpriority
- man 2 sched_setscheduler, man 2 sched_getscheduler
⚙️ Internals
- sched_init id is called from start_kernel id
- __schedule id is the main scheduler function.
- runqueues id, this_rq id
- kernel/sched src
- kernel/sched/core.c src
- kernel/sched/fair.c src implements SCHED_NORMAL id, SCHED_BATCH id, SCHED_IDLE id
- sched_setscheduler id, sched_getscheduler id
- task_struct id::rt_priority id and other members with less unique identifiers
🛠️ Utilities
📖 References
- man 7 sched
- Scheduling doc
- CFS
- Completely fair scheduler LWN
- Deadline Task Scheduler doc
- sched ltp
- sched_setparam ltp
- sched_getscheduler ltp
- sched_setscheduler ltp
📚 Further reading about the scheduler
Preemption
[edit | edit source]Preemption refers to the ability of the system to interrupt a running task to switch to another task. This is essential for ensuring that high-priority tasks receive the necessary CPU time and for improving the system's responsiveness. In Linux, preemption models define how and when the kernel can preempt tasks. Different models offer varying trade-offs between system responsiveness and throughput.
📖 References
- kernel/Kconfig.preempt src
- CONFIG_PREEMPT_NONE id – no forced preemption for servers
- CONFIG_PREEMPT_VOLUNTARY id – voluntary preemption for desktops
- CONFIG_PREEMPT id – preemptible except for critical sections for low-latency desktops
- CONFIG_PREEMPT_RT id – real-time preemption for highly responsive applications
- CONFIG_PREEMPT_DYNAMIC id, see /sys/kernel/debug/sched/preempt
Wait queues
[edit | edit source]A wait queue in the kernel is a data structure that allows one or more processes to wait (sleep) until something of interest happens. They are used throughout the kernel to wait for available memory, I/O completion, message arrival, and many other things. In the early days of Linux, a wait queue was a simple list of waiting processes, but various scalability problems (including the thundering herd problem) have led to the addition of a fair amount of complexity since then.
⚲ API
wait_queue_head id consists of double linked list of wait_queue_entry id and a spinlock.
Waiting for simple events:
- Use one of two methods for wait_queue_head id initialization:
- init_waitqueue_head id initializes wait_queue_head id in function context
- DECLARE_WAIT_QUEUE_HEAD id - actually defines wait_queue_head id in global context
- Wait alternatives:
- wait_event_interruptible id - preferable wait
- wait_event_interruptible_timeout id
- wait_event id - uninterruptible wait. Can cause deadlock ⚠
- wake_up id etc
👁 For example usage see references to unique suspend_queue id.
Explicit use of add_wait_queue instead of simple wait_event for complex cases:
- DECLARE_WAITQUEUE id actually defines wait_queue_entry with default_wake_function id
- add_wait_queue id inserts process in the first position of a wait queue
- remove_wait_queue id
⚙️ Internals
📖 References
Synchronization
[edit | edit source]Thread synchronization is defined as a mechanism which ensures that two or more concurrent processes or threads do not simultaneously execute some particular program segment known as mutual exclusion (mutex). When one thread starts executing the critical section (serialized segment of the program) the other thread should wait until the first thread finishes. If proper synchronization techniques are not applied, it may cause a race condition where, the values of variables may be unpredictable and vary depending on the timings of context switches of the processes or threads.
User space synchronization
[edit | edit source]Futex
[edit | edit source]A man 2 futex ↪ do_futex id (short for "fast userspace mutex") is a kernel system call that programmers can use to implement basic locking, or as a building block for higher-level locking abstractions such as semaphores and POSIX mutexes or condition variables.
A futex consists of a kernelspace wait queue that is attached to an aligned integer in userspace. Multiple processes or threads operate on the integer entirely in userspace (using atomic operations to avoid interfering with one another), and only resort to relatively expensive system calls to request operations on the wait queue (for example to wake up waiting processes, or to put the current process on the wait queue). A properly programmed futex-based lock will not use system calls except when the lock is contended; since most operations do not require arbitration between processes, this will not happen in most cases.
The basic operations of futexes are based on only two central operations futex_wait id and futex_wake id though implementation has a more operations for more specialized cases.
- WAIT (addr, val) checks if the value stored at the address addr is val, and if it is puts the current thread to sleep.
- WAKE (addr, val) wakes up val number of threads waiting on the address addr.
⚲ API
⚙️ Internals: kernel/futex.c src
📖 References
File locking
[edit | edit source]⚲ API: man 2 flock
Semaphore
[edit | edit source]💾 History: Semaphore is part of System V IPC man 7 sysvipc
⚲ API
⚙️ Internals: ipc/sem.c src
Kernel space synchronization
[edit | edit source]For kernel mode synchronization Linux provides three categories of locking primitives: sleeping, per CPU local locks and spinning locks.
Read-Copy-Update
[edit | edit source]Common mechanism to solve the readers–writers problem is the read-copy-update (RCU) algorithm. Read-copy-update implements a kind of mutual exclusion that is wait-free (non-blocking) for readers, allowing extremely low overhead. However, RCU updates can be expensive, as they must leave the old versions of the data structure in place to accommodate pre-existing readers.
💾 History: RCU was added to Linux in October 2002. Since then, there are thousandths uses of the RCU API within the kernel including the networking protocol stacks and the memory-management system. The implementation of RCU in version 2.6 of the Linux kernel is among the better-known RCU implementations.
⚲ The core API in linux/rcupdate.h inc is quite small:
- rcu_read_lock id marks an RCU-protected data structure so that it won't be reclaimed for the full duration of that critical section.
- rcu_read_unlock id is used by a reader to inform the reclaimer that the reader is exiting an RCU read-side critical section. Note that RCU read-side critical sections may be nested and/or overlapping.
- synchronize_rcu id blocks until all pre-existing RCU read-side critical sections on all CPUs have completed. Note that
synchronize_rcu
will not necessarily wait for any subsequent RCU read-side critical sections to complete.
👁 For example, consider the following sequence of events:
CPU 0 CPU 1 CPU 2 ----------------- ------------------------- --------------- 1. rcu_read_lock() 2. enters synchronize_rcu() 3. rcu_read_lock() 4. rcu_read_unlock() 5. exits synchronize_rcu() 6. rcu_read_unlock()
- Since
synchronize_rcu
is the API that must figure out when readers are done, its implementation is key to RCU. For RCU to be useful in all but the most read-intensive situations,synchronize_rcu
's overhead must also be quite small.
- Alternatively, instead of blocking, synchronize_rcu may register a callback to be invoked after all ongoing RCU read-side critical sections have completed. This callback variant is called call_rcu id in the Linux kernel.
- rcu_assign_pointer id - The updater uses this function to assign a new value to an RCU-protected pointer, in order to safely communicate the change in value from the updater to the reader. This function returns the new value, and also executes any memory barrier instructions required for a given CPU architecture. Perhaps more importantly, it serves to document which pointers are protected by RCU.
- rcu_dereference id - The reader uses this function to fetch an RCU-protected pointer, which returns a value that may then be safely dereferenced. It also executes any directives required by the compiler or the CPU, for example, a volatile cast for gcc, a memory_order_consume load for C/C++11 or the memory-barrier instruction required by the old DEC Alpha CPU. The value returned by
rcu_dereference
is valid only within the enclosing RCU read-side critical section. As withrcu_assign_pointer
, an important function ofrcu_dereference
is to document which pointers are protected by RCU.
The RCU infrastructure observes the time sequence of rcu_read_lock
, rcu_read_unlock
, synchronize_rcu
, and call_rcu
invocations in order to determine when (1) synchronize_rcu
invocations may return to their callers and (2) call_rcu
callbacks may be invoked. Efficient implementations of the RCU infrastructure make heavy use of batching in order to amortize their overhead over many uses of the corresponding APIs.
⚙️ Internals
📖 References
Sleeping locks
[edit | edit source]Mutexes
[edit | edit source]⚲ API
- linux/mutex.h inc
- linux/completion.h inc
- mutex id has owner and usage constrains, more easy to debug then semaphore
- rt_mutex id blocking mutual exclusion locks with priority inheritance (PI) support
- ww_mutex id Wound/Wait mutexes: blocking mutual exclusion locks with deadlock avoidance
- rw_semaphore id readers–writer semaphores
- percpu_rw_semaphore id
- completion id - use completion for synchronization task with ISR and task or two tasks.
💾 Historical
- semaphore id - use mutex instead semaphore if possible
- linux/semaphore.h inc
- linux/rwsem.h inc
📖 References
per CPU local lock
[edit | edit source]
On normal preemptible kernel local_lock calls preempt_disable id. On RT preemptible kernel local_lock calls migrate_disable id and spin_lock id. Any changes applied to spinlock_t also apply to local_lock.
⚲ API
📖 References
- local_lock doc
- PREEMPT_RT caveats: spinlock_t, rwlock_t, migrate_disable and local_lock doc
- Proper locking under a preemptive kernel doc
- Local locks in the kernel
💾 History: Prior to kernel version 2.6, Linux disabled interrupt to implement short critical sections. Since version 2.6 and later, Linux is fully preemptive.
Spinning locks
[edit | edit source]a spinlock is a lock which causes a thread trying to acquire it to simply wait in a loop ("spin") while repeatedly checking if the lock is available. Since the thread remains active but is not performing a useful task, the use of such a lock is a kind of busy waiting. Once acquired, spinlocks will usually be held until they are explicitly released, although in some implementations they may be automatically released if the thread being waited on (that which holds the lock) blocks, or "goes to sleep".
Spinlocks are commonly used inside kernels because they are efficient if threads are likely to be blocked for only short periods. However, spinlocks become wasteful if held for longer durations, as they may prevent other threads from running and require rescheduling. 👁 For example kobj_kset_join id uses spinlock to protect assess to the linked list.
Enabling and disabling of kernel preemption replaced spinlocks on uniprocessor systems (disabled CONFIG_SMP id). Most spinning locks becoming sleeping locks in the CONFIG_PREEMPT_RT id kernels.
📖 References
A seqlock (short for "sequential lock") is a special locking mechanism used in Linux for supporting fast writes of shared variables between two parallel operating system routines. It is a special solution to the readers–writers problem when the number of writers is small.
It is a reader-writer consistent mechanism which avoids the problem of writer starvation. A seqlock_t id consists of storage for saving a sequence counter seqcount_t id/seqcount_spinlock_t in addition to a lock. The lock is to support synchronization between two writers and the counter is for indicating consistency in readers. In addition to updating the shared data, the writer increments the sequence counter, both after acquiring the lock and before releasing the lock. Readers read the sequence counter before and after reading the shared data. If the sequence counter is odd on either occasion, a writer had taken the lock while the data was being read and it may have changed. If the sequence counters are different, a writer has changed the data while it was being read. In either case readers simply retry (using a loop) until they read the same even sequence counter before and after.
💾 History: The semantics stabilized as of version 2.5.59, and they are present in the 2.6.x stable kernel series. The seqlocks were developed by Stephen Hemminger and originally called frlocks, based on earlier work by Andrea Arcangeli. The first implementation was in the x86-64 time code where it was needed to synchronize with user space where it was not possible to use a real lock.
⚲ API
👁 Example: mount_lock id, defined in fs/namespace.c src
📖 References
Spinning or sleeping locks
[edit | edit source]normal on preempt RT spinlock_t, raw_spinlock_t rt_mutex_base, rt_spin_lock, sleeping rwlock_t spinning sleeping local_lock preempt_disable migrate_disable, rt_spin_lock, sleeping
Low level
[edit | edit source]The compiler might optimize away or reorder writes to variables leading to unexpected behavior when variables are accessed concurrently by multiple threads.
⚲ API
- asm-generic/rwonce.h inc – prevent the compiler from merging or refetching reads or writes.
- linux/compiler.h inc
- barrier id – prevents the compiler from reordering instructions around the barrier
- asm-generic/barrier.h inc – generic barrier definitions
- arch/x86/include/asm/barrier.h src – force strict CPU ordering
- mb id – ensures that all memory operations before the barrier are completed before any memory operations after the barrier are started
⚙️ Internals
📚 Further reading
- volatile – prevents the compiler from optimizations
- Memory barrier – enforces an ordering constraint on memory operations
...
[edit | edit source]⚙️ Locking internals
- linux/lockdep.h inc – runtime locking correctness validator
- linux/debug_locks.h inc
- lib/locking-selftest.c src
- kernel/locking src
- timer_list id wait_queue_head_t id
- kernel/locking/locktorture.c src – module-based torture test facility for locking
📖 Locking references
- locking doc
- Lock types and their rules doc
- 😴 sleeping locks doc
- mutex id, rt_mutex id, semaphore id, rw_semaphore id, ww_mutex id, percpu_rw_semaphore id
- on preempt RT: local_lock, spinlock_t, rwlock_t
- 😵💫 spinning locks doc:
- raw_spinlock_t, bit spinlocks
- on non preempt RT: spinlock_t, rwlock_t
- 😴 sleeping locks doc
- Lock types and their rules doc
- Unreliable Guide To Locking doc
- Synchronization primitives
- Tickless (Full dynticks), CONFIG_NO_HZ_FULL id
Time
[edit | edit source]⚲ UAPI
- uapi/linux/time.h inc
- timespec id – nanosecond resolution
- timeval id – microsecond resolution
- timezone id
- ...
- uapi/linux/time_types.h inc
- __kernel_timespec id – nanosecond resolution, used in syscalls
- ...
⚲ API
- linux/time.h inc
- linux/ktime.h inc
- ktime_t id – nanosecond scalar representation for kernel time values
- ktime_sub id
- ...
- linux/timekeeping.h inc
- linux/time64.h inc
- uapi/linux/rtc.h inc
- linux/jiffies.h inc
⚙️ Internals
📖 References
- ktime accessors doc
- Clock sources, Clock events, sched_clock() and delay timers doc
- Time and timer routines doc
- Year 2038 problem
Interrupts
[edit | edit source]An interrupt is a signal to the processor emitted by hardware or software indicating an event that needs immediate attention. An interrupt alerts the processor to a high-priority condition requiring the interruption of the current code the processor is executing. The processor responds by suspending its current activities, saving its state, and executing a function called an interrupt handler (or an interrupt service routine, ISR) to deal with the event. This interruption is temporary, and, after the interrupt handler finishes, the processor resumes normal activities.
There are two types of interrupts: hardware interrupts and software interrupts. Hardware interrupts are used by devices to communicate that they require attention from the operating system. For example, pressing a key on the keyboard or moving the mouse triggers hardware interrupts that cause the processor to read the keystroke or mouse position. Unlike the software type, hardware interrupts are asynchronous and can occur in the middle of instruction execution, requiring additional care in programming. The act of initiating a hardware interrupt is referred to as an interrupt request - IRQ (⚙️ do_IRQ id).
A software interrupt is caused either by an exceptional condition in the processor itself, or a special instruction in the instruction set which causes an interrupt when it is executed. The former is often called a trap (⚙️ do_trap id) or exception and is used for errors or events occurring during program execution that are exceptional enough that they cannot be handled within the program itself. For example, if the processor's arithmetic logic unit is commanded to divide a number by zero, this impossible demand will cause a divide-by-zero exception (⚙️ X86_TRAP_DE id), perhaps causing the computer to abandon the calculation or display an error message. Software interrupt instructions function similarly to subroutine calls and are used for a variety of purposes, such as to request services from low-level system software such as device drivers. For example, computers often use software interrupt instructions to communicate with the disk controller to request data be read or written to the disk.
Each interrupt has its own interrupt handler. The number of hardware interrupts is limited by the number of interrupt request (IRQ) lines to the processor, but there may be hundreds of different software interrupts.
⚲ API
- /proc/interrupts
- man 1 irqtop – utility to display kernel interrupt information
- irqbalance – distribute hardware interrupts across processors on a multiprocessor system
- There are many ways to request ISR, two of them
- devm_request_threaded_irq id – preferable function to allocate an interrupt line for a managed device with a threaded ISR
- request_irq id, free_irq id – old and common functions to add and remove a handler for an interrupt line
- linux/interrupt.h inc – main interrupt support header
- irqaction id – contains handler functions
- linux/irq.h inc
- include/linux/irqflags.h inc
- linux/irqdesc.h inc
- linux/irqdomain.h inc
- irq_domain id – hardware interrupt number translation object
- irq_domain_get_irq_data id
- linux/msi.h inc – Message Signaled Interrupts
- Structure of structures:
- irq_desc id is container of
⚙️ Internals
- kernel/irq/settings.h src
- kernel/irq src
- ls /sys/kernel/debug/irq/domains/
- irq_chip id
📖 References
- IRQs doc
- Linux generic IRQ handling doc
- Message Signaled Interrupts: The MSI Driver Guide doc
- Lock types and their rules doc
- Hard IRQ Context doc
- Interrupts
👁 Examples
- dummy_irq_chip id – dummy interrupt chip implementation
- lib/locking-selftest.c src
IRQ affinity
[edit | edit source]⚲ API
- /proc/irq/default_smp_affinity
- /proc/irq/*/smp_affinity and /proc/irq/*/smp_affinity_list
Common types and functions:
- struct irq_affinity id – description for automatic irq affinity assignments, see devm_platform_get_irqs_affinity id
- struct irq_affinity_desc id – interrupt affinity descriptor, see irq_update_affinity_desc id, irq_create_affinity_masks id
- irq_set_affinity id
- irq_get_affinity_mask id
- irq_can_set_affinity id
- irq_set_affinity_hint id
- irqd_affinity_is_managed id
- irq_data_get_affinity_mask id
- irq_data_get_effective_affinity_mask id
- irq_data_update_effective_affinity id
- irq_set_affinity_notifier id
- irq_affinity_notify id
- irq_chip_set_affinity_parent id
- irq_set_vcpu_affinity id
🛠️ Utilities
- irqbalance – distributes hardware interrupts across CPUs
...
[edit | edit source]📖 References
- SMP IRQ affinity doc
- IRQ affinity, LF
- managed_irq kernel parameter, @LKML
- irqaffinity kernel parameter, @LKML
📚 Further reading
Deferred works
[edit | edit source]Scheduler context
[edit | edit source]Threaded IRQ
[edit | edit source]⚲ API
devm_request_threaded_irq id, request_threaded_irq id
ISR should return IRQ_WAKE_THREAD to run thread function
⚙️ Internals
📖 References
Work
[edit | edit source]work is a workqueue wrapper
⚲ API
- linux/workqueue.h inc
- work_struct id, INIT_WORK id, schedule_work id,
- delayed_work id, INIT_DELAYED_WORK id, schedule_delayed_work id, cancel_delayed_work_sync id
👁 Example usage samples/ftrace/sample-trace-array.c src
⚙️ Internals: system_wq id
Workqueue
[edit | edit source]⚲ API
⚙️ Internals
📖 References
Interrupt context
[edit | edit source]- linux/irq_work.h inc – framework for enqueueing and running callbacks from hardirq context
Timers
[edit | edit source]softirq timer
[edit | edit source]This timer is a softirq for periodical tasks with jiffies resolution
⚲ API
- linux/timer.h inc
- timer_list id, DEFINE_TIMER id, timer_setup id
- mod_timer id — sets expiration time in jiffies.
- del_timer id
⚙️ Internals
👁 Examples
High-resolution timer
[edit | edit source]⚲ API
- linux/hrtimer.h inc
- hrtimer id, hrtimer.function — callback
- hrtimer_init id, hrtimer_cancel id
- hrtimer_start id starts a timer with nanosecond resolution
👁 Example watchdog_enable id
⚙️ Internals
📚 HR timers references
- High-resolution timers doc
- hrtimers - subsystem for high-resolution kernel timers doc
- high resolution timers and dynamic ticks design notes doc
...
[edit | edit source]📚 Timers references
Tasklet
[edit | edit source]tasklet is a softirq, for time critical operations
⚲ API is deprecated in favor of threaded IRQs: devm_request_threaded_irq id
⚙️ Internals: tasklet_action_common id HI_SOFTIRQ, TASKLET_SOFTIRQ
Softirq
[edit | edit source]softirq is internal system facility and should not be used directly. Use tasklet or threaded IRQs
⚲ API
- cat /proc/softirqs
- open_softirq id registers softirq_action id
⚙️ Internals
⚲ API
📖 References
- Introduction to deferred interrupts (Softirq, Tasklets and Workqueues)
- Softirq, Tasklets and Workqueues
- Timers and time management
- Deferred work, linux-kernel-labs
- Chapter 7. Time, Delays, and Deferred Work
CPU specific
[edit | edit source]🖱️ GUI
- tuna – program for tuning running processes
⚲ API
- cat /proc/cpuinfo
- /sys/devices/system/cpu/
- /sys/cpu/
- /sys/fs/cgroup/cpu/
- grep -i cpu /proc/self/status
- rdmsr – tool for reading CPU machine specific registers (MSR)
- man 1 lscpu – display information about the CPU architecture
- linux/arch_topology.h inc – arch specific cpu topology information
- linux/cpu.h inc – generic cpu definition
- linux/cpu_cooling.h inc
- linux/cpu_pm.h inc
- linux/cpufeature.h inc
- linux/cpufreq.h inc
- linux/cpuidle.h inc – a generic framework for CPU idle power management
- linux/peci-cpu.h inc
- linux/sched/cpufreq.h inc – Interface between cpufreq drivers and the scheduler
- linux/sched/cputime.h inc – cputime accounting APIs
⚙️ Internals
Cache
[edit | edit source]
⚙️ Internals
📖 References
- Working-State Power Management doc
- https://www.thinkwiki.org/wiki/How_to_use_cpufrequtils
- cpufreq ltp
SMP
[edit | edit source]This chapter is about multiprocessing and muti-core aspects of Linux kernel.
Key concepts and features of Linux SMP include:
- Symmetry: In an SMP system, all processors are considered the same without hardware hierarchy in contradiction to use of coprocessors.
- Load balancing: The Linux kernel employs load balancing mechanisms to distribute tasks evenly among available CPU cores. This prevents any one core from becoming overwhelmed while others remain underutilized.
- Parallelism: SMP enables parallel processing, where multiple threads or processes can execute simultaneously on different CPU cores. This can significantly improve the execution speed of applications that are designed to take advantage of multiple threads.
- Thread scheduling: The Linux kernel scheduler is responsible for determining which threads or processes run on which CPU cores and for how long. It aims to optimize performance by minimizing contention and maximizing CPU utilization.
- Shared memory: In an SMP system, all CPU cores typically share the same physical memory space. This allows processes and threads running on different cores to communicate and share data more efficiently.
- NUMA – Non-Uniform Memory Access: In larger SMP systems, memory access times might not be uniform due to the physical arrangement of memory banks and processors. Linux has mechanisms to handle NUMA architectures efficiently, allowing processes to be scheduled on CPUs closer to their associated memory.
- Cache coherency: SMP systems require mechanisms to ensure that all CPU cores have consistent views of memory. Cache coherency protocols ensure that changes made to shared memory locations are correctly propagated to all cores.
- Scalability: SMP systems can be scaled up to include more CPU cores, enhancing the overall computing power of the system. However, as the number of cores increases, challenges related to memory access, contention, and communication between cores may arise.
- Kernel and user space: Linux applications running in user space can take advantage of SMP without needing to be aware of the underlying hardware details. The kernel handles the management of CPU cores and resource allocation.
🗝️ Key terms
- Affinity refers to assigning a process or thread to specific CPU cores. This helps control which CPUs execute tasks, potentially improving performance by reducing data movement between cores. It can be managed using system calls or commands. Affinity can be represented as CPU bitmask: cpumask_t id or CPU affinity list: cpulist_parse id.
⚲ API
ps -PLe
– lists threads with processor that the thread last executed on (the third column PSR).- man 1 taskset – set or retrieve a process's CPU affinity
- man 2 getcpu – determine CPU and NUMA node on which the calling thread is running
- man 7 cpuset – confine processes to processor and memory node subsets
- man 8 chcpu – configure CPUs
- man 3 CPU_SET – macros for manipulating CPU sets
- grep Cpus_allowed /proc/self/status
- man 2 sched_setaffinity man 2 sched_getaffinity – set and get a thread's CPU affinity mask
- set_cpus_allowed_ptr id – common kernel function to change a task's affinity mask
- linux/smp.h inc
- linux/cpu.h inc
- linux/group_cpus.h inc: group_cpus_evenly id – groups all CPUs evenly per NUMA/CPU locality
- linux/cpuset.h inc – cpuset interface
- linux/cpu_rmap.h inc – CPU affinity reverse-map support
- linux/cpumask_types.h inc
- struct cpumask, cpumask_t id – CPUs bitmap, can be very big
- cpumask_var_t id – type for local cpumask variable, see alloc_cpumask_var id, free_cpumask_var id.
- linux/cpumask.h inc – Cpumasks provide a bitmap suitable for representing the set of CPU's in a system, one bit position per CPU number
- asm-generic/percpu.h inc
- linux/percpu-defs.h inc – basic definitions for percpu areas
- linux/percpu.h inc
- linux/percpu-refcount.h inc
- linux/percpu-rwsem.h inc
- linux/preempt.h inc
- /sys/bus/cpu
- per CPU local_lock
⚙️ Internals
- boot_cpu_init id activates the first CPU
- smp_prepare_cpus id initializes rest CPUs during boot
- cpuset_init id
- cpu_number id
- cpus_mask id – affinity of task_struct id
- CONFIG_SMP id
- trace/events/percpu.h inc
- IPI – Inter-processor interrupt
- trace/events/ipi.h inc
- kernel/irq/ipi.c src
- ipi_send_single id, ipi_send_mask id ...
- drivers/base/cpu.c src – CPU driver model subsystem support
- kernel/cpu.c src
- smpboot
- lib/group_cpus.c src
🛠️ Utilities
- irqbalance – distributes hardware interrupts across CPUs
- man 8 numactl – controls NUMA policy for processes or shared memory
📖 References
- CPUSETS of cgroup v1 doc
- CPU lists in command-line parameters doc
- nohz_full clears housekeeping.cpumasks id for tick, wq, timer, rcu, misc, and kthread in housekeeping_nohz_full_setup id
- isolcpus clears housekeeping.cpumasks id for tick, domain, and managed_irq in housekeeping_isolcpus_setup id
📚 Further reading
- CPU Isolation state of the art, LPC'23
- CPU Partitioning
- Scheduler Domains doc – the Scheduler balances CPUs (scheduling groups) within a sched domain
- CPU Isolation
- isolcpus @LKML
- nohz_full @LKML
- Functionalities of the scheduler TuneD plugin
CPU hotplug
[edit | edit source]CPU hotplugging in Linux refers to the ability to dynamically add or remove CPUs from the system without needing a reboot. This feature is crucial in environments requiring high availability and resource flexibility, such as data centers, virtualized systems, and systems that use power management aggressively.
⚲ API
- /sys/devices/system/cpu/cpu*/online
- /sys/devices/system/cpu/cpu*/hotplug/
- include/linux/cpu.h inc
- add_cpu id ...
- linux/cpuhotplug.h inc
- cpuhp_state id – CPU hotplug states
- cpuhp_setup_state id ... – setups hotplug state callbacks
- linux/cpuhplock.h inc – CPU hotplug locking
- cpus_read_lock id ...
- remove_cpu id ...
⚙️ Internals
- kernel/cpu.c src
- kernel/irq/cpuhotplug.c src
- drivers/base/cpu.c src – CPU subsystem support
📖 References
📚 Further reading
- CONFIG_CPU_HOTPLUG_STATE_CONTROL id – enables the ability to write incremental steps between "offline" and "online" states to the CPU's sysfs target file, allowing for more granular control of state transitions.
- cpuhotplug @LKML
Memory barriers (MB) are synchronization mechanisms used to ensure proper ordering of memory operations in a SMP environment. They play a crucial role in maintaining the consistency and correctness of data shared among different CPU cores or processors. MBs prevent unexpected and potentially harmful reordering of memory access instructions by the compiler or CPU, which can lead to data corruption and race conditions in a concurrent software system.
⚲ API
⚙️ Internals
📖 References
Architectures
[edit | edit source]Linux CPU architectures refer to the different types of central processing units (CPUs) that are compatible with the Linux operating system. Linux is designed to run on a wide range of CPU architectures, which allows it to be utilized on various devices, from smartphones to servers and supercomputers. Each architecture has its own unique features, advantages, and design considerations.
Architectures are classified by family (e.g. x86, ARM), word or long int size (e.g. CONFIG_32BIT id, CONFIG_64BIT id).
Some functions with different implementations for different CPU architectures:
- do_boot_cpu id > start_secondary id > cpu_init id
- setup_arch id, start_thread id, get_current id, current id
⚲ API
⚙️ Arch internals
📖 References
📚 Further reading about multitasking, scheduling and CPU
- ↑ a b Malte Skarupke. "Measuring Mutexes, Spinlocks and how Bad the Linux Scheduler Really is".