Threads & Lightweight Processes

Understanding user threads vs kernel threads, thread pools, and thread models for concurrent programming in modern operating systems.

published: reading time: 27 min read author: GeekWorkBench

Threads & Lightweight Processes

Picture a factory floor. Each process is like a separate building with its own entrance, utilities, and security clearance. Inside each building, workers (threads) share the same electricity, water, and access cards — they can communicate instantly without going outside.

In computing terms: threads are the unit of execution within a process. They share the process’s address space, open files, and global variables, but each thread has its own stack, program counter, and registers. This shared everything makes threads lightweight compared to processes, and it’s why modern servers can handle millions of concurrent connections with a handful of processes.

Introduction

When you run a web server, each incoming connection doesn’t need a separate process. A single process can handle thousands of connections using threads — each thread processes a different client, but all threads share the same memory and file descriptors. This dramatically reduces overhead compared to forking a new process per connection.

Threads enable concurrency (one CPU doing multiple things by interleaving) and parallelism (multiple CPUs doing multiple things simultaneously). Understanding when to use each — and which threading model to use — is crucial for building scalable systems.

User Threads vs Kernel Threads

The fundamental distinction in threading is where threads are managed:

AspectUser ThreadsKernel Threads
Managed byUser-level library (pthread, greenlet)Operating system kernel
SchedulingUser-space library decidesKernel scheduler decides
Context switchFast (no kernel involvement)Slower (kernel mode required)
Blocking I/OCan block entire processCan block single thread
Multi-coreCannot run in parallelCan run in parallel
ExampleGNU Pth, Green threadsPOSIX pthread, Windows threads

User Threads

User threads are managed entirely in user space without kernel involvement. The thread library (e.g., GNU Pth, goroutine runtime) handles thread creation, scheduling, and synchronization. The kernel sees only the single process — it knows nothing about the user-level threads within it.

Advantages:

  • Fast creation and context switching (no system calls)
  • Custom scheduling policies
  • Portable across operating systems

Disadvantages:

  • If any user thread blocks on I/O, the entire process blocks (kernel sees only one runnable task)
  • Cannot take advantage of multi-core parallelism
  • If the user thread library has bugs, the whole process crashes
// User-level threading example (GNU Pth)
#include <pth.h>

void *worker(void *arg) {
    pth_sleep(1);  // Cooperative yield to other user threads
    return NULL;
}

int main() {
    pth_init();
    pth_t t1 = pth_spawn(PTH_ATTR_DEFAULT, worker, NULL);
    pth_join(t1, NULL);
    pth_exit(0);
}

Kernel Threads

Kernel threads are managed by the OS scheduler — each thread appears as a schedulable entity. The kernel handles context switching between threads, and threads from the same process can run on different CPU cores.

Advantages:

  • True parallelism on multi-core systems
  • Individual threads can block on I/O without affecting others
  • Kernel handles all scheduling, no custom library needed

Disadvantages:

  • Slower context switches (requires kernel mode transition)
  • Higher memory overhead (kernel data structures per thread)
  • More expensive creation
// Kernel threading example (POSIX pthread)
#include <pthread.h>

void *worker(void *arg) {
    sleep(1);  // Kernel preempts, other threads run
    return NULL;
}

int main() {
    pthread_t t1, t2;
    pthread_create(&t1, NULL, worker, NULL);
    pthread_create(&t2, NULL, worker, NULL);
    pthread_join(t1, NULL);
    pthread_join(t2, NULL);
    return 0;
}

When to Use

  • I/O-bound concurrent workloads: Web servers, API handlers, database connection pools
  • CPU-bound parallel computation: Scientific computing, image processing, ML inference
  • Real-time systems: Hard deadlines require kernel-level thread control
  • High-throughput network services: Millions of connections need efficient threading

When Not to Use

  • Simple scripts: Using threads adds complexity without benefit for sequential tasks
  • CPU-bound single-task: A single-threaded process often outperforms threaded code for compute-heavy single tasks due to avoiding synchronization overhead
  • Debugging complexity: Threading bugs (deadlocks, race conditions) are notoriously hard to reproduce — only use when necessary

Thread Models

Modern systems implement threading in several models, each with tradeoffs:

graph BT
    subgraph "Threading Models"
        M1[1:1 Model<br/>One kernel thread<br/>per user thread]
        M2[N:1 Model<br/>Many user threads<br/>to one kernel thread]
        M3[N:M Model<br/>Many user threads<br/>to many kernel threads]
    end

    K[Kernel Space]
    U[User Space]

    U --> M1
    U --> M2
    U --> M3
    M1 --> K
    M3 --> K

    style M1 stroke:#1a3a1a
    style M2 stroke:#3a1a1a
    style M3 stroke:#1a1a3a

1:1 Model (One-to-One)

Each user-level thread maps to one kernel thread.

  • Used by: Linux (NPTL), Windows (Win32 Threads), Solaris
  • Pros: True parallelism, individual thread blocking works
  • Cons: Higher overhead (kernel thread creation), limited scalability (thread count limited)

N:1 Model (Many-to-One)

Many user threads map to a single kernel thread.

  • Used by: GNU Pth, early green thread implementations
  • Pros: Fast, scalable to millions of fibers
  • Cons: Cannot use multi-core, one blocked user thread blocks all

N:M Model (Many-to-Many)

Many user threads are distributed across a pool of kernel threads.

  • Used by: Solaris (LWPs), some Go runtimes (goroutines on GOMAXPROCS > 1)
  • Pros: Best of both worlds — parallelism without unlimited kernel threads
  • Cons: Complex implementation, requires coordination between user and kernel scheduling

Architecture Diagram

graph TB
    subgraph Process_Address_Space
        subgraph Code_Region[Code (.text)]
            C1[Function A]
            C2[Function B]
        end

        subgraph Shared_Data[Shared Data]
            G1[Global Variables]
            G2[Heap]
        end

        subgraph Thread_Stacks[Per-Thread Stacks]
            S1[Thread 1 Stack<br/>SP, BP, Return Addr]
            S2[Thread 2 Stack<br/>SP, BP, Return Addr]
            S3[Thread 3 Stack<br/>SP, BP, Return Addr]
        end

        subgraph Per_Thread_Registers[Per-Thread Registers]
            R1[Thread 1: PC=0x401234]
            R2[Thread 2: PC=0x402345]
            R3[Thread 3: PC=0x403456]
        end
    end

    S1 --> G1
    S2 --> G1
    S3 --> G1

    S1 --> C1
    S2 --> C2
    S3 --> C1

    style Shared_Data stroke:#1a3a1a
    style Thread_Stacks stroke:#3a1a1a
    style Per_Thread_Registers stroke:#1a1a3a

Each thread has its own stack and registers, but all threads share the code and global data regions. This shared access is what makes threads efficient for communication but also what creates synchronization challenges.


Core Concepts

Thread-Specific Data

Each thread needs its own stack for local variables, but often needs “global” data that’s actually per-thread (thread-local storage):

// Thread-local storage example
#include <pthread.h>

__thread int my_thread_id = 0;  // GCC extension

void *worker(void *arg) {
    my_thread_id = *(int *)arg;
    printf("Thread %d sees my_thread_id = %d\n", pthread_self(), my_thread_id);
    return NULL;
}

Thread Local Storage (TLS)

TLS provides thread-safe access to global-like data without explicit parameter passing. The OS allocates a separate slot for each thread’s copy:

# Python threading with local context
import threading

local_data = threading.local()

def worker(n):
    local_data.value = n
    import time
    time.sleep(0.1)
    print(f"Thread {n} sees value = {local_data.value}")

threads = [threading.Thread(target=worker, args=(i,)) for i in range(5)]
for t in threads: t.start()
for t in threads: t.join()

Thread States

Threads have states similar to processes:

  • New: Thread being initialized
  • Ready: Thread waiting to run
  • Running: Thread executing on a CPU
  • Blocked: Thread waiting for I/O or synchronization
  • Terminated: Thread finished, waiting to be joined

Thread Cancellation

Threads can be cancelled, but careful handling is required:

#include <pthread.h>

void *worker(void *arg) {
    while (1) {
        // Check for cancellation point
        pthread_testcancel();

        // Do work
        printf("Working...\n");
        sleep(1);
    }
    return NULL;
}

int main() {
    pthread_t t;
    pthread_create(&t, NULL, worker, NULL);

    sleep(5);
    pthread_cancel(t);  // Async cancellation
    pthread_join(t, NULL);
    return 0;
}

Thread Pools

Creating a new thread per task is expensive. Thread pools solve this by pre-creating threads and reusing them:

#!/usr/bin/env python3
"""Thread pool implementation in Python."""

from queue import Queue, Empty
from threading import Thread, Condition
import time

class ThreadPool:
    def __init__(self, num_threads):
        self.tasks = Queue()
        self.workers = []
        self.shutdown = False
        self.lock = Condition()

        for _ in range(num_threads):
            t = Thread(target=self._worker_loop)
            t.daemon = True
            t.start()
            self.workers.append(t)

    def _worker_loop(self):
        while True:
            try:
                task = self.tasks.get(timeout=0.1)
                if task is None:  # Poison pill
                    break
                func, args, kwargs = task
                try:
                    func(*args, **kwargs)
                except Exception as e:
                    print(f"Task failed: {e}")
                finally:
                    self.tasks.task_done()
            except Empty:
                if self.shutdown:
                    break

    def submit(self, func, *args, **kwargs):
        if self.shutdown:
            raise RuntimeError("Pool is shutting down")
        self.tasks.put((func, args, kwargs))

    def wait_completion(self):
        self.tasks.join()

    def shutdown_pool(self):
        self.shutdown = True
        for _ in self.workers:
            self.tasks.put(None)  # Poison pills
        for t in self.workers:
            t.join()


# Usage
def cpu_task(n):
    result = sum(i*i for i in range(1000000))
    print(f"Task {n} completed, result starts with {result}")

pool = ThreadPool(4)
for i in range(8):
    pool.submit(cpu_task, i)
pool.wait_completion()
pool.shutdown_pool()

Production Failure Scenarios

Deadlocks

Problem: Two threads each hold a lock the other needs, and neither will release until it acquires the other.

Symptoms: Application hangs with 100% CPU utilization (spinning on locks) or no CPU utilization (deadlocked with sleeping threads).

Mitigation:

  • Always acquire locks in a consistent global order
  • Use lock timeout with pthread_mutex_timedlock()
  • Use lock-free data structures when possible
  • Detect potential deadlocks with thread sanitizers
// Lock ordering to prevent deadlock
pthread_mutex_lock(&lock_a);
pthread_mutex_lock(&lock_b);
// Do work
pthread_mutex_unlock(&lock_b);
pthread_mutex_unlock(&lock_a);

// NEVER do this in another thread:
// pthread_mutex_lock(&lock_b);
// pthread_mutex_lock(&lock_a);

Race Conditions

Problem: Multiple threads access shared data without synchronization, leading to unpredictable results.

Symptoms: Intermittent crashes, data corruption, bizarre behavior only in production.

Mitigation:

  • Use proper synchronization (mutexes, atomic operations)
  • Prefer lock-free data structures
  • Use thread-safe primitives from day one
  • Run with TSAN_OPTIONS=halt_on_error=1 (ThreadSanitizer)
# Compile with ThreadSanitizer
gcc -fsanitize=thread -g program.c -o program

# Run to detect race conditions
./program

Thread Starvation

Problem: Some threads never get CPU time because other threads monopolize resources.

Symptoms: Certain requests timeout while others complete normally.

Mitigation:

  • Set thread priorities appropriately
  • Use fair scheduling policies (SCHED_FIFO, SCHED_OTHER)
  • Implement work-stealing queues

Premature Thread Termination

Problem: Process exits before spawned threads complete.

Mitigation: Always call pthread_join() or use pthread_detach().


Trade-off Table

Thread ModelParallelismScalabilityComplexity
1:1 (Kernel)FullLimited by thread countSimple
N:1 (User)NoneHighSimple
N:M (Hybrid)FullHighComplex
Thread CreationTime (µs)Memory
pthread_create~10-50~8MB stack
goroutine~0.2~2KB stack
fiber (fthread)~0.1~4KB stack
Use CaseBest ModelReason
Web server (blocking I/O)1:1True parallelism needed
Event-driven serverN:1 (green threads)Millions of connections
CPU-bound parallelN:MBalance parallelism and overhead

Implementation Snippets

Basic Pthread Creation and Join

#include <stdio.h>
#include <pthread.h>
#include <stdlib.h>

#define NUM_THREADS 4

void *compute(void *arg) {
    int n = *(int *)arg;
    long result = 0;
    for (int i = 0; i < 1000000; i++) {
        result += i * i;
    }
    printf("Thread %d computed %ld\n", n, result);
    return NULL;  // Return value passed to pthread_join()
}

int main() {
    pthread_t threads[NUM_THREADS];
    int args[NUM_THREADS];

    // Create threads
    for (int i = 0; i < NUM_THREADS; i++) {
        args[i] = i;
        if (pthread_create(&threads[i], NULL, compute, &args[i]) != 0) {
            perror("pthread_create failed");
            return 1;
        }
    }

    // Wait for all threads
    for (int i = 0; i < NUM_THREADS; i++) {
        pthread_join(threads[i], NULL);  // NULL = don't capture return
    }

    printf("All threads completed\n");
    return 0;
}

Mutex for Shared Data

#include <pthread.h>
#include <stdio.h>

typedef struct {
    pthread_mutex_t mutex;
    int counter;
} shared_data_t;

void *increment(void *arg) {
    shared_data_t *data = (shared_data_t *)arg;

    for (int i = 0; i < 1000; i++) {
        pthread_mutex_lock(&data->mutex);
        data->counter++;
        pthread_mutex_unlock(&data->mutex);
    }
    return NULL;
}

int main() {
    shared_data_t data = {.counter = 0};
    pthread_mutex_init(&data.mutex, NULL);

    pthread_t t1, t2;
    pthread_create(&t1, NULL, increment, &data);
    pthread_create(&t2, NULL, increment, &data);

    pthread_join(t1, NULL);
    pthread_join(t2, NULL);

    printf("Final counter: %d (expected: 2000)\n", data.counter);
    pthread_mutex_destroy(&data.mutex);
    return 0;
}

Condition Variables for Thread Coordination

#include <pthread.h>
#include <stdio.h>

typedef struct {
    pthread_mutex_t mutex;
    pthread_cond_t cond;
    int ready;
} sync_data_t;

void *producer(void *arg) {
    sync_data_t *data = (sync_data_t *)arg;
    printf("Producer: Doing work...\n");
    sleep(1);

    pthread_mutex_lock(&data->mutex);
    data->ready = 1;
    pthread_cond_signal(&data->cond);  // Wake up consumer
    pthread_mutex_unlock(&data->mutex);
    return NULL;
}

void *consumer(void *arg) {
    sync_data_t *data = (sync_data_t *)arg;

    pthread_mutex_lock(&data->mutex);
    while (!data->ready) {
        pthread_cond_wait(&data->cond, &data->mutex);  // Atomically unlock and wait
    }
    pthread_mutex_unlock(&data->mutex);

    printf("Consumer: Data is ready!\n");
    return NULL;
}

int main() {
    sync_data_t data = {.ready = 0};
    pthread_mutex_init(&data.mutex, NULL);
    pthread_cond_init(&data.cond, NULL);

    pthread_t prod, cons;
    pthread_create(&cons, NULL, consumer, &data);
    pthread_create(&prod, NULL, producer, &data);

    pthread_join(prod, NULL);
    pthread_join(cons, NULL);

    pthread_mutex_destroy(&data.mutex);
    pthread_cond_destroy(&data->cond);
    return 0;
}

Observability Checklist

Key Metrics

# Show all threads (LWPs) for a process
ps -eLf -p <pid>

# Show thread-specific CPU usage
top -H -p <pid>

# Count threads
cat /proc/<pid>/status | grep Threads

# Check thread IDs
cat /proc/<pid>/task/*/tid

Trace Commands

# Trace thread creation
sudo perf trace -e clone

# Trace mutex operations
perf record -e 'mutex:*' -a -- sleep 10
perf report

# Show lock contention
perf sched record -- sleep 5
perf sched latency

Debugging Commands

# Attach to process and view threads
gdb -p <pid>
(gdb) info threads

# View thread stack traces
pstack <pid>

# Show where threads are blocked
cat /proc/<pid>/stack

Common Pitfalls / Anti-Patterns

Thread Safety in Security-Critical Code

Security-sensitive data (keys, tokens, credentials) must be protected with proper synchronization. Use thread-local storage for sensitive per-thread data to avoid accidentally leaking between threads.

Resource Limits

Thread counts are limited by system resources:

# Check thread limits
cat /proc/sys/kernel/threads-max
cat /proc/sys/vm/max_map_count

# Set soft limit
ulimit -u

Race Conditions as Security Vulnerabilities

Race conditions in privilege operations can lead to TOCTOU (Time-of-check to time-of-use) attacks. Use atomic operations for security-critical code paths.


Common Pitfalls / Anti-patterns

Creating too many threads

Each thread has a stack (default 8MB on Linux). Creating 1000 threads consumes 8GB just for stacks. Use thread pools instead.

Not handling thread cancellation safely

If you call pthread_cancel(), ensure cancellation points exist. Async cancellation can leave data structures corrupted.

Ignoring false sharing

Threads modifying data in the same cache line (even with different variables) cause cache line invalidation storms. Pad data structures to cache line boundaries (64 bytes).

// False sharing example
struct {
    pthread_mutex_t mutex;
    long counter;  // These two might share a cache line
} data[4];  // Each thread modifies data[i].counter

// Better: pad to prevent cache line sharing
struct {
    char pad[64];  // Cache line padding
    long counter;
};

Mixing locked and lock-free access

Don’t use mutexes for some accesses to shared data and atomics for others — this creates undefined behavior.


Quick Recap Checklist

  • Threads share process address space but have own stack and registers
  • User threads: fast but can’t use multi-core; kernel threads: slower but truly parallel
  • 1:1 model (Linux, Windows) is most common; N:M (Solaris, some Go) balances tradeoffs
  • Thread pools avoid creation overhead and limit thread count
  • Deadlocks occur when locks are acquired in inconsistent orders
  • Race conditions cause intermittent failures; use ThreadSanitizer to detect
  • Each thread needs ~8MB stack on Linux; too many threads = memory exhaustion
  • Always join or detach threads to prevent resource leaks
  • Use thread-local storage for per-thread globals without explicit parameter passing

Interview Questions

1. What is the difference between a process and a thread?

A process is an independent execution context with its own address space, while a thread is the unit of execution within a process.

Key differences:

  • Memory: Processes have separate address spaces; threads share the same address space
  • Communication: Processes require IPC (pipes, sockets, shared memory); threads can communicate directly via shared memory
  • Creation overhead: Process creation duplicates memory tables; thread creation only allocates a stack and registers
  • Fault isolation: A crash in one process doesn't affect others; a crash in one thread can crash the entire process

Think of processes as separate houses (each with own utilities) and threads as roommates in the same house (sharing electricity, water, internet).

2. Explain the three main thread models: 1:1, N:1, and N:M.

1:1 Model (One-to-One): Each user thread maps to one kernel thread. Used by Linux (NPTL), Windows. Allows true parallelism and individual thread blocking, but kernel thread count is limited.

N:1 Model (Many-to-One): Many user threads map to a single kernel thread. Used by GNU Pth and green thread implementations. Very scalable (millions of fibers) but cannot use multi-core, and one blocking user thread blocks the entire process.

N:M Model (Many-to-Many): User threads are multiplexed over a pool of kernel threads. Used by Solaris LWPs. Best of both worlds — parallelism without unbounded kernel thread count. Most complex to implement.

Modern server operating systems predominantly use the 1:1 model due to its simplicity and the fact that kernel thread limits (thousands) are sufficient for most workloads.

3. What is a deadlock and how can you prevent it?

A deadlock occurs when two or more threads are permanently blocked, each holding a lock the others need. Classic example: Thread A holds Lock 1 and waits for Lock 2, while Thread B holds Lock 2 and waits for Lock 1.

Prevention strategies:

  • Lock ordering: Always acquire multiple locks in a consistent global order. If all threads follow the same order, circular wait is impossible.
  • Lock timeout: Use pthread_mutex_timedlock() to detect potential deadlocks before they occur.
  • Fine-grained locking: Reduce the number of locks held simultaneously.
  • Lock-free data structures: Use atomic operations and lock-free algorithms when possible.
  • Avoid nested locks: If possible, design code to need only one lock at a time.

Detection tools like ThreadSanitizer can identify potential deadlock patterns before they occur in production.

4. What is a thread pool and when should you use one?

A thread pool is a pattern where a fixed number of threads are pre-created and reused to execute multiple tasks. Instead of creating a new thread per task (expensive), tasks are placed in a queue and picked up by idle threads.

Benefits:

  • Avoids thread creation overhead (10-50 µs per pthread_create)
  • Bounds memory usage (fixed thread count, each ~8MB stack)
  • Provides backpressure (queue full = reject or block new tasks)
  • Simplifies concurrency management

Use a thread pool when:

  • Handling many short-lived tasks
  • Processing items in parallel without unbounded thread growth
  • Web servers handling many concurrent requests
  • Database connection pools

Typical pool size: 1-2x CPU core count for CPU-bound work, higher for I/O-bound work. Many languages have built-in thread pools: Java's ExecutorService, Python's concurrent.futures.ThreadPoolExecutor, Go's goroutine pools.

5. What is false sharing and how does it affect performance?

False sharing occurs when multiple threads modify data that happens to reside in the same CPU cache line, even though the threads are accessing different variables. This causes cache line invalidation on every write, effectively serializing access.

Example: Thread 0 modifies counter[0], Thread 1 modifies counter[1]. If both are in the same 64-byte cache line, every write by one thread invalidates the cache line for the other, forcing memory round-trips.

Solutions:

  • Pad data structures to cache line boundaries (64 bytes on x86)
  • Use per-thread local data and aggregate later
  • Use atomic operations with cache-aligned structures

False sharing can cause 10-100x performance degradation in high-frequency code paths. It's often seen in statistics counters, ring buffers, and any frequently-updated shared state.

6. What is the difference between a mutex and a spinlock?

A mutex (mutual exclusion lock) is a synchronization primitive that puts a thread to sleep when the lock is not available. When the lock is released, the kernel wakes up one of the waiting threads.

A spinlock busy-waits (spins) on a atomic variable until the lock becomes available. It does not sleep. Spinlocks are only efficient when the critical section is very short (less than the cost of a context switch) and lock contention is low.

Using a spinlock while holding a lock that might sleep (or in interrupt context) causes deadlock. In Linux kernel code, spinlocks are used in interrupt handlers because they cannot sleep, while mutexes are used in process context.

7. What is thread-local storage (TLS) and when is it useful?

Thread-local storage (TLS) provides each thread with its own independent copy of a global variable. Instead of sharing a single variable across all threads, each thread sees its own instance.

Use cases:

  • errno — each thread needs its own error code
  • Performance counters per thread
  • Caching per-thread state without locking

Implementations: GCC's __thread keyword, pthread_key_create()/pthread_getspecific(), thread_local (C++11). The compiler/linker allocates space in the thread's stack segment for each TLS variable.

8. How does a condition variable work with a mutex?

A condition variable (condvar) allows a thread to wait for some arbitrary condition to become true, atomically releasing the associated mutex and adding itself to the wait queue.

Pattern:

  1. Thread acquires mutex
  2. Checks condition — if false, calls pthread_cond_wait()
  3. pthread_cond_wait() atomically releases the mutex and adds the thread to the wait queue
  4. When another thread signals (pthread_cond_signal) or broadcasts (pthread_cond_broadcast), the waiting thread wakes up and re-acquires the mutex
  5. Thread re-checks the condition and proceeds if true

Always re-check the condition in a loop — spurious wakeups can occur and another thread may have changed the condition between the wakeup and mutex acquisition.

9. What is the producer-consumer problem and how does a semaphore solve it?

The producer-consumer problem: one or more threads produce items and place them in a bounded buffer, while one or more threads consume items from the buffer. The buffer must not overflow ( producers must wait if full) and must not underflow ( consumers must wait if empty).

A semaphore is a non-negative integer with atomic wait() and signal() operations. A semaphore initialized to N allows up to N concurrent wait() calls before blocking.

Solution: Use two semaphores — one counting the number of items in the buffer (for consumers to wait on), and one counting the number of empty slots (for producers to wait on). A mutex provides mutual exclusion for the buffer itself.

This avoids the need for explicit while loops checking buffer state — the semaphore count tracks availability.

10. What is the readers-writers problem and what are the possible solutions?

The readers-writers problem: multiple readers can access data concurrently, but writers need exclusive access. This is common for cached data that is read frequently but updated less often.

Solutions:

  • Reader-preferring: Allow readers to proceed as long as no writer holds the lock. Can starve writers indefinitely.
  • Writer-preferring: Once a writer is waiting, new readers are blocked. Can starve readers.
  • Fair/ordered: Both readers and writers are queued FIFO. First to request the lock gets it next.

pthread provides pthread_rwlock_init() for reader-writer locks with optional writer-preference scheduling via pthread_rwlockattr_setkind_np().

11. What is the difference between pthread_join() and pthread_detach()?

pthread_join() blocks the caller until the target thread terminates. It also retrieves the thread's return value (passed to pthread_exit()). After join(), the thread's resources are freed.

pthread_detach() marks the thread as detached — its resources are freed immediately when it terminates, without needing to be joined. You cannot join a detached thread.

If you neither join nor detach a thread, its resources leak (remain allocated) after termination until the process exits. Always either join or detach every created thread.

Common pattern: main thread creates worker threads, workers detach themselves with pthread_detach() if they don't return a value that main needs.

12. What is a barrier (pthread_barrier) and when would you use it?

A barrier is a synchronization primitive that blocks threads until a fixed number of threads have reached the barrier. Then all threads are released simultaneously.

Use case: parallel algorithms where work is divided into phases — all threads must complete phase 1 before any can start phase 2.

Example: parallel sort where threads sort their chunks, then all threads must merge — no thread can merge until all chunks are sorted.

pthread_barrier_init(N) creates a barrier for N threads. Each thread calls pthread_barrier_wait(). When the Nth thread calls wait(), all N threads are released.

13. What are the thread safety issues with standard library functions?

Many C standard library functions use static internal state and are not thread-safe. Examples: strtok() (uses internal pointer), rand() (uses internal state), getenv() (implementation-dependent).

The errno variable is thread-local — each thread gets its own errno, so it is thread-safe even though it looks like a global.

Linux provides reentrant versions: strtok_r(), rand_r(), and async-signal-safe alternatives for signal handlers. man 7 pthreads documents which functions are thread-safe.

In C++11 and later, most standard library functions are specified as thread-safe. std::locale, std::cout (with some synchronization), and most containers require explicit synchronization for concurrent access.

14. What is the ABA problem in lock-free programming?

The ABA problem occurs in lock-free data structures when using compare-and-swap (CAS):

  1. Thread A reads location X, sees value A
  2. Thread B changes X from A to B and back to A
  3. Thread A reads X again, sees A (same as before), and assumes no change occurred

If A's CAS checks only the value (A), it will succeed even though X changed twice. This can cause data corruption in queues, stacks, and linked lists.

Solutions: Double-width CAS (storing a counter with the pointer), tagged pointers (using extra bits as a tag that increments on each change), or using hazard pointers to track nodes currently being accessed.

15. What is a read-copy-update (RCU) mechanism?

RCU is a synchronization mechanism used in the Linux kernel for read-heavy workloads. Readers access data without locking — they just read. Writers make a private copy, modify it, then atomically replace the pointer to the old data with the new copy.

The key insight: old readers are guaranteed to finish their read before the old data is freed. This is called "grace period" — the writer waits for all pre-existing readers to complete.

RCU is used for routing tables, file system dentries, and other read-heavy structures in the Linux kernel. It provides extremely low overhead for readers (no atomic operations, no memory barriers on most architectures) at the cost of more complex writer logic.

16. How do you debug thread-related issues like deadlocks and race conditions?

Deadlock detection: use tools like helgrind (part of Valgrind), ThreadSanitizer, or the libasan thread checker. For production, use the deadlock detection options in your mutex implementation or instrument lock acquisition with stack traces.

Race conditions: ThreadSanitizer (tsan) detects races by tracking memory access and synchronization events. Compile with -fsanitize=thread. Helgrind detects lock order inversions and races in pthread programs.

For production debugging: use SIGABRT handlers that print thread backtraces (especially useful for deadlocks — pressing Ctrl+\\ sends SIGQUIT to dump all thread stacks). Use GDB's thread apply all bt command.

17. What is the difference between kernel threads and user threads from the scheduler's perspective?

From the kernel scheduler's perspective, only kernel threads (lightweight processes, LWPs) are schedulable entities. User threads exist only in user space — the kernel knows nothing about them.

In the 1:1 model, each user thread maps to a kernel thread, so each is schedulable. In the N:1 model, all user threads map to one kernel thread — only one can run at a time. In the N:M model, a pool of kernel threads serves many user threads.

The scheduler makes decisions about which kernel thread to run on which CPU core. The threading library's user-space scheduler decides which of its user threads runs on the next available kernel thread.

18. What is a futex and why was it introduced?

A futex (Fast Userspace Mutex) is a Linux-specific syscall that provides efficient synchronization without kernel involvement in the common case.

The problem with traditional mutexes: every lock acquisition and release required a syscall (to enter the kernel), even when there was no contention. On an uncontended mutex, this was pure overhead.

Futex works by keeping the lock state in userspace memory (an atomic integer). Contention is detected via atomic operations. Only when contention is detected does the thread enter the kernel to wait on a kernel queue. When the lock is released, the releaser notifies the kernel, which wakes one waiter.

Modern mutexes (pthread_mutex, std::mutex) on Linux are all implemented on top of futexes.

19. What is a lock-free data structure and when would you use one?

A lock-free data structure uses atomic operations (CAS, fetch_add, etc.) instead of locks to ensure that at least one thread can make progress in any operation, regardless of other threads' state.

Lock-free is stronger than wait-free (every operation completes in bounded steps) and lock-free (some thread always makes progress).

Use lock-free structures when:

  • Multiple threads update a shared structure with minimal contention
  • You need better latency predictability than locks provide (no priority inversion)
  • Real-time systems where priority inheritance overhead is unacceptable

Trade-offs: lock-free algorithms are significantly more complex to write correctly. Memory management is particularly hard (hazard pointers, RCU, or epoch-based reclamation are common approaches).

20. How does Go's goroutine scheduler differ from traditional thread models?

Go uses an M:N threading model — M goroutines multiplexed onto N OS threads (typically GOMAXPROCS threads). The Go runtime scheduler manages goroutine scheduling in user space, without kernel involvement for most scheduling decisions.

Unlike pthreads where each thread is a kernel thread, Go's scheduler runs goroutines on a fixed number of OS threads (configurable, defaults to number of CPU cores). When a goroutine blocks on a syscall, Go may spin up additional OS threads.

Key advantages: creating thousands of goroutines is cheap (kilobytes of stack vs megabytes for threads). The scheduler can make better decisions than the OS because it understands Go's channel synchronization.

However, CPU-bound goroutines can starve each other without yielding, unlike preemptive OS thread scheduling. Go 1.14 introduced asynchronous preemption to handle busy-waiting loops.

Further Reading


Conclusion

Threads are the unit of execution within a process — they share the process’s address space, open files, and global variables while each maintaining its own stack, program counter, and registers. This shared-everything model makes threads lightweight compared to processes and enables efficient communication between concurrent tasks.

User threads (managed in library space) are fast but cannot achieve true multi-core parallelism because the kernel sees only one schedulable entity. Kernel threads are managed by the OS scheduler and can run on different CPU cores simultaneously. The 1:1 model (used by Linux and Windows) is the most common implementation — each user thread maps to one kernel thread. The N:M model (used by some Go runtimes) multiplexes many user threads over a pool of kernel threads for best-of-both-worlds performance.

Thread pools avoid the overhead of creating and destroying threads for short-lived tasks, while bounding memory usage by limiting the number of concurrent threads. Synchronization primitives (mutexes, condition variables, semaphores) protect shared data, but incorrect usage leads to deadlocks, race conditions, and performance pathologies like false sharing.

For your next step, explore process scheduling algorithms to understand how the OS decides which thread runs on each CPU core, or synchronization primitives to master the tools that make multi-threaded programming safe and correct.

Category

Related Posts

ASLR & Stack Protection

Address Space Layout Randomization, stack canaries, and exploit mitigation techniques

#operating-systems #aslr-stack-protection #computer-science

Assembly Language Basics: Writing Code the CPU Understands

Learn to read and write simple programs in x86 and ARM assembly, understanding registers, instructions, and the art of thinking in low-level operations.

#operating-systems #assembly-language-basics #computer-science

Boolean Logic & Gates

Understanding AND, OR, NOT gates and how they combine into arithmetic logic units — the building blocks of every processor.

#operating-systems #boolean-logic-gates #computer-science