Sockets & Network IPC

Learn about Unix domain sockets, TCP/UDP sockets for local and network IPC, socket pairs, and advanced socket options for high-performance inter-process communication.

published: reading time: 28 min read author: GeekWorkBench

Sockets & Network IPC

If pipes and message queues are the local delivery trucks of the IPC world, sockets are the postal service — they can deliver data not just between processes on the same machine, but across the network to any reachable host. Sockets are the most versatile and widely-used form of IPC on Unix systems, and understanding them is essential for every systems programmer. Whether you are building a web server, a database client, a microservice communication layer, or a local daemon, sockets are the foundational building block.

Introduction

A socket is a bidirectional communication endpoint. Unlike pipes which are unidirectional and unnamed, sockets provide bidirectional, connection-oriented (TCP) or connectionless (UDP) communication that can be local (Unix domain) or network-based (TCP/IP, UDP/IP).

There are two main families of sockets:

Unix domain sockets (AF_UNIX / AF_LOCAL) — Use filesystem paths as addresses. Data never leaves the kernel. Used for local inter-process communication with TCP-like or UDP-like semantics. Comparable in speed to shared memory for many workloads.

Internet domain sockets (AF_INET / AF_INET6) — Use IP addresses and port numbers. Data flows through the full network stack. Used for network communication between processes on different hosts.

Within each family, there are two main protocols:

SOCK_STREAM (TCP) — Connection-oriented, reliable, byte-stream, no message boundaries. Similar to a pipe but bidirectional.

SOCK_DGRAM (UDP) — Connectionless, unreliable, message-oriented with preserved boundaries. Each send delivers a discrete message.

The socket API was originally developed for BSD Unix and standardized in POSIX. It consists of socket(), bind(), listen(), accept(), connect(), send(), recv(), close(), and related functions.

When to Use / When Not to Use

Use Unix domain sockets when:

  • You need IPC between processes on the same machine with TCP-like semantics
  • You need bidirectional communication
  • You want a simpler alternative to shared memory (with built-in synchronization at the kernel level)
  • You need to use select()/poll()/epoll for multiplexing multiple connections
  • You need a connection-oriented channel with backpressure (TCP flow control)

Use TCP sockets when:

  • You need network communication between different machines
  • You need reliable, ordered, connection-oriented delivery
  • You need to handle many concurrent connections efficiently

Use UDP sockets when:

  • You need low-latency communication and can tolerate some packet loss
  • You are building systems that handle brief disconnections gracefully
  • You are doing broadcast or multicast communication

Do not use sockets when:

  • You need maximum throughput for local communication (shared memory may be faster)
  • You need simple unidirectional streaming (pipes are simpler)
  • You need message queue semantics with priorities (message queues fit better)
  • You are communicating between threads in the same process (use condition variables or channels)

Architecture or Flow Diagram

sequenceDiagram
    participant S as Server
    participant K as Kernel (Socket Layer)
    participant C as Client

    Note over S,K,C: TCP Server-Client Flow (Unix Domain or TCP)

    S->>K: socket(AF_UNIX, SOCK_STREAM, 0)
    K-->>S: fd = socketfd

    S->>K: bind(socketfd, "/tmp/mysock")
    S->>K: listen(socketfd, backlog=5)

    C->>K: socket(AF_UNIX, SOCK_STREAM, 0)
    C->>K: connect(socketfd, "/tmp/mysock")

    K-->>S: Notification: new connection
    S->>K: accept(socketfd)
    K-->>S: connfd = new connection socket

    K->>K: Enqueue connection in listen queue
    Note over K: Kernel maintains accept queue

    C->>K: send(data)
    K-->>C: 15 bytes accepted

    K-->>S: data available on connfd
    S->>K: recv(connfd, buf, 1024)
    K-->>S: "Hello from client"

    S->>K: send(connfd, "Hi back", 8)
    K-->>C: "Hi back"

    Note over S,C: close both sockets when done

Core Concepts

Unix Domain Socket Creation and Binding

#include <sys/socket.h>
#include <sys/un.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>

int main() {
    int server_fd = socket(AF_UNIX, SOCK_STREAM, 0);
    if (server_fd == -1) {
        perror("socket");
        exit(1);
    }

    struct sockaddr_un addr;
    memset(&addr, 0, sizeof(addr));
    addr.sun_family = AF_UNIX;
    strncpy(addr.sun_path, "/tmp/my_socket", sizeof(addr.sun_path) - 1);

    // Remove existing socket file (avoid EADDRINUSE)
    unlink("/tmp/my_socket");

    if (bind(server_fd, (struct sockaddr *)&addr, sizeof(addr)) == -1) {
        perror("bind");
        exit(1);
    }

    if (listen(server_fd, 5) == -1) {
        perror("listen");
        exit(1);
    }

    printf("Server listening on %s\n", addr.sun_path);

    // Accept a connection
    int client_fd = accept(server_fd, NULL, NULL);
    if (client_fd == -1) {
        perror("accept");
        exit(1);
    }

    char buf[256];
    ssize_t n = recv(client_fd, buf, sizeof(buf) - 1, 0);
    if (n > 0) {
        buf[n] = '\0';
        printf("Received: %s\n", buf);
    }

    send(client_fd, "Hello from server", 17, 0);

    close(client_fd);
    close(server_fd);
    unlink("/tmp/my_socket");

    return 0;
}

Socket Pairs (Anonymous Connected Sockets)

A socket pair is a pair of connected sockets where data written to one can be read from the other. Created with socketpair(), they are useful for creating bidirectional communication channels between related processes:

#include <sys/socket.h>
#include <unistd.h>
#include <stdio.h>
#include <string.h>

int main() {
    int sv[2];  // Two connected sockets

    if (socketpair(AF_UNIX, SOCK_STREAM, 0, sv) == -1) {
        perror("socketpair");
        exit(1);
    }

    pid_t pid = fork();
    if (pid == 0) {
        // Child: close write end, read from read end
        close(sv[1]);
        char buf[128];
        ssize_t n = recv(sv[0], buf, sizeof(buf), 0);
        if (n > 0) {
            buf[n] = '\0';
            printf("Child received: %s\n", buf);
        }
        close(sv[0]);
        _exit(0);
    } else {
        // Parent: close read end, write to write end
        close(sv[0]);
        send(sv[1], "Hello from parent!", 18, 0);
        close(sv[1]);
        wait(NULL);
    }

    return 0;
}

TCP Server with select() Multiplexing

#include <sys/socket.h>
#include <netinet/in.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

#define PORT 8080
#define MAX_CLIENTS 10

int main() {
    int server_fd = socket(AF_INET, SOCK_STREAM, 0);
    int opt = 1;
    setsockopt(server_fd, SOL_SOCKET, SO_REUSEADDR, &opt, sizeof(opt));

    struct sockaddr_in addr;
    memset(&addr, 0, sizeof(addr));
    addr.sin_family = AF_INET;
    addr.sin_addr.s_addr = INADDR_ANY;
    addr.sin_port = htons(PORT);

    bind(server_fd, (struct sockaddr *)&addr, sizeof(addr));
    listen(server_fd, 5);

    fd_set readfds;
    int client_fds[MAX_CLIENTS] = {0};

    while (1) {
        FD_ZERO(&readfds);
        FD_SET(server_fd, &readfds);
        int maxfd = server_fd;

        for (int i = 0; i < MAX_CLIENTS; i++) {
            if (client_fds[i] > 0) {
                FD_SET(client_fds[i], &readfds);
                if (client_fds[i] > maxfd) maxfd = client_fds[i];
            }
        }

        int activity = select(maxfd + 1, &readfds, NULL, NULL, NULL);
        if (activity < 0) perror("select");

        // New connection?
        if (FD_ISSET(server_fd, &readfds)) {
            int client_fd = accept(server_fd, NULL, NULL);
            for (int i = 0; i < MAX_CLIENTS; i++) {
                if (client_fds[i] == 0) {
                    client_fds[i] = client_fd;
                    break;
                }
            }
        }

        // Client data?
        for (int i = 0; i < MAX_CLIENTS; i++) {
            if (client_fds[i] > 0 && FD_ISSET(client_fds[i], &readfds)) {
                char buf[1024];
                ssize_t n = recv(client_fds[i], buf, sizeof(buf), 0);
                if (n <= 0) {
                    close(client_fds[i]);
                    client_fds[i] = 0;
                } else {
                    // Echo back
                    send(client_fds[i], buf, n, 0);
                }
            }
        }
    }

    return 0;
}

Production Failure Scenarios

EADDRINUSE — Socket Already Bound

If you try to bind() to a path/port that is already in use (previous server crashed without cleanup), you get EADDRINUSE. On Linux, use SO_REUSEADDR before bind() to allow reusing the address:

int opt = 1;
setsockopt(server_fd, SOL_SOCKET, SO_REUSEADDR, &opt, sizeof(opt));
// Now bind() can succeed even if previous socket on same port is in TIME_WAIT

Note: SO_REUSEADDR does not work the same way for TCP as it does for Unix domain sockets. For Unix domain, you also need to unlink() the path before rebinding if the previous socket file still exists.

Connection Refused Under Load

When a server is overwhelmed with connections, the listen queue (backlog) fills up. New connections are refused with ECONNREFUSED or silently dropped (depending on OS). Set an appropriate backlog and monitor queue depth.

Mitigation: Increase the listen backlog with listen(fd, backlog). The kernel caps this at somaxconn (viewable in /proc/sys/net/core/somaxconn). Also implement connection limiting at the application level.

Socket Leak — File Descriptors Not Closed

Every socket creates a file descriptor. If you do not close sockets properly (especially on error paths), file descriptors leak. Over time, you exhaust the system’s file descriptor limit and new socket calls fail with EMFILE.

Mitigation: Always close sockets in all code paths. Use a wrapper function that handles cleanup, or use close() in every error case. Monitor file descriptor usage with lsof -p <pid> or ls /proc/<pid>/fd/.

Partial Send / recv

send() may transmit fewer bytes than requested if the kernel’s send buffer is full (especially on non-blocking sockets). recv() may return fewer bytes than requested. Always check return values and handle partial operations.

Mitigation: Loop until all data is sent. Use sendall() wrappers that handle partial sends. For recv, accumulate data until a complete message is received (based on length prefix or delimiter).

UDP Packet Loss and Reordering

UDP does not guarantee delivery. Packets can be lost, duplicated, or arrive out of order. Applications using UDP must implement their own reliability mechanisms.

Mitigation: Design for some packet loss. Implement sequence numbers and acknowledgments at the application layer if reliability is needed. Consider using TCP instead for critical data.

Trade-off Table

FeatureUnix Domain SocketTCP SocketUDP SocketNamed Pipe (FIFO)
ScopeLocal onlyLocal + networkLocal + networkLocal only
Connection modelConnection-oriented (SOCK_STREAM) or datagramConnection-orientedConnectionlessConnectionless
Message boundariesSOCK_STREAM: no, SOCK_DGRAM: yesNo (byte stream)Yes (preserved)No (byte stream)
BidirectionalYesYesYes (send/recv on same socket)No (unidirectional)
Select/poll/epollYesYesYesYes (via file descriptor)
PerformanceVery high (kernel, no network)High (kernel, local); moderate (network)Highest (no connection overhead)High
ReliabilityDepends on protocolReliable, orderedUnreliable, best-effortReliable (kernel buffer)
Typical useLocal daemons, high-perf IPCNetwork services, clientsStreaming, low-latencySimple cross-process streaming

Implementation Snippet(s)

Python: Unix Domain Socket Client

import socket
import os

SOCKET_PATH = "/tmp/my_socket"

# Create Unix domain socket
client = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)

try:
    client.connect(SOCKET_PATH)
    client.sendall(b"Hello from Python client")
    data = client.recv(1024)
    print(f"Server said: {data.decode()}")
except ConnectionRefusedError:
    print("Server not running")
except Exception as e:
    print(f"Error: {e}")
finally:
    client.close()

Bash: Using netcat for Socket Testing

# Connect to a Unix domain socket (Linux only)
# nc -U /tmp/my_socket

# Listen on a Unix domain socket
# nc -l -U /tmp/my_socket

# Test TCP server
# nc localhost 8080

# Send HTTP request to test server
# echo -e "GET / HTTP/1.0\r\n\r\n" | nc localhost 80

# Check what is listening on TCP ports
ss -tlnp | grep :8080
netstat -tlnp | grep :8080

Observability Checklist

  • Open sockets: lsof -p <pid> shows all file descriptors including sockets
  • Listening ports: ss -tlnp (preferred over netstat on modern Linux) shows listening sockets with process info
  • Established connections: ss -tnp shows all established TCP connections
  • Socket buffers: Check with cat /proc/sys/net/core/rmem_default and /proc/sys/net/core/wmem_default
  • Connection state: ss -ti shows TCP connection state, retransmissions, congestion window
  • strace: strace -e trace=bind,listen,accept,connect,send,recv,close -p <pid> to trace socket operations
  • perf: perf stat -e syscalls:sys_enter_bind,syscalls:sys_enter_connect to measure socket call frequency

Common Pitfalls / Anti-Patterns

Socket permissions (Unix domain): Unix domain sockets respect filesystem permissions on the socket file path. Use appropriate permissions on the directory containing the socket. Consider using a directory with 0700 permissions for sensitive IPC.

Network socket exposure: TCP/UDP sockets bound to 0.0.0.0 or :: are accessible from the network. Always bind to localhost (127.0.0.1 or ::1) if you only want local access. Use firewall rules for additional protection.

DoS via connection flood: An attacker can exhaust server resources by opening many connections (SYN flood for TCP, connection flood for SOCK_STREAM). Use connection limits, rate limiting, and proper timeout configuration. Consider using a load balancer or SYN cookies for TCP.

Socket sniffing: Local processes can potentially sniff Unix domain socket traffic if they have access to the socket path. Use filesystem permissions and separate namespaces for sensitive IPC.

Audit: Socket operations (bind, listen, connect) generate standard audit events on most Linux distributions. For compliance, monitor for unexpected socket creation or binding to unusual ports.

Common Pitfalls / Anti-patterns

  1. Not handling EINTR on socket calls — same as pipes and other blocking calls, socket operations can return EINTR. Handle it or use SA_RESTART.

  2. Ignoring partial send/recvsend() and recv() may process fewer bytes than requested. Loop until all data is transferred.

  3. Forgetting SO_REUSEADDR — not setting this causes “Address already in use” errors after server restart, especially during development.

  4. Buffer overflow in recv — always bounds-check the buffer size. Malicious clients can send more data than expected.

  5. Using UDP for reliable data — UDP makes no guarantee of delivery, order, or uniqueness. If you need reliability on top of UDP, implement sequence numbers, ACKs, and retransmission.

  6. Not setting socket timeouts — default socket operations may block forever. Set SO_RCVTIMEO and SO_SNDTIMEO for production code.

  7. Leaving sockets in TIME_WAIT too long — after closing a connection, the kernel holds the port in TIME_WAIT state. Use SO_REUSEADDR to allow rebinding, or design your protocol to use longer-lived connections.

  8. Mixing select/poll with edge-triggered epoll — if using epoll() in edge-triggered mode and not draining all pending data, you may miss events. Use level-triggered mode or drain completely.

Quick Recap Checklist

  • Sockets provide bidirectional IPC for both local (AF_UNIX) and network (AF_INET) communication
  • SOCK_STREAM is connection-oriented, reliable, byte-stream (like a bidirectional pipe)
  • SOCK_DGRAM is connectionless, unreliable, message-oriented (preserves boundaries)
  • Unix domain sockets (AF_UNIX) are the fastest local IPC mechanism with full socket API features
  • Always set SO_REUSEADDR before bind() to handle server restarts gracefully
  • Handle EINTR on all socket calls and partial send/recv by looping
  • Use select()/poll()/epoll() for multiplexing many connections in a single thread
  • Monitor socket file descriptor usage to prevent leaks; always close in all code paths
  • UDP requires application-level reliability if your use case needs it
  • Socket buffer sizes affect performance — tune with SO_RCVBUF and SO_SNDBUF

Interview Questions

1. What is the difference between AF_UNIX and AF_INET sockets?

AF_UNIX (also called AF_LOCAL) Unix domain sockets use a filesystem path as the address. Data never leaves the kernel — it is copied directly from sender's buffer to receiver's buffer through the kernel's socket infrastructure. They are used for local IPC between processes on the same machine and offer the highest performance.

AF_INET (IPv4) and AF_INET6 (IPv6) are internet domain sockets that use IP address and port number pairs as addresses. Data flows through the full TCP/IP network stack — through the kernel's networking layers and potentially across a physical network. They support communication with processes on remote machines.

Both support SOCK_STREAM (reliable, connection-oriented, byte-stream) and SOCK_DGRAM (message-oriented, unreliable). Unix domain sockets are generally faster since they avoid network stack overhead, but they are limited to local communication.

2. How does select() work with sockets, and what are its limitations?

select() allows a process to monitor multiple file descriptors, blocking until one or more become "ready" (readable, writable, or have an error condition). Internally, select() copies three bitmap sets (readfds, writefds, exceptfds) into the kernel, which checks each fd's state. When any fd is ready, the kernel updates the bitmaps in-place and returns.

Limitations:

  • O(n) scanning: On return, you must iterate through all fds to find which are ready, even if only one was ready. Poor scaling with thousands of fds.
  • Bitmap limit: The fd sets use fixed-size bitmaps (typically FD_SETSIZE, often 1024), limiting the number of fds you can monitor.
  • Reset on return: The fd sets are modified by select(), so you must reinitialize them on each call.

Modern alternatives: poll() solves the fd limit issue (uses array instead of bitmap). epoll() (Linux) solves both — it uses a kernel event list and returns only ready fds, scales to millions of fds, and supports edge-triggered mode. kqueue() (BSD/macOS) provides similar functionality.

3. What is the difference between SOCK_STREAM and SOCK_DGRAM?

SOCK_STREAM provides a connection-oriented, reliable, byte-stream channel. It behaves like a bidirectional pipe — you send bytes, they arrive in order at the other end, with no message boundaries. If you send 100 bytes then 50 bytes, the receiver might read 150 bytes at once, or 50 bytes then 100 bytes, or any other division. TCP is the protocol that implements SOCK_STREAM over IP.

SOCK_DGRAM provides a connectionless, unreliable, message-oriented channel. Each send() delivers a discrete message (datagram) that arrives as a unit. Messages have boundaries — a single recv() returns exactly one datagram. Packets can be lost, duplicated, or arrive out of order. UDP is the protocol that implements SOCK_DGRAM over IP.

Choose SOCK_STREAM when you need reliable ordered delivery with no message framing concerns. Choose SOCK_DGRAM when you need low latency and can tolerate packet loss, or when each message is self-contained and should not be fragmented across recv calls.

4. What is the TIME_WAIT state and how does SO_REUSEADDR help?

When a TCP connection is closed, the endpoint that initiates the close (the one sending the first FIN) enters TIME_WAIT state for a duration of 2 * Maximum Segment Lifetime (MSL), typically 60 seconds on Linux. During this time, the socket pair (IP:port combination) cannot be reused. This exists so that delayed packets from the old connection are not mistaken for packets from a new connection using the same tuple.

This causes problems when restarting a server — you try to bind to port 8080 but it is still in TIME_WAIT from the previous run.

SO_REUSEADDR tells the kernel to allow binding to an address that is in TIME_WAIT. For server sockets, you set this option before calling bind(). For Unix domain sockets, you also need to unlink() the socket path if the file still exists.

int opt = 1;
setsockopt(sockfd, SOL_SOCKET, SO_REUSEADDR, &opt, sizeof(opt));
bind(sockfd, ...);

This does not violate the TIME_WAIT safety property because the kernel only allows binding to an address in TIME_WAIT — it does not allow binding to a connection that is still active. Incoming packets for the old connection will still be handled correctly.

5. How do you handle partial sends and receives with sockets?

send() and recv() may transmit or receive fewer bytes than requested when the kernel's socket buffer is full (send) or no data is immediately available (recv). Both return the number of bytes actually transferred, or -1 on error.

Handling partial sends — always loop until all data is sent:

ssize_t sendall(int sockfd, const void *buf, size_t len) {
    size_t total = 0;
    while (total < len) {
        ssize_t n = send(sockfd, (const char *)buf + total, len - total, 0);
        if (n == -1) {
            if (errno == EINTR) continue;  // Retry
            return -1;
        }
        total += n;
    }
    return total;
}

Handling partial receives — accumulate until a complete message is received. Use a length prefix protocol (4-byte length header) so the receiver knows when a message is complete:

// Read exactly len bytes
ssize_t recvall(int sockfd, void *buf, size_t len) {
    size_t total = 0;
    while (total < len) {
        ssize_t n = recv(sockfd, (char *)buf + total, len - total, 0);
        if (n == 0) return total;  // EOF
        if (n == -1) {
            if (errno == EINTR) continue;
            return -1;
        }
        total += n;
    }
    return total;
}

For streaming protocols, consider using a circular buffer and maintaining a receive state machine that tracks how many bytes of the current message have been received.

6. How does accept() behave with regard to the listen backlog queue?

The kernel maintains two queues for a listening socket: the accept queue (completed connections ready for accept()) and the SYN queue (incomplete connections that have received SYN but not yet a matching SYN-ACK). When listen(fd, backlog) is called, the backlog argument specifies the maximum size of the accept queue. The actual maximum is also bounded by /proc/sys/net/core/somaxconn.

When the accept queue is full, new completed connections are discarded (silently dropping the final ACK) rather than being queued, causing clients to retransmit until they timeout. accept() simply removes connections from the accept queue; it does not affect the SYN queue. High-performance servers must balance backlog size (memory) against the rate of new connections arriving. Linux also provides TCP_FASTOPEN which bypasses the SYN handshake for previously connected clients, bypassing queue pressure.

7. What is the difference between edge-triggered and level-triggered notification in epoll?

Level-triggered mode (the default for epoll) notifies you whenever a file descriptor is ready, as long as the condition persists. If you do not drain all available data when notified, you will be notified again on the next epoll_wait call. Edge-triggered mode notifies you only when the state changes from not-ready to ready, or when new data arrives on a file descriptor that was previously empty.

Edge-triggered mode requires careful handling: you must drain all data when notified (or use EAGAIN returns to stop), and you must handle all file descriptors that become ready on each call or miss events. Edge-triggered is typically used with non-blocking sockets in high-performance servers to avoid spurious wakeups. Level-triggered is easier to program but can generate more events in high-throughput scenarios.

8. How does SO_KEEPALIVE work and what are its configuration parameters?

SO_KEEPALIVE enables periodic probes on idle TCP connections to detect if the peer has crashed or become unreachable. When enabled and no data has been sent or received for a configurable period, the kernel sends a zero-checksum probe packet. If the peer responds with an ACK, the connection is alive. If probes fail repeatedly, the connection is closed with ETIMEDOUT.

Configurable via socket options: TCP_KEEPIDLE (time before first probe, default 7200s on Linux), TCP_KEEPINTVL (interval between probes, default 75s), and TCP_KEEPCNT (number of failed probes before giving up). These can be set per socket before connect() or listen(). Keepalive is useful for detecting dead peers in long-lived connections like database connections, but the long defaults mean dead connections may be undetected for hours.

9. What is the difference between Unix domain sockets and named pipes (FIFOs)?

Both Unix domain sockets and FIFOs use filesystem paths as addresses and respect filesystem permissions. Key difference: FIFOs are unidirectional (one reader, one writer) and message-oriented but with read/write semantics (write returns success when data is copied to kernel buffer), while Unix domain sockets are bidirectional and can be stream-oriented (SOCK_STREAM) or datagram-oriented (SOCK_DGRAM).

FIFOs cannot be used with select()/poll()/epoll() for multiplexing multiple readers/writers the way sockets can. A Unix domain SOCK_STREAM socket pair behaves like a bidirectional pipe, while a FIFO requires two named pipes for bidirectional communication. Unix sockets support ancillary messages (file descriptors, credentials), byte-stream ordering, and connection-oriented communication, making them more versatile than FIFOs for most IPC scenarios.

10. How does UDP connect() work and what advantages does it provide for UDP sockets?

Calling connect() on a UDP socket does not perform a handshake (unlike TCP). Instead, it associates the socket with a specific peer address and records this association in the kernel's socket state. For subsequent send() and recv() calls, the kernel uses the connected peer address, and ECONNREFUSED is returned if the peer is unreachable (ICMP port unreachable). Without connect(), UDP sendto() must specify the destination each time.

Connected UDP provides: automatic use of the peer address (no need to specify on every send), immediate error notification when the peer is unreachable (ICMP errors delivered to socket), and on some systems, improved performance due to reduced address resolution overhead. Connected UDP still has no delivery guarantees but is more efficient for bidirectional UDP communication with a single peer.

11. What are raw sockets and when would you use them?

Raw sockets bypass the normal protocol stack layers and allow you to construct custom network packets at the IP level or below. With socket(AF_INET, SOCK_RAW, protocol), you receive and send raw IP datagrams with full control over IP headers and payload. You can even use SOCK_PACKET (Linux-specific) to access Ethernet frames directly.

Legitimate uses: network diagnostics tools (ping uses raw ICMP sockets), packet sniffers with BPF, custom tunneling protocols (like GRE or IP-in-IP), network simulation, and firewall implementations. Requires CAP_NET_RAW capability. Raw sockets are a security concern because they can be used for reconnaissance, crafting spoofed packets, and network scanning, which is why many production environments restrict or disable them.

12. How does socket buffer sizing affect network performance?

Socket buffers (SO_RCVBUF and SO_SNDBUF) are kernel-managed ring buffers for incoming and outgoing data. If the send buffer is full, a send() call blocks or returns EAGAIN (non-blocking), backpressuring the application. If the receive buffer is full, incoming data is dropped (UDP) or the sender's TCP window closes (TCP), reducing throughput.

The kernel auto-tunes these on modern Linux, but high-throughput or low-latency applications may benefit from manual tuning. Increasing the receive buffer helps when receiving bursts of data. For low-latency, smaller buffers reduce queuing delay. For high-bandwidth-delay-product links (like WAN connections), larger buffers allow more data to be in flight. Linux also provides SO_SNDBUF and SO_RCVBUF with _LOWAT variants to set minimum watermarks.

13. What is the difference between poll() and epoll() for socket multiplexing?

poll() and select() both copy the file descriptor sets from user space to kernel space and scan all fds linearly on each call, making them O(n) for each notification. poll() uses an array of pollfd structures (no fixed FD_SETSIZE limit like select()), but still has the linear scan problem. Both require re-adding fds after each call (for poll(), events are cleared after each call; for select(), the fd sets are modified).

epoll() uses a kernel-maintained red-black tree of monitored file descriptors and a separate ready list of file descriptors that have events. On epoll_wait(), it returns directly from the ready list without scanning all fds, making it O(1) for notification. It supports edge-triggered mode and one-shot mode (one notification per event until explicitly rearmed). epoll() scales to hundreds of thousands of file descriptors efficiently, which is why nginx and other high-performance servers use it.

14. What is the purpose of TCP_NODELAY and when should you disable Nagle's algorithm?

Nagle's algorithm (enabled by default on TCP sockets) buffers outgoing data and waits for an ACK before sending more, coalescing small writes into larger segments to reduce packet overhead on low-speed links. For interactive, low-latency applications like remote shells or real-time game updates, this buffering adds unacceptable latency.

TCP_NODELAY disables Nagle's algorithm, sending data immediately without waiting for ACKs. Use it when you have small, latency-sensitive messages that should be sent immediately: keystrokes in an SSH session, game state updates, real-time chat messages. The tradeoff is more packets on the wire and potentially reduced throughput for bulk transfers. Most interactive applications disable Nagle's algorithm.

15. How does socketpair() differ from pipe() for IPC?

pipe() creates a unidirectional channel with two file descriptors: fd[0] for reading, fd[1] for writing. Data written to fd[1] is read from fd[0]. socketpair(AF_UNIX, SOCK_STREAM) creates a pair of bidirectional, connected stream sockets where both file descriptors can both send and receive.

socketpair() with SOCK_STREAM provides a bidirectional pipe that works with shutdown() (SHUT_RD, SHUT_WR, SHUT_RDWR) for half-close semantics, can be used with select()/poll()/epoll() for multiplexed communication, and supports out-of-band data (MSG_OOB) and file descriptor passing via sendmsg() with SCM_RIGHTS. A bidirectional protocol is more natural to implement on a socketpair than coordinating two unidirectional pipes.

16. What is the CLOSE_WAIT and LAST_ACK state in TCP connection termination?

When the remote end sends FIN (initiating connection close), the local TCP receives the FIN and moves the connection to CLOSE_WAIT state while notifying the application that the connection is closed for sending. The application must call close() to complete the close. After the application buffers are flushed and close() is called, the local TCP sends the final ACK and moves to LAST_ACK. The connection stays in LAST_ACK until the final ACK is received.

Connections stuck in CLOSE_WAIT indicate the application is not calling close() after receiving the remote close. This commonly happens when the application does not properly handle half-closes from the peer. A socket in CLOSE_WAIT holds kernel resources (receive buffer) until the application calls close(). Properly handling shutdown(SHUT_RDWR) or detecting peer close with zero-length recv() prevents socket leaks.

17. What is the difference between recv() and read() on a socket?

On Linux, read(fd, buf, len) and recv(sockfd, buf, len, flags) are essentially equivalent for socket file descriptors. The key difference is recv() accepts a flags argument: MSG_PEEK (peek at data without consuming it), MSG_DONTWAIT (non-blocking), MSG_OOB (out-of-band data), and MSG_WAITALL (wait for full request).

read() is the generic POSIX file descriptor operation and works on any file descriptor (pipes, files, sockets). recv() is socket-specific and provides socket-related control via flags. For regular sockets without special flags, they behave identically. Using recv() makes the socket-specific intent explicit and provides access to features that read() cannot express.

18. How does getsockopt() and setsockopt() work for socket configuration?

setsockopt() modifies kernel-level socket behavior at various levels: SOL_SOCKET (generic options like SO_REUSEADDR, SO_KEEPALIVE, SO_RCVBUF), SOL_TCP (TCP-specific like TCP_NODELAY, TCP_QUICKACK), SOL_IP (IP-specific like IP_MTU_DISCOVER), and protocol-family-specific options. Each option has a type and value that the kernel interprets according to the protocol's implementation.

getsockopt() retrieves the current value of an option. Options can be integer values, structs, or binary blobs depending on the option. Some options are read-only (determined by the kernel or connection state) and return errors when set. Setting options before bind() or connect() is important because some options affect connection establishment and cannot be changed after the socket is fully established.

19. What is the difference between shutdown() and close() for sockets?

close() closes the file descriptor and, when the last reference to the socket is closed, initiates the TCP close sequence (FIN exchange) if it is a connected socket. It releases the file descriptor and kernel resources. If multiple processes share the socket (via fork), each close() decrements the reference count and only the last one initiates the TCP close.

shutdown() operates at the socket level, shutting down one or both directions of the connection: SHUT_RD (no more receptions allowed, receive buffer is discarded), SHUT_WR (no more sends, initiates FIN), SHUT_RDWR (both directions). shutdown() is useful for half-close patterns where one direction closes while the other remains open, which close() cannot express. shutdown() does not release the file descriptor itself.

20. What is io_uring and how does it improve asynchronous socket I/O?

io_uring is a Linux kernel interface (5.1+) that enables high-performance asynchronous I/O for sockets, files, and other I/O operations. Unlike traditional async I/O (select/poll/epoll are synchronous multiplexing), io_uring uses a pair of ring buffers (submission queue and completion queue) shared between the application and kernel. The application submits I/O operations by filling the submission queue ring buffer; the kernel processes them and returns results in the completion queue ring buffer.

Key advantages: true async send/recv with IORING_OP_SEND and IORING_OP_RECV that return immediately and notify via completion queue, batched submissions that amortize syscall overhead, and features like fixed buffers that eliminate per-operation memory allocation. For high-frequency socket I/O, io_uring reduces context switch overhead significantly compared to epoll with blocking send/recv. Libraries like liburing make io_uring accessible to applications.

Further Reading

Conclusion

Sockets represent the pinnacle of IPC flexibility—bidirectional, connection-oriented or connectionless, local or network-spanning. Unix domain sockets provide highest-performance local communication, while TCP/UDP sockets extend that capability across network boundaries. The layered API (socket, bind, listen, accept, connect, send, recv) has remained remarkably stable since BSD introduced it decades ago.

Mastering socket programming means mastering edge cases: partial sends and receives requiring loops, EINTR handling for interrupted system calls, TIME_WAIT state and SO_REUSEADDR for server restarts, and appropriate use of select/poll/epoll for scalable multiplexing. These fundamentals apply whether you’re building a local daemon, a network service, or a distributed system.

Looking forward, the evolution continues with zero-copy socket APIs, io_uring integration for asynchronous I/O, and increasingly sophisticated offloading to network hardware. The socket API adapts, but the underlying principles of bidirectional communication endpoints and kernel-mediated data transfer remain constant.

Category

Related Posts

ASLR & Stack Protection

Address Space Layout Randomization, stack canaries, and exploit mitigation techniques

#operating-systems #aslr-stack-protection #computer-science

Assembly Language Basics: Writing Code the CPU Understands

Learn to read and write simple programs in x86 and ARM assembly, understanding registers, instructions, and the art of thinking in low-level operations.

#operating-systems #assembly-language-basics #computer-science

Boolean Logic & Gates

Understanding AND, OR, NOT gates and how they combine into arithmetic logic units — the building blocks of every processor.

#operating-systems #boolean-logic-gates #computer-science