Sockets & Network IPC
Learn about Unix domain sockets, TCP/UDP sockets for local and network IPC, socket pairs, and advanced socket options for high-performance inter-process communication.
Sockets & Network IPC
If pipes and message queues are the local delivery trucks of the IPC world, sockets are the postal service — they can deliver data not just between processes on the same machine, but across the network to any reachable host. Sockets are the most versatile and widely-used form of IPC on Unix systems, and understanding them is essential for every systems programmer. Whether you are building a web server, a database client, a microservice communication layer, or a local daemon, sockets are the foundational building block.
Introduction
A socket is a bidirectional communication endpoint. Unlike pipes which are unidirectional and unnamed, sockets provide bidirectional, connection-oriented (TCP) or connectionless (UDP) communication that can be local (Unix domain) or network-based (TCP/IP, UDP/IP).
There are two main families of sockets:
Unix domain sockets (AF_UNIX / AF_LOCAL) — Use filesystem paths as addresses. Data never leaves the kernel. Used for local inter-process communication with TCP-like or UDP-like semantics. Comparable in speed to shared memory for many workloads.
Internet domain sockets (AF_INET / AF_INET6) — Use IP addresses and port numbers. Data flows through the full network stack. Used for network communication between processes on different hosts.
Within each family, there are two main protocols:
SOCK_STREAM (TCP) — Connection-oriented, reliable, byte-stream, no message boundaries. Similar to a pipe but bidirectional.
SOCK_DGRAM (UDP) — Connectionless, unreliable, message-oriented with preserved boundaries. Each send delivers a discrete message.
The socket API was originally developed for BSD Unix and standardized in POSIX. It consists of socket(), bind(), listen(), accept(), connect(), send(), recv(), close(), and related functions.
When to Use / When Not to Use
Use Unix domain sockets when:
- You need IPC between processes on the same machine with TCP-like semantics
- You need bidirectional communication
- You want a simpler alternative to shared memory (with built-in synchronization at the kernel level)
- You need to use select()/poll()/epoll for multiplexing multiple connections
- You need a connection-oriented channel with backpressure (TCP flow control)
Use TCP sockets when:
- You need network communication between different machines
- You need reliable, ordered, connection-oriented delivery
- You need to handle many concurrent connections efficiently
Use UDP sockets when:
- You need low-latency communication and can tolerate some packet loss
- You are building systems that handle brief disconnections gracefully
- You are doing broadcast or multicast communication
Do not use sockets when:
- You need maximum throughput for local communication (shared memory may be faster)
- You need simple unidirectional streaming (pipes are simpler)
- You need message queue semantics with priorities (message queues fit better)
- You are communicating between threads in the same process (use condition variables or channels)
Architecture or Flow Diagram
sequenceDiagram
participant S as Server
participant K as Kernel (Socket Layer)
participant C as Client
Note over S,K,C: TCP Server-Client Flow (Unix Domain or TCP)
S->>K: socket(AF_UNIX, SOCK_STREAM, 0)
K-->>S: fd = socketfd
S->>K: bind(socketfd, "/tmp/mysock")
S->>K: listen(socketfd, backlog=5)
C->>K: socket(AF_UNIX, SOCK_STREAM, 0)
C->>K: connect(socketfd, "/tmp/mysock")
K-->>S: Notification: new connection
S->>K: accept(socketfd)
K-->>S: connfd = new connection socket
K->>K: Enqueue connection in listen queue
Note over K: Kernel maintains accept queue
C->>K: send(data)
K-->>C: 15 bytes accepted
K-->>S: data available on connfd
S->>K: recv(connfd, buf, 1024)
K-->>S: "Hello from client"
S->>K: send(connfd, "Hi back", 8)
K-->>C: "Hi back"
Note over S,C: close both sockets when done
Core Concepts
Unix Domain Socket Creation and Binding
#include <sys/socket.h>
#include <sys/un.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
int main() {
int server_fd = socket(AF_UNIX, SOCK_STREAM, 0);
if (server_fd == -1) {
perror("socket");
exit(1);
}
struct sockaddr_un addr;
memset(&addr, 0, sizeof(addr));
addr.sun_family = AF_UNIX;
strncpy(addr.sun_path, "/tmp/my_socket", sizeof(addr.sun_path) - 1);
// Remove existing socket file (avoid EADDRINUSE)
unlink("/tmp/my_socket");
if (bind(server_fd, (struct sockaddr *)&addr, sizeof(addr)) == -1) {
perror("bind");
exit(1);
}
if (listen(server_fd, 5) == -1) {
perror("listen");
exit(1);
}
printf("Server listening on %s\n", addr.sun_path);
// Accept a connection
int client_fd = accept(server_fd, NULL, NULL);
if (client_fd == -1) {
perror("accept");
exit(1);
}
char buf[256];
ssize_t n = recv(client_fd, buf, sizeof(buf) - 1, 0);
if (n > 0) {
buf[n] = '\0';
printf("Received: %s\n", buf);
}
send(client_fd, "Hello from server", 17, 0);
close(client_fd);
close(server_fd);
unlink("/tmp/my_socket");
return 0;
}
Socket Pairs (Anonymous Connected Sockets)
A socket pair is a pair of connected sockets where data written to one can be read from the other. Created with socketpair(), they are useful for creating bidirectional communication channels between related processes:
#include <sys/socket.h>
#include <unistd.h>
#include <stdio.h>
#include <string.h>
int main() {
int sv[2]; // Two connected sockets
if (socketpair(AF_UNIX, SOCK_STREAM, 0, sv) == -1) {
perror("socketpair");
exit(1);
}
pid_t pid = fork();
if (pid == 0) {
// Child: close write end, read from read end
close(sv[1]);
char buf[128];
ssize_t n = recv(sv[0], buf, sizeof(buf), 0);
if (n > 0) {
buf[n] = '\0';
printf("Child received: %s\n", buf);
}
close(sv[0]);
_exit(0);
} else {
// Parent: close read end, write to write end
close(sv[0]);
send(sv[1], "Hello from parent!", 18, 0);
close(sv[1]);
wait(NULL);
}
return 0;
}
TCP Server with select() Multiplexing
#include <sys/socket.h>
#include <netinet/in.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#define PORT 8080
#define MAX_CLIENTS 10
int main() {
int server_fd = socket(AF_INET, SOCK_STREAM, 0);
int opt = 1;
setsockopt(server_fd, SOL_SOCKET, SO_REUSEADDR, &opt, sizeof(opt));
struct sockaddr_in addr;
memset(&addr, 0, sizeof(addr));
addr.sin_family = AF_INET;
addr.sin_addr.s_addr = INADDR_ANY;
addr.sin_port = htons(PORT);
bind(server_fd, (struct sockaddr *)&addr, sizeof(addr));
listen(server_fd, 5);
fd_set readfds;
int client_fds[MAX_CLIENTS] = {0};
while (1) {
FD_ZERO(&readfds);
FD_SET(server_fd, &readfds);
int maxfd = server_fd;
for (int i = 0; i < MAX_CLIENTS; i++) {
if (client_fds[i] > 0) {
FD_SET(client_fds[i], &readfds);
if (client_fds[i] > maxfd) maxfd = client_fds[i];
}
}
int activity = select(maxfd + 1, &readfds, NULL, NULL, NULL);
if (activity < 0) perror("select");
// New connection?
if (FD_ISSET(server_fd, &readfds)) {
int client_fd = accept(server_fd, NULL, NULL);
for (int i = 0; i < MAX_CLIENTS; i++) {
if (client_fds[i] == 0) {
client_fds[i] = client_fd;
break;
}
}
}
// Client data?
for (int i = 0; i < MAX_CLIENTS; i++) {
if (client_fds[i] > 0 && FD_ISSET(client_fds[i], &readfds)) {
char buf[1024];
ssize_t n = recv(client_fds[i], buf, sizeof(buf), 0);
if (n <= 0) {
close(client_fds[i]);
client_fds[i] = 0;
} else {
// Echo back
send(client_fds[i], buf, n, 0);
}
}
}
}
return 0;
}
Production Failure Scenarios
EADDRINUSE — Socket Already Bound
If you try to bind() to a path/port that is already in use (previous server crashed without cleanup), you get EADDRINUSE. On Linux, use SO_REUSEADDR before bind() to allow reusing the address:
int opt = 1;
setsockopt(server_fd, SOL_SOCKET, SO_REUSEADDR, &opt, sizeof(opt));
// Now bind() can succeed even if previous socket on same port is in TIME_WAIT
Note: SO_REUSEADDR does not work the same way for TCP as it does for Unix domain sockets. For Unix domain, you also need to unlink() the path before rebinding if the previous socket file still exists.
Connection Refused Under Load
When a server is overwhelmed with connections, the listen queue (backlog) fills up. New connections are refused with ECONNREFUSED or silently dropped (depending on OS). Set an appropriate backlog and monitor queue depth.
Mitigation: Increase the listen backlog with listen(fd, backlog). The kernel caps this at somaxconn (viewable in /proc/sys/net/core/somaxconn). Also implement connection limiting at the application level.
Socket Leak — File Descriptors Not Closed
Every socket creates a file descriptor. If you do not close sockets properly (especially on error paths), file descriptors leak. Over time, you exhaust the system’s file descriptor limit and new socket calls fail with EMFILE.
Mitigation: Always close sockets in all code paths. Use a wrapper function that handles cleanup, or use close() in every error case. Monitor file descriptor usage with lsof -p <pid> or ls /proc/<pid>/fd/.
Partial Send / recv
send() may transmit fewer bytes than requested if the kernel’s send buffer is full (especially on non-blocking sockets). recv() may return fewer bytes than requested. Always check return values and handle partial operations.
Mitigation: Loop until all data is sent. Use sendall() wrappers that handle partial sends. For recv, accumulate data until a complete message is received (based on length prefix or delimiter).
UDP Packet Loss and Reordering
UDP does not guarantee delivery. Packets can be lost, duplicated, or arrive out of order. Applications using UDP must implement their own reliability mechanisms.
Mitigation: Design for some packet loss. Implement sequence numbers and acknowledgments at the application layer if reliability is needed. Consider using TCP instead for critical data.
Trade-off Table
| Feature | Unix Domain Socket | TCP Socket | UDP Socket | Named Pipe (FIFO) |
|---|---|---|---|---|
| Scope | Local only | Local + network | Local + network | Local only |
| Connection model | Connection-oriented (SOCK_STREAM) or datagram | Connection-oriented | Connectionless | Connectionless |
| Message boundaries | SOCK_STREAM: no, SOCK_DGRAM: yes | No (byte stream) | Yes (preserved) | No (byte stream) |
| Bidirectional | Yes | Yes | Yes (send/recv on same socket) | No (unidirectional) |
| Select/poll/epoll | Yes | Yes | Yes | Yes (via file descriptor) |
| Performance | Very high (kernel, no network) | High (kernel, local); moderate (network) | Highest (no connection overhead) | High |
| Reliability | Depends on protocol | Reliable, ordered | Unreliable, best-effort | Reliable (kernel buffer) |
| Typical use | Local daemons, high-perf IPC | Network services, clients | Streaming, low-latency | Simple cross-process streaming |
Implementation Snippet(s)
Python: Unix Domain Socket Client
import socket
import os
SOCKET_PATH = "/tmp/my_socket"
# Create Unix domain socket
client = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
try:
client.connect(SOCKET_PATH)
client.sendall(b"Hello from Python client")
data = client.recv(1024)
print(f"Server said: {data.decode()}")
except ConnectionRefusedError:
print("Server not running")
except Exception as e:
print(f"Error: {e}")
finally:
client.close()
Bash: Using netcat for Socket Testing
# Connect to a Unix domain socket (Linux only)
# nc -U /tmp/my_socket
# Listen on a Unix domain socket
# nc -l -U /tmp/my_socket
# Test TCP server
# nc localhost 8080
# Send HTTP request to test server
# echo -e "GET / HTTP/1.0\r\n\r\n" | nc localhost 80
# Check what is listening on TCP ports
ss -tlnp | grep :8080
netstat -tlnp | grep :8080
Observability Checklist
- Open sockets:
lsof -p <pid>shows all file descriptors including sockets - Listening ports:
ss -tlnp(preferred overnetstaton modern Linux) shows listening sockets with process info - Established connections:
ss -tnpshows all established TCP connections - Socket buffers: Check with
cat /proc/sys/net/core/rmem_defaultand/proc/sys/net/core/wmem_default - Connection state:
ss -tishows TCP connection state, retransmissions, congestion window - strace:
strace -e trace=bind,listen,accept,connect,send,recv,close -p <pid>to trace socket operations - perf:
perf stat -e syscalls:sys_enter_bind,syscalls:sys_enter_connectto measure socket call frequency
Common Pitfalls / Anti-Patterns
Socket permissions (Unix domain): Unix domain sockets respect filesystem permissions on the socket file path. Use appropriate permissions on the directory containing the socket. Consider using a directory with 0700 permissions for sensitive IPC.
Network socket exposure: TCP/UDP sockets bound to 0.0.0.0 or :: are accessible from the network. Always bind to localhost (127.0.0.1 or ::1) if you only want local access. Use firewall rules for additional protection.
DoS via connection flood: An attacker can exhaust server resources by opening many connections (SYN flood for TCP, connection flood for SOCK_STREAM). Use connection limits, rate limiting, and proper timeout configuration. Consider using a load balancer or SYN cookies for TCP.
Socket sniffing: Local processes can potentially sniff Unix domain socket traffic if they have access to the socket path. Use filesystem permissions and separate namespaces for sensitive IPC.
Audit: Socket operations (bind, listen, connect) generate standard audit events on most Linux distributions. For compliance, monitor for unexpected socket creation or binding to unusual ports.
Common Pitfalls / Anti-patterns
-
Not handling EINTR on socket calls — same as pipes and other blocking calls, socket operations can return
EINTR. Handle it or useSA_RESTART. -
Ignoring partial send/recv —
send()andrecv()may process fewer bytes than requested. Loop until all data is transferred. -
Forgetting SO_REUSEADDR — not setting this causes “Address already in use” errors after server restart, especially during development.
-
Buffer overflow in recv — always bounds-check the buffer size. Malicious clients can send more data than expected.
-
Using UDP for reliable data — UDP makes no guarantee of delivery, order, or uniqueness. If you need reliability on top of UDP, implement sequence numbers, ACKs, and retransmission.
-
Not setting socket timeouts — default socket operations may block forever. Set
SO_RCVTIMEOandSO_SNDTIMEOfor production code. -
Leaving sockets in TIME_WAIT too long — after closing a connection, the kernel holds the port in TIME_WAIT state. Use
SO_REUSEADDRto allow rebinding, or design your protocol to use longer-lived connections. -
Mixing select/poll with edge-triggered epoll — if using
epoll()in edge-triggered mode and not draining all pending data, you may miss events. Use level-triggered mode or drain completely.
Quick Recap Checklist
- Sockets provide bidirectional IPC for both local (AF_UNIX) and network (AF_INET) communication
- SOCK_STREAM is connection-oriented, reliable, byte-stream (like a bidirectional pipe)
- SOCK_DGRAM is connectionless, unreliable, message-oriented (preserves boundaries)
- Unix domain sockets (AF_UNIX) are the fastest local IPC mechanism with full socket API features
- Always set SO_REUSEADDR before bind() to handle server restarts gracefully
- Handle EINTR on all socket calls and partial send/recv by looping
- Use select()/poll()/epoll() for multiplexing many connections in a single thread
- Monitor socket file descriptor usage to prevent leaks; always close in all code paths
- UDP requires application-level reliability if your use case needs it
- Socket buffer sizes affect performance — tune with SO_RCVBUF and SO_SNDBUF
Interview Questions
AF_UNIX (also called AF_LOCAL) Unix domain sockets use a filesystem path as the address. Data never leaves the kernel — it is copied directly from sender's buffer to receiver's buffer through the kernel's socket infrastructure. They are used for local IPC between processes on the same machine and offer the highest performance.
AF_INET (IPv4) and AF_INET6 (IPv6) are internet domain sockets that use IP address and port number pairs as addresses. Data flows through the full TCP/IP network stack — through the kernel's networking layers and potentially across a physical network. They support communication with processes on remote machines.
Both support SOCK_STREAM (reliable, connection-oriented, byte-stream) and SOCK_DGRAM (message-oriented, unreliable). Unix domain sockets are generally faster since they avoid network stack overhead, but they are limited to local communication.
select() allows a process to monitor multiple file descriptors, blocking until one or more become "ready" (readable, writable, or have an error condition). Internally, select() copies three bitmap sets (readfds, writefds, exceptfds) into the kernel, which checks each fd's state. When any fd is ready, the kernel updates the bitmaps in-place and returns.
Limitations:
- O(n) scanning: On return, you must iterate through all fds to find which are ready, even if only one was ready. Poor scaling with thousands of fds.
- Bitmap limit: The fd sets use fixed-size bitmaps (typically FD_SETSIZE, often 1024), limiting the number of fds you can monitor.
- Reset on return: The fd sets are modified by
select(), so you must reinitialize them on each call.
Modern alternatives: poll() solves the fd limit issue (uses array instead of bitmap). epoll() (Linux) solves both — it uses a kernel event list and returns only ready fds, scales to millions of fds, and supports edge-triggered mode. kqueue() (BSD/macOS) provides similar functionality.
SOCK_STREAM provides a connection-oriented, reliable, byte-stream channel. It behaves like a bidirectional pipe — you send bytes, they arrive in order at the other end, with no message boundaries. If you send 100 bytes then 50 bytes, the receiver might read 150 bytes at once, or 50 bytes then 100 bytes, or any other division. TCP is the protocol that implements SOCK_STREAM over IP.
SOCK_DGRAM provides a connectionless, unreliable, message-oriented channel. Each send() delivers a discrete message (datagram) that arrives as a unit. Messages have boundaries — a single recv() returns exactly one datagram. Packets can be lost, duplicated, or arrive out of order. UDP is the protocol that implements SOCK_DGRAM over IP.
Choose SOCK_STREAM when you need reliable ordered delivery with no message framing concerns. Choose SOCK_DGRAM when you need low latency and can tolerate packet loss, or when each message is self-contained and should not be fragmented across recv calls.
When a TCP connection is closed, the endpoint that initiates the close (the one sending the first FIN) enters TIME_WAIT state for a duration of 2 * Maximum Segment Lifetime (MSL), typically 60 seconds on Linux. During this time, the socket pair (IP:port combination) cannot be reused. This exists so that delayed packets from the old connection are not mistaken for packets from a new connection using the same tuple.
This causes problems when restarting a server — you try to bind to port 8080 but it is still in TIME_WAIT from the previous run.
SO_REUSEADDR tells the kernel to allow binding to an address that is in TIME_WAIT. For server sockets, you set this option before calling bind(). For Unix domain sockets, you also need to unlink() the socket path if the file still exists.
int opt = 1;
setsockopt(sockfd, SOL_SOCKET, SO_REUSEADDR, &opt, sizeof(opt));
bind(sockfd, ...);
This does not violate the TIME_WAIT safety property because the kernel only allows binding to an address in TIME_WAIT — it does not allow binding to a connection that is still active. Incoming packets for the old connection will still be handled correctly.
send() and recv() may transmit or receive fewer bytes than requested when the kernel's socket buffer is full (send) or no data is immediately available (recv). Both return the number of bytes actually transferred, or -1 on error.
Handling partial sends — always loop until all data is sent:
ssize_t sendall(int sockfd, const void *buf, size_t len) {
size_t total = 0;
while (total < len) {
ssize_t n = send(sockfd, (const char *)buf + total, len - total, 0);
if (n == -1) {
if (errno == EINTR) continue; // Retry
return -1;
}
total += n;
}
return total;
}
Handling partial receives — accumulate until a complete message is received. Use a length prefix protocol (4-byte length header) so the receiver knows when a message is complete:
// Read exactly len bytes
ssize_t recvall(int sockfd, void *buf, size_t len) {
size_t total = 0;
while (total < len) {
ssize_t n = recv(sockfd, (char *)buf + total, len - total, 0);
if (n == 0) return total; // EOF
if (n == -1) {
if (errno == EINTR) continue;
return -1;
}
total += n;
}
return total;
}
For streaming protocols, consider using a circular buffer and maintaining a receive state machine that tracks how many bytes of the current message have been received.
The kernel maintains two queues for a listening socket: the accept queue (completed connections ready for accept()) and the SYN queue (incomplete connections that have received SYN but not yet a matching SYN-ACK). When listen(fd, backlog) is called, the backlog argument specifies the maximum size of the accept queue. The actual maximum is also bounded by /proc/sys/net/core/somaxconn.
When the accept queue is full, new completed connections are discarded (silently dropping the final ACK) rather than being queued, causing clients to retransmit until they timeout. accept() simply removes connections from the accept queue; it does not affect the SYN queue. High-performance servers must balance backlog size (memory) against the rate of new connections arriving. Linux also provides TCP_FASTOPEN which bypasses the SYN handshake for previously connected clients, bypassing queue pressure.
Level-triggered mode (the default for epoll) notifies you whenever a file descriptor is ready, as long as the condition persists. If you do not drain all available data when notified, you will be notified again on the next epoll_wait call. Edge-triggered mode notifies you only when the state changes from not-ready to ready, or when new data arrives on a file descriptor that was previously empty.
Edge-triggered mode requires careful handling: you must drain all data when notified (or use EAGAIN returns to stop), and you must handle all file descriptors that become ready on each call or miss events. Edge-triggered is typically used with non-blocking sockets in high-performance servers to avoid spurious wakeups. Level-triggered is easier to program but can generate more events in high-throughput scenarios.
SO_KEEPALIVE enables periodic probes on idle TCP connections to detect if the peer has crashed or become unreachable. When enabled and no data has been sent or received for a configurable period, the kernel sends a zero-checksum probe packet. If the peer responds with an ACK, the connection is alive. If probes fail repeatedly, the connection is closed with ETIMEDOUT.
Configurable via socket options: TCP_KEEPIDLE (time before first probe, default 7200s on Linux), TCP_KEEPINTVL (interval between probes, default 75s), and TCP_KEEPCNT (number of failed probes before giving up). These can be set per socket before connect() or listen(). Keepalive is useful for detecting dead peers in long-lived connections like database connections, but the long defaults mean dead connections may be undetected for hours.
Both Unix domain sockets and FIFOs use filesystem paths as addresses and respect filesystem permissions. Key difference: FIFOs are unidirectional (one reader, one writer) and message-oriented but with read/write semantics (write returns success when data is copied to kernel buffer), while Unix domain sockets are bidirectional and can be stream-oriented (SOCK_STREAM) or datagram-oriented (SOCK_DGRAM).
FIFOs cannot be used with select()/poll()/epoll() for multiplexing multiple readers/writers the way sockets can. A Unix domain SOCK_STREAM socket pair behaves like a bidirectional pipe, while a FIFO requires two named pipes for bidirectional communication. Unix sockets support ancillary messages (file descriptors, credentials), byte-stream ordering, and connection-oriented communication, making them more versatile than FIFOs for most IPC scenarios.
Calling connect() on a UDP socket does not perform a handshake (unlike TCP). Instead, it associates the socket with a specific peer address and records this association in the kernel's socket state. For subsequent send() and recv() calls, the kernel uses the connected peer address, and ECONNREFUSED is returned if the peer is unreachable (ICMP port unreachable). Without connect(), UDP sendto() must specify the destination each time.
Connected UDP provides: automatic use of the peer address (no need to specify on every send), immediate error notification when the peer is unreachable (ICMP errors delivered to socket), and on some systems, improved performance due to reduced address resolution overhead. Connected UDP still has no delivery guarantees but is more efficient for bidirectional UDP communication with a single peer.
Raw sockets bypass the normal protocol stack layers and allow you to construct custom network packets at the IP level or below. With socket(AF_INET, SOCK_RAW, protocol), you receive and send raw IP datagrams with full control over IP headers and payload. You can even use SOCK_PACKET (Linux-specific) to access Ethernet frames directly.
Legitimate uses: network diagnostics tools (ping uses raw ICMP sockets), packet sniffers with BPF, custom tunneling protocols (like GRE or IP-in-IP), network simulation, and firewall implementations. Requires CAP_NET_RAW capability. Raw sockets are a security concern because they can be used for reconnaissance, crafting spoofed packets, and network scanning, which is why many production environments restrict or disable them.
Socket buffers (SO_RCVBUF and SO_SNDBUF) are kernel-managed ring buffers for incoming and outgoing data. If the send buffer is full, a send() call blocks or returns EAGAIN (non-blocking), backpressuring the application. If the receive buffer is full, incoming data is dropped (UDP) or the sender's TCP window closes (TCP), reducing throughput.
The kernel auto-tunes these on modern Linux, but high-throughput or low-latency applications may benefit from manual tuning. Increasing the receive buffer helps when receiving bursts of data. For low-latency, smaller buffers reduce queuing delay. For high-bandwidth-delay-product links (like WAN connections), larger buffers allow more data to be in flight. Linux also provides SO_SNDBUF and SO_RCVBUF with _LOWAT variants to set minimum watermarks.
poll() and select() both copy the file descriptor sets from user space to kernel space and scan all fds linearly on each call, making them O(n) for each notification. poll() uses an array of pollfd structures (no fixed FD_SETSIZE limit like select()), but still has the linear scan problem. Both require re-adding fds after each call (for poll(), events are cleared after each call; for select(), the fd sets are modified).
epoll() uses a kernel-maintained red-black tree of monitored file descriptors and a separate ready list of file descriptors that have events. On epoll_wait(), it returns directly from the ready list without scanning all fds, making it O(1) for notification. It supports edge-triggered mode and one-shot mode (one notification per event until explicitly rearmed). epoll() scales to hundreds of thousands of file descriptors efficiently, which is why nginx and other high-performance servers use it.
Nagle's algorithm (enabled by default on TCP sockets) buffers outgoing data and waits for an ACK before sending more, coalescing small writes into larger segments to reduce packet overhead on low-speed links. For interactive, low-latency applications like remote shells or real-time game updates, this buffering adds unacceptable latency.
TCP_NODELAY disables Nagle's algorithm, sending data immediately without waiting for ACKs. Use it when you have small, latency-sensitive messages that should be sent immediately: keystrokes in an SSH session, game state updates, real-time chat messages. The tradeoff is more packets on the wire and potentially reduced throughput for bulk transfers. Most interactive applications disable Nagle's algorithm.
pipe() creates a unidirectional channel with two file descriptors: fd[0] for reading, fd[1] for writing. Data written to fd[1] is read from fd[0]. socketpair(AF_UNIX, SOCK_STREAM) creates a pair of bidirectional, connected stream sockets where both file descriptors can both send and receive.
socketpair() with SOCK_STREAM provides a bidirectional pipe that works with shutdown() (SHUT_RD, SHUT_WR, SHUT_RDWR) for half-close semantics, can be used with select()/poll()/epoll() for multiplexed communication, and supports out-of-band data (MSG_OOB) and file descriptor passing via sendmsg() with SCM_RIGHTS. A bidirectional protocol is more natural to implement on a socketpair than coordinating two unidirectional pipes.
When the remote end sends FIN (initiating connection close), the local TCP receives the FIN and moves the connection to CLOSE_WAIT state while notifying the application that the connection is closed for sending. The application must call close() to complete the close. After the application buffers are flushed and close() is called, the local TCP sends the final ACK and moves to LAST_ACK. The connection stays in LAST_ACK until the final ACK is received.
Connections stuck in CLOSE_WAIT indicate the application is not calling close() after receiving the remote close. This commonly happens when the application does not properly handle half-closes from the peer. A socket in CLOSE_WAIT holds kernel resources (receive buffer) until the application calls close(). Properly handling shutdown(SHUT_RDWR) or detecting peer close with zero-length recv() prevents socket leaks.
On Linux, read(fd, buf, len) and recv(sockfd, buf, len, flags) are essentially equivalent for socket file descriptors. The key difference is recv() accepts a flags argument: MSG_PEEK (peek at data without consuming it), MSG_DONTWAIT (non-blocking), MSG_OOB (out-of-band data), and MSG_WAITALL (wait for full request).
read() is the generic POSIX file descriptor operation and works on any file descriptor (pipes, files, sockets). recv() is socket-specific and provides socket-related control via flags. For regular sockets without special flags, they behave identically. Using recv() makes the socket-specific intent explicit and provides access to features that read() cannot express.
setsockopt() modifies kernel-level socket behavior at various levels: SOL_SOCKET (generic options like SO_REUSEADDR, SO_KEEPALIVE, SO_RCVBUF), SOL_TCP (TCP-specific like TCP_NODELAY, TCP_QUICKACK), SOL_IP (IP-specific like IP_MTU_DISCOVER), and protocol-family-specific options. Each option has a type and value that the kernel interprets according to the protocol's implementation.
getsockopt() retrieves the current value of an option. Options can be integer values, structs, or binary blobs depending on the option. Some options are read-only (determined by the kernel or connection state) and return errors when set. Setting options before bind() or connect() is important because some options affect connection establishment and cannot be changed after the socket is fully established.
close() closes the file descriptor and, when the last reference to the socket is closed, initiates the TCP close sequence (FIN exchange) if it is a connected socket. It releases the file descriptor and kernel resources. If multiple processes share the socket (via fork), each close() decrements the reference count and only the last one initiates the TCP close.
shutdown() operates at the socket level, shutting down one or both directions of the connection: SHUT_RD (no more receptions allowed, receive buffer is discarded), SHUT_WR (no more sends, initiates FIN), SHUT_RDWR (both directions). shutdown() is useful for half-close patterns where one direction closes while the other remains open, which close() cannot express. shutdown() does not release the file descriptor itself.
io_uring is a Linux kernel interface (5.1+) that enables high-performance asynchronous I/O for sockets, files, and other I/O operations. Unlike traditional async I/O (select/poll/epoll are synchronous multiplexing), io_uring uses a pair of ring buffers (submission queue and completion queue) shared between the application and kernel. The application submits I/O operations by filling the submission queue ring buffer; the kernel processes them and returns results in the completion queue ring buffer.
Key advantages: true async send/recv with IORING_OP_SEND and IORING_OP_RECV that return immediately and notify via completion queue, batched submissions that amortize syscall overhead, and features like fixed buffers that eliminate per-operation memory allocation. For high-frequency socket I/O, io_uring reduces context switch overhead significantly compared to epoll with blocking send/recv. Libraries like liburing make io_uring accessible to applications.
Further Reading
- socket(7) — Linux man page — Socket API overview and options
- unix(7) — Linux man page — Unix domain socket API
- tcp(7) — Linux man page — TCP socket options and behavior
- epoll(7) — Linux man page — Scalable I/O event notification for sockets
Conclusion
Sockets represent the pinnacle of IPC flexibility—bidirectional, connection-oriented or connectionless, local or network-spanning. Unix domain sockets provide highest-performance local communication, while TCP/UDP sockets extend that capability across network boundaries. The layered API (socket, bind, listen, accept, connect, send, recv) has remained remarkably stable since BSD introduced it decades ago.
Mastering socket programming means mastering edge cases: partial sends and receives requiring loops, EINTR handling for interrupted system calls, TIME_WAIT state and SO_REUSEADDR for server restarts, and appropriate use of select/poll/epoll for scalable multiplexing. These fundamentals apply whether you’re building a local daemon, a network service, or a distributed system.
Looking forward, the evolution continues with zero-copy socket APIs, io_uring integration for asynchronous I/O, and increasingly sophisticated offloading to network hardware. The socket API adapts, but the underlying principles of bidirectional communication endpoints and kernel-mediated data transfer remain constant.
Category
Related Posts
ASLR & Stack Protection
Address Space Layout Randomization, stack canaries, and exploit mitigation techniques
Assembly Language Basics: Writing Code the CPU Understands
Learn to read and write simple programs in x86 and ARM assembly, understanding registers, instructions, and the art of thinking in low-level operations.
Boolean Logic & Gates
Understanding AND, OR, NOT gates and how they combine into arithmetic logic units — the building blocks of every processor.