Assembly Language Basics: Writing Code the CPU Understands
Learn to read and write simple programs in x86 and ARM assembly, understanding registers, instructions, and the art of thinking in low-level operations.
Introduction
Assembly language sits at the boundary between human-readable code and machine execution. Each instruction in assembly maps almost directly to a single CPU instruction, giving programmers complete control over what the processor does. High-level languages abstract away hardware for portability, but assembly shows you every register allocation, every memory access, every branch decision.
Understanding assembly helps if you’re serious about operating systems development, performance optimization, reverse engineering, or embedded systems. Even if you never write production assembly, reading it shows you what the compiler generates, why some code patterns perform better, and how security exploits actually work.
When to Use / When Not to Use
When to Use Assembly:
- Writing critical performance kernels where every cycle counts (crypto primitives, signal processing)
- Implementing operating system primitives that must control hardware directly (context switching, interrupt handlers)
- Creating bootloaders and firmware with no operating system support
- Reverse engineering and security research
- Compiler backend work and code generation research
When Not to Use:
- Application development where portability across platforms matters
- Teams without assembly expertise—maintainability suffers dramatically
- When development speed matters more than micro-optimization
- Anywhere a well-tuned C implementation achieves sufficient performance
Core Concepts Architecture
flowchart TB
subgraph "Assembly Development Flow"
A["Assembly Source<br/>.s file"] --> B["Assembler"]
B --> C["Object Code<br/>.o file"]
C --> D["Linker"]
D --> E["Executable"]
end
subgraph "Assembly Structure"
F["Labels"] --> G["Instructions"]
G --> H["Operands"]
H --> I["Directives"]
end
subgraph "CPU Execution"
J["Registers"] --> K["ALU"]
J --> L["Memory"]
K --> M["Flags/Status"]
end
Core Concepts
The Register File
Registers are the CPU’s fastest storage—built directly into the processor silicon with access times measured in picoseconds. Understanding register usage is fundamental to assembly programming.
x86-64 General Purpose Registers:
| Register | Purpose | Caller-saved? | Notes |
|---|---|---|---|
| RAX | Accumulator, return values | Yes | Widely used for small operations |
| RBX | General purpose | No | Callee-saved |
| RCX | Counter, 3rd argument | Yes | Used by shifts and loops |
| RDX | Data, 4th argument | Yes | Multiplication output |
| RSI | Source index, 2nd argument | Yes | String operations source |
| RDI | Destination index, 1st argument | Yes | String operations destination |
| RBP | Base pointer, frame reference | No | Callee-saved, optional frame pointer |
| RSP | Stack pointer | Special | Points to top of stack |
| R8-R15 | General purpose | Yes (R8-R11), No (R12-R15) | New in x86-64 |
ARM64 General Purpose Registers:
| Register | Purpose | Notes |
|---|---|---|
| X0-X7 | Arguments, return values | X0 also return value |
| X8-X15 | Caller-saved | Temporary registers |
| X19-X28 | Callee-saved | Preserved across calls |
| SP | Stack pointer | Must be 16-byte aligned |
| PC | Program counter | Not directly readable |
| XZR | Zero register | Reads as zero, discards writes |
| W0-W31 | 32-bit views of X0-X31 | Lower half of 64-bit regs |
Basic Arithmetic Instructions
x86-64 Arithmetic:
; Addition
add rax, rbx ; rax = rax + rbx
add rax, 42 ; rax = rax + 42 (immediate)
; Subtraction
sub rsp, 8 ; rsp = rsp - 8 (allocate stack space)
sub rdi, rsi ; rdi = rdi - rsi
; Multiplication
mul rdx ; rdx:rax = rax * rdx (unsigned)
imul rcx, rbx ; rcx = rcx * rbx (signed, two-operand form)
; Division
div rcx ; rdx:rax / rcx, quotient in rax, remainder in rdx
idiv rbx ; signed division
; Increment/Decrement
inc rax ; rax++
dec rax ; rax--
ARM64 Arithmetic:
; Addition
add x0, x1, x2 ; x0 = x1 + x2 (three-operand)
add x0, x0, 42 ; x0 = x0 + 42 (immediate)
add x0, x1, x1, lsl 2 ; x0 = x1 + (x1 << 2) = x1 * 5
; Subtraction
sub sp, sp, 16 ; sp = sp - 16 (allocate stack)
subs x0, x1, x2 ; x0 = x1 - x2, set flags
; Multiplication
mul x0, x1, x2 ; x0 = x1 * x2
madd x0, x3, x4, x5 ; x0 = x3 + (x4 * x5)
; Division (available on ARMv8-A)
udiv x0, x1, x2 ; x0 = x1 / x2 (unsigned)
sdiv x0, x1, x2 ; x0 = x1 / x2 (signed)
Memory Operations
x86-64 Memory Access:
; Load from memory
mov rax, [rbx] ; rax = *rbx (64-bit load)
mov eax, [rbx] ; eax = *rbx (32-bit load, zero-extends to 64-bit)
movzx rax, byte [rbx] ; rax = *(uint8_t*)rbx (zero-extend)
movsx rax, byte [rbx] ; rax = *(int8_t*)rbx (sign-extend)
; Store to memory
mov [rdi], rax ; *rdi = rax
mov byte [rsi], 42 ; *(uint8_t*)rsi = 42
; Address calculation (LEA - Load Effective Address)
lea rax, [rbx + rcx] ; rax = rbx + rcx (no memory access)
lea rax, [rdi + rdi*4] ; rax = rdi + rdi*4 (useful for array indexing)
ARM64 Memory Access:
; Load from memory
ldr x0, [x1] ; x0 = *x1
ldr x0, [x1, 8] ; x0 = *(x1 + 8) (pre-index)
ldr x0, [x1], 8 ; x0 = *x1; x1 = x1 + 8 (post-index)
ldr w0, [x1] ; w0 = *(int32_t*)x1
; Store to memory
str x0, [x1] ; *x1 = x0
str x0, [sp, -16]! ; sp -= 16; *sp = x0 (pre-index update)
; Load pair (efficient for function prologues/epilogues)
ldp x0, x1, [sp], 16 ; x0 = *sp; x1 = *(sp+8); sp += 16
stp x0, x1, [sp, -16]! ; sp -= 16; *sp = x0; *(sp+8) = x1
Control Flow
x86-64 Branches:
; Unconditional jump
jmp label ; goto label
je label ; if (zf) goto label (equal)
jne label ; if (!zf) goto label (not equal)
jl label ; if (sf != of) goto label (less than, signed)
jg label ; if (zf == 0 && sf == of) goto label (greater than, signed)
jb label ; if (cf) goto label (below, unsigned)
ja label ; if (cf == 0 && zf == 0) goto label (above, unsigned)
; Comparison
cmp rax, rbx ; sets flags based on rax - rbx
test rax, rax ; sets flags based on rax & rax (for zero check)
; Function call
call function_label ; push return_address; jmp function
ret ; pop address; jmp to it
ARM64 Branches:
; Unconditional
b label ; goto label
bl label ; link register = pc+4; goto label (call)
ret ; branch to x30 (return address)
; Conditional (many options)
cbnz x0, label ; if (x0 != 0) goto label
cbz x0, label ; if (x0 == 0) goto label
tbnz x0, 3, label ; if (bit 3 of x0 != 0) goto label
tbz x0, 3, label ; if (bit 3 of x0 == 0) goto label
; Compare and branch
cmp x0, x1
b.eq label ; if (x0 == x1) goto label
b.ne label ; if (x0 != x1) goto label
b.lt label ; if (x0 < x1) goto label (signed)
b.lo label ; if (x0 < x1) goto label (unsigned)
; Conditional select (no branch needed)
csel x0, x1, x2, eq ; x0 = (eq) ? x1 : x2
cset x0, eq ; x0 = (eq) ? 1 : 0
cinc x0, x0, ne ; x0 = (ne) ? x0+1 : x0 (conditional increment)
Stack Operations
x86-64 Function Prologue/Epilogue:
function:
push rbp ; save old base pointer
mov rbp, rsp ; establish new frame
sub rsp, 32 ; allocate 32 bytes for locals
; ... function body ...
mov rsp, rbp ; restore stack pointer
pop rbp ; restore base pointer
ret ; return to caller
ARM64 Stack Frame:
function:
; No explicit frame pointer needed in simple cases
; 16-byte alignment is required at function call
stp x29, x30, [sp, -16]! ; save FP and LR with pre-index
mov x29, sp ; x29 is frame pointer (optional)
; ... function body ...
ldp x29, x30, [sp], 16 ; restore and post-index
ret
Production Failure Scenarios
Scenario 1: Stack Buffer Overflow Exploits
Failure: A function writing beyond a local buffer’s bounds overwrites the return address on the stack. Attacker redirects execution to shellcode.
Example vulnerable pattern:
; Vulnerable: no bounds checking on buffer
vulnerable:
sub rsp, 64
mov rdi, rsp
call get_user_input ; writes beyond buffer if input > 64 bytes
add rsp, 64
ret
Mitigation:
- Use stack canaries (
-fstack-protectorcompiler flag) - Enable NX (No-Execute) bit so stack memory isn’t executable
- Use safe string functions (
strncpyinstead ofstrcpy) - Implement stack smashing protection in the runtime
Scenario 2: Register Usage Violations in Mixed Assembly/C
Failure: Writing assembly that doesn’t preserve callee-saved registers corrupts caller’s state.
Example mistake:
; BROKEN: modifies RBX (callee-saved) without preserving it
broken_function:
mov rbx, rcx ; RBX is callee-saved! Must push/pop
; ... do work ...
ret
; CORRECT: preserve RBX
correct_function:
push rbx ; save it
mov rbx, rcx
; ... do work ...
pop rbx ; restore it
ret
Mitigation:
- Always know which registers are callee-saved vs caller-saved per ABI
- In inline assembly, use constraint modifiers (
"=&r"for early-clobber) - Write assembly functions in separate files with clear documentation
- Use automated testing with valgrind or sanitizers
Scenario 3: ARM64 SP Misalignment Crashes
Failure: Stack pointer not 16-byte aligned at function call causes crashes on ARM64.
Mitigation:
- Always maintain 16-byte alignment:
sub sp, sp, 16notsub sp, sp, 8 - Use
stpandldpfor paired stack operations - Verify alignment at function entry:
and sp, sp, -16
Trade-off Table
| Aspect | Hand-Written Assembly | Compiler-Generated | Notes |
|---|---|---|---|
| Performance | Maximum control | Very good, often optimal | Humans rarely beat compilers at micro-optimization |
| Maintainability | Difficult | Easy | Assembly harder to modify and debug |
| Portability | Architecture-specific | Automatic via compiler | One C source → many ISAs |
| Size | Can be smaller | May include unused code | Link-time optimization helps |
| Reliability | Error-prone | Well-tested compiler | Compilers have decades of bug fixes |
| Security | Easy to make mistakes | Safe by default | Compiler adds protections automatically |
Implementation Snippets
Complete x86-64 Function: strlen
; size_t strlen(const char *str)
; Input: RDI = pointer to null-terminated string
; Output: RAX = length
strlen:
xor rax, rax ; initialize counter to 0
mov rdx, rdi ; copy pointer to RDX for scanning
.loop:
mov cl, [rdx] ; load current byte into CL (low 8 bits of RCX)
test cl, cl ; test if byte is zero
jz .done ; if zero, we're done
inc rax ; increment counter
inc rdx ; advance pointer
jmp .loop ; continue
.done:
ret
Complete ARM64 Function: strcmp
; int strcmp(const char *s1, const char *s2)
; Input: X0 = s1, X1 = s2
; Output: X0 = comparison result (< 0, 0, or > 0)
strcmp:
ldrb w2, [x0], 1 ; load byte from s1, post-increment
ldrb w3, [x1], 1 ; load byte from s2, post-increment
subs w2, w2, w3 ; w2 = w2 - w3, set flags
cbnz w2, .done ; if difference != 0, we're done
cbnz w3, .loop ; if s2 byte != 0, continue
; both bytes were 0, strings are equal
mov x0, 0 ; return 0
ret
.loop:
ldrb w2, [x0], 1
ldrb w3, [x1], 1
subs w2, w2, w3
cbz w3, .done ; if s2 byte == 0, we're done (return diff)
cbnz w2, .done ; if difference != 0, we're done
b .loop ; continue
.done:
sxtw x0, w2 ; sign-extend 32-bit result to 64-bit
ret
High-Level Language Inline Assembly Example
// C function with inline assembly for critical loop
unsigned long long popcount(unsigned long long x) {
unsigned long long count;
__asm__(
"popcnt %1, %0"
: "=r"(count) // output: register operand
: "r"(x) // input: register operand
);
return count;
}
// More complex: sum of array with loop unrolling
void sum_array(const unsigned *arr, size_t len, unsigned *result) {
unsigned sum = 0;
__asm__(
"mov %[sum], 0\n\t"
"1:\n\t"
"add %[sum], [%[arr]]\n\t"
"add %[arr], 4\n\t"
"dec %[len]\n\t"
"jnz 1b"
: [sum] "+r"(sum), [arr] "+r"(arr)
: [len] "r"(len)
: "cc", "memory"
);
*result = sum;
}
Observability Checklist
When debugging assembly code:
- Disassembly listing: Compile with
-Sor useobjdump -dto see generated code - Register values: Use
info registersin GDB orregister readin LLDB - Memory examination: Use
x/10xto examine memory in hex,x/sfor strings - Single-stepping: Use
stepi(instruction level) vsstep(source level) - Breakpoints: Set at function entry, loop boundaries, and call sites
- Instruction count: Profile with
perf statto count retired instructions - Pipeline stalls: Use performance counters to detect cache misses and mispredictions
- Stack trace: Verify frame pointers are intact for accurate backtraces
Essential Tools:
objdump -d binary— disassemble any binarygdb/lldb— debug withdisassemble,info registers,x/iperf— profile with hardware countersvalgrind— memory access validationstrace/ltrace— system call and library call tracing
Common Pitfalls / Anti-Patterns
Security Considerations
-
Code Injection: Assembly programs must carefully validate all inputs. Unlike managed languages, there’s no bounds checking built into memory access instructions.
-
Return-Oriented Programming (ROP): Even with NX stacks, attackers can chain existing code fragments (gadgets). Mitigations:
- Stack canaries detect overwrites before return
- ASLR randomizes memory layout
- Control Flow Integrity (CFI) validates indirect jump targets
-
Timing Attacks: Constant-time programming in assembly is critical for crypto:
; VULNERABLE: branch depends on secret data cmp byte [key + i], 0 je .done ; SAFE: constant-time conditional move mov al, 1 cmp byte [key + i], 0 csel al, al, 0, ne ; al = (key[i] != 0) ? al : 0
Compliance Notes
- Secure coding standards: CERT C/C++ and MISRA have rules about inline assembly
- FIPS 140-3: Cryptographic modules require constant-time implementations
- Safety standards (DO-178C, ISO 26262): Assembly in safety-critical code requires rigorous verification
Common Pitfalls / Anti-patterns
-
Forgetting to Preserve Callee-Saved Registers: Modifying RBX, RBP, R12-R15 in x86-64 without saving them corrupts the caller’s state.
-
Stack Misalignment: x86-64 requires 16-byte stack alignment at call. Violations cause crashes when calling external libraries.
-
Sign-Extension Mistakes: Moving a 32-bit value to a 64-bit register without sign-extension leaves upper bits undefined. Always use
movsxormovsxdfor signed values. -
Assuming Instruction Atomicity: On ARM, without explicit memory barriers, loads and stores may be reordered. Use
dmb ishbefore and after shared memory operations. -
Using EBX Instead of RBX in x86-64: 32-bit register writes to EAX, EBX, ECX, EDX zero the upper 32 bits. Writing
mov ebx, 1clears RBX’s upper bits, potentially breaking code that expects them preserved. -
Off-by-One Errors in Array Indexing: Remember that byte offsets multiply by element size. Accessing
arr[i]wherearrisint*requiresbase + i*4, which in assembly isbase + i*4notbase + i. -
Not Using Frame Pointer for Debugging: Omitting frame pointers (
-fomit-frame-pointer) breaks stack unwinding in debuggers when optimized code crashes.
Quick Recap Checklist
- Assembly language maps directly to machine instructions defined by the ISA
- Understanding register purpose (callee-saved vs caller-saved) is essential for correct function writing
- Memory access requires careful consideration of size, alignment, and signed vs unsigned extension
- Control flow in assembly uses condition codes set by previous operations—
cmpandtestare common - ARM’s conditional execution and conditional select can eliminate branches entirely
- Always preserve callee-saved registers; caller-saved registers are the caller’s responsibility
- Stack must be 16-byte aligned on x86-64 and ARM64 at function call boundaries
- Inline assembly requires correct constraints to communicate with the C calling environment
- Use performance counters and disassemblers to verify what code actually executes
- Security in assembly requires explicit defenses—there’s no automatic protection
Interview Questions
Caller-saved registers (also called volatile or temporary registers) are guaranteed to be preserved across a function call by the caller. If the caller needs their value after the call, it must save and restore them itself.
Callee-saved registers (also called preserved or non-volatile registers) must be preserved across a function call by the callee. If a function wants to use them, it must save on entry and restore before return.
This distinction optimizes for the common case: most function calls don't need to preserve many temporary values. By making temporary values caller-saved, the callee can use all registers freely without overhead. By making values the callee needs to preserve callee-saved, the callee minimizes save/restore overhead only when actually used.
The stack frame is a region of stack memory allocated for a single function call. The stack pointer (SP) always points to the current top of the stack and moves as data is pushed and popped. The base pointer (BP), also called frame pointer, provides a fixed reference point within the frame.
When a function begins, it saves the old BP, sets BP to the current SP, then allocates space for locals by subtracting from SP. Throughout the function, SP might move as we allocate more stack or call other functions. BP stays fixed, so we can access parameters and locals at known, constant offsets from BP—like [rbp-8] for the first local or [rbp+16] for the first parameter.
Modern compilers can omit BP with -fomit-frame-pointer optimization when all locals can be allocated to registers. But BP remains essential for debuggers to perform stack unwinding and for exception handling to trace call chains.
LEA (Load Effective Address) computes the address of a memory location without actually accessing memory—it puts the calculated address into a register. lea rax, [rbx + rcx*4] computes rbx + rcx*4 and stores the result in RAX.
This is useful for address calculation because LEA can perform addition and multiplication by 2, 4, or 8 in a single instruction with no memory access and no flags affected. For example, to compute &array[i] where i is in RCX and array is at RBX: lea rax, [rbx + rcx*8].
Compilers often use LEA for arithmetic too: lea rax, [rbx + rbx*4] computes rbx * 5 efficiently. The instruction is also commonly used for address computation in loop-indexed array access patterns.
ARM64 provides multiple mechanisms for conditional execution that reduce branch dependency. The Compare and Branch instructions combine comparison and conditional jump: cbnz x0, label branches if X0 is not zero. Conditional Select instructions select between values without branching: csel x0, x1, x2, eq sets X0 to X1 if equal, otherwise X2.
In contrast, x86 typically requires separate compare and branch instructions. ARM's conditional select can implement simple conditionals like x = (a == b) ? c : d in a single instruction, avoiding branch misprediction penalties.
ARM also supports conditional execution (the cond variant on instructions) in Thumb mode, where most instructions can conditionally execute based on status flags. This allows very tight conditional code without branches at all.
Both x86-64 and ARM64 require the stack pointer to be 16-byte aligned at the point of a function call instruction. The CALL instruction pushes an 8-byte return address, so if SP is 16-byte aligned before CALL, it becomes 8-byte misaligned after. Functions must realign before calling others.
In x86-64, if your function allocates an odd number of 8-byte stack slots, you break alignment. The fix is to always allocate in multiples of 16 bytes: sub rsp, 32 allocates 32 bytes even if you only need 24.
In ARM64, use sub sp, sp, 16 not sub sp, sp, 8. The stp and ldp instructions (store/load pair) naturally maintain alignment when used with pre-index or post-index update.
Misalignment causes crashes when calling external libraries that assume correct alignment, or performance penalties on ARM due to alignment fault handling.
Position-independent code can execute correctly regardless of where it's loaded in memory. Instead of absolute addresses, PIC uses PC-relative addressing or accesses Global Offset Table (GOT) entries for global data.
Uses:
- Shared libraries (.so/.dll) — loaded at different addresses in different processes
- ASLR (Address Space Layout Randomization) — security feature that randomizes memory addresses
- Kernel modules — must load at arbitrary addresses in kernel memory
In x86-64, use RIP-relative addressing: mov rax, [rel symbol]. In ARM64, PC-relative loads are the default. Compilers generate PIC with -fPIC flag.
The EFLAGS register contains condition flags set by arithmetic and logical operations:
- ZF (Zero Flag): Set when result is zero
- SF (Sign Flag): Set when result is negative (MSB=1)
- CF (Carry Flag): Set on unsigned overflow
- OF (Overflow Flag): Set on signed overflow
- PF (Parity Flag): Set when low 8 bits have even number of 1s
Common patterns:
cmp rax, rbx ; sets ZF if equal, SF if rax < rbx
test rax, rax ; sets ZF if rax is zero, SF if negative
jo label ; jump if overflow flag set
jnc label ; jump if carry flag clear
The test instruction is preferred for checking if a register is zero because it sets flags without destroying the register value.
ldr x0, [x1] loads a 64-bit value from the address in X1 into X0. The load must be naturally aligned (8-byte aligned for X register).
ldrb w0, [x1] loads a single byte (8 bits) into the low 8 bits of W0, zero-extending the upper 24 bits to 64 bits. Unaligned access is allowed.
Related variants:
ldrh: Load halfword (16 bits) into W registerldrsb: Load signed byte (sign-extends to 64 bits)ldrsh: Load signed halfword (sign-extends)ldnp: Load non-temporal pair (cache hints)ldxp/ldaxp: Load exclusive pair (for atomic operations)
Using the wrong size (e.g., ldrb on a 32-bit register when you want zero-extension) causes subtle bugs. Always match the load size to the destination register width.
str (Store Register) stores a single register to memory. stm (Store Multiple) stores multiple registers in a single instruction, writing them sequentially to consecutive memory locations starting at the base register.
str x0, [x1] writes the 64-bit value in X0 to the address in X1. stm x0!, {x1, x2, x3} writes X1 to [x0], X2 to [x0+8], X3 to [x0+16], and increments X0 by 24 (the ! is writeback—updates the base register).
Use str for single values or when you need precise control over addressing. Use stm for saving/restoring multiple registers (function prologue/epilogue, context switching) — it's more efficient than multiple str instructions.
Related: ldp/stp (load/store pair) handle exactly two registers and are faster than separate loads when you need exactly two. The p variants also offer atomic pair access for exclusive monitors.
AT&T syntax is used by the GNU assembler (GAS) on Linux. Intel syntax is used by MASM, NASM, and in Intel documentation.
Key differences:
- Operand order: AT&T is
instruction source, destination. Intel isinstruction destination, source. Somov eax, ebxin Intel moves EBX to EAX in AT&T:movl %ebx, %eax. - Register naming: AT&T prefixes registers with
%(e.g.,%eax). Intel uses plain names (eax). - Immediate values: AT&T prefixes with
$(e.g.,$42). Intel uses plain42. - Memory references: AT&T uses
disp(base, index, scale)syntax. Intel uses[base + index*scale + disp]. - Operand size: AT&T suffixes the instruction with
b,w,l(byte, word, long). Intel infers from operands.
Example comparison:
; Intel syntax (NASM/FASM) mov eax, [ebp+8] add eax, [ebp+12]
; AT&T syntax (GAS) movl 8(%ebp), %eax addl 12(%ebp), %eax
Most Linux development uses Intel syntax via NASM or the GNU assembler with Intel syntax directive (.intel_syntax noprefix).
The EFLAGS register is a 32-bit register containing status flags and control bits. Key flags:
- ZF (Zero Flag): Set when result is zero (or equal after SUB/CMP)
- SF (Sign Flag): Set when result is negative (MSB=1)
- OF (Overflow Flag): Set when signed arithmetic overflows
- CF (Carry Flag): Set when unsigned arithmetic overflows
- PF (Parity Flag): Set when low 8 bits have even number of 1s
- AF (Auxiliary Carry): BCD arithmetic; rarely used
Control bits:
- DF (Direction Flag): String operations increment (0) or decrement (1) pointers
- IF (Interrupt Enable): Enable/disable maskable interrupts
Instructions that set flags: cmp, test, arithmetic (add, sub, inc, dec), logical (and, or, xor).
Instructions that read flags: conditional jumps (je, jne, jl, etc.), setcc (set byte if condition), cmovcc (conditional move).
The flags are set by the previous operation—always check what instruction set them and when they were set.
The red zone is a 128-byte area below the stack pointer that is guaranteed to be preserved by signal handlers and interrupt handlers. Compilers use this area for local variables without adjusting the stack pointer.
sub rsp, 40 ; Allocate space for locals
; Use 40 bytes below RSP
; The red zone is the 128 bytes below RSP
; No need to adjust RSP if we need less than 128 bytes
Red zone usage:
- Compiler emits
sub rsp, Nwhere N is total local size - Access locals as
[rsp-8],[rsp-16], etc. - If signal handler is called, it can safely use this area without corrupting our locals
Red zone limitations:
- Only valid while the function is active (not after a call)
- Leaf functions can use red zone freely; non-leaf functions must maintain alignment
- Windows x64 ABI doesn't use red zone—local variables use stack shadow space instead
This optimization reduces stack pointer adjustments, improving performance in leaf functions.
Variadic functions accept a variable number of arguments. On x86-64 System V ABI:
Argument passing: The first 6 integer arguments are in RDI, RSI, RDX, RCX, R8, R9. The rest go on the stack.
Variadic handling: Use the va_list structure in C, but in pure assembly you must handle it manually:
; A variadic function that sums integers ; First arg in RDI, number of args in RSI variadic_sum: push rbp mov rbp, rsp sub rsp, 32 ; Space for locals and va_list; Set up va_list on the stack ; Arguments after R9 start at [rbp+24] lea rax, [rbp+24] ; Point to first stack argument mov [rsp], rax ; va_start(ap) - save pointer xor ecx, ecx ; i = 0 xor r8, r8 ; sum = 0.loop: cmp ecx, esi je .done mov rax, [rsp + rcx*8] ; Load next argument from va_list add r8, rax inc ecx jmp .loop
.done: mov rax, r8 leave ret
Key points: arguments after the 6th register arg are on the stack, calculate address correctly for each argument position.
PIC can execute correctly regardless of load address. Instead of absolute addresses, PIC uses PC-relative addressing. Essential for shared libraries (.so) and ASLR.
In x86-64, use rip-relative addressing:
; Get current IP (position independent) call .get_pc .get_pc: pop rax ; RAX = address of .get_pc ; Now compute offset to access data sub rax, .get_pc ; Get PC ; Access global_data at offset from PC mov rdx, [rax + global_data - .get_pc]
; Or use the ELF global offset table (GOT): mov rax, [rel global_var@gotpc] ; EIP-relative GOT access mov rax, [rax]
The rel keyword (NASM) or rel suffix (GAS) generates EIP/RIP-relative relocations.
In ARM64, PC-relative loads are the default:
adr x0, .Ldata ; Load address of .Ldata relative to PC
ldr x0, [x0] ; Load the value at that address
Compilers generate PIC automatically with -fPIC. For assembly, you must manually calculate offsets or use the GOT for external symbols.
Atomic operations require special CPU support (lock prefix on x86, load-exclusive/store-conditional on ARM). Without these, you can only approximate atomicity through careful sequencing:
Test-and-set without atomicity:
; BROKEN: Race condition exists
mov eax, 1 ; Value to set
.lock:
xchg [rdi], eax ; Atomic exchange (real CPUs have this)
; But if we didn't have xchg...
cmp [rdi], 0 ; Check if free
jne .busy ; If not, someone else got it
mov [rdi], eax ; This is NOT atomic - race here
mov eax, 0 ; Success
jmp .done
.busy:
mov eax, -1 ; Failure
.done:
Software-only approach (broken for concurrent threads):
; With only basic instructions, you cannot implement true atomics ; This is why CPUs provide atomic instructions ; What you can do: use mutex (busy-wait with memory polling)
.lock: cmp [shared], 0 ; Check lock jnz .lock ; If busy, spin mov [shared], 1 ; Try to acquire - NOT ATOMIC ; Must verify no one else got it cmp [shared], 1 ; Did we really get it? jne .lock ; No, try again ; Acquired (race still exists)
True atomic operations require hardware support: lock prefix on x86, ldaxr/stlxr on ARM. There's no way around this.
A leaf function is a function that doesn't call any other functions. A non-leaf function calls other functions (including itself recursively).
ARM64 leaf function optimization:
; Leaf function: doesn't need stack frame
leaf_func:
add x0, x0, x1 ; x0 = x0 + x1
ret ; Return to link register
No STP/LDP needed because we don't save LR (link register) — we don't call anything.
Non-leaf function must save LR because BL overwrites it:
; Non-leaf: must save link register
non_leaf:
stp x29, x30, [sp, -16]! ; Save FP and LR
mov x29, sp
bl some_function ; LR gets overwritten
ldp x29, x30, [sp], 16 ; Restore
ret</code></pre>
<p>The key difference: leaf functions don't need to save LR. This is a significant optimization for small functions — avoid function calls to keep LR safe and avoid stack frame setup.</p>
17. How do you implement a counting loop in ARM64 assembly?
ARM64 counting loop:
; Calculate sum of 1 to N
; Input: X0 = N
; Output: X0 = sum
mov x1, 0 ; i = 0
mov x2, 0 ; sum = 0
.loop:
cmp x1, x0 ; i < N?
bge .done ; if i >= N, done
add x2, x2, x1 ; sum += i
add x1, x1, 1 ; i++
b .loop ; continue
.done:
mov x0, x2 ; return sum
ret
Or using decrement:
; Using decrement (common pattern)
mov x1, x0 ; Copy N to X1 as counter
mov x0, 0 ; sum = 0
.loop:
cbz x1, .done ; if counter == 0, done
add x0, x0, x1 ; sum += counter
subs x1, x1, 1 ; counter—, set flags
b.ne .loop ; if counter != 0, continue
.done:
ret
The subs (subtract and set flags) followed by b.ne is efficient. cbz (compare and branch if zero) combines comparison and branch.
18. What is the difference between the .data, .text, and .rodata sections?
.text section: Contains executable code (instructions). This section is typically marked read-only and executable. Code lives here.
.data section: Contains initialized global and static variables that can be modified at runtime. Examples: int count = 5;
.rodata section (or .rdata): Contains read-only data — constants, string literals, jump tables. Cannot be modified at runtime.
.section .data
global_var: .quad 42 ; Writable, initialized
.section .rodata
msg: .asciz “Hello, world!” ; Read-only string
table: .word 0, 1, 4, 9 ; Read-only table
.section .text
function:
; code here
ret
On Linux, .rodata is often merged with .data for simplicity. The linker places these sections in the output file, and the OS sets appropriate page permissions: text (RX), data (RW), rodata (R).
19. How do you write a tail call optimization in assembly?
A tail call is when a function calls another function as its last action. The calling function could return the result of the called function directly. Compilers optimize this to avoid stack frame setup.
; Without tail call optimization:
caller:
push rbp
mov rbp, rsp
call callee ; Allocate stack frame for caller
add rsp, 8 ; Clean up return value
pop rbp
ret
; With tail call optimization (after callee returns, caller returns):
caller:
push rbp
mov rbp, rsp
jmp callee ; Jump instead of call
; No ret here - callee’s ret returns to caller’s caller
In ARM64:
; No frame needed - tail call
b callee ; Branch to callee (don't link)
; LR wasn't saved because we're not calling, we're jumping
; callee's return goes to our caller
Tail call optimization prevents stack growth in recursive calls and function composition. It's critical for functional languages where f(g(x)) creates a chain of tail calls.
20. What is the difference between a syscall and a library call from the CPU's perspective?
Library call (printf, malloc): A call to a function within the same process. The call is just a CALL instruction to an address in the code segment. The library code executes in the same privilege mode (user mode). It's a normal function call.
Syscall (read, write, brk): A request to the kernel. The CPU must switch from user mode to kernel mode. This uses the SYSENTER/SYSCALL instruction (or int 0x80 on older x86) which:
- Changes privilege level (user → kernel)
- Saves return address
- Loads new PC from kernel syscall entry point
- Restricts accessible memory (page tables change)
The syscall number goes in a specific register (EAX on x86, X8 on ARM64). Arguments go in other registers. The kernel validates arguments before performing the operation.
Syscall overhead is ~50-100ns; library call overhead is ~1ns. Use syscalls only when necessary (file I/O, process control, memory allocation). Internal library functions are just function calls.
Further Reading
- x86 Assembly Language Programming — UVa CS216 guide to x86
- ARM Assembly Programming — Comprehensive ARM64 tutorial
- Intel x86 Developer Manual — Official Intel documentation
- x86 Assembly Guide (University of Virginia) — Register conventions and calling conventions
- Assembly Language for x86 Processors — Papers on assembly optimization
Conclusion
Assembly language sits at the boundary between what you write and what the hardware actually executes. Even if you never write production assembly, reading it helps you understand compiler output, debug performance issues, and reason about security exploits.
The key takeaways are understanding register usage (caller-saved vs callee-saved), stack frame management, and control flow via condition codes. Both x86-64 and ARM64 have their quirks—memory ordering differences can cause real bugs in concurrent code that works on one architecture but fails on another.
Continue your low-level exploration by studying boolean logic and gates to understand how transistors implement these instructions, or building a CPU simulator to see how these concepts come together in a working system.
Category
Related Posts
ASLR & Stack Protection
Address Space Layout Randomization, stack canaries, and exploit mitigation techniques
Boolean Logic & Gates
Understanding AND, OR, NOT gates and how they combine into arithmetic logic units — the building blocks of every processor.
Boot Process
From BIOS POST to kernel initialization and user space startup — understanding how your operating system comes to life.