Assembly Language Basics: Writing Code the CPU Understands

Learn to read and write simple programs in x86 and ARM assembly, understanding registers, instructions, and the art of thinking in low-level operations.

published: May 19, 2026 reading time: 45 min read author: GeekWorkBench

Quick Summary

Learn to read and write simple programs in x86 and ARM assembly, understanding registers, instructions, and the art of thinking in low-level operations.

Introduction

Assembly language sits at the boundary between human-readable code and machine execution. Each instruction in assembly maps almost directly to a single CPU instruction, giving programmers complete control over what the processor does. High-level languages abstract away hardware for portability, but assembly shows you every register allocation, every memory access, every branch decision.

Understanding assembly helps if you’re serious about operating systems development, performance optimization, reverse engineering, or embedded systems. Even if you never write production assembly, reading it shows you what the compiler generates, why some code patterns perform better, and how security exploits actually work.

When to Use / When Not to Use

When to Use Assembly:

Writing critical performance kernels where every cycle counts (crypto primitives, signal processing)
Implementing operating system primitives that must control hardware directly (context switching, interrupt handlers)
Creating bootloaders and firmware with no operating system support
Reverse engineering and security research
Compiler backend work and code generation research

When Not to Use:

Application development where portability across platforms matters
Teams without assembly expertise—maintainability suffers dramatically
When development speed matters more than micro-optimization
Anywhere a well-tuned C implementation achieves sufficient performance

Core Concepts Architecture

Assembly Development Flow

The pipeline from human-written assembly source to a running executable has four stages. Each stage transforms the code in a specific way, and understanding this flow helps you debug problems at the right level.

Assembler (translation): The assembler reads your .s file and translates each assembly instruction into binary machine code. Directives like .section, .global, and .align tell the assembler how to organize output and handle symbols. The output is an object file (.o) containing relocatable machine code and a symbol table.

Object file (linking): Object files contain your code but may reference symbols from other files (external functions, global variables). They may also have unresolved internal references. Relocations mark places where addresses need fixing up later.

Linker (combination): The linker combines multiple object files into a single executable. It resolves cross-file symbol references, assigns final memory addresses, and creates the executable format (ELF on Linux, Mach-O on macOS). It also removes unused functions.

Executable: The resulting binary contains machine code ready for the OS loader. When you run it, the OS loader reads the executable header, allocates memory, and copies code and data segments into place before jumping to the entry point.

flowchart LR
    A["Assembly Source<br/>.s file"] --> B["Assembler"]
    B --> C["Object Code<br/>.o file"]
    C --> D["Linker"]
    D --> E["Executable"]

Relocations are the linker’s puzzle pieces. When the assembler generates machine code, it often cannot know the final memory address of a symbol. Consider a mov eax, [symbol] instruction — the assembler writes a placeholder offset that the linker fills in once it knows where symbol lives. Each relocation entry specifies: the location in the object file needing a fix, the symbol to resolve, the kind of address computation needed (absolute, PC-relative, GOT-relative), and an addend adjustment. The ELF object file stores relocations in .rel or .rela sections, and the linker processes them in order. Understanding relocation types matters when debugging mysterious crashes or writing position-independent code — a wrong relocation type produces incorrect addresses that manifest as subtle memory corruption.

Symbol resolution happens in two passes. The first pass collects all symbol definitions and references across all object files, building a symbol table that maps each name to a definition. The second pass applies relocations, substituting each unresolved reference with the final address. When a symbol is defined in the same file, resolution is straightforward — the assembler already knows the offset within its section. When a symbol comes from another object file, the linker searches in that object’s symbol table. External symbols (from libraries) require the linker to search archive libraries (.a files) containing object files. The linker extracts only the object files from an archive that resolve pending references, a process called symbol trimming. Unresolved symbols at the end of linking produce the classic “undefined reference to main” error.

The ELF format organizes compiled output into sections and segments. Sections are the granular division of an object file: .text holds executable code, .data holds initialized writable data, .rodata holds read-only constants, .bss holds uninitialized data (compressed to save space), and .rel*/.rela* hold relocation information. Segments, defined in the program header, describe how the loader maps sections into memory — the loader doesn’t care about sections, only about segments that describe memory regions with permissions (read, write, execute). This distinction matters when using tools like objcopy to manipulate sections or when debugging with readelf -S (sections) versus readelf -l (segments). The GNU linker uses a script language (ld --verbose) that lets you control section placement, which becomes critical when writing embedded firmware where memory regions have specific constraints.

The OS loader does more than just copy bytes into memory. When you execute a program, the kernel parses the ELF header to find the program headers describing segments, validates the executable (checking for valid magic numbers, permission flags, and architecture compatibility), maps memory regions according to each segment’s permissions (code segments get read+execute, data segments get read+write), and handles dynamic linking for shared libraries by resolving symbols through the GOT (Global Offset Table) and PLT (Procedure Linkage Table). For dynamically linked executables, the loader also runs the dynamic linker (ld-linux.so) to prelink shared library symbols before jumping to the entry point. The entry point address lives in the ELF header’s e_entry field, and the loader sets up the initial stack with argc, argv, envp, and the auxiliary vector containing information like the initial IP and platform details.

Assembly Structure

Assembly source files have four structural elements. The diagram shows how these elements relate to each other.

Labels are symbolic names for memory addresses. They mark branch targets, function entry points, and data locations. A label followed by a colon exports a symbol for linking: my_function:. Labels without colons are local and get renumbered by the assembler: .loop:. The assembler converts labels to offsets from the start of their section.

Instructions are the actual CPU operations. Each instruction has an opcode and operands. The CPU executes instructions sequentially unless a branch or jump changes the flow. Instructions map almost one-to-one to machine code bytes the processor can decode and execute.

Operands specify the data instructions operate on. Three types exist: registers (rax, x0), immediate values (42, $0xFF), and memory references ([rbx], [sp, 8]). Memory references can use complex addressing modes combining a base register, index register, and scale factor for array access patterns.

Directives are assembler commands that never become CPU instructions. They control the assembler’s behavior: .section .text switches active sections, .global main exports symbols for linking, .align 16 pads with NOPs to reach alignment boundaries, .byte 0xFF embeds raw data. Directives vary between assemblers (GAS vs NASM vs MASM).

flowchart LR
    F["Labels"] --> G["Instructions"]
    G --> H["Operands"]
    H --> I["Directives"]

Label scope determines linkage visibility. Global labels (written without a leading dot, like main: or my_function:) become symbols in the object file’s symbol table, visible to the linker for cross-file references. Local labels (prefixed with a dot, like .loop: or .L1:) are renumbered by the assembler into unique names like .L0, .L1 to prevent namespace collisions when multiple functions define their own .loop labels. GAS uses the .L prefix convention for local labels; NASM requires explicit local directives in macros. Name mangling in assembly is minimal compared to high-level languages, but when writing functions that will be called from C, you must respect the C ABI’s naming convention — typically prefixing with an underscore on macOS ( _main) and no prefix on Linux ( main). Mixing these conventions produces the dreaded “symbol not found” linker error.

Directives cluster into functional categories that control different aspects of the output. Section directives (.section .text, .section .data, .section .rodata) switch the active section where code and data get emitted. Alignment directives (.align n) pad the current section with zeros or NOPs until the location counter reaches a multiple of n, which matters for cache line boundaries and instruction fetch efficiency. Data directives (.byte, .word, .quad, .ascii, .asciz) embed raw values directly into the output — these never become instructions, they become the data section’s contents. Symbol directives (.global, .extern, .equ) control symbol visibility and define constants without allocating space. The .equ directive (or GAS .set) assigns a constant value to a symbol, useful for magic numbers that appear throughout code: .equ SYSCALL_READ, 0 lets you write mov rax, SYSCALL_READ instead of mov rax, 0.

GAS (GNU Assembler) and NASM have fundamentally different syntax philosophies that trip up assembly learners. GAS uses AT&T syntax with operand order reversed (source first), register prefixes with %, immediate prefixes with $, and memory operands in disp(base, index, scale) form. NASM uses Intel syntax with destination-first operand order, no register prefixes, and memory operands in [base + index*scale + disp] form. MASM similarly uses Intel syntax but differs in how labels are treated (labels are implicitly addresses, not values) and which directives are available. A function that adds two numbers looks like add %eax, %ebx in GAS but add eax, ebx in NASM. GAS can also use Intel syntax via the .intel_syntax noprefix directive, which many developers prefer when writing Linux assembly. The choice matters when reading compiler-generated assembly — GCC typically emits GAS syntax by default, while Clang can emit either with the right flags. When interoping with C headers, NASM’s preprocessor macros and sections syntax feel more natural to C programmers, but GAS’s integration with the GNU toolchain makes it the default for inline assembly and compiler output.

Directives directly shape the object file’s structure. The .section directive determines which output section receives subsequent instructions and data — switching sections mid-function is possible but almost never useful. The .align directive affects not just the location counter alignment but also the generated relocations when labels span alignment boundaries. The .global directive controls what appears in the symbol table, which the linker uses to resolve external references — forgetting .global on main produces a file that links but cannot execute because no entry point is visible. The .type directive (.type my_func, @function) tells the linker whether a symbol is a function or data, which matters for debug information and dynamic linking. These metadata directives don’t affect the CPU’s execution directly, but they control the entire toolchain’s ability to produce a working executable.

CPU Execution

The three primary resources a CPU manipulates during instruction execution are registers, the ALU, and memory. Understanding how these components interact explains why certain assembly patterns exist and why some operations are faster than others.

Registers are the CPU’s fastest storage, built into the processor silicon. Access takes a single cycle. x86-64 has 16 general-purpose registers plus special-purpose ones (RIP, RFLAGS, RSP). ARM64 has 31 general-purpose registers. Registers hold values the CPU is actively working with: operands for arithmetic, addresses for memory access, and return addresses for function calls.

ALU (Arithmetic Logic Unit) performs operations on data from registers. It takes inputs, produces outputs, and sets condition flags. Arithmetic (add, sub, mul, div), logical (and, or, xor), and shift operations all flow through the ALU. Simple ALU operations complete in one cycle, but multiplication takes multiple cycles and division is significantly slower.

Memory sits between registers and main memory in the hierarchy. Load and store instructions move data between registers and cache. A cache hit takes ~4 cycles; main memory access takes ~100 cycles. Code that accesses memory frequently runs dramatically slower than code that keeps values in registers.

Flags/Status register tracks the outcome of ALU operations. The Zero Flag (ZF) indicates results equal to zero. The Sign Flag (SF) shows negative results. The Carry Flag (CF) and Overflow Flag (OF) track unsigned and signed overflow respectively. Conditional branch instructions read these flags to make control flow decisions. Compare and test instructions modify flags without storing results.

flowchart TB
    J["Registers"] --> K["ALU"]
    J --> L["Memory"]
    K --> M["Flags/Status"]

The fetch-decode-execute cycle runs at the hardware level for every instruction. During fetch, the CPU reads the instruction from memory at the address in the program counter (RIP/EIP on x86, PC on ARM) and increments PC to point at the next instruction. During decode, the instruction bytes are parsed into an opcode and operand specifiers — on x86 this is complex because instructions vary from 1 to 15 bytes, while ARM64 uses a fixed 4-byte instruction width that simplifies decode. During execute, the operation happens: ALU computes the result, memory access completes, or a branch updates PC. Modern CPUs overlap this cycle using pipelining — while executing instruction N, they’re decoding N+1 and fetching N+2. Branch instructions complicate this because the CPU must guess the branch target before the branch condition is known; a mispredicted branch flushes the pipeline and wastes cycles. Assembly programmers can help the branch predictor by arranging code so branches are predictable (like loop counters that almost always go one direction) and by keeping hot code paths compact enough to fit in the instruction cache.

Flags are set according to the result of ALU operations, but the semantics differ between signed and unsigned interpretations. The carry flag (CF) tracks overflow in unsigned arithmetic — if you add two unsigned values and CF=1 afterward, the result wrapped around. The overflow flag (OF) tracks overflow in signed arithmetic — if you add two signed values and OF=1, the sign bit is wrong because the result overflowed the representable range. The CPU doesn’t know whether the values are signed or unsigned; it always sets both flags and the conditional jump instruction you choose determines the interpretation. ja (jump if above) reads CF=0 and ZF=0 for unsigned comparison, while jg (jump if greater) reads ZF=0 and OF=SF for signed comparison. The cmp instruction subtracts operands to set flags — cmp rax, rbx computes rax - rbx and sets CF for unsigned less-than (rax < rbx) and OF for signed less-than. Mixing signed and unsigned jump conditions is a subtle bug that can corrupt sorting algorithms or memory access patterns when values are interpreted incorrectly.

Superscalar execution lets a CPU issue multiple instructions per cycle, but the register file is a bottleneck. Modern CPUs have multiple execution ports, each capable of running a different instruction type simultaneously — an add and a load can execute in the same cycle if they go to different ports. The register file must service reads and writes from all in-flight instructions, and register renaming eliminates false dependencies: when instruction 1 writes RAX and instruction 2 reads RAX, the CPU physically maps instruction 2’s read to the new value that instruction 1 will produce, even before instruction 1 completes. This removes the serial dependency that would otherwise force instruction 2 to wait. The reorder buffer (ROB) tracks in-flight instructions out-of-order, committing results in program order when instructions are no longer speculative. Register file pressure manifests when a program has more live registers than physical registers available — the CPU spills to a microarchitectural queue, causing pipeline stalls even though the ISA only exposes 16 registers.

Latency greater than one cycle exists because some operations require multiple steps within the ALU or depend on external resources. Integer multiplication on x86 takes 3-4 cycles because it uses a shift-and-add algorithm internally; the multiplier circuit must iteratively add shifted copies of one operand. Division is even worse (20-30 cycles) because it uses iterative algorithms that cannot be easily pipelined. Load instructions have latency determined by the cache hierarchy — an L1 hit is 4 cycles, L2 hit is 12 cycles, L3 hit is ~40 cycles, and main memory access can exceed 100 cycles. Floating-point operations on x86 often map to the XMM/YMM/ZMM registers with their own execution units, and the latency of fused multiply-add (FMA) operations depends on the specific microarchitecture. The distinction between latency (how long an operation takes from start to result) and throughput (how many such operations can start per cycle) matters for loop performance — a loop that multiplies in 4 cycles but has independent iterations can still achieve one multiplication per cycle if the CPU has multiple multiply units.

Core Concepts

The Register File

Registers are the CPU’s fastest storage—built directly into the processor silicon with access times measured in picoseconds. Understanding register usage is fundamental to assembly programming.

x86-64 General Purpose Registers:

Register	Purpose	Caller-saved?	Notes
RAX	Accumulator, return values	Yes	Widely used for small operations
RBX	General purpose	No	Callee-saved
RCX	Counter, 3rd argument	Yes	Used by shifts and loops
RDX	Data, 4th argument	Yes	Multiplication output
RSI	Source index, 2nd argument	Yes	String operations source
RDI	Destination index, 1st argument	Yes	String operations destination
RBP	Base pointer, frame reference	No	Callee-saved, optional frame pointer
RSP	Stack pointer	Special	Points to top of stack
R8-R15	General purpose	Yes (R8-R11), No (R12-R15)	New in x86-64

ARM64 General Purpose Registers:

Register	Purpose	Notes
X0-X7	Arguments, return values	X0 also return value
X8-X15	Caller-saved	Temporary registers
X19-X28	Callee-saved	Preserved across calls
SP	Stack pointer	Must be 16-byte aligned
PC	Program counter	Not directly readable
XZR	Zero register	Reads as zero, discards writes
W0-W31	32-bit views of X0-X31	Lower half of 64-bit regs

Basic Arithmetic Instructions

x86-64 Arithmetic:

; Addition
add rax, rbx        ; rax = rax + rbx
add rax, 42          ; rax = rax + 42 (immediate)

; Subtraction
sub rsp, 8           ; rsp = rsp - 8 (allocate stack space)
sub rdi, rsi         ; rdi = rdi - rsi

; Multiplication
mul rdx              ; rdx:rax = rax * rdx (unsigned)
imul rcx, rbx        ; rcx = rcx * rbx (signed, two-operand form)

; Division
div rcx              ; rdx:rax / rcx, quotient in rax, remainder in rdx
idiv rbx             ; signed division

; Increment/Decrement
inc rax              ; rax++
dec rax              ; rax--

ARM64 Arithmetic:

; Addition
add x0, x1, x2       ; x0 = x1 + x2 (three-operand)
add x0, x0, 42        ; x0 = x0 + 42 (immediate)
add x0, x1, x1, lsl 2 ; x0 = x1 + (x1 << 2) = x1 * 5

; Subtraction
sub sp, sp, 16       ; sp = sp - 16 (allocate stack)
subs x0, x1, x2      ; x0 = x1 - x2, set flags

; Multiplication
mul x0, x1, x2       ; x0 = x1 * x2
madd x0, x3, x4, x5  ; x0 = x3 + (x4 * x5)

; Division (available on ARMv8-A)
udiv x0, x1, x2      ; x0 = x1 / x2 (unsigned)
sdiv x0, x1, x2      ; x0 = x1 / x2 (signed)

Memory Operations

x86-64 Memory Access:

; Load from memory
mov rax, [rbx]           ; rax = *rbx (64-bit load)
mov eax, [rbx]          ; eax = *rbx (32-bit load, zero-extends to 64-bit)
movzx rax, byte [rbx]   ; rax = *(uint8_t*)rbx (zero-extend)
movsx rax, byte [rbx]   ; rax = *(int8_t*)rbx (sign-extend)

; Store to memory
mov [rdi], rax           ; *rdi = rax
mov byte [rsi], 42       ; *(uint8_t*)rsi = 42

; Address calculation (LEA - Load Effective Address)
lea rax, [rbx + rcx]     ; rax = rbx + rcx (no memory access)
lea rax, [rdi + rdi*4]   ; rax = rdi + rdi*4 (useful for array indexing)

ARM64 Memory Access:

; Load from memory
ldr x0, [x1]             ; x0 = *x1
ldr x0, [x1, 8]          ; x0 = *(x1 + 8) (pre-index)
ldr x0, [x1], 8          ; x0 = *x1; x1 = x1 + 8 (post-index)
ldr w0, [x1]             ; w0 = *(int32_t*)x1

; Store to memory
str x0, [x1]             ; *x1 = x0
str x0, [sp, -16]!       ; sp -= 16; *sp = x0 (pre-index update)

; Load pair (efficient for function prologues/epilogues)
ldp x0, x1, [sp], 16     ; x0 = *sp; x1 = *(sp+8); sp += 16
stp x0, x1, [sp, -16]!   ; sp -= 16; *sp = x0; *(sp+8) = x1

Control Flow

x86-64 Branches:

; Unconditional jump
jmp label               ; goto label
je label                 ; if (zf) goto label (equal)
jne label               ; if (!zf) goto label (not equal)
jl label                ; if (sf != of) goto label (less than, signed)
jg label                ; if (zf == 0 && sf == of) goto label (greater than, signed)
jb label                ; if (cf) goto label (below, unsigned)
ja label                ; if (cf == 0 && zf == 0) goto label (above, unsigned)

; Comparison
cmp rax, rbx            ; sets flags based on rax - rbx
test rax, rax           ; sets flags based on rax & rax (for zero check)

; Function call
call function_label     ; push return_address; jmp function
ret                     ; pop address; jmp to it

ARM64 Branches:

; Unconditional
b label                 ; goto label
bl label                ; link register = pc+4; goto label (call)
ret                     ; branch to x30 (return address)

; Conditional (many options)
cbnz x0, label          ; if (x0 != 0) goto label
cbz x0, label           ; if (x0 == 0) goto label
tbnz x0, 3, label       ; if (bit 3 of x0 != 0) goto label
tbz x0, 3, label        ; if (bit 3 of x0 == 0) goto label

; Compare and branch
cmp x0, x1
b.eq label              ; if (x0 == x1) goto label
b.ne label              ; if (x0 != x1) goto label
b.lt label              ; if (x0 < x1) goto label (signed)
b.lo label              ; if (x0 < x1) goto label (unsigned)

; Conditional select (no branch needed)
csel x0, x1, x2, eq     ; x0 = (eq) ? x1 : x2
cset x0, eq             ; x0 = (eq) ? 1 : 0
cinc x0, x0, ne         ; x0 = (ne) ? x0+1 : x0 (conditional increment)

Stack Operations

x86-64 Function Prologue/Epilogue:

function:
    push    rbp             ; save old base pointer
    mov     rbp, rsp        ; establish new frame
    sub     rsp, 32         ; allocate 32 bytes for locals

    ; ... function body ...

    mov     rsp, rbp        ; restore stack pointer
    pop     rbp             ; restore base pointer
    ret                     ; return to caller

ARM64 Stack Frame:

function:
    ; No explicit frame pointer needed in simple cases
    ; 16-byte alignment is required at function call
    stp x29, x30, [sp, -16]!    ; save FP and LR with pre-index
    mov x29, sp                  ; x29 is frame pointer (optional)

    ; ... function body ...

    ldp x29, x30, [sp], 16       ; restore and post-index
    ret

Production Failure Scenarios

Scenario 1: Stack Buffer Overflow Exploits

Failure: A function writing beyond a local buffer’s bounds overwrites the return address on the stack. Attacker redirects execution to shellcode.

Example vulnerable pattern:

; Vulnerable: no bounds checking on buffer
vulnerable:
    sub     rsp, 64
    mov     rdi, rsp
    call    get_user_input    ; writes beyond buffer if input > 64 bytes
    add     rsp, 64
    ret

Mitigation:

Use stack canaries (-fstack-protector compiler flag)
Enable NX (No-Execute) bit so stack memory isn’t executable
Use safe string functions (strncpy instead of strcpy)
Implement stack smashing protection in the runtime

Scenario 2: Register Usage Violations in Mixed Assembly/C

Failure: Writing assembly that doesn’t preserve callee-saved registers corrupts caller’s state.

Example mistake:

; BROKEN: modifies RBX (callee-saved) without preserving it
broken_function:
    mov     rbx, rcx         ; RBX is callee-saved! Must push/pop
    ; ... do work ...
    ret

; CORRECT: preserve RBX
correct_function:
    push    rbx             ; save it
    mov     rbx, rcx
    ; ... do work ...
    pop     rbx              ; restore it
    ret

Mitigation:

Always know which registers are callee-saved vs caller-saved per ABI
In inline assembly, use constraint modifiers ("=&r" for early-clobber)
Write assembly functions in separate files with clear documentation
Use automated testing with valgrind or sanitizers

Scenario 3: ARM64 SP Misalignment Crashes

Failure: Stack pointer not 16-byte aligned at function call causes crashes on ARM64. The ARM64 ABI mandates 16-byte alignment at the point of the bl instruction. If misaligned, the called function may attempt unaligned memory access, triggering an alignment fault.

Example vulnerable pattern:

; BROKEN: allocates 8 bytes, breaking alignment
vulnerable_function:
    sub sp, sp, 8          ; 8 is not a multiple of 16
    mov     x0, x1
    bl      some_external_func ; CRASH: stack misaligned at call
    add     sp, sp, 8
    ret

; CORRECT: allocate in multiples of 16
correct_function:
    sub     sp, sp, 16         ; maintains alignment
    mov     x0, x1
    bl      some_external_func ; safe: stack stays aligned
    add     sp, sp, 16
    ret

Why this happens: The stp (store pair) and ldp (load pair) instructions require 16-byte aligned addresses. ARM64 processors with Strictly Aligned Memory Access will fault on unaligned loads and stores. On processors that allow unaligned access, misaligned stacks still cause performance degradation.

Mitigation:

Always maintain 16-byte alignment: sub sp, sp, 16 not sub sp, sp, 8
Use stp and ldp for paired stack operations (they naturally handle alignment)
Verify alignment at function entry: and sp, sp, -16 forces alignment
When allocating space for locals, round up to the next multiple of 16

Trade-off Table

Aspect	Hand-Written Assembly	Compiler-Generated	Notes
Performance	Maximum control	Very good, often optimal	Humans rarely beat compilers at micro-optimization
Maintainability	Difficult	Easy	Assembly harder to modify and debug
Portability	Architecture-specific	Automatic via compiler	One C source → many ISAs
Size	Can be smaller	May include unused code	Link-time optimization helps
Reliability	Error-prone	Well-tested compiler	Compilers have decades of bug fixes
Security	Easy to make mistakes	Safe by default	Compiler adds protections automatically

Implementation Snippets

Complete x86-64 Function: strlen

; size_t strlen(const char *str)
; Input: RDI = pointer to null-terminated string
; Output: RAX = length

strlen:
    xor     rax, rax          ; initialize counter to 0
    mov     rdx, rdi          ; copy pointer to RDX for scanning

.loop:
    mov     cl, [rdx]         ; load current byte into CL (low 8 bits of RCX)
    test    cl, cl            ; test if byte is zero
    jz      .done             ; if zero, we're done
    inc     rax                ; increment counter
    inc     rdx                ; advance pointer
    jmp     .loop             ; continue

.done:
    ret

Complete ARM64 Function: strcmp

; int strcmp(const char *s1, const char *s2)
; Input: X0 = s1, X1 = s2
; Output: X0 = comparison result (< 0, 0, or > 0)

strcmp:
    ldrb    w2, [x0], 1       ; load byte from s1, post-increment
    ldrb    w3, [x1], 1       ; load byte from s2, post-increment
    subs    w2, w2, w3        ; w2 = w2 - w3, set flags
    cbnz    w2, .done         ; if difference != 0, we're done
    cbnz    w3, .loop         ; if s2 byte != 0, continue
    ; both bytes were 0, strings are equal
    mov     x0, 0             ; return 0
    ret

.loop:
    ldrb    w2, [x0], 1
    ldrb    w3, [x1], 1
    subs    w2, w2, w3
    cbz     w3, .done         ; if s2 byte == 0, we're done (return diff)
    cbnz    w2, .done         ; if difference != 0, we're done
    b       .loop             ; continue

.done:
    sxtw    x0, w2            ; sign-extend 32-bit result to 64-bit
    ret

High-Level Language Inline Assembly Example

// C function with inline assembly for critical loop
unsigned long long popcount(unsigned long long x) {
    unsigned long long count;
    __asm__(
        "popcnt %1, %0"
        : "=r"(count)           // output: register operand
        : "r"(x)               // input: register operand
    );
    return count;
}

// More complex: sum of array with loop unrolling
void sum_array(const unsigned *arr, size_t len, unsigned *result) {
    unsigned sum = 0;
    __asm__(
        "mov %[sum], 0\n\t"
        "1:\n\t"
        "add %[sum], [%[arr]]\n\t"
        "add %[arr], 4\n\t"
        "dec %[len]\n\t"
        "jnz 1b"
        : [sum] "+r"(sum), [arr] "+r"(arr)
        : [len] "r"(len)
        : "cc", "memory"
    );
    *result = sum;
}

Observability Checklist

When debugging assembly code:

Disassembly listing: Compile with -S or use objdump -d to see generated code
Register values: Use info registers in GDB or register read in LLDB
Memory examination: Use x/10x to examine memory in hex, x/s for strings
Single-stepping: Use stepi (instruction level) vs step (source level)
Breakpoints: Set at function entry, loop boundaries, and call sites
Instruction count: Profile with perf stat to count retired instructions
Pipeline stalls: Use performance counters to detect cache misses and mispredictions
Stack trace: Verify frame pointers are intact for accurate backtraces

Essential Tools:

objdump -d binary — disassemble any binary
gdb / lldb — debug with disassemble, info registers, x/i
perf — profile with hardware counters
valgrind — memory access validation
strace / ltrace — system call and library call tracing

Common Pitfalls / Anti-Patterns

Security & Compliance

Code Injection: Assembly programs must carefully validate all inputs. Unlike managed languages, there’s no bounds checking built into memory access instructions.
Return-Oriented Programming (ROP): Even with NX stacks, attackers can chain existing code fragments (gadgets). Mitigations:
- Stack canaries detect overwrites before return
- ASLR randomizes memory layout
- Control Flow Integrity (CFI) validates indirect jump targets

Timing Attacks: Constant-time programming in assembly is critical for crypto:

; VULNERABLE: branch depends on secret data
cmp     byte [key + i], 0
je      .done

; SAFE: constant-time conditional move
mov     al, 1
cmp     byte [key + i], 0
csel    al, al, 0, ne     ; al = (key[i] != 0) ? al : 0

Compliance Requirements: Secure coding standards (CERT C/C++, MISRA), FIPS 140-3 for crypto modules, and safety standards (DO-178C, ISO 26262) have specific rules about inline assembly and require rigorous verification.

Programming Pitfalls

Forgetting to Preserve Callee-Saved Registers: Modifying RBX, RBP, R12-R15 in x86-64 without saving them corrupts the caller’s state.
Stack Misalignment: x86-64 requires 16-byte stack alignment at call. Violations cause crashes when calling external libraries.
Sign-Extension Mistakes: Moving a 32-bit value to a 64-bit register without sign-extension leaves upper bits undefined. Always use movsx or movsxd for signed values.
Assuming Instruction Atomicity: On ARM, without explicit memory barriers, loads and stores may be reordered. Use dmb ish before and after shared memory operations.
Using EBX Instead of RBX in x86-64: 32-bit register writes to EAX, EBX, ECX, EDX zero the upper 32 bits. Writing mov ebx, 1 clears RBX’s upper bits, potentially breaking code that expects them preserved.
Off-by-One Errors in Array Indexing: Remember that byte offsets multiply by element size. Accessing arr[i] where arr is int* requires base + i*4, which in assembly is base + i*4 not base + i.
Not Using Frame Pointer for Debugging: Omitting frame pointers (-fomit-frame-pointer) breaks stack unwinding in debuggers when optimized code crashes.

Quick Recap Checklist

Assembly language maps directly to machine instructions defined by the ISA
Understanding register purpose (callee-saved vs caller-saved) is essential for correct function writing
Memory access requires careful consideration of size, alignment, and signed vs unsigned extension
Control flow in assembly uses condition codes set by previous operations—cmp and test are common
ARM’s conditional execution and conditional select can eliminate branches entirely
Always preserve callee-saved registers; caller-saved registers are the caller’s responsibility
Stack must be 16-byte aligned on x86-64 and ARM64 at function call boundaries
Inline assembly requires correct constraints to communicate with the C calling environment
Use performance counters and disassemblers to verify what code actually executes
Security in assembly requires explicit defenses—there’s no automatic protection

Interview Questions

1. Explain the difference between caller-saved and callee-saved registers. Why does this distinction exist?

Caller-saved registers (also called volatile or temporary registers) are guaranteed to be preserved across a function call by the caller. If the caller needs their value after the call, it must save and restore them itself.

Callee-saved registers (also called preserved or non-volatile registers) must be preserved across a function call by the callee. If a function wants to use them, it must save on entry and restore before return.

This distinction optimizes for the common case: most function calls don't need to preserve many temporary values. By making temporary values caller-saved, the callee can use all registers freely without overhead. By making values the callee needs to preserve callee-saved, the callee minimizes save/restore overhead only when actually used.

2. How does the stack frame work, and why do we need both a stack pointer and a base pointer?

The stack frame is a region of stack memory allocated for a single function call. The stack pointer (SP) always points to the current top of the stack and moves as data is pushed and popped. The base pointer (BP), also called frame pointer, provides a fixed reference point within the frame.

When a function begins, it saves the old BP, sets BP to the current SP, then allocates space for locals by subtracting from SP. Throughout the function, SP might move as we allocate more stack or call other functions. BP stays fixed, so we can access parameters and locals at known, constant offsets from BP—like [rbp-8] for the first local or [rbp+16] for the first parameter.

Modern compilers can omit BP with -fomit-frame-pointer optimization when all locals can be allocated to registers. But BP remains essential for debuggers to perform stack unwinding and for exception handling to trace call chains.

3. What is the purpose of the LEA instruction in x86, and why is it often used instead of ADD?

LEA (Load Effective Address) computes the address of a memory location without actually accessing memory—it puts the calculated address into a register. lea rax, [rbx + rcx*4] computes rbx + rcx*4 and stores the result in RAX.

This is useful for address calculation because LEA can perform addition and multiplication by 2, 4, or 8 in a single instruction with no memory access and no flags affected. For example, to compute &array[i] where i is in RCX and array is at RBX: lea rax, [rbx + rcx*8].

Compilers often use LEA for arithmetic too: lea rax, [rbx + rbx*4] computes rbx * 5 efficiently. The instruction is also commonly used for address computation in loop-indexed array access patterns.

4. Describe ARM64's approach to conditional execution and how it differs from x86.

ARM64 provides multiple mechanisms for conditional execution that reduce branch dependency. The Compare and Branch instructions combine comparison and conditional jump: cbnz x0, label branches if X0 is not zero. Conditional Select instructions select between values without branching: csel x0, x1, x2, eq sets X0 to X1 if equal, otherwise X2.

In contrast, x86 typically requires separate compare and branch instructions. ARM's conditional select can implement simple conditionals like x = (a == b) ? c : d in a single instruction, avoiding branch misprediction penalties.

ARM also supports conditional execution (the cond variant on instructions) in Thumb mode, where most instructions can conditionally execute based on status flags. This allows very tight conditional code without branches at all.

5. What causes stack alignment issues, and how do you fix them in x86-64 and ARM64?

Both x86-64 and ARM64 require the stack pointer to be 16-byte aligned at the point of a function call instruction. The CALL instruction pushes an 8-byte return address, so if SP is 16-byte aligned before CALL, it becomes 8-byte misaligned after. Functions must realign before calling others.

In x86-64, if your function allocates an odd number of 8-byte stack slots, you break alignment. The fix is to always allocate in multiples of 16 bytes: sub rsp, 32 allocates 32 bytes even if you only need 24.

In ARM64, use sub sp, sp, 16 not sub sp, sp, 8. The stp and ldp instructions (store/load pair) naturally maintain alignment when used with pre-index or post-index update.

Misalignment causes crashes when calling external libraries that assume correct alignment, or performance penalties on ARM due to alignment fault handling.

6. What is position-independent code (PIC) and when is it used?

Position-independent code can execute correctly regardless of where it's loaded in memory. Instead of absolute addresses, PIC uses PC-relative addressing or accesses Global Offset Table (GOT) entries for global data.

Uses:

Shared libraries (.so/.dll) — loaded at different addresses in different processes
ASLR (Address Space Layout Randomization) — security feature that randomizes memory addresses
Kernel modules — must load at arbitrary addresses in kernel memory

In x86-64, use RIP-relative addressing: mov rax, [rel symbol]. In ARM64, PC-relative loads are the default. Compilers generate PIC with -fPIC flag.

7. How does the flags register work in x86 and what are the common flag operations?

The EFLAGS register contains condition flags set by arithmetic and logical operations:

ZF (Zero Flag): Set when result is zero
SF (Sign Flag): Set when result is negative (MSB=1)
CF (Carry Flag): Set on unsigned overflow
OF (Overflow Flag): Set on signed overflow
PF (Parity Flag): Set when low 8 bits have even number of 1s

Common patterns:


cmp rax, rbx    ; sets ZF if equal, SF if rax < rbx
test rax, rax    ; sets ZF if rax is zero, SF if negative
jo label         ; jump if overflow flag set
jnc label        ; jump if carry flag clear

The test instruction is preferred for checking if a register is zero because it sets flags without destroying the register value.

8. What is the difference between ldr and ldrb in ARM64?

ldr x0, [x1] loads a 64-bit value from the address in X1 into X0. The load must be naturally aligned (8-byte aligned for X register).

ldrb w0, [x1] loads a single byte (8 bits) into the low 8 bits of W0, zero-extending the upper 24 bits to 64 bits. Unaligned access is allowed.

Related variants:

ldrh: Load halfword (16 bits) into W register
ldrsb: Load signed byte (sign-extends to 64 bits)
ldrsh: Load signed halfword (sign-extends)
ldnp: Load non-temporal pair (cache hints)
ldxp/ldaxp: Load exclusive pair (for atomic operations)

Using the wrong size (e.g., ldrb on a 32-bit register when you want zero-extension) causes subtle bugs. Always match the load size to the destination register width.

9. What is the difference between STR and STM in ARM64, and when would you use each?

str (Store Register) stores a single register to memory. stm (Store Multiple) stores multiple registers in a single instruction, writing them sequentially to consecutive memory locations starting at the base register.

str x0, [x1] writes the 64-bit value in X0 to the address in X1. stm x0!, {x1, x2, x3} writes X1 to [x0], X2 to [x0+8], X3 to [x0+16], and increments X0 by 24 (the ! is writeback—updates the base register).

Use str for single values or when you need precise control over addressing. Use stm for saving/restoring multiple registers (function prologue/epilogue, context switching) — it's more efficient than multiple str instructions.

Related: ldp/stp (load/store pair) handle exactly two registers and are faster than separate loads when you need exactly two. The p variants also offer atomic pair access for exclusive monitors.

10. What is the difference between AT&T syntax and Intel syntax for x86 assembly?

AT&T syntax is used by the GNU assembler (GAS) on Linux. Intel syntax is used by MASM, NASM, and in Intel documentation.

Key differences:

Operand order: AT&T is instruction source, destination. Intel is instruction destination, source. So mov eax, ebx in Intel moves EBX to EAX in AT&T: movl %ebx, %eax.
Register naming: AT&T prefixes registers with % (e.g., %eax). Intel uses plain names (eax).
Immediate values: AT&T prefixes with $ (e.g., $42). Intel uses plain 42.
Memory references: AT&T uses disp(base, index, scale) syntax. Intel uses [base + index*scale + disp].
Operand size: AT&T suffixes the instruction with b, w, l (byte, word, long). Intel infers from operands.

Example comparison:


; Intel syntax (NASM/FASM)
mov eax, [ebp+8]
add eax, [ebp+12]
; AT&T syntax (GAS)
movl 8(%ebp), %eax
addl 12(%ebp), %eax

Most Linux development uses Intel syntax via NASM or the GNU assembler with Intel syntax directive (.intel_syntax noprefix).

11. How does the condition code register (FLAGS/EFLAGS/RFLAGS) work on x86?

The EFLAGS register is a 32-bit register containing status flags and control bits. Key flags:

ZF (Zero Flag): Set when result is zero (or equal after SUB/CMP)
SF (Sign Flag): Set when result is negative (MSB=1)
OF (Overflow Flag): Set when signed arithmetic overflows
CF (Carry Flag): Set when unsigned arithmetic overflows
PF (Parity Flag): Set when low 8 bits have even number of 1s
AF (Auxiliary Carry): BCD arithmetic; rarely used

Control bits:

DF (Direction Flag): String operations increment (0) or decrement (1) pointers
IF (Interrupt Enable): Enable/disable maskable interrupts

Instructions that set flags: cmp, test, arithmetic (add, sub, inc, dec), logical (and, or, xor).

Instructions that read flags: conditional jumps (je, jne, jl, etc.), setcc (set byte if condition), cmovcc (conditional move).

The flags are set by the previous operation—always check what instruction set them and when they were set.

12. What is the purpose of the red zone in x86-64 System V ABI?

The red zone is a 128-byte area below the stack pointer that is guaranteed to be preserved by signal handlers and interrupt handlers. Compilers use this area for local variables without adjusting the stack pointer.


sub rsp, 40        ; Allocate space for locals
; Use 40 bytes below RSP
; The red zone is the 128 bytes below RSP
; No need to adjust RSP if we need less than 128 bytes

Red zone usage:

Compiler emits sub rsp, N where N is total local size
Access locals as [rsp-8], [rsp-16], etc.
If signal handler is called, it can safely use this area without corrupting our locals

Red zone limitations:

Only valid while the function is active (not after a call)
Leaf functions can use red zone freely; non-leaf functions must maintain alignment
Windows x64 ABI doesn't use red zone—local variables use stack shadow space instead

This optimization reduces stack pointer adjustments, improving performance in leaf functions.

13. How do you write a variadic function in x86-64 assembly?

Variadic functions accept a variable number of arguments. On x86-64 System V ABI:

Argument passing: The first 6 integer arguments are in RDI, RSI, RDX, RCX, R8, R9. The rest go on the stack.

Variadic handling: Use the va_list structure in C, but in pure assembly you must handle it manually:


; A variadic function that sums integers
; First arg in RDI, number of args in RSI
variadic_sum:
    push rbp
    mov rbp, rsp
    sub rsp, 32         ; Space for locals and va_list

    ; Set up va_list on the stack
    ; Arguments after R9 start at [rbp+24]
    lea rax, [rbp+24]   ; Point to first stack argument
    mov [rsp], rax       ; va_start(ap) - save pointer

    xor ecx, ecx         ; i = 0
    xor r8, r8           ; sum = 0

.loop:
cmp ecx, esi
je .done
mov rax, [rsp + rcx*8] ; Load next argument from va_list
add r8, rax
inc ecx
jmp .loop

.done:
mov rax, r8
leave
ret

Key points: arguments after the 6th register arg are on the stack, calculate address correctly for each argument position.

14. What is position-independent code (PIC) and how do you write it in x86-64?

PIC can execute correctly regardless of load address. Instead of absolute addresses, PIC uses PC-relative addressing. Essential for shared libraries (.so) and ASLR.

In x86-64, use rip-relative addressing:


; Get current IP (position independent)
call .get_pc
.get_pc:
    pop rax              ; RAX = address of .get_pc
    ; Now compute offset to access data
    sub rax, .get_pc     ; Get PC
    ; Access global_data at offset from PC
    mov rdx, [rax + global_data - .get_pc]
; Or use the ELF global offset table (GOT):
mov rax, [rel global_var@gotpc] ; EIP-relative GOT access
mov rax, [rax]

The rel keyword (NASM) or rel suffix (GAS) generates EIP/RIP-relative relocations.

In ARM64, PC-relative loads are the default:


adr x0, .Ldata         ; Load address of .Ldata relative to PC
ldr x0, [x0]            ; Load the value at that address

Compilers generate PIC automatically with -fPIC. For assembly, you must manually calculate offsets or use the GOT for external symbols.

15. How do you implement atomic operations without special instructions?

Atomic operations require special CPU support (lock prefix on x86, load-exclusive/store-conditional on ARM). Without these, you can only approximate atomicity through careful sequencing:

Test-and-set without atomicity:


; BROKEN: Race condition exists
    mov eax, 1          ; Value to set
.lock:
    xchg [rdi], eax     ; Atomic exchange (real CPUs have this)
    ; But if we didn't have xchg...
    cmp [rdi], 0        ; Check if free
    jne .busy           ; If not, someone else got it
    mov [rdi], eax      ; This is NOT atomic - race here
    mov eax, 0          ; Success
    jmp .done
.busy:
    mov eax, -1         ; Failure
.done:

Software-only approach (broken for concurrent threads):


; With only basic instructions, you cannot implement true atomics
; This is why CPUs provide atomic instructions
; What you can do: use mutex (busy-wait with memory polling)

.lock:
cmp [shared], 0        ; Check lock
jnz .lock              ; If busy, spin
mov [shared], 1        ; Try to acquire - NOT ATOMIC
; Must verify no one else got it
cmp [shared], 1        ; Did we really get it?
jne .lock              ; No, try again
; Acquired (race still exists)

True atomic operations require hardware support: lock prefix on x86, ldaxr/stlxr on ARM. There's no way around this.

16. What is the difference between a leaf function and a non-leaf function in ARM64?

A leaf function is a function that doesn't call any other functions. A non-leaf function calls other functions (including itself recursively).

ARM64 leaf function optimization:


; Leaf function: doesn't need stack frame
leaf_func:
    add x0, x0, x1      ; x0 = x0 + x1
    ret                 ; Return to link register

No STP/LDP needed because we don't save LR (link register) — we don't call anything.

Non-leaf function must save LR because BL overwrites it:


; Non-leaf: must save link register
non_leaf:
    stp x29, x30, [sp, -16]!  ; Save FP and LR
    mov x29, sp

    bl some_function         ; LR gets overwritten

    ldp x29, x30, [sp], 16   ; Restore
    ret

<p>The key difference: leaf functions don't need to save LR. This is a significant optimization for small functions — avoid function calls to keep LR safe and avoid stack frame setup.</p>

17. How do you implement a counting loop in ARM64 assembly?

ARM64 counting loop:


; Calculate sum of 1 to N
; Input: X0 = N
; Output: X0 = sum

    mov x1, 0           ; i = 0
    mov x2, 0           ; sum = 0

.loop:
cmp x1, x0 ; i < N?
bge .done ; if i >= N, done
add x2, x2, x1 ; sum += i
add x1, x1, 1 ; i++
b .loop ; continue

.done:
mov x0, x2 ; return sum
ret

Or using decrement:


; Using decrement (common pattern)
mov x1, x0 ; Copy N to X1 as counter
mov x0, 0 ; sum = 0

.loop:
cbz x1, .done ; if counter == 0, done
add x0, x0, x1 ; sum += counter
subs x1, x1, 1 ; counter--, set flags
b.ne .loop ; if counter != 0, continue

.done:
ret

The subs (subtract and set flags) followed by b.ne is efficient. cbz (compare and branch if zero) combines comparison and branch.

18. What is the difference between the .data, .text, and .rodata sections?

.text section: Contains executable code (instructions). This section is typically marked read-only and executable. Code lives here.

.data section: Contains initialized global and static variables that can be modified at runtime. Examples: int count = 5;

.rodata section (or .rdata): Contains read-only data — constants, string literals, jump tables. Cannot be modified at runtime.

.section .data global_var: .quad 42 ; Writable, initialized .section .rodata msg: .asciz “Hello, world!” ; Read-only string table: .word 0, 1, 4, 9 ; Read-only table

.section .text function: ; code here ret

On Linux, .rodata is often merged with .data for simplicity. The linker places these sections in the output file, and the OS sets appropriate page permissions: text (RX), data (RW), rodata (R).

19. How do you write a tail call optimization in assembly?

A tail call is when a function calls another function as its last action. The calling function could return the result of the called function directly. Compilers optimize this to avoid stack frame setup.


; Without tail call optimization:
caller:
    push rbp
    mov rbp, rsp
    call callee          ; Allocate stack frame for caller
    add rsp, 8           ; Clean up return value
    pop rbp
    ret

With tail call optimization (after callee returns, caller returns):


caller:
push rbp
mov rbp, rsp
jmp callee ; Jump instead of call
; No ret here - callee's ret returns to caller's caller

In ARM64:


; No frame needed - tail call
b callee ; Branch to callee (don't link)
; LR wasn't saved because we're not calling, we're jumping
; callee's return goes to our caller

Tail call optimization prevents stack growth in recursive calls and function composition. It's critical for functional languages where f(g(x)) creates a chain of tail calls.

20. What is the difference between a syscall and a library call from the CPU's perspective?

Library call (printf, malloc): A call to a function within the same process. The call is just a CALL instruction to an address in the code segment. The library code executes in the same privilege mode (user mode). It's a normal function call.

Syscall (read, write, brk): A request to the kernel. The CPU must switch from user mode to kernel mode. This uses the SYSENTER/SYSCALL instruction (or int 0x80 on older x86) which:

Changes privilege level (user → kernel)
Saves return address
Loads new PC from kernel syscall entry point
Restricts accessible memory (page tables change)

The syscall number goes in a specific register (EAX on x86, X8 on ARM64). Arguments go in other registers. The kernel validates arguments before performing the operation.

Syscall overhead is ~50-100ns; library call overhead is ~1ns. Use syscalls only when necessary (file I/O, process control, memory allocation). Internal library functions are just function calls.

Conclusion

Assembly language sits at the boundary between what you write and what the hardware actually executes. Even if you never write production assembly, reading it helps you understand compiler output, debug performance issues, and reason about security exploits.

The key takeaways are understanding register usage (caller-saved vs callee-saved), stack frame management, and control flow via condition codes. Both x86-64 and ARM64 have their quirks—memory ordering differences can cause real bugs in concurrent code that works on one architecture but fails on another.

Continue your low-level exploration by studying boolean logic and gates to understand how transistors implement these instructions, or building a CPU simulator to see how these concepts come together in a working system.

Introduction

When to Use / When Not to Use

Core Concepts Architecture

Assembly Development Flow

Assembly Structure

CPU Execution

Core Concepts

The Register File

Basic Arithmetic Instructions

Memory Operations

Control Flow

Stack Operations

Production Failure Scenarios

Scenario 1: Stack Buffer Overflow Exploits

Scenario 2: Register Usage Violations in Mixed Assembly/C

Scenario 3: ARM64 SP Misalignment Crashes

Trade-off Table

Implementation Snippets

Complete x86-64 Function: strlen

Complete ARM64 Function: strcmp

High-Level Language Inline Assembly Example

Observability Checklist

Common Pitfalls / Anti-Patterns

Security & Compliance

Programming Pitfalls

Quick Recap Checklist

Interview Questions

Further Reading

Conclusion

Category

Tags

Related Posts

ASLR & Stack Protection

Boolean Logic & Gates

Boot Process