Building a Simple CPU Simulator: Implementing the Fetch-Decode-Execute Cycle

Implement a functional CPU simulator in Python that demonstrates the fetch-decode-execute cycle, register file, ALU operations, and memory access.

published: May 19, 2026 reading time: 42 min read author: GeekWorkBench

Quick Summary

Implement a functional CPU simulator in Python that demonstrates the fetch-decode-execute cycle, register file, ALU operations, and memory access.

Introduction

Building a CPU simulator is one of the most educational projects in computer science. By implementing one, you see how instructions actually flow through a processor, how the fetch-decode-execute cycle works, and why real architectures make the design decisions they do. Writing a simulator also develops intuition for debugging systems-level code—state management, error handling, trace output—since these same challenges appear in kernel debugging and emulator development.

The simulator we’ll build implements a simple RISC-like instruction set with a register file, ALU, and memory subsystem. It shows the fundamental concepts used in actual CPU design while staying small enough to understand completely.

When to Use / When Not to Use

When to Use a CPU Simulator:

Teaching and learning computer architecture fundamentals
Testing instruction set changes before hardware implementation
Writing and debugging code for a new or custom ISA
Researching security vulnerabilities in a controlled environment
Creating test harnesses for compiler backends

When Not to Use:

Production performance requirements—interpretive simulation is slow
Whencycle-accurate timing matters—our simulator is functional, not timing-accurate
As a replacement for actual hardware testing before deployment
When you need memory hierarchy simulation (caches, TLBs) in detail

CPU Architecture

flowchart TB
    subgraph "Fetch Unit"
        A["Program Counter<br/>PC"] --> B["Instruction<br/>Memory"]
        B --> C["Instruction<br/>Register"]
    end

    subgraph "Decode Unit"
        C --> D["Control Unit"]
        D --> E["Register File<br/>Read Ports"]
        D --> F["Immediate<br/>Generator"]
    end

    subgraph "Execute Unit"
        E --> G["ALU"]
        F --> G
        E --> H["Branch<br/>Comparator"]
    end

    subgraph "Memory Unit"
        G --> I["Data<br/>Memory"]
        I --> J["Write Back<br/>Mux"]
    end

    subgraph "Write Back"
        J --> K["Register File<br/>Write Port"]
        K --> E
        G --> K
    end

    subgraph "PC Update"
        H --> L["PC Src<br/>Mux"]
        L --> A
        G --> L
    end

Core Concepts

The Fetch-Decode-Execute Cycle

The fundamental operation of any CPU is the fetch-decode-execute cycle, repeated billions of times per second:

Fetch: Read the next instruction from memory at the address in the Program Counter (PC)
Decode: Examine the instruction bits to determine what operation to perform
Execute: Perform the computation—ALU operation, memory access, or register transfer
Write Back: Store the result back to registers or memory
Update PC: Increment PC (or set to branch target if taken)

Our simulator implements this cycle in software, using Python data structures to model the hardware components.

The PC update mechanics reveal why this cycle is structured the way it is. After fetching an instruction at address PC, the CPU must advance to the next instruction. For sequential code, PC becomes PC + 4 (adding 4 bytes because each instruction is 32 bits long and memory is byte-addressable). Branch and jump instructions override this default by loading a different address into PC. The key insight is that the PC always points to the next instruction before fetch happens—meaning the CPU has already prefetched what it believes will be sequential flow. When a branch is taken, that prefetched instruction is discarded and fetch restarts at the branch target. This is why branch mispredictions are expensive: the pipeline has already wasted cycles working on instructions that turn out to be irrelevant.

Instruction fetch from memory has alignment considerations that matter in real hardware. Our simulator uses word-aligned accesses—load_instruction reads 4 consecutive bytes starting at the PC address. If a program somehow managed to set PC to a non-word-aligned address (say, PC = 1), our simulator would crash because load_word expects addresses divisible by 4. Real CPUs typically enforce alignment in hardware and raise alignment faults on violations. Some architectures (notably certain ARM modes) allow unaligned access but at reduced performance. The alignment requirement exists because fetching from memory is typically the slowest operation in the cycle—requiring the memory subsystem to read exactly one aligned word per fetch is simpler than handling arbitrary byte combinations.

The decode phase is where the opcode field does its work. The instruction encoding places the opcode in the lowest 4 bits, so a simple mask and shift extracts it: opcode = instruction & 0xF. Once the opcode is known, the CPU knows whether this is an R-type, I-type, or J-type instruction and can parse the remaining fields accordingly. For R-type instructions, the CPU reads two source registers (rs1, rs2) and writes one destination (rd). For I-type, it reads one source register and sign-extends a 12-bit immediate to 32 bits. The format is self-describing—the opcode determines how all other fields are interpreted. This is the foundation of the stored-program computer concept: instructions are just data in memory, and the same hardware that processes arithmetic can process instructions because they share the same representation.

Register File Design

The register file is a small, fast memory within the CPU that holds working values. Our simulator uses a simple array-based implementation:

class RegisterFile:
    """Simulates the register file with read and write ports."""

    def __init__(self, num_registers=8):
        self.num_registers = num_registers
        self.registers = [0] * num_registers
        self.pc = 0  # Program counter is separate

    def read(self, reg_num):
        """Read a register value."""
        if 0 <= reg_num < self.num_registers:
            return self.registers[reg_num]
        raise ValueError(f"Invalid register number: {reg_num}")

    def write(self, reg_num, value):
        """Write a value to a register."""
        if 0 <= reg_num < self.num_registers:
            self.registers[reg_num] = value
        else:
            raise ValueError(f"Invalid register number: {reg_num}")

    def dump(self):
        """Return current register state for debugging."""
        return {
            f"R{i}": self.registers[i]
            for i in range(self.num_registers)
        }

ALU Operations

The Arithmetic Logic Unit (ALU) performs computational operations. Our simple simulator supports:

Operation	Opcode	Description
ADD	0x0	R[rd] = R[rs1] + R[rs2]
SUB	0x1	R[rd] = R[rs1] - R[rs2]
AND	0x2	R[rd] = R[rs1] & R[rs2]
OR	0x3	R[rd] = R[rs1] \| R[rs2]
XOR	0x4	R[rd] = R[rs1] ^ R[rs2]
SLT	0x5	R[rd] = R[rs1] < R[rs2] (signed)
SLL	0x6	R[rd] = R[rs1] << R[rs2]
SRL	0x7	R[rd] = R[rs1] >> R[rs2] (logical)
ADDI	0x8	R[rd] = R[rs1] + immediate
ANDI	0x9	R[rd] = R[rs1] & immediate
ORI	0xA	R[rd] = R[rs1] \| immediate

Under the hood, the ALU uses a multiplexer-based design: the opcode selects which operation result to pass through. The inputs a and b go to all operation units in parallel, and the opcode drives control lines on a mux that choose the correct result. This is why the ALU can compute multiple things simultaneously but only output one—the mux selects. In hardware, this looks like a set of AND gates, adders, and shifters all computing in parallel, with their outputs fed into a large multiplexer tree. Our Python implementation using a dictionary of lambdas is a direct functional equivalent: ops[opcode](a, b) evaluates the selected operation, discarding the others.

Shift operations (SLL, SRL) live in the ALU rather than a separate shift unit for architectural reasons. Having shifts in the ALU keeps the design simpler—a single execution unit handles all computational operations. More importantly, shifts are often used in conjunction with other operations (shifting then adding, for example), so keeping them in the ALU avoids pipeline bubbles that would occur if the CPU had to dispatch to a separate unit. The tradeoff is that a full-featured shift unit with barrel shifter capability (shifting by variable amounts in a single cycle) is complex, but for a simple simulator, Python’s built-in shift operators suffice.

Flag generation is what separates a toy ALU from a real one. Beyond computing results, a production ALU tracks condition flags: zero flag (Z, set when result is 0), negative flag (N, set when result is negative in signed interpretation), carry flag (C, set on unsigned overflow out of the MSB), and overflow flag (V, set when signed arithmetic produces a result too large or too small for the destination). Our simulator doesn’t track these flags, but a real CPU uses them for conditional branches. The SLT instruction effectively computes a comparison by setting rd = 1 if rs1 < rs2 else 0, which is how “less than” is implemented without dedicated flag-based compare instructions.

Extending the ALU for multiplication and division requires more hardware. Our current instruction set lacks MUL and DIV because they are significantly more complex: multiplication requires an array of adders (the shift-and-add algorithm), and division is even worse—iterative refinement algorithms that can take variable time. A real implementation would add a multiply/divide unit and extend the opcode space. The result is also wider: multiplying two 32-bit numbers produces a 64-bit result, so the ISA typically defines separate HI and LO registers (MIPS) or a 64-bit result register (RISC-V’s MULW in RV64) to capture the full product.

Instruction Format

Our simulator uses a simple 32-bit instruction format with three types:

Type	Format	Opcode range
R-Type	`[opcode: 4][rd: 3][rs1: 3][rs2: 3][unused: 19]`	`0x0 – 0x7`
I-Type	`[opcode: 4][rd: 3][rs1: 3][immediate: 22]`	`0x8 – 0xA`
J-Type	`[opcode: 4][rd: 3][target: 22]`	`0xB`

Key details:

The existence of multiple instruction formats reflects a fundamental tension in ISA design: encoding efficiency versus flexibility. R-type instructions need three register fields (rd, rs1, rs2), which requires 9 bits total. I-type instructions need one register plus a 12-bit immediate—also demanding, but the immediate trades register space for a constant value. J-type dispenses with source registers entirely and uses the space for a larger jump target. The 4-bit opcode field is the lever that distinguishes between these types, and the opcode range determines which format an instruction uses. A RISC philosophy would use fixed-length formats for simplicity; a CISC philosophy (like x86) uses variable-length formats to pack more operations into fewer bytes.

The CPU knows which format to use purely from the opcode. When the control unit decodes the opcode and determines it is, say, 0x0 (ADD), the decode logic immediately knows this is an R-type instruction and interprets bits [10:7] as rs2, bits [7:4] as rs1, bits [4:0] as rd. If the opcode were 0x8 (ADDI), the same bit positions would be interpreted differently: bits [10:7] still specify rs1, but bits [31:10] now form a 22-bit immediate that gets sign-extended to 32 bits. No flag or mode bit tells the CPU which format—this is pure combinational logic driven by the opcode value. Adding a new instruction format means changing the decode ROM or logic, not adding a new field.

Immediate encoding trade-offs reveal why 12 bits is a common immediate width. With 12 bits, you can encode any immediate from -2048 to +2047 (signed) or 0 to 4095 (unsigned). This covers most immediate values that appear in typical programs—the vast majority of immediates are small constants used in array indexing, sign extension, and arithmetic adjustments. Encoding a larger immediate would require using two instructions ( lui + addi in RISC-V, for example), but using 12 bits for every immediate would waste space for programs that only use small values. Some ISAs offer multiple immediate widths: MIPS has 16-bit immediates for regular instructions and 26-bit immediates for jumps. The 12-bit choice in our simulator is a reasonable middle ground.

The R-type versus I-type decision in ISA design comes down to what operations need an immediate operand. Arithmetic operations that need two register sources (ADD, SUB) are naturally R-type. Operations that need a register plus a constant (ADDI for loading small offsets, ANDI for bit masks) are I-type. If the ISA needs a jump with a large target, it gets a J-type format with a wider immediate field. This split is why opcode assignment matters—once opcodes 0x0–0x7 are assigned to R-type, the ISA designer is committed to that format for those operations. Changing later would break all existing software.

Instruction Format Design

All instructions are 32 bits, byte-addressable with word-aligned accesses
R-Type: two register operands (rs1, rs2) and a destination (rd)
I-Type: one register operand (rs1), a 12-bit sign-extended immediate, and a destination (rd)
J-Type: an absolute word-aligned target address and a destination (rd)
R0 is hardwired to zero — writes to it are ignored

The bullet list above is a quick reference. What follows is the rationale behind each format decision and the details that turn these rules into working code.

Hardwiring R0 to zero is a deliberate choice that eliminates an entire class of hazards. In a RISC pipeline, a register that always returns zero lets the CPU handle “load immediate” efficiently: ADDI R1, R0, 42 puts 42 in R1 without needing a separate LUI instruction. It also simplifies the register file because one read port can always bypass the SRAM array and return zero. The tradeoff is losing one register from the 8 available, but the hardware simplification pays for it. ARM64 does the same thing with XZR, and so does RISC-V.

Word alignment exists because memory systems are built around aligned accesses. When the CPU fetches a 32-bit instruction, the memory subsystem reads a 32-bit chunk starting at an address divisible by 4. If you tried to fetch from address 1, the hardware would need to read two separate chunks and stitch the bytes together — a real complexity cost for no real benefit. Our simulator handles this by masking jump targets: target & 0xFFFFFFFC clears the low two bits, guaranteeing alignment even if a programmer produces a misaligned address. Real hardware raises an alignment fault. Our simulator crashes or produces garbage.

The opcode assignment map tells the decode unit how to interpret every other field. Our 4-bit opcode gives 16 values. We assign 0x0–0x7 to R-type, 0x8–0xA to I-type, 0xB to J-type, and 0xC–0xF to control instructions (branch, load, store, halt). Adding a MUL instruction means picking an unused opcode and extending the format, which requires a wider encoding or repurposing an existing slot. Opcode space is limited in any real ISA. RISC-V handles this with a base opcode plus function fields, which is why it has far more instruction variants than a flat 4-bit encoding lets us.

Implementation Snippets

Complete CPU Simulator

#!/usr/bin/env python3
"""
Simple CPU Simulator implementing fetch-decode-execute cycle.
Supports R-type, I-type, and J-type instructions.
"""

from enum import IntEnum
from dataclasses import dataclass
from typing import List, Optional


class Opcode(IntEnum):
    """Opcodes for our simple instruction set."""
    ADD = 0x0
    SUB = 0x1
    AND = 0x2
    OR = 0x3
    XOR = 0x4
    SLT = 0x5
    SLL = 0x6
    SRL = 0x7
    ADDI = 0x8
    ANDI = 0x9
    ORI = 0xA
    J = 0xB
    BEQZ = 0xC
    LOAD = 0xD
    STORE = 0xE
    HALT = 0xF


@dataclass
class Instruction:
    """Parsed instruction with all fields."""
    opcode: Opcode
    rd: int
    rs1: int
    rs2: int
    immediate: int = 0


class Memory:
    """Simulates instruction and data memory."""

    def __init__(self, size=1024):
        self.size = size
        self.data = [0] * size

    def load_byte(self, address: int) -> int:
        if 0 <= address < self.size:
            return self.data[address] & 0xFF
        raise MemoryError(f"Invalid load address: {address}")

    def store_byte(self, address: int, value: int):
        if 0 <= address < self.size:
            self.data[address] = value & 0xFF
        else:
            raise MemoryError(f"Invalid store address: {address}")

    def load_word(self, address: int) -> int:
        """Load 4 bytes as a little-endian word."""
        if address + 3 < self.size:
            return (self.data[address] |
                    self.data[address + 1] << 8 |
                    self.data[address + 2] << 16 |
                    self.data[address + 3] << 24)
        raise MemoryError(f"Invalid word load address: {address}")

    def store_word(self, address: int, value: int):
        """Store 4 bytes as a little-endian word."""
        if address + 3 < self.size:
            self.data[address] = value & 0xFF
            self.data[address + 1] = (value >> 8) & 0xFF
            self.data[address + 2] = (value >> 16) & 0xFF
            self.data[address + 3] = (value >> 24) & 0xFF
        else:
            raise MemoryError(f"Invalid word store address: {address}")

    def load_instruction(self, address: int) -> int:
        """Load a 32-bit instruction (4 bytes little-endian)."""
        return self.load_word(address)


class RegisterFile:
    """Simulates the register file."""

    def __init__(self, num_registers=8):
        self.num_registers = num_registers
        self.registers = [0] * num_registers
        self.pc = 0  # Program counter
        self.registers[0] = 0  # R0 is hardwired to zero

    def read(self, reg_num: int) -> int:
        if reg_num == 0:
            return 0  # R0 is always zero
        if 0 < reg_num < self.num_registers:
            return self.registers[reg_num]
        raise ValueError(f"Invalid register: {reg_num}")

    def write(self, reg_num: int, value: int):
        if 0 < reg_num < self.num_registers:
            self.registers[reg_num] = value
        # R0 writes are ignored (it's hardwired to zero)

    def dump(self) -> dict:
        return {
            f"R{i}": self.registers[i]
            for i in range(min(8, self.num_registers))
        }


class ALU:
    """Arithmetic Logic Unit for the CPU simulator."""

    @staticmethod
    def execute(opcode: Opcode, a: int, b: int) -> int:
        """Execute ALU operation and return result."""
        ops = {
            Opcode.ADD: lambda x, y: x + y,
            Opcode.SUB: lambda x, y: x - y,
            Opcode.AND: lambda x, y: x & y,
            Opcode.OR: lambda x, y: x | y,
            Opcode.XOR: lambda x, y: x ^ y,
            Opcode.SLT: lambda x, y: 1 if x < y else 0,
            Opcode.SLL: lambda x, y: (x << y) & 0xFFFFFFFF,
            Opcode.SRL: lambda x, y: x >> y,
        }

        if opcode in ops:
            return ops[opcode](a, b)

        # For immediate operations, 'b' contains the immediate value
        # handled in execute_instruction
        raise ValueError(f"Unsupported ALU opcode: {opcode}")


class CPUSimulator:
    """Complete CPU simulator with fetch-decode-execute cycle."""

    def __init__(self, memory_size=1024):
        self.memory = Memory(memory_size)
        self.registers = RegisterFile()
        self.alu = ALU()
        self.halted = False
        self.cycles = 0
        self.trace_enabled = False

    def load_program(self, instructions: List[int], address=0):
        """Load a list of 32-bit instructions into memory."""
        for i, instr in enumerate(instructions):
            addr = address + i * 4
            self.memory.store_word(addr, instr)

    def fetch(self) -> int:
        """Fetch instruction from memory at PC."""
        if self.trace_enabled:
            print(f"[FETCH] PC={self.registers.pc:04x} ", end="")
        instr = self.memory.load_instruction(self.registers.pc)
        if self.trace_enabled:
            print(f"Instruction=0x{instr:08x}")
        return instr

    def decode(self, instruction: int) -> Instruction:
        """Decode a 32-bit instruction."""
        opcode = Opcode(instruction & 0xF)
        rd = (instruction >> 4) & 0x7
        rs1 = (instruction >> 7) & 0x7
        rs2 = (instruction >> 10) & 0x7
        immediate = (instruction >> 10) & 0xFFF  # 12-bit immediate

        # Sign-extend immediate if bit 11 is set
        if immediate & 0x800:
            immediate |= 0xFFFFF000

        return Instruction(opcode=opcode, rd=rd, rs1=rs1, rs2=rs2,
                          immediate=immediate)

    def execute(self, instr: Instruction) -> bool:
        """
        Execute a decoded instruction.
        Returns True if we should advance PC, False for branch/jump.
        """
        op = instr.opcode
        pc_src = True  # True = increment PC, False = use calculated address

        # Handle immediate operands for I-type
        if op in (Opcode.ADDI, Opcode.ANDI, Opcode.ORI):
            operand2 = instr.immediate
        else:
            operand2 = self.registers.read(instr.rs2)

        operand1 = self.registers.read(instr.rs1)

        # ALU operations
        if op in (Opcode.ADD, Opcode.SUB, Opcode.AND, Opcode.OR,
                  Opcode.XOR, Opcode.SLT, Opcode.SLL, Opcode.SRL):
            result = self.alu.execute(op, operand1, operand2)
            self.registers.write(instr.rd, result)

        # Immediate operations
        elif op == Opcode.ADDI:
            result = operand1 + operand2
            self.registers.write(instr.rd, result)

        elif op == Opcode.ANDI:
            result = operand1 & operand2
            self.registers.write(instr.rd, result)

        elif op == Opcode.ORI:
            result = operand1 | operand2
            self.registers.write(instr.rd, result)

        # Jump (unconditional)
        elif op == Opcode.J:
            target = instr.immediate & 0xFFFFFFFC  # Word-aligned target
            self.registers.pc = target - 4  # Will be incremented after
            if self.trace_enabled:
                print(f"[JUMP]  PC={self.registers.pc:04x}")
            pc_src = False

        # Branch if equal zero
        elif op == Opcode.BEQZ:
            if operand1 == 0:
                offset = instr.immediate & 0xFFFFFFFC
                self.registers.pc = self.registers.pc + offset - 4
                if self.trace_enabled:
                    print(f"[BRANCH] Taken to PC={self.registers.pc:04x}")
                pc_src = False

        # Load from memory
        elif op == Opcode.LOAD:
            addr = operand1 + instr.immediate
            value = self.memory.load_word(addr)
            self.registers.write(instr.rd, value)
            if self.trace_enabled:
                print(f"[LOAD]  R{instr.rd}=MEM[{addr:04x}]={value}")

        # Store to memory
        elif op == Opcode.STORE:
            addr = operand1 + instr.immediate
            value = self.registers.read(instr.rd)
            self.memory.store_word(addr, value)
            if self.trace_enabled:
                print(f"[STORE] MEM[{addr:04x}]=R{instr.rd}={value}")

        # Halt
        elif op == Opcode.HALT:
            self.halted = True
            if self.trace_enabled:
                print("[HALT]")
            pc_src = False

        return pc_src

    def step(self) -> bool:
        """
        Execute one complete fetch-decode-execute cycle.
        Returns True if CPU is still running, False if halted.
        """
        if self.halted:
            return False

        self.cycles += 1

        # Fetch
        instruction = self.fetch()

        # Decode
        decoded = self.decode(instruction)

        if self.trace_enabled:
            print(f"[DECODE] op={decoded.opcode.name} rd=R{decoded.rd} "
                  f"rs1=R{decoded.rs1} rs2=R{decoded.rs2} "
                  f"imm={decoded.immediate}")

        # Execute
        pc_update = self.execute(decoded)

        # Update PC
        if pc_update:
            self.registers.pc += 4

        if self.trace_enabled:
            regs = self.registers.dump()
            print(f"[STATE] Cycles={self.cycles} Regs={regs}")

        return not self.halted

    def run(self, max_cycles=10000):
        """Run until HALT or max cycles reached."""
        while not self.halted and self.cycles < max_cycles:
            if not self.step():
                break
        return self.cycles


def assemble(instr_str: str) -> int:
    """Simple assembler: convert instruction string to 32-bit binary."""
    parts = instr_str.replace(',', ' ').split()
    if not parts:
        return 0

    op = parts[0].upper()

    # R-type: OP rd, rs1, rs2
    r_ops = {'ADD', 'SUB', 'AND', 'OR', 'XOR', 'SLT', 'SLL', 'SRL'}
    # I-type: OP rd, rs1, imm
    i_ops = {'ADDI', 'ANDI', 'ORI'}
    # J-type: J target
    # Branch: BEQZ rs1, offset

    if op == 'ADD':
        rd, rs1, rs2 = int(parts[1][1]), int(parts[2][1]), int(parts[3][1])
        return (Opcode.ADD << 0) | (rd << 4) | (rs1 << 7) | (rs2 << 10)

    elif op == 'ADDI':
        rd, rs1, imm = int(parts[1][1]), int(parts[2][1]), int(parts[3])
        return (Opcode.ADDI << 0) | (rd << 4) | (rs1 << 7) | ((imm & 0xFFF) << 10)

    elif op == 'J':
        target = int(parts[1])
        return (Opcode.J << 0) | ((target & 0xFFFFFF) << 7)

    elif op == 'BEQZ':
        rs1, offset = int(parts[1][1]), int(parts[2])
        return (Opcode.BEQZ << 0) | (rs1 << 7) | ((offset & 0xFFF) << 10)

    elif op == 'HALT':
        return (Opcode.HALT << 0)

    return 0

Test Program: Sum 1 to N

def test_sum_to_n():
    """Test program: calculate sum of 1 + 2 + ... + 10 = 55"""

    # Assembly program:
    # R1 = 10        ; counter
    # R2 = 0         ; accumulator
    # R3 = 1         ; constant 1
    # R4 = 0         ; temporary for addition
    # loop:
    #   ADD R4, R2, R1   ; R4 = R2 + R1
    #   ADDI R2, R4, 0   ; R2 = R4
    #   ADDI R1, R1, -1  ; R1--
    #   BEQZ R1, done     ; if R1 == 0, exit
    #   J loop
    # done:
    #   HALT

    program = """
        ADDI R1, R0, 10      ; R1 = 10 (counter)
        ADDI R2, R0, 0       ; R2 = 0 (accumulator)
        ADDI R3, R0, 1      ; R3 = 1 (constant)
    loop:
        ADD  R4, R2, R1     ; R4 = R2 + R1
        ADDI R2, R4, 0      ; R2 = R4
        ADDI R1, R1, -1     ; R1--
        BEQZ R1, done       ; if R1 == 0, exit
        J loop
    done:
        HALT
    """

    # Assemble the program
    instructions = []
    for line in program.strip().split('\n'):
        line = line.split(';')[0].strip()  # Remove comments
        if line:
            instructions.append(assemble(line))

    # Load and run
    cpu = CPUSimulator()
    cpu.load_program(instructions)
    cpu.trace_enabled = True

    cycles = cpu.run()
    print(f"\n[RESULT] Program completed in {cycles} cycles")
    print(f"[RESULT] Sum in R2 = {cpu.registers.read(2)}")
    print(f"[RESULT] Expected: 55")


def test_memory_access():
    """Test memory load/store operations."""

    program = """
        ADDI R1, R0, 0       ; R1 = base address
        ADDI R2, R0, 42      ; R2 = 42
        STORE R2, R1, 0      ; MEM[0] = 42
        ADDI R3, R0, 100     ; R3 = 100
        STORE R3, R1, 4      ; MEM[4] = 100
        LOAD R4, R1, 0       ; R4 = MEM[0]
        LOAD R5, R1, 4       ; R5 = MEM[4]
        ADD  R6, R4, R5      ; R6 = R4 + R5 = 142
        HALT
    """

    instructions = []
    for line in program.strip().split('\n'):
        line = line.split(';')[0].strip()
        if line:
            instructions.append(assemble(line))

    cpu = CPUSimulator()
    cpu.load_program(instructions)
    cpu.trace_enabled = True

    cycles = cpu.run()
    print(f"\n[RESULT] Memory test completed in {cycles} cycles")
    print(f"[RESULT] R6 = {cpu.registers.read(6)} (expected 142)")


if __name__ == "__main__":
    test_sum_to_n()
    print("\n" + "="*50 + "\n")
    test_memory_access()

Production Failure Scenarios

Scenario 1: Infinite Loops from Branch Prediction Errors

Failure: Simulator enters infinite loop due to incorrect branch resolution logic.

Example bug:

# BUG: PC update happens incorrectly
elif op == Opcode.BEQZ:
    if operand1 == 0:
        self.registers.pc = self.registers.pc + offset  # Missing -4!

Mitigation:

Add cycle counter with hard timeout: run(max_cycles=10000)
Implement trace mode that prints every instruction
Verify program correctness against known-good reference implementation

Scenario 2: Memory Access Violations

Failure: Loading or storing beyond memory bounds causes crash or silent corruption.

Example bug:

# BUG: No bounds checking in memory access
def load_word(self, address: int) -> int:
    return (self.data[address] |           # Crashes if address out of bounds
            self.data[address + 1] << 8 |
            self.data[address + 2] << 16 |
            self.data[address + 3] << 24)

Mitigation:

Always validate addresses before access
Implement memory protection with clear error messages
Use Python’s bounds checking with try/except for graceful handling

Scenario 3: Signed/Unsigned Confusion in Comparisons

Failure: Sign bit treated incorrectly causes wrong branch decisions.

# BUG: Treating signed immediate as unsigned
immediate = (instruction >> 10) & 0xFFF  # Always positive!

# If we want -1 (0xFFF), this gives 4095 instead!

Mitigation:

Explicitly implement signed and unsigned operations
Use Python’s unlimited integer precision but simulate 32-bit wraparound
Test with edge cases: -1, 0, 1, 0x7FFFFFFF, 0x80000000

Trade-off Table

Aspect	Functional Simulator	Timing-Accurate	Cycle-Accurate
Speed	Very fast	Medium	Slow
Accuracy	Instruction correctness only	Execution order	Full timing
Complexity	Low	High	Very high
Use Case	Education, ISA design	Performance exploration	Hardware validation
Cache modeling	None	Simplified	Full hierarchy
Power modeling	None	Simplified	Accurate

Observability Checklist

When building and debugging CPU simulators:

Instruction trace: Log every instruction with PC, opcode, and operands
Register state dump: Print all registers after each cycle
Memory access log: Track all loads and stores with addresses and values
Cycle counter: Count executed cycles for performance analysis
Breakpoint support: Halt on specific PC values or conditions
Memory dump: Hex dump of memory regions for debugging
Statistics: Count of each instruction type, branches taken, etc.
Visualization: State diagram or pipeline visualization for education

Common Pitfalls / Anti-patterns

Security Considerations

Memory Safety: Our simulator implements bounds checking but production emulators must also check:
- Alignment requirements (word accesses must be word-aligned)
- Privilege levels (user mode vs kernel mode simulation)
- Valid memory regions (no mapping arbitrary addresses)

Instruction Validation: Reject invalid opcode values before attempting execution:

if opcode > Opcode.HALT:
    raise InstructionError(f"Invalid opcode: {opcode}")

Denial of Service: Infinite loops can hang simulators. Always implement:
- Maximum cycle limits
- Maximum memory access limits
- Instruction count limits per program

Compliance Notes

Before using a simulator for hardware validation or production tooling, know what compliance actually means. A simulator that runs your test programs correctly is not the same as one that passes compliance testing.

ISA Compliance Testing verifies that a simulator produces identical results to the reference architecture for every instruction variant. For RISC-V, the riscv-tests suite covers thousands of variants including all privilege modes and extensions. Intel’s Architecture Test Suite (ARTS) and ARM’s Architecture Compliance Suite (ACS) do the same for their own architectures. These test suites are the gatekeepers—hardware architects and toolchain developers will reject a simulator that fails even one test, no matter how well it runs sample programs.

IEEE 754 Compliance comes into play whenever floating-point operations are involved. The standard defines exact behavior for edge cases: denormalized numbers, NaN propagation, overflow, underflow, and rounding modes. Host CPUs do not always match the target architecture here, so production simulators sometimes implement soft float emulation for bit-exact results, taking a 10-100x speed hit in exchange for correctness.

The anti-patterns below are the usual suspects behind compliance failures. They do not crash or raise exceptions—instead the CPU quietly produces wrong answers. Registers hold corrupted values, memory contains garbage, and program output diverges from reference behavior in ways that are hard to notice without systematic test coverage.

Common Anti-patterns:

PC Update Ordering: Forgetting that PC increments after fetch, not before. The instruction at PC is the one fetched; PC+4 is the next sequential instruction.
R0 Hardwiring: In RISC architectures, R0 (or XZR in ARM64) must always read as zero. Forgetting this breaks many programs that assume ADDI R1, R0, 5 puts 5 in R1.
Immediate Sign Extension: 12-bit immediates must be sign-extended to full register width. Missing sign extension causes ADDI R1, R1, -1 to actually add 4095.
Memory Endianness: Our simulator uses little-endian (least significant byte at lowest address). This must match the host machine for debugging or external tools.
Word Alignment: Instructions must be word-aligned (multiple of 4 bytes). Jump targets must be masked to ensure this.
Register-Register vs Register-Immediate: Using the wrong operand interpretation—immediate instructions use the immediate field, not register R2.
Branch Delay Slots: Some architectures (MIPS) have a delay slot where the instruction after branch executes regardless. Our simulator doesn’t model this, but real ISAs may.

Quick Recap Checklist

The fetch-decode-execute cycle is the fundamental CPU operation: fetch instruction, decode it, execute it, update state
Register files provide fast storage with read/write ports; R0 is typically hardwired to zero in RISC designs
The ALU performs arithmetic and logical operations; everything else is either memory access or control flow
Memory systems in simulators need bounds checking, alignment validation, and error handling
Instruction formats (R-type, I-type, J-type) pack different fields into the same 32-bit word
Branch and jump instructions modify PC directly; all other instructions increment PC by 4
Sign extension converts smaller immediates to full register width while preserving signed value
Infinite loops are a real risk in simulators—always implement cycle or instruction limits
Testing with known programs (sum to N, Fibonacci, memory tests) catches bugs early
The simulator concept extends to full system simulation with devices, interrupts, and exceptions

Interview Questions

1. Explain the fetch-decode-execute cycle and why it must exist.

The fetch-decode-execute cycle is the fundamental operation of any CPU. Fetch reads the next instruction from memory at the address in the Program Counter (PC) register. Decode interprets the instruction bits to determine the opcode and operands—what operation to perform and what data to use. Execute performs the actual computation: ALU operations, memory accesses, or register transfers. Finally, the CPU updates its state (registers, memory, PC) and repeats.

The cycle must exist because instructions are stored in memory as data, and the CPU needs a systematic way to retrieve and process them. There's no way to "just know" what an instruction should do—the encoding must be interpreted. This sequential, disciplined approach ensures deterministic behavior: given the same initial state and program, the CPU always produces the same results.

2. What is the purpose of the Program Counter (PC), and why does it increment by 4 in a 32-bit architecture?

The Program Counter (PC) holds the memory address of the next instruction to execute. After an instruction is fetched, the PC must be updated to point to what comes next. In a 32-bit architecture with byte-addressable memory, each instruction is 4 bytes (32 bits), so the next instruction is at PC + 4.

It's crucial to understand that the PC points to the next instruction before fetch, but after fetch, the CPU has already read the instruction at that address. For sequential execution, we increment PC by 4. For branch or jump instructions, we set PC to a different value—the branch target instead of PC+4.

Note: Some architectures (like MIPS) use a "branch delay slot"—the instruction immediately after a branch executes regardless. Our simple simulator doesn't model this, but real RISC processors do.

3. Why do we need sign extension for immediate values, and how does it work?

Immediate values in instructions are encoded in fewer bits than the full register width. For example, our I-type instructions use 12 bits for immediates, but registers are 32 bits. When we load this 12-bit value into a 32-bit register, we must decide what goes in the upper 20 bits.

Sign extension copies the sign bit (most significant bit of the immediate) into all the upper bits. If the 12-bit immediate is 0x800 (-1 in 12-bit two's complement), we sign-extend it to 0xFFFFF800 (still -1 in 32-bit). If it were 0x001 (positive 1), we extend to 0x00000001.

This preserves the arithmetic meaning: -1 + 5 should still equal 4, not 4097. Unsigned immediates (like bitwise operations) still need sign extension to fill the register correctly—they just happen to work the same way because the upper bits end up as zeros for positive values.

4. What is the difference between R-type, I-type, and J-type instructions in our simulator?

The instruction type determines how the 32-bit encoding is interpreted.

R-type (Register-type) instructions have two source registers and a destination register—all operands are registers. The format is [opcode(4)][rd(3)][rs1(3)][rs2(3)][unused(19)]. Examples: ADD, SUB, AND, OR, XOR, SLT.

I-type (Immediate-type) instructions have one source register, one immediate value (constant), and a destination register. The format is [opcode(4)][rd(3)][rs1(3)][immediate(22)]. The immediate is sign-extended. Examples: ADDI, ANDI, ORI, LOAD.

J-type (Jump-type) instructions provide an absolute target address for jumps or a PC-relative offset for branches. STORE also uses the I-type format with register value to store. Examples: J (jump), BEQZ (branch if equal to zero).

The opcode field determines which type to use and how the CPU should interpret the remaining bits.

5. How would you extend this simulator to support function calls and returns?

Function calls require three components: a way to save the return address, a mechanism to allocate stack space, and a way to restore state on return.

Call instruction: Use a J-type with a link register (like ARM's LR). CALL target would be: save PC+4 to a designated register (R7), then jump to target. The RET instruction would jump to the address in that register.

Stack management: Reserve a register as a stack pointer (SP). ALLOCATE frame subtracts from SP. DEALLOCATE frame adds back to SP. Push/pop operations adjust SP and store/load registers at [SP].

Callee-saved registers: On entry to a function, save any callee-saved registers (R5, R6 in our 8-register design) to the stack. Restore them before return.

A complete calling convention would also define parameter passing (first few parameters in R1-R3), return values (R1), and stack frame layout. This is exactly how real RISC architectures implement function calls.

6. How would you implement an interrupt handling mechanism in the simulator?

Interrupts require several additions to the basic simulator design:

Interrupt controller: A module that receives hardware interrupt signals (timer, I/O, external) and determines if they should be presented to the CPU based on interrupt enable flags.

Interrupt enable/disable: Add status flags that can globally enable/disable interrupts and per-interrupt-type masks.

Vectored interrupts: When an interrupt fires, the CPU must:

Finish the current instruction (atomicity)
Push PC and status registers to stack
Load new PC from interrupt vector table
Switch to kernel/supervisor mode

Return from interrupt: An RTI instruction pops the saved state and resumes execution.

7. What is the difference between Harvard and von Neumann architecture?

In von Neumann architecture, instructions and data share the same memory and bus. The same pathway carries both code and data, creating the "von Neumann bottleneck"—the CPU can either fetch an instruction or access data, but not both simultaneously.

In Harvard architecture, instruction memory and data memory are completely separate. The CPU can fetch an instruction and access data simultaneously on separate buses, doubling effective bandwidth.

Modern CPUs use a modified Harvard architecture with separate L1 instruction and data caches, while presenting a unified view of memory to software. This gets close to Harvard performance while maintaining software compatibility.

DSPs and microcontrollers often use true Harvard architecture for real-time performance in constrained environments.

8. How do you handle exceptions and error conditions in a CPU simulator?

Exceptions require a mechanism to detect the error condition and transfer control to an exception handler:

Exception types:

Undefined instruction: Opcode not recognized
Illegal memory access: Address out of bounds or alignment fault
Arithmetic overflow: Signed/unsigned overflow detection
Divide by zero: Division instruction with zero divisor

Handling mechanism:

if (is_undefined_opcode(instr)):
    raise_exception(EXCEPTION_UNDEFINED_INSTRUCTION, saved_pc)
elif is_illegal_memory_access(address):
    raise_exception(EXCEPTION_MEMORY_FAULT, saved_pc)
elif is_overflow(result):
    raise_exception(EXCEPTION_ARITHMETIC_OVERFLOW, saved_pc)
def raise_exception(type, pc): # Save current state
push_to_stack(pc)
push_to_stack(processor_status) # Load exception vector
new_pc = exception_vector_table[type]
jump_to_handler(new_pc)

9. How would you implement branch prediction in a CPU simulator?

Branch prediction reduces pipeline stalls by guessing the direction of branches before they're resolved. For a simulator, you can model several strategies:

Always-taken (forward not taken): Simplest — predict backward branches are taken, forward branches are not. Useful baseline.

1-bit saturating counter: Each branch has a counter that predicts taken if 1, not taken if 0. On misprediction, flip the counter. Simple but 2 mispredictions on loop exit.

2-bit saturating counter: Requires two mispredictions to change prediction. Better for loops — only mispredicts on loop exit, not entry.

Two-level adaptive (gshare): Uses global branch history register (BHR) to index into a pattern history table (PHT). Captures correlation between branches.

On a misprediction:

Flush the pipeline of instructions after the branch
Update the branch predictor state
Fetch from the correct target address

Simulator implementation: Track predictor state per branch address, and measure misprediction rate to evaluate predictor effectiveness.

10. What is the difference between a functional simulator and a cycle-accurate simulator?

A functional simulator (like the one we built) models the behavior of a CPU without regard to timing. It executes instructions correctly but doesn't model how long each instruction takes, pipeline effects, or resource conflicts. The simulator is fast—millions of instructions per second—but not suitable for performance optimization.

A cycle-accurate simulator models the exact cycle-by-cycle behavior of the hardware. Every pipeline stage, every bus transaction, every cache access is modeled with its timing. This enables accurate performance prediction but at significant complexity and speed cost—complex cycle-accurate models may simulate only thousands of cycles per second.

Comparison:

Functional: Used for ISA design exploration, compiler toolchain development, debugging. Speed >> accuracy for correctness testing.
Cycle-accurate: Used for microarchitecture research, performance optimization, hardware validation. Accuracy >> speed.

Many industry simulators sit between these extremes: "timingSimple" models in gem5, for example, add approximate timing to functional simulation without full cycle accuracy.

11. How would you extend the simulator to support out-of-order execution?

Out-of-order execution requires several additions:

Instruction Window: A buffer that holds instructions that have been decoded but not yet executed. Instructions are dispatched to execution units as their dependencies are satisfied, not in program order.

Register Renaming: To avoid WAR (Write After Read) and WAW (Write After Write) hazards, physical registers replace architectural registers. Each instruction gets a destination physical register; older values are preserved until no longer needed.

Reservation Stations: Each execution unit has a queue of pending instructions. When operands are available and the unit is free, the instruction executes.

Reorder Buffer (ROB): Instructions are tracked in program order through the ROB. When an instruction completes, it's retired in program order—updating architectural state in the correct sequence. Speculative results are held until the branch they depend on is resolved.

The data flow: Fetch → Decode → Rename → Dispatch to reservation station → Execute → Write result → ROB Retire

Complex parts: tracking dependencies, managing the ROB size, handling mispredictions (flush the pipeline), recovering from exceptions.

12. What is the purpose of a scoreboard in CPU simulation and how does it work?

A scoreboard is a data structure used in cycle-accurate simulators to track the state of instructions in flight, manage dependencies, and detect hazards. It was popularized by the CDC 6600 scoreboard.

The scoreboard maintains:

Instruction status table: For each instruction in flight, which stage it is in (Issue, Read Operands, Execute, Write Result)
Functional unit status: Which unit is busy, what instruction it contains, how long until result
Register result status: Which instruction will write to each register, and when
Physical register file: Current values and validity flags

On each cycle, the scoreboard:

Determines which instructions can issue (no structural hazard)
ChecksRAW hazards (instruction reading register that will be written by earlier instruction)
Monitors execution completion and triggers result writes
Manages write-back order to prevent WAW and WAR hazards

The scoreboard stalls issue when a hazard exists, allowing in-order retirement semantics while enabling out-of-order execution.

13. How would you implement a cache hierarchy in the CPU simulator?

Implementing a cache hierarchy involves adding cache structures and modifying memory access:

Cache structure:

class Cache:
    def __init__(self, size, block_size, associativity):
        self.size = size
        self.block_size = block_size
        self.associativity = associativity
        self.lines = [None] * (size // block_size)
        self.hits = 0
        self.misses = 0

Memory access flow:

Check L1 cache for the address
On hit: return data, update LRU (Least Recently Used)
On miss: check L2 cache
On L2 hit: return data, update L2 and L1
On L2 miss: access main memory, fill both caches

Cache line states: Valid, Modified (dirty), Owned, Invalid. This is needed for cache coherence protocols (MESI, MOESI).

Write policies:

Write-back: Write to cache only, write to memory on eviction
Write-through: Write to cache and memory simultaneously
Write-allocate: On write miss, allocate the cache line
No-write-allocate: On write miss, write directly to memory

14. What is the difference between a unified cache and separate I-cache/D-cache?

A unified cache stores both instructions and data in the same cache structure. A separate cache has distinct L1 instruction cache (I-cache) and L1 data cache (D-cache).

Advantages of separate caches:

Harvard architecture: Can fetch instruction and access data simultaneously on different buses
No conflict misses: Instructions and data don't compete for the same cache space
Simpler timing: Different cache parameters for instructions vs data

Advantages of unified cache:

Flexibility: All resources available for either instructions or data as needed
Higher hit rates: No wasted space if one cache is underutilized while the other is full
Simpler design: One cache structure instead of two

Modern CPUs typically use separate L1 caches (I-cache and D-cache) because the fetch and data access are independent critical paths. L2 and L3 are usually unified because by that point the distinction is less important and the cache needs to handle both equally.

15. How does a TLB (Translation Lookaside Buffer) work in a simulator?

A TLB is a cache of virtual-to-physical address translations. In a simulator:

TLB structure:

class TLB:
    def __init__(self, entries=64):
        self.entries = entries  # Typically 64-128
        self.tags = [None] * entries  # Virtual page numbers
        self.translations = [None] * entries  # Physical page numbers
        self.valid = [False] * entries
        self.lru = list(range(entries))  # LRU ordering

Lookup flow:

Extract virtual page number from virtual address
Search TLB for matching tag
On hit: combine physical page number with offset
On miss: walk page tables in simulated memory, add to TLB, evict LRU entry

TLB attributes:

ASID (Address Space ID): Identifies which process's address space (prevents flushing on context switch)
Protection bits: Read/write/execute permissions
Valid bit: Entry is valid

TLB misses are expensive—on a real CPU, they require a page table walk (10-100 cycles). Simulating this accurately adds significant complexity.

16. What is the difference between a microscalar and a superscalar processor?

A scalar processor can complete at most one instruction per cycle. A superscalar processor can complete more than one instruction per cycle by having multiple execution units and dispatching instructions to them in parallel.

A microscalar processor extends scalar by having multiple pipeline lanes but limited parallelism—typically issue width of 2-4 and fewer execution units than a full superscalar.

Superscalar characteristics:

Issue width: Number of instructions that can be dispatched per cycle (2, 4, 8)
Execution units: ALU, load/store, FP units that can operate in parallel
Register files with multiple ports: Needed to support multiple reads/writes per cycle
Out-of-order execution: To keep the execution units busy despite hazards

Modern desktop and server CPUs are superscalar (Intel Haswell: 4-wide, AMD Zen: 6-wide). Simulating a superscalar is significantly more complex than scalar because you must model multiple instructions per cycle and their interactions.

17. How do you simulate branch prediction in a CPU simulator?

Branch prediction simulation models how a CPU predicts branch outcomes to avoid pipeline stalls. The simulator needs to track:

Branch History Table (BHT):


class BranchPredictor:
    def __init__(self, size=512):
        self.size = size
        # 2-bit saturating counters: 00=strong not-taken, 01=weak not-taken
        # 10=weak taken, 11=strong taken
        self.counters = [0] * size

    def predict(self, pc):
        idx = pc % self.size
        return self.counters[idx] >= 2  # Taken if counter >= 2

    def update(self, pc, taken):
        idx = pc % self.size
        if taken and self.counters[idx] < 3:
            self.counters[idx] += 1
        elif not taken and self.counters[idx] > 0:
            self.counters[idx] -= 1

Branch Target Buffer (BTB): Caches the destination address of branches to enable faster fetch on subsequent executions.

Prediction flow in simulator:

When fetching a branch instruction, check BTB for target address
Use BHT to predict taken/not-taken
If mispredicted (actual != predicted), flush pipeline and correct PC
Update BHT with actual outcome

More sophisticated predictors: local history (correlates with recent local branches), global history (correlates with overall branch behavior), tournament predictors (choose between local and global).

18. What is the purpose of a pipeline register and how does it affect simulation speed?

Pipeline registers (also called pipeline latches or flip-flops) are storage elements placed between pipeline stages. They capture the output of one stage at the end of a cycle and hold it as input to the next stage at the beginning of the next cycle.

Pipeline registers enable:

Timing closure: Each stage has one cycle to complete its work
Clocking: All stages synchronize on the same clock edge
Isolation: Stages don't interfere with each other mid-cycle

In a cycle-accurate simulator, pipeline registers are modeled explicitly:


class PipelineRegister:
    def __init__(self, name):
        self.name = name
        self.value = None

    def clock(self):
        # On clock edge, capture input
        pass

    def update(self, new_value):
        # Set value for next cycle
        self.value = new_value

Pipeline registers affect simulation speed because they determine how frequently the simulator must evaluate the pipeline. Deeply pipelined processors (20+ stages) require more frequent updates than simple ones. Cycle-accurate simulators update state every simulated cycle, which is slower than functional simulation but necessary for timing accuracy.

19. How would you add support for exceptions and interrupts to the simulator?

Adding exception and interrupt support requires several additions:

Exception types:

Synchronous: Generated by instruction (undefined opcode, overflow, page fault)
Asynchronous: External events (timer interrupt, I/O completion)

Exception handling flow:

Detect exception condition (invalid opcode, alignment fault, etc.)
Save current PC to EPC (Exception Program Counter) register
Set cause register with exception type
Disable further interrupts
Transfer to exception handler address (from vector table)

Interrupt controller:


class InterruptController:
    def __init__(self):
        self.pending = []
        self.enabled = True

    def raise_interrupt(self, source):
        if self.enabled:
            self.pending.append(source)

    def check_interrupts(self, cpu):
        if self.pending and self.enabled:
            # Highest priority interrupt
            irq = self.pending.pop(0)
            cpu.handle_interrupt(irq)

RTI (Return from Interrupt) instruction restores saved state and re-enables interrupts. The key complexity is preserving enough state to resume execution after handling.

20. What is the difference between soft float and hard float emulation in a simulator?

Hard float (hardware floating-point): The simulator leverages the host CPU's floating-point instructions. This is fast but may have subtle differences in rounding, exception handling, and NaN propagation compared to the target architecture.

Soft float: The simulator implements floating-point operations entirely in software. Each FP operation becomes a function call that emulates the operation using integer arithmetic. This is slow (100x slower than hard float) but bit-exact with the target architecture.

Soft float implementation:


def soft_float_add(a_bits, b_bits):
    # Extract sign, exponent, mantissa
    sign_a = (a_bits >> 31) & 1
    exp_a = (a_bits >> 23) & 0xFF
    mantissa_a = a_bits & 0x7FFFFF

    # Perform addition in software
    # Handle special cases (NaN, infinity, denormals)
    # Rounding according to IEEE 754 rules
    return result_bits

When to use each:

Soft float: When bit-exact behavior is required (testing, certification, debugging)
Hard float: When speed is critical and minor behavioral differences are acceptable

gem5 simulator supports both, selectable via configuration. The RISC-V GNU toolchain supports both soft and hard float ABIs.

Conclusion

Building a CPU simulator gives you concrete insight into how hardware actually works. The fetch-decode-execute cycle, register files, ALU operations, and memory access—all the abstract concepts from computer architecture become tangible when you implement them yourself.

The simulator we built demonstrates the essential concepts without getting bogged down in timing accuracy or cache modeling. It handles real instructions, executes branches correctly, and can run actual programs. Extensions like function calls, interrupts, and exception handling would turn this into a full system simulator.

Take the next step by exploring instruction set architecture to understand what instructions real processors provide, or assembly language basics to see how those instructions are actually written.

Introduction

When to Use / When Not to Use

CPU Architecture

Core Concepts

The Fetch-Decode-Execute Cycle

Register File Design

ALU Operations

Instruction Format

Instruction Format Design

Implementation Snippets

Complete CPU Simulator

Test Program: Sum 1 to N

Production Failure Scenarios

Scenario 1: Infinite Loops from Branch Prediction Errors

Scenario 2: Memory Access Violations

Scenario 3: Signed/Unsigned Confusion in Comparisons

Trade-off Table

Observability Checklist

Common Pitfalls / Anti-patterns

Security Considerations

Compliance Notes

Quick Recap Checklist

Interview Questions

Further Reading

Conclusion

Category

Tags

Related Posts

ASLR & Stack Protection

Assembly Language Basics: Writing Code the CPU Understands

Boolean Logic & Gates