Building a Simple CPU Simulator: Implementing the Fetch-Decode-Execute Cycle
Implement a functional CPU simulator in Python that demonstrates the fetch-decode-execute cycle, register file, ALU operations, and memory access.
Introduction
Building a CPU simulator is one of the most educational projects in computer science. By implementing one, you see how instructions actually flow through a processor, how the fetch-decode-execute cycle works, and why real architectures make the design decisions they do. Writing a simulator also develops intuition for debugging systems-level code—state management, error handling, trace output—since these same challenges appear in kernel debugging and emulator development.
The simulator we’ll build implements a simple RISC-like instruction set with a register file, ALU, and memory subsystem. It shows the fundamental concepts used in actual CPU design while staying small enough to understand completely.
When to Use / When Not to Use
When to Use a CPU Simulator:
- Teaching and learning computer architecture fundamentals
- Testing instruction set changes before hardware implementation
- Writing and debugging code for a new or custom ISA
- Researching security vulnerabilities in a controlled environment
- Creating test harnesses for compiler backends
When Not to Use:
- Production performance requirements—interpretive simulation is slow
- Whencycle-accurate timing matters—our simulator is functional, not timing-accurate
- As a replacement for actual hardware testing before deployment
- When you need memory hierarchy simulation (caches, TLBs) in detail
CPU Architecture
flowchart TB
subgraph "Fetch Unit"
A["Program Counter<br/>PC"] --> B["Instruction<br/>Memory"]
B --> C["Instruction<br/>Register"]
end
subgraph "Decode Unit"
C --> D["Control Unit"]
D --> E["Register File<br/>Read Ports"]
D --> F["Immediate<br/>Generator"]
end
subgraph "Execute Unit"
E --> G["ALU"]
F --> G
E --> H["Branch<br/>Comparator"]
end
subgraph "Memory Unit"
G --> I["Data<br/>Memory"]
I --> J["Write Back<br/>Mux"]
end
subgraph "Write Back"
J --> K["Register File<br/>Write Port"]
K --> E
G --> K
end
subgraph "PC Update"
H --> L["PC Src<br/>Mux"]
L --> A
G --> L
end
Core Concepts
The Fetch-Decode-Execute Cycle
The fundamental operation of any CPU is the fetch-decode-execute cycle, repeated billions of times per second:
- Fetch: Read the next instruction from memory at the address in the Program Counter (PC)
- Decode: Examine the instruction bits to determine what operation to perform
- Execute: Perform the computation—ALU operation, memory access, or register transfer
- Write Back: Store the result back to registers or memory
- Update PC: Increment PC (or set to branch target if taken)
Our simulator implements this cycle in software, using Python data structures to model the hardware components.
Register File Design
The register file is a small, fast memory within the CPU that holds working values. Our simulator uses a simple array-based implementation:
class RegisterFile:
"""Simulates the register file with read and write ports."""
def __init__(self, num_registers=8):
self.num_registers = num_registers
self.registers = [0] * num_registers
self.pc = 0 # Program counter is separate
def read(self, reg_num):
"""Read a register value."""
if 0 <= reg_num < self.num_registers:
return self.registers[reg_num]
raise ValueError(f"Invalid register number: {reg_num}")
def write(self, reg_num, value):
"""Write a value to a register."""
if 0 <= reg_num < self.num_registers:
self.registers[reg_num] = value
else:
raise ValueError(f"Invalid register number: {reg_num}")
def dump(self):
"""Return current register state for debugging."""
return {
f"R{i}": self.registers[i]
for i in range(self.num_registers)
}
ALU Operations
The Arithmetic Logic Unit (ALU) performs computational operations. Our simple simulator supports:
| Operation | Opcode | Description |
|---|---|---|
| ADD | 0x0 | R[rd] = R[rs1] + R[rs2] |
| SUB | 0x1 | R[rd] = R[rs1] - R[rs2] |
| AND | 0x2 | R[rd] = R[rs1] & R[rs2] |
| OR | 0x3 | R[rd] = R[rs1] | R[rs2] |
| XOR | 0x4 | R[rd] = R[rs1] ^ R[rs2] |
| SLT | 0x5 | R[rd] = R[rs1] < R[rs2] (signed) |
| SLL | 0x6 | R[rd] = R[rs1] << R[rs2] |
| SRL | 0x7 | R[rd] = R[rs1] >> R[rs2] (logical) |
| ADDI | 0x8 | R[rd] = R[rs1] + immediate |
| ANDI | 0x9 | R[rd] = R[rs1] & immediate |
| ORI | 0xA | R[rd] = R[rs1] | immediate |
Instruction Format
Our simulator uses a simple 32-bit R-type/I-type instruction format:
flowchart LR
subgraph "R-Type [opcode 0-7]"
A["rd<br/>[5 bits]"] --> B["rs1<br/>[5 bits]"] --> C["rs2<br/>[5 bits]"] --> D["opcode<br/>[7 bits]"]
end
subgraph "I-Type [opcode 8-A]"
E["rd<br/>[5 bits]"] --> F["rs1<br/>[5 bits]"] --> G["immediate<br/>[12 bits]"] --> H["opcode<br/>[7 bits]"]
end
subgraph "J-Type [opcode B: jump]"
I["rd<br/>[5 bits]"] --> J["target<br/>[20 bits]"] --> K["opcode<br/>[7 bits]"]
end
Implementation Snippets
Complete CPU Simulator
#!/usr/bin/env python3
"""
Simple CPU Simulator implementing fetch-decode-execute cycle.
Supports R-type, I-type, and J-type instructions.
"""
from enum import IntEnum
from dataclasses import dataclass
from typing import List, Optional
class Opcode(IntEnum):
"""Opcodes for our simple instruction set."""
ADD = 0x0
SUB = 0x1
AND = 0x2
OR = 0x3
XOR = 0x4
SLT = 0x5
SLL = 0x6
SRL = 0x7
ADDI = 0x8
ANDI = 0x9
ORI = 0xA
J = 0xB
BEQZ = 0xC
LOAD = 0xD
STORE = 0xE
HALT = 0xF
@dataclass
class Instruction:
"""Parsed instruction with all fields."""
opcode: Opcode
rd: int
rs1: int
rs2: int
immediate: int = 0
class Memory:
"""Simulates instruction and data memory."""
def __init__(self, size=1024):
self.size = size
self.data = [0] * size
def load_byte(self, address: int) -> int:
if 0 <= address < self.size:
return self.data[address] & 0xFF
raise MemoryError(f"Invalid load address: {address}")
def store_byte(self, address: int, value: int):
if 0 <= address < self.size:
self.data[address] = value & 0xFF
else:
raise MemoryError(f"Invalid store address: {address}")
def load_word(self, address: int) -> int:
"""Load 4 bytes as a little-endian word."""
if address + 3 < self.size:
return (self.data[address] |
self.data[address + 1] << 8 |
self.data[address + 2] << 16 |
self.data[address + 3] << 24)
raise MemoryError(f"Invalid word load address: {address}")
def store_word(self, address: int, value: int):
"""Store 4 bytes as a little-endian word."""
if address + 3 < self.size:
self.data[address] = value & 0xFF
self.data[address + 1] = (value >> 8) & 0xFF
self.data[address + 2] = (value >> 16) & 0xFF
self.data[address + 3] = (value >> 24) & 0xFF
else:
raise MemoryError(f"Invalid word store address: {address}")
def load_instruction(self, address: int) -> int:
"""Load a 32-bit instruction (4 bytes little-endian)."""
return self.load_word(address)
class RegisterFile:
"""Simulates the register file."""
def __init__(self, num_registers=8):
self.num_registers = num_registers
self.registers = [0] * num_registers
self.pc = 0 # Program counter
self.registers[0] = 0 # R0 is hardwired to zero
def read(self, reg_num: int) -> int:
if reg_num == 0:
return 0 # R0 is always zero
if 0 < reg_num < self.num_registers:
return self.registers[reg_num]
raise ValueError(f"Invalid register: {reg_num}")
def write(self, reg_num: int, value: int):
if 0 < reg_num < self.num_registers:
self.registers[reg_num] = value
# R0 writes are ignored (it's hardwired to zero)
def dump(self) -> dict:
return {
f"R{i}": self.registers[i]
for i in range(min(8, self.num_registers))
}
class ALU:
"""Arithmetic Logic Unit for the CPU simulator."""
@staticmethod
def execute(opcode: Opcode, a: int, b: int) -> int:
"""Execute ALU operation and return result."""
ops = {
Opcode.ADD: lambda x, y: x + y,
Opcode.SUB: lambda x, y: x - y,
Opcode.AND: lambda x, y: x & y,
Opcode.OR: lambda x, y: x | y,
Opcode.XOR: lambda x, y: x ^ y,
Opcode.SLT: lambda x, y: 1 if x < y else 0,
Opcode.SLL: lambda x, y: (x << y) & 0xFFFFFFFF,
Opcode.SRL: lambda x, y: x >> y,
}
if opcode in ops:
return ops[opcode](a, b)
# For immediate operations, 'b' contains the immediate value
# handled in execute_instruction
raise ValueError(f"Unsupported ALU opcode: {opcode}")
class CPUSimulator:
"""Complete CPU simulator with fetch-decode-execute cycle."""
def __init__(self, memory_size=1024):
self.memory = Memory(memory_size)
self.registers = RegisterFile()
self.alu = ALU()
self.halted = False
self.cycles = 0
self.trace_enabled = False
def load_program(self, instructions: List[int], address=0):
"""Load a list of 32-bit instructions into memory."""
for i, instr in enumerate(instructions):
addr = address + i * 4
self.memory.store_word(addr, instr)
def fetch(self) -> int:
"""Fetch instruction from memory at PC."""
if self.trace_enabled:
print(f"[FETCH] PC={self.registers.pc:04x} ", end="")
instr = self.memory.load_instruction(self.registers.pc)
if self.trace_enabled:
print(f"Instruction=0x{instr:08x}")
return instr
def decode(self, instruction: int) -> Instruction:
"""Decode a 32-bit instruction."""
opcode = Opcode(instruction & 0xF)
rd = (instruction >> 4) & 0x7
rs1 = (instruction >> 7) & 0x7
rs2 = (instruction >> 10) & 0x7
immediate = (instruction >> 10) & 0xFFF # 12-bit immediate
# Sign-extend immediate if bit 11 is set
if immediate & 0x800:
immediate |= 0xFFFFF000
return Instruction(opcode=opcode, rd=rd, rs1=rs1, rs2=rs2,
immediate=immediate)
def execute(self, instr: Instruction) -> bool:
"""
Execute a decoded instruction.
Returns True if we should advance PC, False for branch/jump.
"""
op = instr.opcode
pc_src = True # True = increment PC, False = use calculated address
# Handle immediate operands for I-type
if op in (Opcode.ADDI, Opcode.ANDI, Opcode.ORI):
operand2 = instr.immediate
else:
operand2 = self.registers.read(instr.rs2)
operand1 = self.registers.read(instr.rs1)
# ALU operations
if op in (Opcode.ADD, Opcode.SUB, Opcode.AND, Opcode.OR,
Opcode.XOR, Opcode.SLT, Opcode.SLL, Opcode.SRL):
result = self.alu.execute(op, operand1, operand2)
self.registers.write(instr.rd, result)
# Immediate operations
elif op == Opcode.ADDI:
result = operand1 + operand2
self.registers.write(instr.rd, result)
elif op == Opcode.ANDI:
result = operand1 & operand2
self.registers.write(instr.rd, result)
elif op == Opcode.ORI:
result = operand1 | operand2
self.registers.write(instr.rd, result)
# Jump (unconditional)
elif op == Opcode.J:
target = instr.immediate & 0xFFFFFFFC # Word-aligned target
self.registers.pc = target - 4 # Will be incremented after
if self.trace_enabled:
print(f"[JUMP] PC={self.registers.pc:04x}")
pc_src = False
# Branch if equal zero
elif op == Opcode.BEQZ:
if operand1 == 0:
offset = instr.immediate & 0xFFFFFFFC
self.registers.pc = self.registers.pc + offset - 4
if self.trace_enabled:
print(f"[BRANCH] Taken to PC={self.registers.pc:04x}")
pc_src = False
# Load from memory
elif op == Opcode.LOAD:
addr = operand1 + instr.immediate
value = self.memory.load_word(addr)
self.registers.write(instr.rd, value)
if self.trace_enabled:
print(f"[LOAD] R{instr.rd}=MEM[{addr:04x}]={value}")
# Store to memory
elif op == Opcode.STORE:
addr = operand1 + instr.immediate
value = self.registers.read(instr.rd)
self.memory.store_word(addr, value)
if self.trace_enabled:
print(f"[STORE] MEM[{addr:04x}]=R{instr.rd}={value}")
# Halt
elif op == Opcode.HALT:
self.halted = True
if self.trace_enabled:
print("[HALT]")
pc_src = False
return pc_src
def step(self) -> bool:
"""
Execute one complete fetch-decode-execute cycle.
Returns True if CPU is still running, False if halted.
"""
if self.halted:
return False
self.cycles += 1
# Fetch
instruction = self.fetch()
# Decode
decoded = self.decode(instruction)
if self.trace_enabled:
print(f"[DECODE] op={decoded.opcode.name} rd=R{decoded.rd} "
f"rs1=R{decoded.rs1} rs2=R{decoded.rs2} "
f"imm={decoded.immediate}")
# Execute
pc_update = self.execute(decoded)
# Update PC
if pc_update:
self.registers.pc += 4
if self.trace_enabled:
regs = self.registers.dump()
print(f"[STATE] Cycles={self.cycles} Regs={regs}")
return not self.halted
def run(self, max_cycles=10000):
"""Run until HALT or max cycles reached."""
while not self.halted and self.cycles < max_cycles:
if not self.step():
break
return self.cycles
def assemble(instr_str: str) -> int:
"""Simple assembler: convert instruction string to 32-bit binary."""
parts = instr_str.replace(',', ' ').split()
if not parts:
return 0
op = parts[0].upper()
# R-type: OP rd, rs1, rs2
r_ops = {'ADD', 'SUB', 'AND', 'OR', 'XOR', 'SLT', 'SLL', 'SRL'}
# I-type: OP rd, rs1, imm
i_ops = {'ADDI', 'ANDI', 'ORI'}
# J-type: J target
# Branch: BEQZ rs1, offset
if op == 'ADD':
rd, rs1, rs2 = int(parts[1][1]), int(parts[2][1]), int(parts[3][1])
return (Opcode.ADD << 0) | (rd << 4) | (rs1 << 7) | (rs2 << 10)
elif op == 'ADDI':
rd, rs1, imm = int(parts[1][1]), int(parts[2][1]), int(parts[3])
return (Opcode.ADDI << 0) | (rd << 4) | (rs1 << 7) | ((imm & 0xFFF) << 10)
elif op == 'J':
target = int(parts[1])
return (Opcode.J << 0) | ((target & 0xFFFFFF) << 7)
elif op == 'BEQZ':
rs1, offset = int(parts[1][1]), int(parts[2])
return (Opcode.BEQZ << 0) | (rs1 << 7) | ((offset & 0xFFF) << 10)
elif op == 'HALT':
return (Opcode.HALT << 0)
return 0
Test Program: Sum 1 to N
def test_sum_to_n():
"""Test program: calculate sum of 1 + 2 + ... + 10 = 55"""
# Assembly program:
# R1 = 10 ; counter
# R2 = 0 ; accumulator
# R3 = 1 ; constant 1
# R4 = 0 ; temporary for addition
# loop:
# ADD R4, R2, R1 ; R4 = R2 + R1
# ADDI R2, R4, 0 ; R2 = R4
# ADDI R1, R1, -1 ; R1--
# BEQZ R1, done ; if R1 == 0, exit
# J loop
# done:
# HALT
program = """
ADDI R1, R0, 10 ; R1 = 10 (counter)
ADDI R2, R0, 0 ; R2 = 0 (accumulator)
ADDI R3, R0, 1 ; R3 = 1 (constant)
loop:
ADD R4, R2, R1 ; R4 = R2 + R1
ADDI R2, R4, 0 ; R2 = R4
ADDI R1, R1, -1 ; R1--
BEQZ R1, done ; if R1 == 0, exit
J loop
done:
HALT
"""
# Assemble the program
instructions = []
for line in program.strip().split('\n'):
line = line.split(';')[0].strip() # Remove comments
if line:
instructions.append(assemble(line))
# Load and run
cpu = CPUSimulator()
cpu.load_program(instructions)
cpu.trace_enabled = True
cycles = cpu.run()
print(f"\n[RESULT] Program completed in {cycles} cycles")
print(f"[RESULT] Sum in R2 = {cpu.registers.read(2)}")
print(f"[RESULT] Expected: 55")
def test_memory_access():
"""Test memory load/store operations."""
program = """
ADDI R1, R0, 0 ; R1 = base address
ADDI R2, R0, 42 ; R2 = 42
STORE R2, R1, 0 ; MEM[0] = 42
ADDI R3, R0, 100 ; R3 = 100
STORE R3, R1, 4 ; MEM[4] = 100
LOAD R4, R1, 0 ; R4 = MEM[0]
LOAD R5, R1, 4 ; R5 = MEM[4]
ADD R6, R4, R5 ; R6 = R4 + R5 = 142
HALT
"""
instructions = []
for line in program.strip().split('\n'):
line = line.split(';')[0].strip()
if line:
instructions.append(assemble(line))
cpu = CPUSimulator()
cpu.load_program(instructions)
cpu.trace_enabled = True
cycles = cpu.run()
print(f"\n[RESULT] Memory test completed in {cycles} cycles")
print(f"[RESULT] R6 = {cpu.registers.read(6)} (expected 142)")
if __name__ == "__main__":
test_sum_to_n()
print("\n" + "="*50 + "\n")
test_memory_access()
Production Failure Scenarios
Scenario 1: Infinite Loops from Branch Prediction Errors
Failure: Simulator enters infinite loop due to incorrect branch resolution logic.
Example bug:
# BUG: PC update happens incorrectly
elif op == Opcode.BEQZ:
if operand1 == 0:
self.registers.pc = self.registers.pc + offset # Missing -4!
Mitigation:
- Add cycle counter with hard timeout:
run(max_cycles=10000) - Implement trace mode that prints every instruction
- Verify program correctness against known-good reference implementation
Scenario 2: Memory Access Violations
Failure: Loading or storing beyond memory bounds causes crash or silent corruption.
Example bug:
# BUG: No bounds checking in memory access
def load_word(self, address: int) -> int:
return (self.data[address] | # Crashes if address out of bounds
self.data[address + 1] << 8 |
self.data[address + 2] << 16 |
self.data[address + 3] << 24)
Mitigation:
- Always validate addresses before access
- Implement memory protection with clear error messages
- Use Python’s bounds checking with try/except for graceful handling
Scenario 3: Signed/Unsigned Confusion in Comparisons
Failure: Sign bit treated incorrectly causes wrong branch decisions.
# BUG: Treating signed immediate as unsigned
immediate = (instruction >> 10) & 0xFFF # Always positive!
# If we want -1 (0xFFF), this gives 4095 instead!
Mitigation:
- Explicitly implement signed and unsigned operations
- Use Python’s unlimited integer precision but simulate 32-bit wraparound
- Test with edge cases: -1, 0, 1, 0x7FFFFFFF, 0x80000000
Trade-off Table
| Aspect | Functional Simulator | Timing-Accurate | Cycle-Accurate |
|---|---|---|---|
| Speed | Very fast | Medium | Slow |
| Accuracy | Instruction correctness only | Execution order | Full timing |
| Complexity | Low | High | Very high |
| Use Case | Education, ISA design | Performance exploration | Hardware validation |
| Cache modeling | None | Simplified | Full hierarchy |
| Power modeling | None | Simplified | Accurate |
Observability Checklist
When building and debugging CPU simulators:
- Instruction trace: Log every instruction with PC, opcode, and operands
- Register state dump: Print all registers after each cycle
- Memory access log: Track all loads and stores with addresses and values
- Cycle counter: Count executed cycles for performance analysis
- Breakpoint support: Halt on specific PC values or conditions
- Memory dump: Hex dump of memory regions for debugging
- Statistics: Count of each instruction type, branches taken, etc.
- Visualization: State diagram or pipeline visualization for education
Common Pitfalls / Anti-Patterns
Security Considerations
-
Memory Safety: Our simulator implements bounds checking but production emulators must also check:
- Alignment requirements (word accesses must be word-aligned)
- Privilege levels (user mode vs kernel mode simulation)
- Valid memory regions (no mapping arbitrary addresses)
-
Instruction Validation: Reject invalid opcode values before attempting execution:
if opcode > Opcode.HALT: raise InstructionError(f"Invalid opcode: {opcode}") -
Denial of Service: Infinite loops can hang simulators. Always implement:
- Maximum cycle limits
- Maximum memory access limits
- Instruction count limits per program
Compliance Notes
-
ISA Compliance Testing: Simulators used for hardware validation must pass:
- riscv-tests for RISC-V
- Intel Architecture Test Suite (ARTS)
- ARM Architecture Compliance Suite
-
Floating-Point: Production simulators must implement IEEE 754 compliance for FP operations
Common Pitfalls / Anti-patterns
-
PC Update Ordering: Forgetting that PC increments after fetch, not before. The instruction at PC is the one fetched; PC+4 is the next sequential instruction.
-
R0 Hardwiring: In RISC architectures, R0 (or XZR in ARM64) must always read as zero. Forgetting this breaks many programs that assume
ADDI R1, R0, 5puts 5 in R1. -
Immediate Sign Extension: 12-bit immediates must be sign-extended to full register width. Missing sign extension causes
ADDI R1, R1, -1to actually add 4095. -
Memory Endianness: Our simulator uses little-endian (least significant byte at lowest address). This must match the host machine for debugging or external tools.
-
Word Alignment: Instructions must be word-aligned (multiple of 4 bytes). Jump targets must be masked to ensure this.
-
Register-Register vs Register-Immediate: Using the wrong operand interpretation—immediate instructions use the immediate field, not register R2.
-
Branch Delay Slots: Some architectures (MIPS) have a delay slot where the instruction after branch executes regardless. Our simulator doesn’t model this, but real ISAs may.
Quick Recap Checklist
- The fetch-decode-execute cycle is the fundamental CPU operation: fetch instruction, decode it, execute it, update state
- Register files provide fast storage with read/write ports; R0 is typically hardwired to zero in RISC designs
- The ALU performs arithmetic and logical operations; everything else is either memory access or control flow
- Memory systems in simulators need bounds checking, alignment validation, and error handling
- Instruction formats (R-type, I-type, J-type) pack different fields into the same 32-bit word
- Branch and jump instructions modify PC directly; all other instructions increment PC by 4
- Sign extension converts smaller immediates to full register width while preserving signed value
- Infinite loops are a real risk in simulators—always implement cycle or instruction limits
- Testing with known programs (sum to N, Fibonacci, memory tests) catches bugs early
- The simulator concept extends to full system simulation with devices, interrupts, and exceptions
Interview Questions
The fetch-decode-execute cycle is the fundamental operation of any CPU. Fetch reads the next instruction from memory at the address in the Program Counter (PC) register. Decode interprets the instruction bits to determine the opcode and operands—what operation to perform and what data to use. Execute performs the actual computation: ALU operations, memory accesses, or register transfers. Finally, the CPU updates its state (registers, memory, PC) and repeats.
The cycle must exist because instructions are stored in memory as data, and the CPU needs a systematic way to retrieve and process them. There's no way to "just know" what an instruction should do—the encoding must be interpreted. This sequential, disciplined approach ensures deterministic behavior: given the same initial state and program, the CPU always produces the same results.
The Program Counter (PC) holds the memory address of the next instruction to execute. After an instruction is fetched, the PC must be updated to point to what comes next. In a 32-bit architecture with byte-addressable memory, each instruction is 4 bytes (32 bits), so the next instruction is at PC + 4.
It's crucial to understand that the PC points to the next instruction before fetch, but after fetch, the CPU has already read the instruction at that address. For sequential execution, we increment PC by 4. For branch or jump instructions, we set PC to a different value—the branch target instead of PC+4.
Note: Some architectures (like MIPS) use a "branch delay slot"—the instruction immediately after a branch executes regardless. Our simple simulator doesn't model this, but real RISC processors do.
Immediate values in instructions are encoded in fewer bits than the full register width. For example, our I-type instructions use 12 bits for immediates, but registers are 32 bits. When we load this 12-bit value into a 32-bit register, we must decide what goes in the upper 20 bits.
Sign extension copies the sign bit (most significant bit of the immediate) into all the upper bits. If the 12-bit immediate is 0x800 (-1 in 12-bit two's complement), we sign-extend it to 0xFFFFF800 (still -1 in 32-bit). If it were 0x001 (positive 1), we extend to 0x00000001.
This preserves the arithmetic meaning: -1 + 5 should still equal 4, not 4097. Unsigned immediates (like bitwise operations) still need sign extension to fill the register correctly—they just happen to work the same way because the upper bits end up as zeros for positive values.
The instruction type determines how the 32-bit encoding is interpreted.
R-type (Register-type) instructions have two source registers and a destination register—all operands are registers. The format is [opcode(4)][rd(3)][rs1(3)][rs2(3)][unused(19)]. Examples: ADD, SUB, AND, OR, XOR, SLT.
I-type (Immediate-type) instructions have one source register, one immediate value (constant), and a destination register. The format is [opcode(4)][rd(3)][rs1(3)][immediate(22)]. The immediate is sign-extended. Examples: ADDI, ANDI, ORI, LOAD.
J-type (Jump-type) instructions provide an absolute target address for jumps or a PC-relative offset for branches. STORE also uses the I-type format with register value to store. Examples: J (jump), BEQZ (branch if equal to zero).
The opcode field determines which type to use and how the CPU should interpret the remaining bits.
Function calls require three components: a way to save the return address, a mechanism to allocate stack space, and a way to restore state on return.
Call instruction: Use a J-type with a link register (like ARM's LR). CALL target would be: save PC+4 to a designated register (R7), then jump to target. The RET instruction would jump to the address in that register.
Stack management: Reserve a register as a stack pointer (SP). ALLOCATE frame subtracts from SP. DEALLOCATE frame adds back to SP. Push/pop operations adjust SP and store/load registers at [SP].
Callee-saved registers: On entry to a function, save any callee-saved registers (R5, R6 in our 8-register design) to the stack. Restore them before return.
A complete calling convention would also define parameter passing (first few parameters in R1-R3), return values (R1), and stack frame layout. This is exactly how real RISC architectures implement function calls.
Interrupts require several additions to the basic simulator design:
Interrupt controller: A module that receives hardware interrupt signals (timer, I/O, external) and determines if they should be presented to the CPU based on interrupt enable flags.
Interrupt enable/disable: Add status flags that can globally enable/disable interrupts and per-interrupt-type masks.
Vectored interrupts: When an interrupt fires, the CPU must:
- Finish the current instruction (atomicity)
- Push PC and status registers to stack
- Load new PC from interrupt vector table
- Switch to kernel/supervisor mode
Return from interrupt: An RTI instruction pops the saved state and resumes execution.
In von Neumann architecture, instructions and data share the same memory and bus. The same pathway carries both code and data, creating the "von Neumann bottleneck"—the CPU can either fetch an instruction or access data, but not both simultaneously.
In Harvard architecture, instruction memory and data memory are completely separate. The CPU can fetch an instruction and access data simultaneously on separate buses, doubling effective bandwidth.
Modern CPUs use a modified Harvard architecture with separate L1 instruction and data caches, while presenting a unified view of memory to software. This gets close to Harvard performance while maintaining software compatibility.
DSPs and microcontrollers often use true Harvard architecture for real-time performance in constrained environments.
Exceptions require a mechanism to detect the error condition and transfer control to an exception handler:
Exception types:
- Undefined instruction: Opcode not recognized
- Illegal memory access: Address out of bounds or alignment fault
- Arithmetic overflow: Signed/unsigned overflow detection
- Divide by zero: Division instruction with zero divisor
Handling mechanism:
if (is_undefined_opcode(instr)): raise_exception(EXCEPTION_UNDEFINED_INSTRUCTION, saved_pc) elif is_illegal_memory_access(address): raise_exception(EXCEPTION_MEMORY_FAULT, saved_pc) elif is_overflow(result): raise_exception(EXCEPTION_ARITHMETIC_OVERFLOW, saved_pc)
def raise_exception(type, pc): # Save current state push_to_stack(pc) push_to_stack(processor_status) # Load exception vector new_pc = exception_vector_table[type] jump_to_handler(new_pc)
Branch prediction reduces pipeline stalls by guessing the direction of branches before they're resolved. For a simulator, you can model several strategies:
Always-taken (forward not taken): Simplest — predict backward branches are taken, forward branches are not. Useful baseline.
1-bit saturating counter: Each branch has a counter that predicts taken if 1, not taken if 0. On misprediction, flip the counter. Simple but 2 mispredictions on loop exit.
2-bit saturating counter: Requires two mispredictions to change prediction. Better for loops — only mispredicts on loop exit, not entry.
Two-level adaptive (gshare): Uses global branch history register (BHR) to index into a pattern history table (PHT). Captures correlation between branches.
On a misprediction:
- Flush the pipeline of instructions after the branch
- Update the branch predictor state
- Fetch from the correct target address
Simulator implementation: Track predictor state per branch address, and measure misprediction rate to evaluate predictor effectiveness.
A functional simulator (like the one we built) models the behavior of a CPU without regard to timing. It executes instructions correctly but doesn't model how long each instruction takes, pipeline effects, or resource conflicts. The simulator is fast—millions of instructions per second—but not suitable for performance optimization.
A cycle-accurate simulator models the exact cycle-by-cycle behavior of the hardware. Every pipeline stage, every bus transaction, every cache access is modeled with its timing. This enables accurate performance prediction but at significant complexity and speed cost—complex cycle-accurate models may simulate only thousands of cycles per second.
Comparison:
- Functional: Used for ISA design exploration, compiler toolchain development, debugging. Speed >> accuracy for correctness testing.
- Cycle-accurate: Used for microarchitecture research, performance optimization, hardware validation. Accuracy >> speed.
Many industry simulators sit between these extremes: "timingSimple" models in gem5, for example, add approximate timing to functional simulation without full cycle accuracy.
Out-of-order execution requires several additions:
Instruction Window: A buffer that holds instructions that have been decoded but not yet executed. Instructions are dispatched to execution units as their dependencies are satisfied, not in program order.
Register Renaming: To avoid WAR (Write After Read) and WAW (Write After Write) hazards, physical registers replace architectural registers. Each instruction gets a destination physical register; older values are preserved until no longer needed.
Reservation Stations: Each execution unit has a queue of pending instructions. When operands are available and the unit is free, the instruction executes.
Reorder Buffer (ROB): Instructions are tracked in program order through the ROB. When an instruction completes, it's retired in program order—updating architectural state in the correct sequence. Speculative results are held until the branch they depend on is resolved.
The data flow: Fetch → Decode → Rename → Dispatch to reservation station → Execute → Write result → ROB Retire
Complex parts: tracking dependencies, managing the ROB size, handling mispredictions (flush the pipeline), recovering from exceptions.
A scoreboard is a data structure used in cycle-accurate simulators to track the state of instructions in flight, manage dependencies, and detect hazards. It was popularized by the CDC 6600 scoreboard.
The scoreboard maintains:
- Instruction status table: For each instruction in flight, which stage it is in (Issue, Read Operands, Execute, Write Result)
- Functional unit status: Which unit is busy, what instruction it contains, how long until result
- Register result status: Which instruction will write to each register, and when
- Physical register file: Current values and validity flags
On each cycle, the scoreboard:
- Determines which instructions can issue (no structural hazard)
- ChecksRAW hazards (instruction reading register that will be written by earlier instruction)
- Monitors execution completion and triggers result writes
- Manages write-back order to prevent WAW and WAR hazards
The scoreboard stalls issue when a hazard exists, allowing in-order retirement semantics while enabling out-of-order execution.
Implementing a cache hierarchy involves adding cache structures and modifying memory access:
Cache structure:
class Cache:
def __init__(self, size, block_size, associativity):
self.size = size
self.block_size = block_size
self.associativity = associativity
self.lines = [None] * (size // block_size)
self.hits = 0
self.misses = 0
Memory access flow:
- Check L1 cache for the address
- On hit: return data, update LRU (Least Recently Used)
- On miss: check L2 cache
- On L2 hit: return data, update L2 and L1
- On L2 miss: access main memory, fill both caches
Cache line states: Valid, Modified (dirty), Owned, Invalid. This is needed for cache coherence protocols (MESI, MOESI).
Write policies:
- Write-back: Write to cache only, write to memory on eviction
- Write-through: Write to cache and memory simultaneously
- Write-allocate: On write miss, allocate the cache line
- No-write-allocate: On write miss, write directly to memory
A unified cache stores both instructions and data in the same cache structure. A separate cache has distinct L1 instruction cache (I-cache) and L1 data cache (D-cache).
Advantages of separate caches:
- Harvard architecture: Can fetch instruction and access data simultaneously on different buses
- No conflict misses: Instructions and data don't compete for the same cache space
- Simpler timing: Different cache parameters for instructions vs data
Advantages of unified cache:
- Flexibility: All resources available for either instructions or data as needed
- Higher hit rates: No wasted space if one cache is underutilized while the other is full
- Simpler design: One cache structure instead of two
Modern CPUs typically use separate L1 caches (I-cache and D-cache) because the fetch and data access are independent critical paths. L2 and L3 are usually unified because by that point the distinction is less important and the cache needs to handle both equally.
A TLB is a cache of virtual-to-physical address translations. In a simulator:
TLB structure:
class TLB:
def __init__(self, entries=64):
self.entries = entries # Typically 64-128
self.tags = [None] * entries # Virtual page numbers
self.translations = [None] * entries # Physical page numbers
self.valid = [False] * entries
self.lru = list(range(entries)) # LRU ordering
Lookup flow:
- Extract virtual page number from virtual address
- Search TLB for matching tag
- On hit: combine physical page number with offset
- On miss: walk page tables in simulated memory, add to TLB, evict LRU entry
TLB attributes:
- ASID (Address Space ID): Identifies which process's address space (prevents flushing on context switch)
- Protection bits: Read/write/execute permissions
- Valid bit: Entry is valid
TLB misses are expensive—on a real CPU, they require a page table walk (10-100 cycles). Simulating this accurately adds significant complexity.
A scalar processor can complete at most one instruction per cycle. A superscalar processor can complete more than one instruction per cycle by having multiple execution units and dispatching instructions to them in parallel.
A microscalar processor extends scalar by having multiple pipeline lanes but limited parallelism—typically issue width of 2-4 and fewer execution units than a full superscalar.
Superscalar characteristics:
- Issue width: Number of instructions that can be dispatched per cycle (2, 4, 8)
- Execution units: ALU, load/store, FP units that can operate in parallel
- Register files with multiple ports: Needed to support multiple reads/writes per cycle
- Out-of-order execution: To keep the execution units busy despite hazards
Modern desktop and server CPUs are superscalar (Intel Haswell: 4-wide, AMD Zen: 6-wide). Simulating a superscalar is significantly more complex than scalar because you must model multiple instructions per cycle and their interactions.
Branch prediction simulation models how a CPU predicts branch outcomes to avoid pipeline stalls. The simulator needs to track:
Branch History Table (BHT):
class BranchPredictor:
def __init__(self, size=512):
self.size = size
# 2-bit saturating counters: 00=strong not-taken, 01=weak not-taken
# 10=weak taken, 11=strong taken
self.counters = [0] * size
def predict(self, pc):
idx = pc % self.size
return self.counters[idx] >= 2 # Taken if counter >= 2
def update(self, pc, taken):
idx = pc % self.size
if taken and self.counters[idx] < 3:
self.counters[idx] += 1
elif not taken and self.counters[idx] > 0:
self.counters[idx] -= 1</code></pre>
<p><strong>Branch Target Buffer (BTB)</strong>: Caches the destination address of branches to enable faster fetch on subsequent executions.</p>
<p><strong>Prediction flow in simulator</strong>:</p>
<ol>
<li>When fetching a branch instruction, check BTB for target address</li>
<li>Use BHT to predict taken/not-taken</li>
<li>If mispredicted (actual != predicted), flush pipeline and correct PC</li>
<li>Update BHT with actual outcome</li>
</ol>
<p>More sophisticated predictors: local history (correlates with recent local branches), global history (correlates with overall branch behavior), tournament predictors (choose between local and global).</p>
18. What is the purpose of a pipeline register and how does it affect simulation speed?
Pipeline registers (also called pipeline latches or flip-flops) are storage elements placed between pipeline stages. They capture the output of one stage at the end of a cycle and hold it as input to the next stage at the beginning of the next cycle.
Pipeline registers enable:
- Timing closure: Each stage has one cycle to complete its work
- Clocking: All stages synchronize on the same clock edge
- Isolation: Stages don't interfere with each other mid-cycle
In a cycle-accurate simulator, pipeline registers are modeled explicitly:
class PipelineRegister:
def __init__(self, name):
self.name = name
self.value = None
def clock(self):
# On clock edge, capture input
pass
def update(self, new_value):
# Set value for next cycle
self.value = new_value</code></pre>
<p>Pipeline registers affect simulation speed because they determine how frequently the simulator must evaluate the pipeline. Deeply pipelined processors (20+ stages) require more frequent updates than simple ones. Cycle-accurate simulators update state every simulated cycle, which is slower than functional simulation but necessary for timing accuracy.</p>
19. How would you add support for exceptions and interrupts to the simulator?
Adding exception and interrupt support requires several additions:
Exception types:
- Synchronous: Generated by instruction (undefined opcode, overflow, page fault)
- Asynchronous: External events (timer interrupt, I/O completion)
Exception handling flow:
- Detect exception condition (invalid opcode, alignment fault, etc.)
- Save current PC to EPC (Exception Program Counter) register
- Set cause register with exception type
- Disable further interrupts
- Transfer to exception handler address (from vector table)
Interrupt controller:
class InterruptController:
def __init__(self):
self.pending = []
self.enabled = True
def raise_interrupt(self, source):
if self.enabled:
self.pending.append(source)
def check_interrupts(self, cpu):
if self.pending and self.enabled:
# Highest priority interrupt
irq = self.pending.pop(0)
cpu.handle_interrupt(irq)</code></pre>
<p>RTI (Return from Interrupt) instruction restores saved state and re-enables interrupts. The key complexity is preserving enough state to resume execution after handling.</p>
20. What is the difference between soft float and hard float emulation in a simulator?
Hard float (hardware floating-point): The simulator leverages the host CPU's floating-point instructions. This is fast but may have subtle differences in rounding, exception handling, and NaN propagation compared to the target architecture.
Soft float: The simulator implements floating-point operations entirely in software. Each FP operation becomes a function call that emulates the operation using integer arithmetic. This is slow (100x slower than hard float) but bit-exact with the target architecture.
Soft float implementation:
def soft_float_add(a_bits, b_bits):
# Extract sign, exponent, mantissa
sign_a = (a_bits >> 31) & 1
exp_a = (a_bits >> 23) & 0xFF
mantissa_a = a_bits & 0x7FFFFF
# Perform addition in software
# Handle special cases (NaN, infinity, denormals)
# Rounding according to IEEE 754 rules
return result_bits</code></pre>
<p>When to use each:</p>
<ul>
<li><strong>Soft float</strong>: When bit-exact behavior is required (testing, certification, debugging)</li>
<li><strong>Hard float</strong>: When speed is critical and minor behavioral differences are acceptable</li>
</ul>
<p>gem5 simulator supports both, selectable via configuration. The RISC-V GNU toolchain supports both soft and hard float ABIs.</p>
Further Reading
- Nand2Tetris — Build a complete computer from NAND gates
- LC3 Simulator — Open-source LC-3 architecture simulator
- RISC-V Spike Simulator — Official RISC-V reference simulator
- QEMU Open Source Emulator — Full system emulation for multiple architectures
- CPU Design Course ( Coursera) — Free online course on processor design
Conclusion
Building a CPU simulator gives you concrete insight into how hardware actually works. The fetch-decode-execute cycle, register files, ALU operations, and memory access—all the abstract concepts from computer architecture become tangible when you implement them yourself.
The simulator we built demonstrates the essential concepts without getting bogged down in timing accuracy or cache modeling. It handles real instructions, executes branches correctly, and can run actual programs. Extensions like function calls, interrupts, and exception handling would turn this into a full system simulator.
Take the next step by exploring instruction set architecture to understand what instructions real processors provide, or assembly language basics to see how those instructions are actually written.
Category
Related Posts
ASLR & Stack Protection
Address Space Layout Randomization, stack canaries, and exploit mitigation techniques
Assembly Language Basics: Writing Code the CPU Understands
Learn to read and write simple programs in x86 and ARM assembly, understanding registers, instructions, and the art of thinking in low-level operations.
Boolean Logic & Gates
Understanding AND, OR, NOT gates and how they combine into arithmetic logic units — the building blocks of every processor.