Computer Organization & Architecture Study Notes

Introduction :

Computer Organization and Architecture forms the core foundation of computer engineering. This subject explains how a computer system is designed, how its internal components are structured, and how they interact to execute instructions. While computer architecture focuses on the functional structure and operational behavior of a computer system as seen by the programmer, computer organization deals with the structural units and their interconnections that realize the architectural specifications. This article provides an exhaustive, textbook-style guide covering the evolution of computing, fundamental architectures, performance metrics, system interconnections, and the deep mechanics of computer arithmetic.

A Brief History of Computers

The evolution of modern computers can be categorized into distinct generations, each marked by a major technological advancement that fundamentally changed how computers operate.

First Generation (1940 - 1956): Vacuum Tubes

The earliest digital computers used vacuum tubes for circuitry and magnetic drums for memory. These machines were massive, filling entire rooms, consuming enormous amounts of electricity, and generating tremendous heat. They were programmed entirely in machine language (0s and 1s). Examples include ENIAC (Electronic Numerical Integrator and Computer) and UNIVAC (Universal Automatic Computer).

Second Generation (1956 - 1963): Transistors

The invention of the transistor replaced unreliable vacuum tubes. Transistors were smaller, faster, cheaper, and more energy-efficient. This era also saw the development of early symbolic or assembly languages, allowing programmers to use abbreviated text instructions instead of raw binary numbers. High-level languages like early versions of FORTRAN and COBOL were also introduced.

Third Generation (1964 - 1971): Integrated Circuits (ICs)

The breakthrough of the Integrated Circuit (IC) allowed hundreds of transistors to be placed on a single silicon semiconductor chip. This drastically increased the speed and efficiency of computers while significantly reducing their size and cost. Keyboards, monitors, and early operating systems replaced punched cards and printouts, making computers accessible to a broader audience.

Fourth Generation (1971 - Present): Microprocessors

The Fourth Generation began with the advent of the microprocessor, which brought thousands of integrated circuits onto a single silicon chip. The Intel 4004 chip, developed in 1971, located all the components of the computer—including the central processing unit, memory, and input/output controls—on a single chip. This era led to the rise of personal computers (PCs), laptops, and handheld devices, interconnected via local networks and eventually the global internet.

Fundamental System Architectures

To design an effective computing system, engineers rely on foundational structural frameworks. The two most prominent architectural models are the Von Neumann Architecture and the Harvard Architecture.

Von Neumann Architecture

Proposed by mathematician and physicist John Von Neumann in 1945, this design is built upon the "stored-program concept," where both program instructions and data are stored together in the same physical memory unit.

Key Components

Central Processing Unit (CPU): The brain of the computer, containing the Control Unit (CU) and the Arithmetic Logic Unit (ALU).
Memory Unit: A single, shared memory space that stores both data values and program instructions.
Input/Output (I/O) Interfaces: Mechanisms to communicate with external devices like keyboards, displays, and storage.
Registers: High-speed, temporary internal storage locations within the CPU, such as the Program Counter (PC) and Accumulator (AC).

The Von Neumann Bottleneck

Because instructions and data share the same physical memory space and access pathway (the system bus), the CPU cannot read an instruction and read/write data at the exact same instant. This sequential access limitation acts as a performance restrictive barrier, widely known as the Von Neumann Bottleneck.

Hinglish Simplification: Von Neumann architecture mein data aur instructions dono ek hi memory mein store hote hain. CPU ek waqt par ya toh instruction fetch kar sakta hai ya data read/write kar sakta hai kyunki dono ke liye ek hi rasta (bus) hota hai. Is limitation ko hum Von Neumann Bottleneck kehte hain.

Harvard Architecture

The Harvard Architecture was designed to overcome the primary limitation of the Von Neumann model by physically separating the storage and pathways for program instructions and data.

Features

Separate Memory Units: It contains a distinct Instruction Memory and a separate Data Memory.
Independent Buses: It features separate sets of physical wires (buses) to access instructions and data simultaneously.
Pipelining Support: Since instruction fetching and data memory access do not conflict, the CPU can execute instructions much faster through overlapping execution stages.

Comparison Between Von Neumann and Harvard Architecture

Feature	Von Neumann Architecture	Harvard Architecture
Memory Space	Single shared memory for data and instructions.	Separate memory blocks for data and instructions.
Bus Structure	Single shared bus for data and instructions.	Separate, dedicated buses.
Execution Speed	Slower execution due to sequential access.	Faster execution due to simultaneous access.
Hardware Complexity	Simpler hardware design and less expensive.	Complex design due to duplicate buses and memory controllers.
Applications	Standard PCs, laptops, and general-purpose systems.	Microcontrollers, Digital Signal Processors (DSPs), and embedded systems.

Designing for Performance

As technology advances, users demand faster processing speeds and greater efficiency. Computer engineers maximize performance by addressing fundamental constraints using several core design strategies.

Pipelining

Pipelining is a technique where multiple instructions are overlapped in execution. Think of it like an assembly line in a factory. While one instruction is being executed, the next instruction is being decoded, and a third instruction is being fetched from memory.

Cache Memory

To prevent the fast CPU from waiting idle for data from slow main memory (RAM), designers insert a small, ultra-fast memory called Cache Memory between the CPU and RAM. It stores frequently accessed data and instructions to accelerate processing times.

Architectural Scaling Paradigms

Superscalar Execution: Equipping the CPU with multiple parallel execution units (multiple ALUs) so that more than one instruction can be executed during a single clock cycle.
Multicore Processors: Placing multiple distinct processor cores onto a single silicon die, allowing completely independent programs or threads to run truly in parallel.

Evolution of Intel Processor Architecture: 4-Bit to 64-Bit

The historical trajectory of Intel microprocessors provides an ideal case study on how word size, register layout, and clock speeds expanded over time to handle complex software applications.

4-Bit Era (Intel 4004)

Introduced in 1971, the Intel 4004 was the world’s first commercially available single-chip microprocessor. It operated with a 4-bit word size, meaning it could process binary numbers up to 4 bits wide (values from 0 to 15) in a single operation. It possessed a clock speed of around 740 kHz and was primarily used in basic electronic calculators.

8-Bit Era (Intel 8008, 8085)

Launched in the mid-1970s, processors like the Intel 8085 expanded the word size to 8 bits. This enabled the processing of characters (text) natively, as 8 bits correspond to 1 byte. These processors featured an expanded instruction set, could address up to 64 KB of memory, and powered the earliest commercial personal computing systems.

16-Bit Era (Intel 8086, 80286)

The Intel 8086, released in 1978, represented a major leap forward with its 16-bit internal architecture and a 16-bit wide data bus. It introduced an instruction cache queue and could address up to 1 MB of physical memory using a technique called memory segmentation. This architecture laid the groundwork for the famous x86 instruction set family.

32-Bit Era (Intel 80386, 80486, Pentium Series)

Starting with the Intel 386 in 1985, the architecture shifted to 32 bits. Registers, data buses, and address buses expanded to 32 bits, allowing the processor to natively address up to 4 GB of RAM (2^32 bytes). This generation brought advanced hardware-level multitasking, virtual memory management, integrated floating-point math units, and complex pipelining structures into mainstream computing.

64-Bit Era (Intel Core, Xeon, Modern x86-64)

The modern era relies on the 64-bit architecture (x86-64). This transition broke through the 4 GB memory barrier, enabling processors to theoretically address up to 16 Exabytes of structural RAM (2^64 bytes). Modern 64-bit processors feature multi-level cache systems, highly advanced branch prediction mechanisms, and integrated graphics processing units, allowing them to manage complex, resource-heavy computing operations.

Performance Assessment Metrics

To determine the efficiency of a computer system objectively, engineers use quantitative metrics to evaluate execution speeds.

Clock Speed (Frequency)

The internal clock of a computer generates regular electronic pulses that synchronize all internal operations. Clock speed is measured in Hertz (Hz), with modern systems operating in the Gigahertz (GHz) range (billions of cycles per second).

MIPS and MFLOPS

MIPS (Millions of Instructions Per Second): A metric specifying how many millions of integer instructions a processor can execute in one second.
MFLOPS (Millions of Floating-Point Operations Per Second): Used predominantly in scientific computing to measure how many millions of decimal or fractional math operations a system can process per second.

The Fundamental CPU Performance Equation

The total execution time (T) required to run a specific program is calculated using the following mathematical formula:

T = I × CPI × t

Where:

T = Total CPU Execution Time for the given program.
I = Number of Instructions in the program (Instruction Count).
CPI = Average Cycles Per Instruction (the average number of clock cycles an instruction takes to finish).
t = Clock Cycle Time (the duration of a single clock cycle, which is the reciprocal of Clock Frequency, 1 / f).

Alternatively, this can be written as:

CPU Time = (Instruction Count × CPI) / Clock Frequency

Hinglish Simplification: Kisi program ko run hone mein kitna time lagega (CPU Time), yeh teen cheezon par depend karta hai: total kitne instructions hain (Instruction Count), ek instruction ko chalne mein average kitne cycles lagte hain (CPI), aur aapke processor ki speed kya hai (Clock Frequency). Agar frequency badhegi, toh CPU time kam hoga aur computer fast chalega.

Top-Level View of Computer Function and Interconnection

A computer functions by continuously executing instructions fetched from memory. This process is coordinated by internal interconnection pathways.

Computer Components

At a macro level, a computer consists of three core hardware modules:

The CPU: Contains control logic, data paths, and high-speed temporary storage registers.
The Memory Module: Composed of an array of addressable locations, each containing binary sequences representing data or instructions.
The I/O Module: Manages communication between internal system components and external peripherals.

Instruction Cycle

The basic operation of a CPU is called the instruction cycle, which consists of two primary phases:

[ Fetch Stage ] ---> [ Decode Stage ] ---> [ Execute Stage ]

Fetch Stage: The CPU reads an instruction from memory from the address specified by the Program Counter (PC). The PC is then incremented to point to the next instruction.
Execute Stage: The CPU interprets (decodes) the binary instruction opcode and carries out the required operation (such as loading data, performing math, or altering the execution flow).

Interconnection Structures and Bus Design

Since components must exchange data, they require dedicated communication pathways. The collective collection of paths connecting the modules is called the Interconnection Structure.

The System Bus

The most common interconnection method is the System Bus, a shared pathway consisting of multiple physical wires grouped by function.

Data Bus: Wires used to transfer actual data words between the CPU, memory, and I/O devices. The number of lines determines the data bus width, which impacts system performance.
Address Bus: Wires used by the CPU to designate the specific physical memory address or I/O port it wants to read from or write to. A wider address bus allows the system to access a larger pool of physical memory.
Control Bus: Wires carrying timing and command signals to manage access to and use of the data and address lines, transmitting read/write commands, clock pulses, and interrupt requests.

Bus Arbitration

Since the system bus is shared, if multiple modules (like the CPU and a Direct Memory Access controller) attempt to transmit data simultaneously, a collision occurs. Bus Arbitration is the mechanism that determines which device gets control of the bus. It can be centralized (handled by a single hardware controller) or distributed (modules work together using shared logic to decide control).

Computer Arithmetic: The Arithmetic and Logic Unit (ALU)

The Arithmetic and Logic Unit (ALU) is the core part of the CPU that performs all mathematical and logical operations. Data is presented to the ALU in digital registers, and the result of an operation is stored back into a register.

Fixed-Point Number Representation

Computers store integers in binary form using three primary methods:

Sign-Magnitude: The most significant bit (MSB) acts as the sign bit (0 for positive, 1 for negative), while the remaining bits represent the absolute magnitude.
1's Complement: Negative numbers are formed by inverting all the bits of the corresponding positive binary number.
2's Complement: The universal standard for integer representation. To find the 2's complement of a negative number, invert all the bits of its positive equivalent (1's complement) and then add 1 to the least significant bit (LSB).

Addition and Subtraction of Signed Numbers

In 2's complement representation, addition and subtraction can be performed using the same hardware loop.

2's Complement Addition

To add two signed numbers, perform a straightforward binary addition across all bits, including the sign bit. Any carry out of the most significant bit (MSB) is discarded.

2's Complement Subtraction

To subtract a binary number B from A (i.e., A - B), convert B into its negative 2's complement form (-B) and then add it directly to A.

Operation: A - B = A + (-B)

Overflow Conditions

When two signed numbers are added, there is a risk of the result exceeding the storage capacity of the bit-width. Overflow occurs if two numbers with the same sign are added, but the resulting output has the opposite sign (e.g., Positive + Positive = Negative). In hardware circuits, an overflow condition is detected when the carry entering the sign bit does not match the carry exiting the sign bit.

Design of Adders and Fast Adders

To perform additions, the ALU relies on specific combinational logic gates.

Half Adder

A Half Adder takes two single-bit inputs (A and B) and produces two outputs: Sum (S) and Carry (C).

Sum (S) = A XOR B

Carry (C) = A AND B

Full Adder

A Full Adder handles three binary digits: two operand bits (A, B) and an incoming carry bit from a lower-order column (Cin).

Sum (S) = A XOR B XOR Cin

Carry Out (Cout) = (A AND B) OR (Cin AND (A XOR B))

Ripple Carry Adder

A Ripple Carry Adder is constructed by cascading multiple full adder stages in series. The Cout of each stage serves as the Cin for the next higher-order stage.

Performance Limitation

The primary drawback of the Ripple Carry Adder is propagation delay. The high-order bits cannot calculate their final sum until the carry signals "ripple" through every single preceding stage. This serial dependency limits processing speed in wide (32-bit or 64-bit) architectures.

Carry Lookahead Adder (Fast Adder)

The Carry Lookahead Adder (CLA) solves the propagation delay problem by calculating the carry bits for all stages simultaneously using lookahead logic gates.

It defines two specialized signals for any given bit stage i:

Generate (Gi): This stage creates a carry out regardless of the incoming carry.

Gi = Ai AND Bi

Propagate (Pi): This stage passes an incoming carry through to the next stage.

Pi = Ai XOR Bi

The ultimate carry out (Ci+1) for a stage can be determined using these signals:

Ci+1 = Gi OR (Pi AND Ci)

By unrolling this recursive equation, we can compute higher-order carries without waiting for them to ripple sequentially:

C1 = G0 + (P0 × C0)

C2 = G1 + (P1 × G0) + (P1 × P0 × C0)

C3 = G2 + (P2 × G1) + (P2 × P1 × G0) + (P2 × P1 × P0 × C0)

Since these equations depend only on the external inputs (A, B) and the initial carry (C0), the carries can be calculated simultaneously. This drastically reduces propagation delay, making it an essential component for high-speed ALUs.

Hinglish Simplification: Ripple Carry Adder mein agla bit tab tak calculate nahi ho sakta jab tak pichle bit ka carry chalte hue aage na aaye, jisse delay hota hai. Carry Lookahead Adder is dikkat ko solve karta hai. Yeh advance mein hi generate (G) aur propagate (P) equations ka use karke saare carries ek saath nikal leta hai, jisse addition bohot fast ho jata hai.

Multiplication of Positive Numbers

Multiplying unsigned binary numbers follows a shift-and-add approach similar to long multiplication in decimal numbers.

Step-by-Step Multiplication Process

Check the multiplier bits one by one, starting from the least significant bit (LSB).
If the current multiplier bit is 1, copy the multiplicand down as a partial product. If the bit is 0, write down rows of zeros.
Shift each successive partial product row one position to the left.
Add all shifted partial product rows together to find the final product.

Signed Operand Multiplication: Booth's Algorithm

Multiplying signed 2's complement numbers requires careful handling, as standard shift-and-add methods do not account for negative sign bits. Booth's Algorithm is a specialized hardware method that multiplies signed integers efficiently by treating strings of 1s as a single subtraction and addition operation.

Booth's Algorithm Flowchart and Steps

The algorithm uses an extra bit (Q-1) placed directly to the right of the least significant bit of the multiplier (Q0). Initially, Q-1 is set to 0. An accumulator register (A) is initialized to 0.

During each step, the algorithm inspects the two bits Q0 and Q-1:

If the bits are 10: The multiplier is entering a string of 1s. Subtract the multiplicand (M) from the accumulator (A). Formula: A = A - M.
If the bits are 01: The multiplier is exiting a string of 1s. Add the multiplicand (M) to the accumulator (A). Formula: A = A + M.
If the bits are 00 or 11: No arithmetic operation is performed.
After the check: Perform an Arithmetic Right Shift (ASR) on the combined registers [A, Q, Q-1]. An arithmetic right shift shifts all bits to the right by one position but keeps the original sign bit (MSB) of register A unchanged to preserve the correct sign.
Repeat this complete cycle for as many steps as there are bits in the operands.

Numerical Example Using Booth's Algorithm

Let's multiply two 4-bit signed integers:

Multiplicand (M) = +3 (Binary: 0011)
Multiplier (Q) = -4 (Binary 2's complement: 1100)

Initial Setup

M = 0011
-M = 1101 (2's complement of +3)
A = 0000 (Accumulator initialized to 0)
Q = 1100
Q-1 = 0
Count = 4 (Since we are using 4-bit numbers)

Step-by-Step Execution Trace

Step 1

Inspect Bits (Q0, Q-1): Looking at Q = 1100 and Q-1 = 0, the current combination is 00.
Operation: No arithmetic operation is needed according to the rules.
Shift Phase: Perform an Arithmetic Right Shift (ASR) across the combined registers [A, Q, Q-1].
- Before Shift: A = 0000, Q = 1100, Q-1 = 0
- After Shift: A = 0000, Q = 0110, Q-1 = 0
Count Decouple: Count decreases from 4 to 3.

Step 2

Inspect Bits (Q0, Q-1): Looking at Q = 0110 and Q-1 = 0, the current combination is 00.
Operation: No arithmetic operation is needed.
Shift Phase: Perform an Arithmetic Right Shift (ASR).
- Before Shift: A = 0000, Q = 0110, Q-1 = 0
- After Shift: A = 0000, Q = 0011, Q-1 = 0
Count Decouple: Count decreases from 3 to 2.

Step 3

Inspect Bits (Q0, Q-1): Looking at Q = 0011 and Q-1 = 0, the current combination is 10.
Operation: Perform subtraction: A = A - M (or A = A + (-M)).
- A = 0000 + 1101 = 1101
- Current State: A = 1101, Q = 0011, Q-1 = 0
Shift Phase: Perform an Arithmetic Right Shift (ASR).
- Before Shift: A = 1101, Q = 0011, Q-1 = 0
- After Shift: A = 1110, Q = 1001, Q-1 = 1
Count Decouple: Count decreases from 2 to 1.

Step 4

Inspect Bits (Q0, Q-1): Looking at Q = 1001 and Q-1 = 1, the current combination is 11.
Operation: No arithmetic operation is needed.
Shift Phase: Perform an Arithmetic Right Shift (ASR).
- Before Shift: A = 1110, Q = 1001, Q-1 = 1
- After Shift: A = 1111, Q = 0100, Q-1 = 1
Count Decouple: Count decreases from 1 to 0. Execution terminates.

Final Product Analysis

The final combined binary output in registers [A, Q] is 11110100.

To verify if this binary string is correct, let's find its decimal value. Since the sign bit (MSB) is 1, it represents a negative number in 2's complement form.

Invert all bits of 11110100 -> 00001011
Add 1 to the LSB -> 00001100
00001100 in binary equals 12 in decimal.
Therefore, the original signed value is -12.

Our original mathematical expression was (+3) × (-4) = -12. The result matches perfectly.

Core Summary and Review Takeaways

Architectural Models: The Von Neumann architecture stores data and instructions in a single shared memory, which can create a performance bottleneck. The Harvard architecture uses separate memory units and buses to allow simultaneous access and faster processing.
Processor Evolution: Intel architectures grew from 4-bit to 64-bit systems, which expanded data bus widths, increased register capacities, and shattered memory access limits.
System Bus Function: The system bus coordinates communication within the computer using three distinct pathways: the address bus, the data bus, and the control bus.
Advanced Adders: The Carry Lookahead Adder eliminates the propagation delays found in Ripple Carry Adders by using combinatorial logic to calculate carry bits simultaneously.
Booth's Algorithm: This algorithm streamlines signed binary multiplication by using sequential arithmetic shifts and conditional operations based on pairs of multiplier bits.

SEO Keywords

Computer Organization and Architecture Notes, Von Neumann Architecture vs Harvard, Intel Processor History, System Bus Interconnection, Carry Lookahead Adder Equations, Booths Algorithm Solved Example, 2s Complement Addition Overflow, CPU Performance Equation, Computer Engineering Tutorials, Academic Exam Preparation Computer Architecture, SPPU 2024 Pattern.

Unit 1 - Computer Evolution and Performance

Introduction :

A Brief History of Computers

Comments (0)