2. Performance Concepts *

Microprocessor Speed

Techniques built into contemporary processors

Pipelining: The processor moves data or instructions into a conceptual pipe with all stages of the pipe processing simultaneously

Branch Prediction: The processor looks ahead in the instruction code fetched from memory and predicates which branches, or groups of instructions, are likely to be processed next

Superscalar execution: Ability to issue more than one instruction in every processor clock cycle (multiple parallel pipelines are used)

Data flow analysis: The processor analyzed which instructions are dependent on each other's results, or data to create an optimized schedule of instructions.

Speculative execution: Using branch prediction and data flow analysis, some processors speculatively execute instructions ahead of their actual appearance in the program execution, holding the results in temporary locations, keeping execution engines as busy as possible.

Performance Balance

Adjust the organization and architecture to compensate for the mismatch among the capabilities of the various components

Increase the number of bits retrieved at one time by making DRAMs wider rather than deeper and by using wide bus data paths.
Reduce the frequency of memory access by incorporating increasingly complex and efficient cache structures between the processor and main memory
Change the DRAM interface to make it more efficient by including a cache or other buffering scheme on the DRAM chip
Increase the interconnect bandwidth between processors and memory by using higher speed buses and a hierarchy of buses to buffer and structure data flow

Improvements in Chip Organization and Architecture

Increase hardware speed of the processor
- Fundamentally due to shrinking logic gate size
  - More gates, packed more tightly, increasing clock rate
  - Propagation time for signals reduced
Increase the size and speed of caches
- Dedicating port of processor chip
  - Cache access times drop significantly
Change processor organization and architecture
1. Increase the effective speed of instruction execution
2. Parallelism

Problems with Clock Speed and Logic Density

Power

Power density increases with the density of logic and clock speed
Dissipating heat

RC Delay

Speed at which electrons flow limited by resistance and capacitance of metal wires connecting them
Delay increases as the RC product increases
As components on the chip decrease in size, the wire interconnects become thinner, increasing resistance
Wires closer together, increasing capacitance

Memory latency and throughput

Memory access speed (latency) and transfer speed (throughput) lag processor speeds

Multi-Core

The use of multiple processors on the same chip provides the potential to increase performance without increasing the clock rate

Strategy is to use two simpler processors on the chip rather than one more complex processor

With two processors, larger cache is justified

As cache became larger it made performance sense to create two and then three levels on a chip

MIC - Many Integrated Core

Multicore and MIC involves a homogeneous collection of general-purpose processors on a single chip

GPU

2D and 3D graphics

vector processors

Clock

Quartz Crystal (Analog) -> A-to-D converter -> System Clock (Digital)
A computer clock runs at a constant rate and determines when events take place in hardware
The clock cycle time is the amount of time for one clock period to elapse
The clock rate (Hz) is the inverse of the clock cycle time (sec)

Clock Cycle Time * Clock Rate = 1

Computing CPU Time / Execution Time

CPU TIME

CPU TIME = CPU CLOCK CYCLEs * CLOCK CYCLE TIME
CPU TIME = CPU CLOCK CYCLEs / CLOCK RATE

INSTRUCTION COUNT

INSTRUCTION COUNT = INSTRUCTIONs / PROGRAM

CPI (CYCLE per INSTRUCTION)

CLOCK CYCLEs / INSTRUCTION

CLOCK CYCLEs

CPU CLOCK CYCLEs = (INSTRUCTIONs / PROGRAM) * (CLOCK CYCLEs / INSTRUCTION)
CPU CLOCK CYCLEs = INSTRUCTION COUNT * CPI

CPU TIME

CPU TIME = INSTRUCTION COUNT * CPI * CLOCK CYCLE TIME
CPU TIME = INSTRUCTION COUNT * CPI / CLOCK RATE

Computing CPI

CPI is the average number of cycles per instruction.

F - frequency

CPI = \sum_{i=1}^{n} CPI_{i}\times F_{i}

MIPS / MFLOPS

Marketing metrics for computer performance included MIPS and MFLOPS

MIPS

Millions of instructions per second

MIPS = INSTRUCTION COUNT / (EXECUTION TIME * 10^6)
Advantage: easy to understand and measure
Disadvantage: may not reflect actual performance, since simple instructions do better

MFLOPS

Millions of floating-point operation per second

MFLOPS = FLOATING POINT OPERATIONS / (EXECUTION TIME * 10^6)
Advantage: easy to understand and measure
Disadvantage: only measure floating point

Benchmark Principles

Written in a high-level language - portable across different machines
Representative of a particular kind of programming domain or paradigm
Easily measured
Wide distribution

PreviousCSA NOTES Next3. Top-Level view of computer function and interconnection

Last updated 4 years ago

hashtagMicroprocessor Speed

hashtagPerformance Balance

hashtagImprovements in Chip Organization and Architecture

hashtagProblems with Clock Speed and Logic Density

hashtagPower

hashtagRC Delay

hashtagMemory latency and throughput

hashtagMulti-Core

hashtagMIC - Many Integrated Core

hashtagGPU

hashtagClock

hashtagComputing CPU Time / Execution Time

hashtagCPU TIME

hashtagINSTRUCTION COUNT

hashtagCPI (CYCLE per INSTRUCTION)

hashtagCLOCK CYCLEs

hashtagCPU TIME

hashtagComputing CPI

hashtagMIPS / MFLOPS

hashtagMIPS

hashtagMFLOPS

hashtagBenchmark Principles

Microprocessor Speed

Performance Balance

Improvements in Chip Organization and Architecture

Problems with Clock Speed and Logic Density

Power

RC Delay

Memory latency and throughput

Multi-Core

MIC - Many Integrated Core

GPU

Clock

Computing CPU Time / Execution Time

CPU TIME

INSTRUCTION COUNT

CPI (CYCLE per INSTRUCTION)

CLOCK CYCLEs

CPU TIME

Computing CPI

MIPS / MFLOPS

MIPS

MFLOPS

Benchmark Principles