2. Performance Concepts *
Microprocessor Speed
Techniques built into contemporary processors
Pipelining: The processor moves data or instructions into a conceptual pipe with all stages of the pipe processing simultaneously
Branch Prediction: The processor looks ahead in the instruction code fetched from memory and predicates which branches, or groups of instructions, are likely to be processed next
Superscalar execution: Ability to issue more than one instruction in every processor clock cycle (multiple parallel pipelines are used)
Data flow analysis: The processor analyzed which instructions are dependent on each other's results, or data to create an optimized schedule of instructions.
Speculative execution: Using branch prediction and data flow analysis, some processors speculatively execute instructions ahead of their actual appearance in the program execution, holding the results in temporary locations, keeping execution engines as busy as possible.
Performance Balance
Adjust the organization and architecture to compensate for the mismatch among the capabilities of the various components
Increase the number of bits retrieved at one time by making DRAMs wider rather than deeper and by using wide bus data paths.
Reduce the frequency of memory access by incorporating increasingly complex and efficient cache structures between the processor and main memory
Change the DRAM interface to make it more efficient by including a cache or other buffering scheme on the DRAM chip
Increase the interconnect bandwidth between processors and memory by using higher speed buses and a hierarchy of buses to buffer and structure data flow
Improvements in Chip Organization and Architecture
Increase hardware speed of the processor
Fundamentally due to shrinking logic gate size
More gates, packed more tightly, increasing clock rate
Propagation time for signals reduced
Increase the size and speed of caches
Dedicating port of processor chip
Cache access times drop significantly
Change processor organization and architecture
Increase the effective speed of instruction execution
Parallelism
Problems with Clock Speed and Logic Density
Power
Power density increases with the density of logic and clock speed
Dissipating heat
RC Delay
Speed at which electrons flow limited by resistance and capacitance of metal wires connecting them
Delay increases as the RC product increases
As components on the chip decrease in size, the wire interconnects become thinner, increasing resistance
Wires closer together, increasing capacitance
Memory latency and throughput
Memory access speed (latency) and transfer speed (throughput) lag processor speeds
Multi-Core
The use of multiple processors on the same chip provides the potential to increase performance without increasing the clock rate
Strategy is to use two simpler processors on the chip rather than one more complex processor
With two processors, larger cache is justified
As cache became larger it made performance sense to create two and then three levels on a chip
MIC - Many Integrated Core
Multicore and MIC involves a homogeneous collection of general-purpose processors on a single chip
GPU
2D and 3D graphics
vector processors
Clock
Quartz Crystal (Analog) -> A-to-D converter -> System Clock (Digital)
A computer clock runs at a constant rate and determines when events take place in hardware
The clock cycle time is the amount of time for one clock period to elapse
The clock rate (Hz) is the inverse of the clock cycle time (sec)
Clock Cycle Time * Clock Rate = 1
Computing CPU Time / Execution Time
CPU TIME
CPU TIME = CPU CLOCK CYCLEs * CLOCK CYCLE TIME
CPU TIME = CPU CLOCK CYCLEs / CLOCK RATE
INSTRUCTION COUNT
INSTRUCTION COUNT = INSTRUCTIONs / PROGRAM
CPI (CYCLE per INSTRUCTION)
CLOCK CYCLEs / INSTRUCTION
CLOCK CYCLEs
CPU CLOCK CYCLEs = (INSTRUCTIONs / PROGRAM) * (CLOCK CYCLEs / INSTRUCTION)
CPU CLOCK CYCLEs = INSTRUCTION COUNT * CPI
CPU TIME
CPU TIME = INSTRUCTION COUNT * CPI * CLOCK CYCLE TIME
CPU TIME = INSTRUCTION COUNT * CPI / CLOCK RATE
Computing CPI
CPI is the average number of cycles per instruction.
F - frequency
MIPS / MFLOPS
Marketing metrics for computer performance included MIPS and MFLOPS
MIPS
Millions of instructions per second
MIPS = INSTRUCTION COUNT / (EXECUTION TIME * 10^6)
Advantage: easy to understand and measure
Disadvantage: may not reflect actual performance, since simple instructions do better
MFLOPS
Millions of floating-point operation per second
MFLOPS = FLOATING POINT OPERATIONS / (EXECUTION TIME * 10^6)
Advantage: easy to understand and measure
Disadvantage: only measure floating point
Benchmark Principles
Written in a high-level language - portable across different machines
Representative of a particular kind of programming domain or paradigm
Easily measured
Wide distribution
Last updated
Was this helpful?
