Category Archives: [UCSD] Computer Architecture

[Computer Architecture] Instruction Level Parallelism

Parallelism

Thread: running program on a piece of hardware (context)

Single thread program: we can exploit ILP on SISD (ex. Single core)
Multi thread program: we can exploit TLP on MIMD (ex. CMP)
Many threaded program: we can exploit DLP on SIMD (ex. GPU)

What limits infinite simultaneous multithreading?
– Data hazard
– Control hazard
– Memory bandwidth
– Memory latency

Instruction Level Parallelism

Instruction Level Parallelism is the characteristic of a program that certain instructions are independent, and can potentially be executed in parallel.

Compiler optimization (SW)

Instruction scheduling: changes ILP within a basic block
Loop unrolling: allows ILP across iterations by putting instructions from multiple iterations in the same basic block
Trace scheduling: allows ILP across multiple basic blocks by checking their paths
Software pipelining: interleaving of multiple iterations of loop to form a kernel that better utilizes hardware resources

Example

Dynamic scheduling

Issue in-order, Execute and Write back out-of-order

Scoreboarding

Instruction storage is added to each functional execution units, but control and buffer are centralized as scoreboard and register file
Both source registers are read together (WAR/WAW limited)

Tomasulo

  1. Issue (get instruction from FP Instruction Queue): If reservation station is free, the Instruction Queue issues instruction and sends operands (renames registers)
  2. Execution (operate on operands): When both operands are ready, then execute; if not ready, watch Common Data Bus (CDB) for result.
  3. Write result (finish execution): Write on CDB to all waiting units; mark reservation station available.

Instruction storage is added to each functional execution units, control and buffers are distributed as reservation stations.
Each source registers are read as soon as available (only RAW matters)

Tomasulo + ROB

  1. Issue (get instruction from FP Instruction Queue): If reservation station is free, the Instruction Queue issues instruction and sends operands (renames registers)
  2. Execution (operate on operands): When both operands are ready, then execute; if not ready, watch Common Data Bus (CDB) for result.
  3. Write result (finish execution): Write on CDB to all waiting units; mark reservation station available.
  4. Commit (update register with reorder result): When instruction is at the head of the reorder buffer and receives the result, update register with result (or store to memory) and remove instruction from reorder buffer.

ROB (in-order commit) provides speculative execution, provides precise exceptions in an out-of-order machine, provides larger window of instructions across branch boundaries.

Instruction Queue

It uses explicit register renaming. Registers are not read until instruction dispatches (begin execution), and register renaming ensures no conflicts.

References

http://courses.csail.mit.edu/6.888/spring13/lectures/L2-multicore.pdf

[Computer Architecture] Cache

Problem

Out of speed, density (proxy for cost), and capacity, we can only get two. With such limitation, Memory Wall and Bandwidth Wall problem persist.

Memory Wall: speed of memory (latency) is being the bottleneck, thus masking the fast speed of the processor
Bandwidth Wall: bandwidth of memory is being the bottleneck, thus masking the fast speed of the processor

Insight

Principle of locality lets us predict which data will be accessed which can be cached to reduce access time. This gives the illusion of SRAM access time with larger capacity.

Temporal Locality: likely to access the same location in the near future again
Spatial Locality: likely to access neighboring locations soon
* Value Locality: nearby locations are likely to have the same values (used heavily in approximate computing)

Cache

Organization (size, associativity, block size)


Associativity means freedom to place data. There will be a single set in a cache with many ways (associativity) to place data. Least associative cache is named Direct Mapped and most associative is named Fully associative.

Replacement policy

Ones that are most unlikely to be used in the future should be pushed back to lower level memory to make room for one that is needed now.

Longest until next use: ideal but impossible
Least Recently Used (LRU): best practical approximation
Pseudo-LRU
Random

Write policy

Write through: the information is written to both the block in the cache and to the block in the lower-level memory (read miss does not result in write back to DRAM, because all data are being synchronized)
* Write through is always combined with write buffers to reduce delay
Write back: the information is written only to the block in the cache, and is written to the DRAM only when it is replaced (good because no write to DRAM for repeated writes)

Write-allocate: make room for the cache block in the cache and fetch rest of the block from DRAM, then write to cache
No-write-allicate (Write-around): write to lower levels of memory hierarchy, ignoring the cache.

Improving cache performance

Reduce miss rate
Reduce miss penalty
Reduce hit time

Reducing miss rate

Types of miss

Compulsory: the first access to a block is not in the cache, so the block must be brought into the cache. (cold start miss, first reference miss) = miss in infinite cache
Capacity: if C is the size of the cache (in blocks) and there have been more than C unique cache blocks accessed since this cache was last accessed. = Non-compulsory miss in size X fully associative cache
Conflict: occurs because too many active blocks are mapped to the same cache set. = Non-compulsory, non-capacity miss
Coherence: ***

Compulsory miss: increase block size or prefetch
* too small a block size will lead to under-exploiting of spatial locality, too large a block size will lead to under-exploiting of temporal locality
Capacity miss: increase size of the cache
Conflict miss: increase associativity (or place victim cache)

Prefetching

Prefetching relies on extra memory bandwidth that can be used without penalty

(HW) Instruction prefetching can be done by placing stream buffer to prefetch the extra block.
(SW) Data prefetching can be done by special prefetching instructions, or by placing “helper thread” to prefetch the data for the main thread.

Compiler optimization

Reordering procedures in memory to reduce misses (partly link time optimization)
Profiling to look at conflicts

Merging arrays: improve spatial locality by single array of compound elements vs. two arrays
Loop interchange: change nesting of loops to access data in order stored in memory
Loop fusion: combine two independent loops where their accessed data overlap
Blocking: improve temporal locality by accessing blocks of data repeatedly vs. going down whole column of rows

Victim cache (Norman Jouppi, 1990)

Add buffer to hold data recently discarded from cache.

Reducing miss penalty

Write back memory

Read misses may require write of dirty block back to DRAM, instead copy the dirty block to a write buffer, then read, then write

Early restart and Critical word first

Early restart: as soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution.
Critical word first: request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block.
* Not always helpful because of spatial locality

Build a multi-level cache

Higher level can have less associative for faster speed, lower level can have higher associativity to reduce global miss rate.

Local miss rate: misses in this cache divided by the total number of memory accesses to this cache
Global miss rate: product of each local miss rate

Property of inclusion means that if it is in the higher level cache, it will have the lower level cache. (exclusion could be better than inclusion when L2 is only slightly larger than L1)

Reduce hit time

Way prediction

Add bits to each cache line to predict which way is going to hit
Gives direct mapped hit time + associative hit rate

[Computer Architecture] Branch Prediction

No Speculation

“Bubble” insertion

Insert “No-op”s. This is the simplest method, but leads to underutilization

Hoisting

Hoist invariant expressions that come after branch to inside the if-else statement (on the premise that if-else statement and the hoisted expressions do not conflict with each other). The compiler inserts these expressions inside if-else statement (copied to both –> increased code size) so that instructions executed do not have to be flushed.

* The term “hoisting” seems to be more widely used in compilers to name some cases of Loop-invariant code motion (LICM)

Predication

Change control dependencies into data dependency (by using conditional instructions).

Speculation

Static branch prediction

All Taken/Not Taken

T/NT for all branches.

Profile based

We profile all the executed to choose between T and NT, but this is done at static time.

In LLVM there exists methods to help weight the branch to T or NT.

Dynamic branch prediction

Pattern History Table (PHT)

Table that uses a set of saturating counters (1-bit or 2-bit mostly) to record the past, and checks the table using PC as the hash key to get prediction when branch instruction hits instruction fetch stage.

* 2-bit adds transient/hysteresis states that removes some anomalies.
* different branches can contaminate (aliasing) the table

Branch History Table (BHT)

Saves local branch history, where the key would be the PC and the local branch history for each entry will be shifted in to be used as a key for PHT.

Global History Register (GHR)

This register saves all branch history. All branch history will be shifted in to be used as a key for PHTs or to be combined with PC in G-share.

Tournament Predictor

Local history table here is a BHT of 1024 entries with 10 bits per entry. All prediction are PHT with 3 or 2 bits per entry. Path history is a GHR.

Gshare

Combine PC and GHR using XOR to better capture both local and global behavior.

Perceptron

Add more non-linearity to prediction using a perceptron.

https://www.cs.utexas.edu/~lin/papers/hpca01.pdf

Aliasing

Constructive: when branch contaminations occur, reinforces the prediction of branches that share the same entry
Destructive: when branch contaminations occur, the branches disrupt the predictions of other branches occupying the same entry

Branch Target Buffer (BTB)

BTB is used to reduce the branch penalty by identifying a branch and its predicted target in the IF stage.

BTB is a cache where the full PC is compared to get the prediction for the branch target. It also has a bit for preliminary branch prediction (T or NT), because branch prediction could happen over multiple cycles in modern processors and this bit can be an initial guess before the real prediction.

Reference

https://www.cs.cmu.edu/afs/cs/academic/class/15740-s17/www/lectures/L19-BranchPrediction.pdf
http://web.engr.oregonstate.edu/~benl/Projects/branch_pred/
https://courses.cs.washington.edu/courses/csep548/06au/lectures/branchPred.pdf
http://www-ee.eng.hawaii.edu/~tep/EE461/Notes/ILP/buffer.html

[Computer Architecture] Hazards

Hazards

Structural hazards: HW cannot support this combination of instructions. For example, no two instructions can be at execution stage simultaneously if there is only one execution unit.
Data hazards: instruction depends on result of prior instruction still in the pipeline. (RAW, WAW, WAR)
Control hazards: pipelining of branches and other instructions that change the PC. If there are branches, it is uncertain which instruction should be issued.

Data hazards

From Compiler Optimization course by Seon Wook Kim

Alleviated with forwarding

Control hazards

Branch instructions are the cause of control hazards

[Computer Architecture] Architecture and Microarchitecture

Computation Stack

  • Application: making cars to drive itself requiring different algorithms
  • Programming Languages: express the algorithms, virtual entity
  • Operating Systems / Compiler

Semantic Gap between virtual and physical world

  • Architecture: Machine code, you have no idea about how… “WHAT”
  • Microarchitecture: Specific composition of circuits… “HOW”
  • Circuits
  • Transistors (devices in ECE): voltage controlled 2D switch with ON/OFF
  • Atoms

* Execution requires physical substrate
* OS and Compilers work in tandem, intertwined

Architecture and Microarchitecture

Architecture and Microarchitecture are what bridges the semantic gap between Operating Systems/Compiler and Circuits of the computation stack. Architecture defines “what” to accomplish, and Microarchitecture defines “how” to accomplish it.

Architecture could be an Instruction Set Architecture (ISA) where it defines “what” a processor can do or canonical set of capability. Microarchitecture could be a specific composition of circuits, where it defines “how” operations/capabilities defined in Architecture can be accomplished.

“How” is not defined in the Architecture to provide abstraction, so that we don’t have to build a new compiler every time we develop a new processor and so that we don’t need to recompile the code for new circuits.

Top-down: Facebook, translates to machine codes (bunch of instructions)
Bottom-up: Capabilities in the hardware that are canonical (like orthogonal functions)

Machine model

Von Neumann’s model
– CPU: PC, ALU, Control Unit
– Memory: including both Data and Instructions

Harvard model
– CPU
– Instruction Memory + Data Memory

Modern processors
– CPU
– Instruction Cache + Data Cache
– Last Level Cache (LLC) to back them up
– Memory: including both Data and Instructions

Memory

Memory is so slow that small temporary storage to do operations close by.
– Registers: small namespace that lets us do operations on the data.
– Cache: on-chip and NOT a namespace, because you cannot address to it
– Scratchpad: on-chip memory that is visible/available to the software. A namespace, because you can address to it
– Memory: off-chip and is a namespace, because you can address to it

* With registers, you encode names in code. With SPM, (much bigger) you cannot encode names so go through the same mechanism as off-chip memory

Type of Instruction Set Architecture

  • Operands
    • Register-Register, Load-store (: ex. ARM):
    • Register-Memory (ex. Intel/AMD):
    • Accumulator: one of the operands act as the accumulator
    • Stack (ex. JVM): push and pop
    • Dataflow (ex. LLVM): No registers but with unique destination for every instructions
  • Addressing modes
    • Immediate: #25
    • Register indirect: M[R3]
    • Displacement: M[R3 + 10000]
  • Operations
    • Arithemetic: Add, Subtract, Multiply, Divide
    • Logical: AND, OR, Shift left, Shift right
    • Data transfer: Load word, Store word
    • Control flow: Conditional branch, Jump, Procedure call/return
  • Branch method
    • Side-effect: condition codes
    • No side-effect: condition register, compare and branch
  • Instruction format
    • Fixed size
    • Variable size: Intel

* To reduce complexity, follow what compiler will find easier: regularity, primitives
* Static Single Assignment: Static (Compile time), Single Assignment (Uniqueness)
* Register allocation: take infinite number of operands (or the edges) and assign it to real registers.

MIPS ISA

Designed after making various trade-off between various options given above
– Load-store
– Addressing modes: Immediate, Register indirect, Displacement
– Reasonable number of operations (primitives)
– No condition codes
– Fixed instruction encoding (regularity)

[Computer Architecture] Processor Performance and Benchmarks

Processor Performance

* VAX-11: Reference Design, a real design we use to compare with real processors
* Logarithmic Y-axis means exponential enhancement until 2011. Now it is flattening

Pareto optimality

Pareto optimal point: points that represent the state of allocation of resources from which it is impossible to reallocate so as to make any one individual or preference criterion better off without making at least one individual or preference criterion worse off.
Pareto frontier: line that represents all the Pareto optimal points in the design space. Where above this, you would get diminishing return.

* Regarding Power vs Area, area used to be the cost, but due to dark silicon phenomena and other advances in area, power is the primary concern. You would trade area for power efficiency!
* Designing computer architecture, deciding on which technology to use is about Pareto optimality analysis (cost-benefit analysis, trade-off)

No Free Lunch!

What is performance?

  • How much it takes to run a benchmark.
  • Instruction per unit time (MIPS, FLOPS) is a proxy definition of performance

Benchmark

  • Single program cannot represent workload everybody uses. So we use suite of benchmarks.
  • Early people used synthetic programs, random generation of instructions to run on the processor. In reality there is no randomness, so it does not capture the behavior of the real programs.
  • It is important to capture commonality among programs to be able to design general purpose CPUs.
  • SPEC: real benchmarks from real world. ex) gcc
    • FP: floating point, including C/C++ and Fortran programs
    • INT: int, including C/C++ programs
    • Running these would give you SPEC Mark

* Moore’s law is about cost for performance, economics of manufacturing processors. Price of transistor is not going down anymore, power limits us from improving performance and we are just getting transistors that we cannot switch on.

Dataset

Benchmark alone is not enough. Without clear dataset, benchmarks have no value. ex) number of gcc file you compile with gcc changes the time it takes.

So, time, benchmark, data are three elements that define performance

Time

Wall clock time: start to the end, including the time it takes to load.

How do we measure performance?

There are many vague notions, but are not real metrics. ex) Core count, RAM, Storage type, etc…

  • CT =1/f
    • circuit and microarchitecture
  • Clock Per Instruction (CPI) = 1/IPC
    • microarchitecture and architecture
    • We don’t change clock frequency nor program when we measure gains in IPC or CPI
  • Instruction Count (IC)
    • program

Amdahl’s Law

Overall speedup that you get given any method

Geometric mean

Even the unweighted arithmetic mean implies a weighting, whereas ratios of geometric means never change (regardless of which machine is used as the baseline).

[Computer Architecture] Computing Industry

What has made computing pervasive? What is the backbone of the computing industry?

  • Computation: Although we had computation in calculators before, it is about programmability!
  • Communication: Although we had phones to do this, we can do more (social media, games, news…) and it is about networking!
  • What has made computing pervasive is its programmability and networking capability

What makes computers programmable?

Model of computation gives us programmability.

Von-Neumann Model: Notion of computer + Execution model

  • Computer is a collection of components
    • Memory (Random Access Memory: go anywhere in the memory and grab data)
    • Central Processing Unit
      • Control Unit: in charge of giving the instructions
      • Arithmetic Logical Unit (ALU): does the arithmetic logic operations on the data
    • Input/Output System: provides data from the outside
  • Memory that stores program and data
  • Program instructions execute sequentially (sequential execution paradigm)
    • Program counter (PC): keeps track of the current instruction

Programmability vs Efficiency

  • Pipeline: Instruction Fetch, Decode, Execute, Memory, Write Back
  • We do a lot of extra work to support programmability, and this lets us use economy of scale to reduce the price. As a result, commoditize computing!

* Specialized hardware is similar to ASIC. ASIC does one application, specialized hardware does more than one.

What is the difference between the computing industry and the paper towel industry?

  • Computing industry is an industry of new possibilities.

Can we continue being an industry of new possibilities?

  • Moore’s law
    • Doubling number of transistors every two years or 18 months
    • Build higher performance general-purpose processors
    • This itself is of no use; computer architecture offers abstraction and generalization to help programmers write codes.
  • Dennard scaling: provides the physics and equations on how shrinking device sizes shrinks voltage, current, current density, thus makes power density a constant
    • Transistor: two dimensional voltage controlled switch
    • Power = Capacitance x Frequency x Voltage^2

      Power Density = Power / Area
      • 0.7x dimension
      • 0.7x capacitance
      • 1.4x frequency
      • 0.7x voltage

What is the catch?

  • End of Dennard scaling
    • Pushing down threshold voltage exponentially increases the leakage current. So voltage cannot get under 0.9 volts.
    • Lower capacitance is impossible because we cannot reduce the distance
    • We have to stop scaling the frequency.
  • Dark Silicon
    • If you reduce the dimensions of transistors but not the power and reducing frequency does not compensate for the lack of power, you cannot turn all of them on.

Single-core era to multicore era

  • Two issues with the idea
    • Can we continuously increase the number of cores?
    • How will it be programmed?
  • It is a stopgap solution

Radical departures from conventional approaches are necessary

  • Bio computing, quantum computing
  • Specialization and co-design, approximate computing