CMU Computer Systems: Thread-Level Parallelism
Exploiting parallel execution
- Use threads to deal with I/O delays
- Multi-core/Hyperthreaded CPUs offer another opportunity
- Spread work over threads executing in parallel
- Happens automatically, if many independent tasks
- Can also write code to make one big task go faster
Typical Multicore Processor
Out-of-Order Processor Structure
Hyperthreading Implementation
Characterizing Parallel Program Performance
- p processor cores, T_k is the running time using k cores
- Speedup: Sp=T1/Tp
- Sp is relative speedup if T1 is running rime of parallel version of the code running on 1 core
- Sp is absolute speedup if T1 is running time of sequential version of code running on 1 core
- Absolute speedup is a much truer measure of the benefits of parallelism
- Efficiency: Ep=Sp/p=T1/(pTp)
- Reported as a percentage in the range (0, 100]
- Measures the overhead due to parallelization
Amdahl's Law
- Captures the difficulty of using parallelism to speed things up
- Overall problem
- T Total sequential time required
- p Fraction of total that can be sped up (0≤p≤1)
- k Speedup factor
- Resulting Performance
- Tk=pT/k+(1−p)T
- Portion which can be sped up runs k times faster
- Portion which cannot be sped up stays the same
- Least possible running time
- k=∞
- T∞=(1−p)T
Experience with Parallel Partitioning
- Could not obtain speedup
- Speculate: Too much data copying
- Could not do everything within source array
- Set up temporary space for reassembling partition
Memory Consistency
- What are the possible values printed
- Depends on memory consistency model
- Abstract model of how hardware handles concurrent accesses
- Sequential consistency
- Overall effect consistent with each individual thread
- Otherwise, arbitrary interleaving
Snoopy Caches
- Tag each cache block with state
- Invalid Cannot use value
- Shared Readable copy
- Exclusive Writeable copy