CMU Computer Systems: Thread-Level ParallelismExploiting pa

Use threads to deal with I/O delays
Multi-core/Hyperthreaded CPUs offer another opportunity
- Spread work over threads executing in parallel
- Happens automatically, if many independent tasks
- Can also write code to make one big task go faster

p processor cores, T_k is the running time using k cores
Speedup: $S_p= T_1 /T_p$
- $S_p$ is relative speedup if $T_1$ is running rime of parallel version of the code running on 1 core
- $S_p$ is absolute speedup if $T_1$ is running time of sequential version of code running on 1 core
- Absolute speedup is a much truer measure of the benefits of parallelism
Efficiency: $E_p=S_p/p = T_1/(pT_p)$
- Reported as a percentage in the range (0, 100]
- Measures the overhead due to parallelization

Captures the difficulty of using parallelism to speed things up
Overall problem
- T Total sequential time required
- p Fraction of total that can be sped up ( $0\leq p \leq 1$ )
- k Speedup factor
Resulting Performance
- $T_k=pT/k+(1−p)T$
  - Portion which can be sped up runs k times faster
  - Portion which cannot be sped up stays the same
- Least possible running time
  - $k = ∞$
  - $T_∞=(1−p)T$

Could not obtain speedup
Speculate: Too much data copying
- Could not do everything within source array
- Set up temporary space for reassembling partition

What are the possible values printed
- Depends on memory consistency model
- Abstract model of how hardware handles concurrent accesses
Sequential consistency
- Overall effect consistent with each individual thread
- Otherwise, arbitrary interleaving

Tag each cache block with state
- Invalid Cannot use value
- Shared Readable copy
- Exclusive Writeable copy