CMU Computer Systems: Program OptimizationA superscalar proc

Optimization

Overview
Generally Useful Optimizations
- Code motion/precomputation
- Strength reduction
- Sharing of common subexpressions
- Removing unnecessary procedure calls
Optimization Blockers
- Procedure calls
- Memory aliasing
Exploiting Instruction-Level Parallism
Dealing with Conditionals

There’s more to performance than asymptotic complexity
Constant factors matter tool
- Easily see 10:1 performance range depending on how code is written
- Must optimize at multiple levels:
  - algorithm, data representations, procedures, and loops
Must understand system to optimize performance
- How programs are compiled and executed
- How modern processors + memory systems operate
- How to measure program performance and identify bottlenecks
- How to improve performance without destroying code modular generality

Operate under fundamental constraint
- Must not cause any change in program behavior
- Often prevents it from making optimizations that would only affect behavior under pathological conditions
Behavior that may be obvious to the programmer can be obfuscated by languages and coding styles
Most analysis is performed only within procedures
- Whole-program analysis is too expensive in most cases
- Newer versions of GCC do interprocedural analysis within individual files
Most analysis is based on static information
When in doubt, the compiler must be conservative

Optimizations that you or the compiler should do regardless of processor / compiler
Code Motion
- Reduce frequency with which computation performed
  - If it will always procedure same result
  - Especially moving code out of loop
- Reduction in Strength
  - Replace costly operation with simpler one
  - Shift, and instead of multiply or divide
  - Recognize sequence of products
- Share Common Subexpressions
  - Reuse portions of expressions
  - GCC will do this with -O1

Why couldn’t compiler move strlen out of inner loop
- Procedure may have side effects
  - Alters global state each time called
- Function may not return same value for given arguments
  - Depends on other parts of global state
  - Procedure lower could interact with strlen
Warning
- Compiler treats procedure call as a black box
- Weak optimizations near them
Remedies
- Use of inline functions
- Do your own code motion

Need general understanding of modern
- Hardware can execute multiple instructions in parallel
Performance limited by data dependencies
Simple transformations can yield dramatic performance improvement
- Compilers often cannot make these transformations
- Lack of associativity and distributivity in floating-point arithmetic

Convenient way to express performance of program that operates on vectors in lists
Length = n
In our case: CPE = cycles per OP
T = CPE*n + Overhead
- CPE is slope of line

Definition
- A superscalar processor can issue and execute multiple instructions in one cycle. The instructions are retrieved from a sequential instruction stream and are usually scheduled dynamically.
Benefit
- without programming effort, superscalar processor can take advantage of instruction level parallelism that most programs have

Idea
- Can unroll to any degree L
- Can accumulate K results in parallel
- L must be multiple of K
Limitations
- Diminishing returns
  - Cannot go beyond throughput limitations of execution units
- Large overhead for short lengths
  - Finish off iterations sequentially

Make use of AVX Instructions
- Parallel operations on multiple data elements
- See Web Aside OPT: SIMD on CS: APP web page

Idea
- Guess which way branch will go
- Begin executing instructions at predicted position
  -But don’t actually modify register or memory data

Performance Cost
- Multiple clock cycles on modern processor
- Can be a major performance limiter

Good compiler and flags
Don’t do anything stupid
- Watch out for hidden algorithmic inefficiencies
- Write compiler-friendly code
  - Watch out for optimization blockers
- Look carefully at innermost loops
Turn code for machine
- Exploit instruction-level parallelism
- Avoid unpredictable branches
- Make code cache friendly (Covered later in course)