Optimization
-
Overview
-
Generally Useful Optimizations
- Code motion/precomputation
- Strength reduction
- Sharing of common subexpressions
- Removing unnecessary procedure calls
-
Optimization Blockers
- Procedure calls
- Memory aliasing
-
Exploiting Instruction-Level Parallism
-
Dealing with Conditionals
Performance Realties
-
There’s more to performance than asymptotic complexity
-
Constant factors matter tool
-
Easily see 10:1 performance range depending on how code is written
-
Must optimize at multiple levels:
- algorithm, data representations, procedures, and loops
-
-
Must understand system to optimize performance
- How programs are compiled and executed
- How modern processors + memory systems operate
- How to measure program performance and identify bottlenecks
- How to improve performance without destroying code modular generality
Optimizing Compilers
- Provide efficient mapping of program to machine
- Don’t (usually) improve asymptotic efficiency
- Have difficulty overcoming “optimization blockers”
Limitations of Optimizing Compilers
-
Operate under fundamental constraint
- Must not cause any change in program behavior
- Often prevents it from making optimizations that would only affect behavior under pathological conditions
-
Behavior that may be obvious to the programmer can be obfuscated by languages and coding styles
-
Most analysis is performed only within procedures
- Whole-program analysis is too expensive in most cases
- Newer versions of GCC do interprocedural analysis within individual files
-
Most analysis is based on static information
-
When in doubt, the compiler must be conservative
Generally Useful Optimizations
-
Optimizations that you or the compiler should do regardless of processor / compiler
-
Code Motion
-
Reduce frequency with which computation performed
- If it will always procedure same result
- Especially moving code out of loop
-
Reduction in Strength
- Replace costly operation with simpler one
- Shift, and instead of multiply or divide
- Recognize sequence of products
-
Share Common Subexpressions
- Reuse portions of expressions
- GCC will do this with -O1
-
Optimization Blocker #1: Procedure Calls
-
Why couldn’t compiler move strlen out of inner loop
-
Procedure may have side effects
- Alters global state each time called
-
Function may not return same value for given arguments
- Depends on other parts of global state
- Procedure lower could interact with strlen
-
-
Warning
- Compiler treats procedure call as a black box
- Weak optimizations near them
-
Remedies
- Use of inline functions
- Do your own code motion
Optimization Blocker #2: Memory Aliasing
-
Aliasing
-
Two different memory references specify single location
-
Easy to have happen in C
- Since allowed to do address arithmetic
- Direct access to storage structures
-
Get in habit of introducing local variables
- Accumulating within loops
- Your way of telling compiler not to check for aliasing
-
Exploiting Instruction-Level Parallelism
-
Need general understanding of modern
- Hardware can execute multiple instructions in parallel
-
Performance limited by data dependencies
-
Simple transformations can yield dramatic performance improvement
- Compilers often cannot make these transformations
- Lack of associativity and distributivity in floating-point arithmetic
Cycles Per Element (CPE)
-
Convenient way to express performance of program that operates on vectors in lists
-
Length = n
-
In our case: CPE = cycles per OP
-
T = CPE*n + Overhead
- CPE is slope of line
Superscalar Processor
-
Definition
- A superscalar processor can issue and execute multiple instructions in one cycle. The instructions are retrieved from a sequential instruction stream and are usually scheduled dynamically.
-
Benefit
- without programming effort, superscalar processor can take advantage of instruction level parallelism that most programs have
Pipelined Functional Units
- Divide computation into stages
- Pass partial computations from stage to stage
- Stage i can start on new computation once values passed to i+1
Unrolling & Accumulating
-
Idea
- Can unroll to any degree L
- Can accumulate K results in parallel
- L must be multiple of K
-
Limitations
-
Diminishing returns
- Cannot go beyond throughput limitations of execution units
-
Large overhead for short lengths
- Finish off iterations sequentially
-
Using Vector Instructions
-
Make use of AVX Instructions
- Parallel operations on multiple data elements
- See Web Aside OPT: SIMD on CS: APP web page
Branch Prediction
-
Idea
- Guess which way branch will go
- Begin executing instructions at predicted position
-But don’t actually modify register or memory data
Branch Misprediction Recovery
-
Performance Cost
- Multiple clock cycles on modern processor
- Can be a major performance limiter
Getting High Performance
-
Good compiler and flags
-
Don’t do anything stupid
-
Watch out for hidden algorithmic inefficiencies
-
Write compiler-friendly code
- Watch out for optimization blockers
-
Look carefully at innermost loops
-
-
Turn code for machine
- Exploit instruction-level parallelism
- Avoid unpredictable branches
- Make code cache friendly (Covered later in course)