CMU Computer Systems: Program Optimization

159 阅读2分钟

Optimization

  • Overview

  • Generally Useful Optimizations

    • Code motion/precomputation
    • Strength reduction
    • Sharing of common subexpressions
    • Removing unnecessary procedure calls
  • Optimization Blockers

    • Procedure calls
    • Memory aliasing
  • Exploiting Instruction-Level Parallism

  • Dealing with Conditionals

Performance Realties

  • There’s more to performance than asymptotic complexity

  • Constant factors matter tool

    • Easily see 10:1 performance range depending on how code is written

    • Must optimize at multiple levels:

      • algorithm, data representations, procedures, and loops
  • Must understand system to optimize performance

    • How programs are compiled and executed
    • How modern processors + memory systems operate
    • How to measure program performance and identify bottlenecks
    • How to improve performance without destroying code modular generality

Optimizing Compilers

  • Provide efficient mapping of program to machine
  • Don’t (usually) improve asymptotic efficiency
  • Have difficulty overcoming “optimization blockers”

Limitations of Optimizing Compilers

  • Operate under fundamental constraint

    • Must not cause any change in program behavior
    • Often prevents it from making optimizations that would only affect behavior under pathological conditions
  • Behavior that may be obvious to the programmer can be obfuscated by languages and coding styles

  • Most analysis is performed only within procedures

    • Whole-program analysis is too expensive in most cases
    • Newer versions of GCC do interprocedural analysis within individual files
  • Most analysis is based on static information

  • When in doubt, the compiler must be conservative

Generally Useful Optimizations

  • Optimizations that you or the compiler should do regardless of processor / compiler

  • Code Motion

    • Reduce frequency with which computation performed

      • If it will always procedure same result
      • Especially moving code out of loop
    • Reduction in Strength

      • Replace costly operation with simpler one
      • Shift, and instead of multiply or divide
      • Recognize sequence of products
    • Share Common Subexpressions

      • Reuse portions of expressions
      • GCC will do this with -O1

Optimization Blocker #1: Procedure Calls

  • Why couldn’t compiler move strlen out of inner loop

    • Procedure may have side effects

      • Alters global state each time called
    • Function may not return same value for given arguments

      • Depends on other parts of global state
      • Procedure lower could interact with strlen
  • Warning

    • Compiler treats procedure call as a black box
    • Weak optimizations near them
  • Remedies

    • Use of inline functions
    • Do your own code motion

Optimization Blocker #2: Memory Aliasing

  • Aliasing

    • Two different memory references specify single location

    • Easy to have happen in C

      • Since allowed to do address arithmetic
      • Direct access to storage structures
    • Get in habit of introducing local variables

      • Accumulating within loops
      • Your way of telling compiler not to check for aliasing

Exploiting Instruction-Level Parallelism

  • Need general understanding of modern

    • Hardware can execute multiple instructions in parallel
  • Performance limited by data dependencies

  • Simple transformations can yield dramatic performance improvement

    • Compilers often cannot make these transformations
    • Lack of associativity and distributivity in floating-point arithmetic

Cycles Per Element (CPE)

  • Convenient way to express performance of program that operates on vectors in lists

  • Length = n

  • In our case: CPE = cycles per OP

  • T = CPE*n + Overhead

    • CPE is slope of line

Superscalar Processor

  • Definition

    • A superscalar processor can issue and execute multiple instructions in one cycle. The instructions are retrieved from a sequential instruction stream and are usually scheduled dynamically.
  • Benefit

    • without programming effort, superscalar processor can take advantage of instruction level parallelism that most programs have

Pipelined Functional Units

  • Divide computation into stages
  • Pass partial computations from stage to stage
  • Stage i can start on new computation once values passed to i+1

Unrolling & Accumulating

  • Idea

    • Can unroll to any degree L
    • Can accumulate K results in parallel
    • L must be multiple of K
  • Limitations

    • Diminishing returns

      • Cannot go beyond throughput limitations of execution units
    • Large overhead for short lengths

      • Finish off iterations sequentially

Using Vector Instructions

  • Make use of AVX Instructions

    • Parallel operations on multiple data elements
    • See Web Aside OPT: SIMD on CS: APP web page

Branch Prediction

  • Idea

    • Guess which way branch will go
    • Begin executing instructions at predicted position
      -But don’t actually modify register or memory data

Branch Misprediction Recovery

  • Performance Cost

    • Multiple clock cycles on modern processor
    • Can be a major performance limiter

Getting High Performance

  • Good compiler and flags

  • Don’t do anything stupid

    • Watch out for hidden algorithmic inefficiencies

    • Write compiler-friendly code

      • Watch out for optimization blockers
    • Look carefully at innermost loops

  • Turn code for machine

    • Exploit instruction-level parallelism
    • Avoid unpredictable branches
    • Make code cache friendly (Covered later in course)