SDSoC environment

777 阅读2分钟

reference: ug1027

Software Acceleration

  • use the source code libraries of existing hardware functions, such as the Xilinx xfOpenCV library
  • modifying your code to better use the PL device architecture

sds++ system compiler

  • -coption:
    • invoke HLS to compoile a file to hardware IP
    • translate# pragma SDSinto pragma understood by HLS

HLS

  • scheduling
  • pipelining
  • dataflow

sds++ linker

  • analyzes dataflow into/between hardware functions
    • identify operations can be shared
  • orchestrate accelerators and data transfers through data movers
    • software control code(stubs)
    • insert wait barrier API into stubs

Execution Model

hardware functions -> hardware accelerators that are accessed as a task with the standard C runtime through calls into these functions.

  • CPU & accelerators

    through arguments after task completion

  • memory & acceletors

    through data movers: eg: DMA engine

    • automatically inserted into the system by the sds++ system compiler taking into account user data mover pragmas such as zero_copy.

system compiler

  • intercepts each call to a hardware function, and
  • replaces it with a call to a generated stub function that has an identical signature but with a derived name.
stub function
  • synchronize software and accelerator hardware at the exit of the hardware function call.

  • control all accelerator and data mover through a set of send and receive APIs provided by the sds_lib library within the stub.

      one optimization:
      array arguments between hardware functions calls 
      if not accessed after the function calls
          other than destructors or `free()` calls
      dataflow through stream
    

SDSoC program execution steps

  1. Initialization of the sds_lib library occurs during the program constructor before entering main().

  2. every call to a hardware function is intercepted by a function call into a stub function with the same function signature (other than name) as the original function.Within the stub function, the following steps occur:

    a. A synchronous accelerator task control command is sent to the hardware.

    b. For each argument to the hardware function, an asynchronous data transfer request is sent to the appropriate data mover, with an associated wait() handle. A non-void return value is treated as an implicit output scalar argument.

    c. A barrier wait() is issued for each transfer request. If a data transfer between accelerators is implemented as a direct hardware stream, the barrier wait() for this transfer occurs in the stub function for the last in the chain of accelerator functions for this argument.最后一个需要wait

  3. Clean up of the sds_lib library occurs during the program destructor, upon exiting main().

Customized concurrent task execution

#pragma SDS async(ID): generate a stub function without any barrier wait() calls for data transfers

issue all data requests -> return to program

enabling concurrent execution of the program while the accelerator is running

your responsibility: insert #pragma SDS wait(ID)within the program at appropriate synchronization points, which are resolved into sds_wait(ID) API calls to correctly synchronize hardware accelerators, their implicit data movers, and the CPU.

Build Process

build: using sds++ system compiler

  • compilation
    • Compilation for main application(on the ARM core) & each hardware accelerator
    • Compiling the application code with an object (.o) file produced using standard GNU Arm compilation tools
    • Running the hardware accelerated functions with an object (.o) file produced using HLS
  • linking
    • Modifying the hardware platform to accept the accelerators
    • Implementing the hardware accelerators into PL: synthesis 、 implementation、bitstream generation(Vivado)
    • Updating the software images with hardware access APIs to call hardware functions
    • Producing an integrated SD card image that can boot the board with the application in an Executable and Linkable Format (ELF) file.

PS: The Data Motion Network report lists the accelerated functions and how their arguments were mapped and connected to platform interfaces

sds++ system compiler:

  • HLS & Vivado:implement the generated hardware system
  • Arm compiler & sds++ linker: create application binaries that run on the CPU invoking the accelerator (stubs) for each hardware function by outputting a complete bootable system for an SD card.

Best Practices

  • streaming data
  • Reuse data
  • task-level parallelization

software-centric approach

  • good memory management techniques: eg: sds_alloc()/sds_free()
  • system emulation: functionally correct
  • Write/migrate hardware functions to separate C/C++ files as to not re-compile the entire design for incremental changes

hardware-centric approach

  • Keep track of the AXI4 Interface offsets for: IP、accelerator、what function definition parameters require what data type The interfaces need to be byte aligned.
  • Maintain the original Vivado IP project so that modifications to it can be quickly implemented
  • Keep the static library (.a) file and corresponding header file together.